DATA TRANSFER

DMA (Direct Memory Access) Complete Guide

PCIe DMA transactions, IOMMU integration, scatter-gather, and security

1. What is DMA?

What is Direct Memory Access?

DMA (Direct Memory Access) allows PCIe devices to read from and write to system memory independently, without CPU involvement in the data transfer. The device acts as a bus master, initiating memory transactions directly.

DMA vs PIO (Programmed I/O)

Aspect PIO DMA
CPU Involvement CPU moves each byte CPU only sets up transfer
Efficiency Low (CPU bound) High (offloaded)
Throughput Limited Full link bandwidth
Latency Lower for small Better for large
Use Case Config, registers Bulk data transfer

2. PCIe DMA Architecture

DMA Transaction Flow

    CPU                 System Memory              PCIe Device
     │                       │                         │
     │ 1. Setup DMA:         │                         │
     │    - Allocate buffer  │                         │
     │    - Program device   │                         │
     │ ──────────────────────────────────────────────► │
     │                       │                         │
     │                       │   2. DMA Read (MRd)     │
     │                       │ ◄────────────────────── │
     │                       │                         │
     │                       │   3. Completion (CplD)  │
     │                       │ ──────────────────────► │
     │                       │                         │
     │                       │   4. DMA Write (MWr)    │
     │                       │ ◄────────────────────── │
     │                       │                         │
     │ ◄─── 5. Interrupt ────────────────────────────── │
     │      (MSI-X)          │                         │
     │                       │                         │
     │ 6. Process results    │                         │

TLP Types for DMA

Operation TLP Type Credit Class
DMA Read Memory Read (MRd) Non-Posted
DMA Read Completion Completion with Data (CplD) Completion
DMA Write Memory Write (MWr) Posted

3. DMA Addressing

Address Types

    Application Space
    ┌──────────────────────┐
    │   Virtual Address    │ ─── User space pointer
    └──────────┬───────────┘
               │ MMU
               ▼
    ┌──────────────────────┐
    │  Physical Address    │ ─── System memory location
    └──────────┬───────────┘
               │ IOMMU
               ▼
    ┌──────────────────────┐
    │    DMA Address       │ ─── Address seen by device
    │   (Bus Address)      │     (I/O Virtual Address)
    └──────────────────────┘

DMA Address Width

Checking Device DMA Capability

// Linux: Check device DMA mask
pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));

// If 64-bit fails, try 32-bit
if (err)
    pci_set_dma_mask(pdev, DMA_BIT_MASK(32));

4. Scatter-Gather DMA

What is Scatter-Gather?

Scatter-Gather DMA allows transferring data to/from non-contiguous memory regions in a single DMA operation, eliminating the need for buffer copying.

Scatter-Gather List (SGL) Structure

    SGL Entry 0        Physical Memory
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: A  │ ───► │ Page 0 (4KB)         │
    │ Len: 4K  │      └──────────────────────┘
    └──────────┘
    SGL Entry 1      
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: B  │ ───► │ Page 1 (4KB)         │
    │ Len: 4K  │      └──────────────────────┘
    └──────────┘
    SGL Entry 2      
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: C  │ ───► │ Page 2 (8KB)         │
    │ Len: 8K  │      │                      │
    └──────────┘      └──────────────────────┘
    
    Total transfer: 16KB across 3 non-contiguous pages

PRP vs SGL (NVMe)

Feature PRP (Physical Region Page) SGL (Scatter-Gather List)
Entry Size 8 bytes (address only) 16 bytes (addr + len + type)
Alignment Page aligned (4KB) DWord aligned
Flexibility Limited High (arbitrary lengths)
Chaining PRP List SGL Segment/Last Segment

5. IOMMU Integration

IOMMU Architecture

    PCIe Device                  IOMMU                 System Memory
         │                          │                        │
         │  DMA Request             │                        │
         │  (IOVA: 0x1000)          │                        │
         │ ────────────────────────►│                        │
         │                          │                        │
         │              ┌───────────┴───────────┐            │
         │              │ I/O Page Table Walk   │            │
         │              │ IOVA → PA translation │            │
         │              │ Permission Check      │            │
         │              └───────────┬───────────┘            │
         │                          │                        │
         │                          │  Translated Request    │
         │                          │  (PA: 0x80001000)      │
         │                          │ ──────────────────────►│
         │                          │                        │
         │                          │ ◄──────────────────────│
         │ ◄────────────────────────│  Response              │
         │                          │                        │

IOMMU Benefits

Intel VT-d / AMD IOMMU

Feature Intel VT-d AMD-Vi
Address Width 48-bit (up to 57) 48-bit (up to 57)
PASID Support Yes Yes
ATS Support Yes Yes
Page Sizes 4KB, 2MB, 1GB 4KB, 2MB, 1GB

6. DMA Descriptor Ring

Ring Buffer Structure

    Descriptor Ring (in Host Memory)
    
    ┌────────────────────────────────────────────────────────────┐
    │ Entry 0 │ Entry 1 │ Entry 2 │ ... │ Entry N-1              │
    └────┬────────┬────────┬─────────────────┬───────────────────┘
         │        │        │                 │
         │        │        │                 └── Wrap around
         │        │        │
    ┌────┴──┐  ┌──┴───┐  ┌─┴────┐
    │ Head  │  │ Curr │  │ Tail │
    │ (HW)  │  │      │  │ (SW) │
    └───────┘  └──────┘  └──────┘
    
    Descriptor Entry:
    ┌──────────────────────────────────────────────────────────────┐
    │ Source Address (64-bit) │ Dest Address (64-bit)             │
    │ Length (32-bit)         │ Control Flags │ Status            │
    └──────────────────────────────────────────────────────────────┘

Producer-Consumer Model

  1. Software (producer) writes descriptors to ring
  2. Software updates Tail pointer (doorbell write)
  3. Hardware (consumer) reads descriptors from Head
  4. Hardware performs DMA operations
  5. Hardware updates Head pointer
  6. Hardware signals completion (MSI-X)

7. DMA Security

DMA Attack Vectors

  • Direct Memory Access: Malicious device reads/writes arbitrary memory
  • DMA Injection: Compromised device modifies kernel memory
  • Thunderclap: Malicious Thunderbolt/PCIe device
  • Evil Maid: Physical access attack

Protection Mechanisms

Mechanism Protection
IOMMU Address translation and access control
ACS Access Control Services - P2P routing control
IDE Link/Selective encryption
Boot-time IOMMU Enable IOMMU before device enumeration

8. DMA Programming (Linux)

DMA API Usage

// Allocate coherent DMA buffer
void *buffer;
dma_addr_t dma_handle;
buffer = dma_alloc_coherent(&pdev->dev, size, &dma_handle, GFP_KERNEL);

// Map existing buffer for DMA
dma_addr_t dma_addr;
dma_addr = dma_map_single(&pdev->dev, buf, size, DMA_TO_DEVICE);

// Scatter-gather mapping
int nents = dma_map_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE);

// Unmap when done
dma_unmap_single(&pdev->dev, dma_addr, size, DMA_TO_DEVICE);
dma_unmap_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE);

Streaming vs Coherent DMA

Type Use Case Caching
Coherent Descriptor rings, shared data Uncached or cache-coherent
Streaming Data buffers, one-way transfer Cached, requires sync

9. Performance Optimization

DMA Best Practices

TLP Size Optimization

    Max Payload Size (MPS) impact:
    
    128B MPS, 4KB transfer: 32 TLPs (32 × 16B header = 512B overhead)
    256B MPS, 4KB transfer: 16 TLPs (16 × 16B header = 256B overhead)
    512B MPS, 4KB transfer:  8 TLPs ( 8 × 16B header = 128B overhead)
    
    Larger MPS = Fewer TLPs = Better efficiency