PCIe DMA transactions, IOMMU integration, scatter-gather, and security
DMA (Direct Memory Access) allows PCIe devices to read from and write to system memory independently, without CPU involvement in the data transfer. The device acts as a bus master, initiating memory transactions directly.
| Aspect | PIO | DMA |
|---|---|---|
| CPU Involvement | CPU moves each byte | CPU only sets up transfer |
| Efficiency | Low (CPU bound) | High (offloaded) |
| Throughput | Limited | Full link bandwidth |
| Latency | Lower for small | Better for large |
| Use Case | Config, registers | Bulk data transfer |
CPU System Memory PCIe Device
│ │ │
│ 1. Setup DMA: │ │
│ - Allocate buffer │ │
│ - Program device │ │
│ ──────────────────────────────────────────────► │
│ │ │
│ │ 2. DMA Read (MRd) │
│ │ ◄────────────────────── │
│ │ │
│ │ 3. Completion (CplD) │
│ │ ──────────────────────► │
│ │ │
│ │ 4. DMA Write (MWr) │
│ │ ◄────────────────────── │
│ │ │
│ ◄─── 5. Interrupt ────────────────────────────── │
│ (MSI-X) │ │
│ │ │
│ 6. Process results │ │
| Operation | TLP Type | Credit Class |
|---|---|---|
| DMA Read | Memory Read (MRd) | Non-Posted |
| DMA Read Completion | Completion with Data (CplD) | Completion |
| DMA Write | Memory Write (MWr) | Posted |
Application Space
┌──────────────────────┐
│ Virtual Address │ ─── User space pointer
└──────────┬───────────┘
│ MMU
▼
┌──────────────────────┐
│ Physical Address │ ─── System memory location
└──────────┬───────────┘
│ IOMMU
▼
┌──────────────────────┐
│ DMA Address │ ─── Address seen by device
│ (Bus Address) │ (I/O Virtual Address)
└──────────────────────┘
// Linux: Check device DMA mask
pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
// If 64-bit fails, try 32-bit
if (err)
pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
Scatter-Gather DMA allows transferring data to/from non-contiguous memory regions in a single DMA operation, eliminating the need for buffer copying.
SGL Entry 0 Physical Memory
┌──────────┐ ┌──────────────────────┐
│ Addr: A │ ───► │ Page 0 (4KB) │
│ Len: 4K │ └──────────────────────┘
└──────────┘
SGL Entry 1
┌──────────┐ ┌──────────────────────┐
│ Addr: B │ ───► │ Page 1 (4KB) │
│ Len: 4K │ └──────────────────────┘
└──────────┘
SGL Entry 2
┌──────────┐ ┌──────────────────────┐
│ Addr: C │ ───► │ Page 2 (8KB) │
│ Len: 8K │ │ │
└──────────┘ └──────────────────────┘
Total transfer: 16KB across 3 non-contiguous pages
| Feature | PRP (Physical Region Page) | SGL (Scatter-Gather List) |
|---|---|---|
| Entry Size | 8 bytes (address only) | 16 bytes (addr + len + type) |
| Alignment | Page aligned (4KB) | DWord aligned |
| Flexibility | Limited | High (arbitrary lengths) |
| Chaining | PRP List | SGL Segment/Last Segment |
PCIe Device IOMMU System Memory
│ │ │
│ DMA Request │ │
│ (IOVA: 0x1000) │ │
│ ────────────────────────►│ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ I/O Page Table Walk │ │
│ │ IOVA → PA translation │ │
│ │ Permission Check │ │
│ └───────────┬───────────┘ │
│ │ │
│ │ Translated Request │
│ │ (PA: 0x80001000) │
│ │ ──────────────────────►│
│ │ │
│ │ ◄──────────────────────│
│ ◄────────────────────────│ Response │
│ │ │
| Feature | Intel VT-d | AMD-Vi |
|---|---|---|
| Address Width | 48-bit (up to 57) | 48-bit (up to 57) |
| PASID Support | Yes | Yes |
| ATS Support | Yes | Yes |
| Page Sizes | 4KB, 2MB, 1GB | 4KB, 2MB, 1GB |
Descriptor Ring (in Host Memory)
┌────────────────────────────────────────────────────────────┐
│ Entry 0 │ Entry 1 │ Entry 2 │ ... │ Entry N-1 │
└────┬────────┬────────┬─────────────────┬───────────────────┘
│ │ │ │
│ │ │ └── Wrap around
│ │ │
┌────┴──┐ ┌──┴───┐ ┌─┴────┐
│ Head │ │ Curr │ │ Tail │
│ (HW) │ │ │ │ (SW) │
└───────┘ └──────┘ └──────┘
Descriptor Entry:
┌──────────────────────────────────────────────────────────────┐
│ Source Address (64-bit) │ Dest Address (64-bit) │
│ Length (32-bit) │ Control Flags │ Status │
└──────────────────────────────────────────────────────────────┘
| Mechanism | Protection |
|---|---|
| IOMMU | Address translation and access control |
| ACS | Access Control Services - P2P routing control |
| IDE | Link/Selective encryption |
| Boot-time IOMMU | Enable IOMMU before device enumeration |
// Allocate coherent DMA buffer void *buffer; dma_addr_t dma_handle; buffer = dma_alloc_coherent(&pdev->dev, size, &dma_handle, GFP_KERNEL); // Map existing buffer for DMA dma_addr_t dma_addr; dma_addr = dma_map_single(&pdev->dev, buf, size, DMA_TO_DEVICE); // Scatter-gather mapping int nents = dma_map_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE); // Unmap when done dma_unmap_single(&pdev->dev, dma_addr, size, DMA_TO_DEVICE); dma_unmap_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE);
| Type | Use Case | Caching |
|---|---|---|
| Coherent | Descriptor rings, shared data | Uncached or cache-coherent |
| Streaming | Data buffers, one-way transfer | Cached, requires sync |
Max Payload Size (MPS) impact:
128B MPS, 4KB transfer: 32 TLPs (32 × 16B header = 512B overhead)
256B MPS, 4KB transfer: 16 TLPs (16 × 16B header = 256B overhead)
512B MPS, 4KB transfer: 8 TLPs ( 8 × 16B header = 128B overhead)
Larger MPS = Fewer TLPs = Better efficiency