DMA (Direct Memory Access) Complete Guide

1. What is DMA?

What is Direct Memory Access?

DMA (Direct Memory Access) allows PCIe devices to read from and write to system memory independently, without CPU involvement in the data transfer. The device acts as a bus master, initiating memory transactions directly.

DMA vs PIO (Programmed I/O)

Aspect	PIO	DMA
CPU Involvement	CPU moves each byte	CPU only sets up transfer
Efficiency	Low (CPU bound)	High (offloaded)
Throughput	Limited	Full link bandwidth
Latency	Lower for small	Better for large
Use Case	Config, registers	Bulk data transfer

2. PCIe DMA Architecture

DMA Transaction Flow

    CPU                 System Memory              PCIe Device
     │                       │                         │
     │ 1. Setup DMA:         │                         │
     │    - Allocate buffer  │                         │
     │    - Program device   │                         │
     │ ──────────────────────────────────────────────► │
     │                       │                         │
     │                       │   2. DMA Read (MRd)     │
     │                       │ ◄────────────────────── │
     │                       │                         │
     │                       │   3. Completion (CplD)  │
     │                       │ ──────────────────────► │
     │                       │                         │
     │                       │   4. DMA Write (MWr)    │
     │                       │ ◄────────────────────── │
     │                       │                         │
     │ ◄─── 5. Interrupt ────────────────────────────── │
     │      (MSI-X)          │                         │
     │                       │                         │
     │ 6. Process results    │                         │

TLP Types for DMA

Operation	TLP Type	Credit Class
DMA Read	Memory Read (MRd)	Non-Posted
DMA Read Completion	Completion with Data (CplD)	Completion
DMA Write	Memory Write (MWr)	Posted

3. DMA Addressing

Address Types

    Application Space
    ┌──────────────────────┐
    │   Virtual Address    │ ─── User space pointer
    └──────────┬───────────┘
               │ MMU
               ▼
    ┌──────────────────────┐
    │  Physical Address    │ ─── System memory location
    └──────────┬───────────┘
               │ IOMMU
               ▼
    ┌──────────────────────┐
    │    DMA Address       │ ─── Address seen by device
    │   (Bus Address)      │     (I/O Virtual Address)
    └──────────────────────┘

DMA Address Width

32-bit DMA: Limited to first 4GB (legacy)
64-bit DMA: Full address space support
DAC (Dual Address Cycle): 64-bit in 32-bit header field

Checking Device DMA Capability

// Linux: Check device DMA mask
pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));

// If 64-bit fails, try 32-bit
if (err)
    pci_set_dma_mask(pdev, DMA_BIT_MASK(32));

4. Scatter-Gather DMA

What is Scatter-Gather?

Scatter-Gather DMA allows transferring data to/from non-contiguous memory regions in a single DMA operation, eliminating the need for buffer copying.

Scatter-Gather List (SGL) Structure

    SGL Entry 0        Physical Memory
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: A  │ ───► │ Page 0 (4KB)         │
    │ Len: 4K  │      └──────────────────────┘
    └──────────┘
    SGL Entry 1      
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: B  │ ───► │ Page 1 (4KB)         │
    │ Len: 4K  │      └──────────────────────┘
    └──────────┘
    SGL Entry 2      
    ┌──────────┐      ┌──────────────────────┐
    │ Addr: C  │ ───► │ Page 2 (8KB)         │
    │ Len: 8K  │      │                      │
    └──────────┘      └──────────────────────┘
    
    Total transfer: 16KB across 3 non-contiguous pages

PRP vs SGL (NVMe)

Feature	PRP (Physical Region Page)	SGL (Scatter-Gather List)
Entry Size	8 bytes (address only)	16 bytes (addr + len + type)
Alignment	Page aligned (4KB)	DWord aligned
Flexibility	Limited	High (arbitrary lengths)
Chaining	PRP List	SGL Segment/Last Segment

5. IOMMU Integration

IOMMU Architecture

    PCIe Device                  IOMMU                 System Memory
         │                          │                        │
         │  DMA Request             │                        │
         │  (IOVA: 0x1000)          │                        │
         │ ────────────────────────►│                        │
         │                          │                        │
         │              ┌───────────┴───────────┐            │
         │              │ I/O Page Table Walk   │            │
         │              │ IOVA → PA translation │            │
         │              │ Permission Check      │            │
         │              └───────────┬───────────┘            │
         │                          │                        │
         │                          │  Translated Request    │
         │                          │  (PA: 0x80001000)      │
         │                          │ ──────────────────────►│
         │                          │                        │
         │                          │ ◄──────────────────────│
         │ ◄────────────────────────│  Response              │
         │                          │                        │

IOMMU Benefits

Address Translation: Device sees I/O Virtual Addresses (IOVA)
Memory Protection: Restrict DMA to allowed regions
Device Isolation: Prevent device from accessing other memory
Virtualization: Guest physical to host physical translation

Intel VT-d / AMD IOMMU

Feature	Intel VT-d	AMD-Vi
Address Width	48-bit (up to 57)	48-bit (up to 57)
PASID Support	Yes	Yes
ATS Support	Yes	Yes
Page Sizes	4KB, 2MB, 1GB	4KB, 2MB, 1GB

6. DMA Descriptor Ring

Ring Buffer Structure

    Descriptor Ring (in Host Memory)
    
    ┌────────────────────────────────────────────────────────────┐
    │ Entry 0 │ Entry 1 │ Entry 2 │ ... │ Entry N-1              │
    └────┬────────┬────────┬─────────────────┬───────────────────┘
         │        │        │                 │
         │        │        │                 └── Wrap around
         │        │        │
    ┌────┴──┐  ┌──┴───┐  ┌─┴────┐
    │ Head  │  │ Curr │  │ Tail │
    │ (HW)  │  │      │  │ (SW) │
    └───────┘  └──────┘  └──────┘
    
    Descriptor Entry:
    ┌──────────────────────────────────────────────────────────────┐
    │ Source Address (64-bit) │ Dest Address (64-bit)             │
    │ Length (32-bit)         │ Control Flags │ Status            │
    └──────────────────────────────────────────────────────────────┘

Producer-Consumer Model

Software (producer) writes descriptors to ring
Software updates Tail pointer (doorbell write)
Hardware (consumer) reads descriptors from Head
Hardware performs DMA operations
Hardware updates Head pointer
Hardware signals completion (MSI-X)

7. DMA Security

DMA Attack Vectors

Direct Memory Access: Malicious device reads/writes arbitrary memory
DMA Injection: Compromised device modifies kernel memory
Thunderclap: Malicious Thunderbolt/PCIe device
Evil Maid: Physical access attack

Protection Mechanisms

Mechanism	Protection
IOMMU	Address translation and access control
ACS	Access Control Services - P2P routing control
IDE	Link/Selective encryption
Boot-time IOMMU	Enable IOMMU before device enumeration

8. DMA Programming (Linux)

DMA API Usage

// Allocate coherent DMA buffer
void *buffer;
dma_addr_t dma_handle;
buffer = dma_alloc_coherent(&pdev->dev, size, &dma_handle, GFP_KERNEL);

// Map existing buffer for DMA
dma_addr_t dma_addr;
dma_addr = dma_map_single(&pdev->dev, buf, size, DMA_TO_DEVICE);

// Scatter-gather mapping
int nents = dma_map_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE);

// Unmap when done
dma_unmap_single(&pdev->dev, dma_addr, size, DMA_TO_DEVICE);
dma_unmap_sg(&pdev->dev, sg, nents, DMA_FROM_DEVICE);

Streaming vs Coherent DMA

Type	Use Case	Caching
Coherent	Descriptor rings, shared data	Uncached or cache-coherent
Streaming	Data buffers, one-way transfer	Cached, requires sync

9. Performance Optimization

DMA Best Practices

Large Transfers: Batch multiple requests when possible
Alignment: Align buffers to cache line (64 bytes)
MPS: Set Max Payload Size to highest supported
MRRS: Set Max Read Request Size appropriately
Coalescing: Interrupt coalescing for high throughput
NUMA: Allocate buffers on local NUMA node

TLP Size Optimization

    Max Payload Size (MPS) impact:
    
    128B MPS, 4KB transfer: 32 TLPs (32 × 16B header = 512B overhead)
    256B MPS, 4KB transfer: 16 TLPs (16 × 16B header = 256B overhead)
    512B MPS, 4KB transfer:  8 TLPs ( 8 × 16B header = 128B overhead)
    
    Larger MPS = Fewer TLPs = Better efficiency