Cache-aware data placement hints for improved memory subsystem efficiency
TPH (TLP Processing Hints) is a PCIe capability that allows devices to provide hints about data placement in the processor's cache hierarchy. These hints help the memory controller make better caching decisions.
Without hints, DMA data often pollutes CPU caches unnecessarily or misses cache when it should hit. TPH helps the memory subsystem make optimal placement decisions.
Without TPH:
Network Card CPU L3 Cache
│ ┌─────────────────┐
│ DMA Write (packet data) │ Hot data (app) │ ← evicted!
│ ─────────────────────────────│─────────────────│
│ │ Packet (cold) │ ← inserted
│ └─────────────────┘
Result: Hot application data evicted for cold network data
With TPH (BiDir hint):
Network Card CPU L3 Cache
│ ┌─────────────────┐
│ DMA Write + BiDir hint │ Hot data (app) │ ← preserved!
│ ─────────────────────────────│─────────────────│
│ │ │
│ └─────────────────┘
│ ▼
│ Memory (bypass cache)
Result: Network data bypasses cache, app data preserved
| PH Value | Name | Description | Use Case |
|---|---|---|---|
| 00 | BiDir | Data accessed by device and host roughly equally | Shared buffers |
| 01 | Req | Data accessed primarily by Requester | Device workspace |
| 10 | Target | Data accessed primarily by target (CPU) | Inbound data |
| 11 | Target High Priority | Target access, higher cache priority | Latency-sensitive |
Device ST Table (in BAR space):
┌─────────────────────────────────────────────┐
│ Entry 0: ST[15:0] │ (maps to LLC way 0) │
├─────────────────────────────────────────────┤
│ Entry 1: ST[15:0] │ (maps to LLC way 1) │
├─────────────────────────────────────────────┤
│ Entry 2: ST[15:0] │ (maps to CPU 0 cache) │
├─────────────────────────────────────────────┤
│ ... │
└─────────────────────────────────────────────┘
Software configures ST Table based on system topology
Device uses ST index in TLP header
TLP Header with TPH (TH bit = 1):
DW0: ┌──────────────────────────────────────────────────┐
│ ... │ TH=1 │ ... │ │
└──────────────────────────────────────────────────┘
DW2: ┌──────────────────────────────────────────────────┐
│ ST[15:8] │ PH[1:0] │ ST[7:0] │ ... │
└──────────────────────────────────────────────────┘
| Offset | Register | Description |
|---|---|---|
| 00h | Extended Cap Header | ID = 0017h |
| 04h | TPH Requester Capability | Modes, ST table size |
| 08h | TPH Requester Control | Enable, mode selection |
| 0Ch | TPH ST Table | Steering Tag entries |
Intel DDIO is an implementation of TPH-like functionality where DMA data goes directly to LLC instead of memory:
Without DDIO:
Device ──► Memory ──► L3 Cache ──► CPU
(slow)
With DDIO:
Device ──► L3 Cache ──► CPU
(fast, fewer memory accesses)
┌────────────────────────────────────────────────────────────┐
│ 2-Socket Server │
│ │
│ NUMA Node 0 │ NUMA Node 1 │
│ ┌───────────┐ │ ┌───────────┐ │
│ │ CPU 0 │ │ │ CPU 1 │ │
│ │ L3 Cache │ │ │ L3 Cache │ │
│ └─────┬─────┘ │ └─────┬─────┘ │
│ │ │ │ │
│ ┌─────▼─────┐ │ ┌─────▼─────┐ │
│ │ Memory 0 │ │ │ Memory 1 │ │
│ └───────────┘ │ └───────────┘ │
│ │ │
│ NIC with TPH: │ │
│ ST=0 → CPU 0 L3 │ ST=1 → CPU 1 L3 │
│ (for CPU 0 traffic) │ (for CPU 1 traffic) │
└────────────────────────────────────────────────────────────┘
Proper TPH configuration can reduce memory bandwidth consumption by 20-40% and improve latency for I/O-intensive workloads.