CACHE OPTIMIZATION

TPH (TLP Processing Hints)

Cache-aware data placement hints for improved memory subsystem efficiency

1. What is TPH?

What are TLP Processing Hints?

TPH (TLP Processing Hints) is a PCIe capability that allows devices to provide hints about data placement in the processor's cache hierarchy. These hints help the memory controller make better caching decisions.

TPH Components

2. Why TPH?

Why use Processing Hints?

Without hints, DMA data often pollutes CPU caches unnecessarily or misses cache when it should hit. TPH helps the memory subsystem make optimal placement decisions.

Cache Pollution Problem

    Without TPH:
    
    Network Card                        CPU L3 Cache
         │                              ┌─────────────────┐
         │ DMA Write (packet data)      │ Hot data (app)  │ ← evicted!
         │ ─────────────────────────────│─────────────────│
         │                              │ Packet (cold)   │ ← inserted
         │                              └─────────────────┘
    
    Result: Hot application data evicted for cold network data
    
    With TPH (BiDir hint):
    
    Network Card                        CPU L3 Cache
         │                              ┌─────────────────┐
         │ DMA Write + BiDir hint       │ Hot data (app)  │ ← preserved!
         │ ─────────────────────────────│─────────────────│
         │                              │                 │
         │                              └─────────────────┘
         │                                      ▼
         │                              Memory (bypass cache)
    
    Result: Network data bypasses cache, app data preserved

Use Cases

3. Processing Hint Types

PH Value Name Description Use Case
00 BiDir Data accessed by device and host roughly equally Shared buffers
01 Req Data accessed primarily by Requester Device workspace
10 Target Data accessed primarily by target (CPU) Inbound data
11 Target High Priority Target access, higher cache priority Latency-sensitive

4. Steering Tags

ST Table Structure

    Device ST Table (in BAR space):
    
    ┌─────────────────────────────────────────────┐
    │ Entry 0: ST[15:0] │ (maps to LLC way 0)     │
    ├─────────────────────────────────────────────┤
    │ Entry 1: ST[15:0] │ (maps to LLC way 1)     │
    ├─────────────────────────────────────────────┤
    │ Entry 2: ST[15:0] │ (maps to CPU 0 cache)   │
    ├─────────────────────────────────────────────┤
    │ ...                                         │
    └─────────────────────────────────────────────┘
    
    Software configures ST Table based on system topology
    Device uses ST index in TLP header

ST in TLP Header

    TLP Header with TPH (TH bit = 1):
    
    DW0: ┌──────────────────────────────────────────────────┐
         │ ... │ TH=1 │ ... │                              │
         └──────────────────────────────────────────────────┘
    
    DW2: ┌──────────────────────────────────────────────────┐
         │  ST[15:8]  │  PH[1:0]  │  ST[7:0]  │   ...      │
         └──────────────────────────────────────────────────┘

5. TPH Extended Capability

Offset Register Description
00h Extended Cap Header ID = 0017h
04h TPH Requester Capability Modes, ST table size
08h TPH Requester Control Enable, mode selection
0Ch TPH ST Table Steering Tag entries

TPH Modes

6. System Configuration

TPH Setup Sequence

  1. Read TPH Capability to determine support
  2. Query system topology (NUMA, cache structure)
  3. Populate ST Table with appropriate mappings
  4. Enable TPH Requester
  5. Device uses ST index in TLPs

Linux DDIO (Data Direct I/O)

Intel DDIO is an implementation of TPH-like functionality where DMA data goes directly to LLC instead of memory:

    Without DDIO:
    Device ──► Memory ──► L3 Cache ──► CPU
              (slow)
    
    With DDIO:
    Device ──► L3 Cache ──► CPU
              (fast, fewer memory accesses)

7. TPH with NUMA

NUMA-Aware Data Placement

    ┌────────────────────────────────────────────────────────────┐
    │                     2-Socket Server                        │
    │                                                            │
    │   NUMA Node 0              │      NUMA Node 1              │
    │   ┌───────────┐            │      ┌───────────┐            │
    │   │   CPU 0   │            │      │   CPU 1   │            │
    │   │  L3 Cache │            │      │  L3 Cache │            │
    │   └─────┬─────┘            │      └─────┬─────┘            │
    │         │                  │            │                  │
    │   ┌─────▼─────┐            │      ┌─────▼─────┐            │
    │   │ Memory 0  │            │      │ Memory 1  │            │
    │   └───────────┘            │      └───────────┘            │
    │                            │                               │
    │   NIC with TPH:            │                               │
    │   ST=0 → CPU 0 L3          │      ST=1 → CPU 1 L3          │
    │   (for CPU 0 traffic)      │      (for CPU 1 traffic)      │
    └────────────────────────────────────────────────────────────┘

8. Best Practices

Performance Impact

Proper TPH configuration can reduce memory bandwidth consumption by 20-40% and improve latency for I/O-intensive workloads.