TPH (TLP Processing Hints) Complete Guide

1. What is TPH?

What are TLP Processing Hints?

TPH (TLP Processing Hints) is a PCIe capability that allows devices to provide hints about data placement in the processor's cache hierarchy. These hints help the memory controller make better caching decisions.

TPH Components

Steering Tag (ST): Identifies target cache location
Processing Hint (PH): Suggests caching behavior
ST Table: Maps logical tags to physical cache locations

2. Why TPH?

Why use Processing Hints?

Without hints, DMA data often pollutes CPU caches unnecessarily or misses cache when it should hit. TPH helps the memory subsystem make optimal placement decisions.

Cache Pollution Problem

    Without TPH:
    
    Network Card                        CPU L3 Cache
         │                              ┌─────────────────┐
         │ DMA Write (packet data)      │ Hot data (app)  │ ← evicted!
         │ ─────────────────────────────│─────────────────│
         │                              │ Packet (cold)   │ ← inserted
         │                              └─────────────────┘
    
    Result: Hot application data evicted for cold network data
    
    With TPH (BiDir hint):
    
    Network Card                        CPU L3 Cache
         │                              ┌─────────────────┐
         │ DMA Write + BiDir hint       │ Hot data (app)  │ ← preserved!
         │ ─────────────────────────────│─────────────────│
         │                              │                 │
         │                              └─────────────────┘
         │                                      ▼
         │                              Memory (bypass cache)
    
    Result: Network data bypasses cache, app data preserved

Use Cases

Network Cards: Direct packet data to correct CPU's cache
NVMe: Completion queue entries to specific cache
GPU: Control caching of command buffers
RDMA: Cache placement for remote memory access

3. Processing Hint Types

PH Value	Name	Description	Use Case
00	BiDir	Data accessed by device and host roughly equally	Shared buffers
01	Req	Data accessed primarily by Requester	Device workspace
10	Target	Data accessed primarily by target (CPU)	Inbound data
11	Target High Priority	Target access, higher cache priority	Latency-sensitive

4. Steering Tags

ST Table Structure

    Device ST Table (in BAR space):
    
    ┌─────────────────────────────────────────────┐
    │ Entry 0: ST[15:0] │ (maps to LLC way 0)     │
    ├─────────────────────────────────────────────┤
    │ Entry 1: ST[15:0] │ (maps to LLC way 1)     │
    ├─────────────────────────────────────────────┤
    │ Entry 2: ST[15:0] │ (maps to CPU 0 cache)   │
    ├─────────────────────────────────────────────┤
    │ ...                                         │
    └─────────────────────────────────────────────┘
    
    Software configures ST Table based on system topology
    Device uses ST index in TLP header

ST in TLP Header

    TLP Header with TPH (TH bit = 1):
    
    DW0: ┌──────────────────────────────────────────────────┐
         │ ... │ TH=1 │ ... │                              │
         └──────────────────────────────────────────────────┘
    
    DW2: ┌──────────────────────────────────────────────────┐
         │  ST[15:8]  │  PH[1:0]  │  ST[7:0]  │   ...      │
         └──────────────────────────────────────────────────┘

5. TPH Extended Capability

Offset	Register	Description
00h	Extended Cap Header	ID = 0017h
04h	TPH Requester Capability	Modes, ST table size
08h	TPH Requester Control	Enable, mode selection
0Ch	TPH ST Table	Steering Tag entries

TPH Modes

No ST Mode: Only PH used, no steering tag
Interrupt Vector Mode: ST derived from MSI-X vector
Device Specific Mode: Device determines ST

6. System Configuration

TPH Setup Sequence

Read TPH Capability to determine support
Query system topology (NUMA, cache structure)
Populate ST Table with appropriate mappings
Enable TPH Requester
Device uses ST index in TLPs

Linux DDIO (Data Direct I/O)

Intel DDIO is an implementation of TPH-like functionality where DMA data goes directly to LLC instead of memory:

    Without DDIO:
    Device ──► Memory ──► L3 Cache ──► CPU
              (slow)
    
    With DDIO:
    Device ──► L3 Cache ──► CPU
              (fast, fewer memory accesses)

7. TPH with NUMA

NUMA-Aware Data Placement

    ┌────────────────────────────────────────────────────────────┐
    │                     2-Socket Server                        │
    │                                                            │
    │   NUMA Node 0              │      NUMA Node 1              │
    │   ┌───────────┐            │      ┌───────────┐            │
    │   │   CPU 0   │            │      │   CPU 1   │            │
    │   │  L3 Cache │            │      │  L3 Cache │            │
    │   └─────┬─────┘            │      └─────┬─────┘            │
    │         │                  │            │                  │
    │   ┌─────▼─────┐            │      ┌─────▼─────┐            │
    │   │ Memory 0  │            │      │ Memory 1  │            │
    │   └───────────┘            │      └───────────┘            │
    │                            │                               │
    │   NIC with TPH:            │                               │
    │   ST=0 → CPU 0 L3          │      ST=1 → CPU 1 L3          │
    │   (for CPU 0 traffic)      │      (for CPU 1 traffic)      │
    └────────────────────────────────────────────────────────────┘

8. Best Practices

Network RX: Use Target hint with ST pointing to receiving CPU's LLC
Network TX: Use BiDir for buffers accessed by both CPU and NIC
NVMe Completions: Target hint to submission CPU
GPU Commands: Req hint for GPU-primary data

Performance Impact

Proper TPH configuration can reduce memory bandwidth consumption by 20-40% and improve latency for I/O-intensive workloads.