Header compression for improved bandwidth efficiency in PCIe 6.0/7.0 Flit Mode
OHC (Optimized Header Compression) is a feature in PCIe 6.0/7.0 Flit Mode that compresses TLP headers to reduce overhead, improving effective bandwidth utilization. It removes redundant or predictable header fields.
| Header Type | Size | Mode |
|---|---|---|
| 3DW Header (32-bit addr) | 12 bytes | Legacy/Flit |
| 4DW Header (64-bit addr) | 16 bytes | Legacy/Flit |
| OHC-A (2DW) | 8 bytes | Flit Mode Only |
| OHC-B (1DW) | 4 bytes | Flit Mode Only |
| OHC-C (0.5DW) | 2 bytes | Flit Mode Only |
TLP headers consume a significant portion of link bandwidth, especially for small transfers. OHC recovers this overhead, improving effective throughput by 10-30% for typical workloads.
64-byte payload with 16-byte header:
Overhead = 16 / (16 + 64) = 20%
64-byte payload with 4-byte OHC header:
Overhead = 4 / (4 + 64) = 5.9%
Bandwidth improvement: ~15%
For small NVMe commands (4KB with many 64B TLPs):
Traditional: 16B header × 64 TLPs = 1024B overhead
OHC-B: 4B header × 64 TLPs = 256B overhead
Savings: 768 bytes per 4KB transfer = 18.75%
DW0:
┌────────────────────────────────────────────────────────────────┐
│ OH Type │ TC │ Attr │ TH │ Len[9:0] │ Tag[9:0] │
│ (3b) │ (3b) │ (3b) │ (1b) │ (10b) │ (10b) │
└────────────────────────────────────────────────────────────────┘
DW1:
┌────────────────────────────────────────────────────────────────┐
│ Requester ID (16b) │ Tag[13:10] │ Rsvd │
│ │ (4b) │ │
└────────────────────────────────────────────────────────────────┘
Usage: When full Requester ID and 14-bit Tag needed
Supports: Memory Read/Write, Completions
DW0:
┌────────────────────────────────────────────────────────────────┐
│ OH Type │ TC │ TH │ Len[9:0] │ Tag[13:0] │
│ (3b) │ (3b) │(1b)│ (10b) │ (14b) │
└────────────────────────────────────────────────────────────────┘
Usage: When Requester ID can be derived from context
Supports: Completions (implied Completer ID)
┌────────────────────────────────────────────────────────────────┐
│ OH Type │ Len[3:0] │ Tag[9:0] │
│ (3b) │ (4b) │ (10b) │
└────────────────────────────────────────────────────────────────┘
Usage: Very short completions (≤ 64 bytes)
Maximum payload: 16 DW (64 bytes)
Supports: CplD with constrained parameters
| OH Type | Value | Description |
|---|---|---|
| OHC-A1 | 000 | Memory Read, 64-bit address |
| OHC-A2 | 001 | Memory Write, 64-bit address |
| OHC-A3 | 010 | Completion with Data |
| OHC-B | 100 | Reduced Completion |
| OHC-C | 110 | Minimal Completion |
| TLP Type | OHC-A | OHC-B | OHC-C |
|---|---|---|---|
| Memory Read (MRd) | Yes | No | No |
| Memory Write (MWr) | Yes | No | No |
| Completion (Cpl) | Yes | No | No |
| Completion w/Data (CplD) | Yes | Yes | Yes |
| Config/IO | No | No | No |
| Messages | No | No | No |
OHC can compress addresses by encoding only the difference from a base address, when the address falls within a predictable range.
Base Address (maintained in context): 0x0000_0001_0000_0000
TLP 1: Address 0x0000_0001_0000_0100
Delta = 0x100 (fits in 12 bits)
Send: 12-bit delta instead of 64-bit address
TLP 2: Address 0x0000_0001_0000_0200
Delta = 0x200 (fits in 12 bits)
Send: 12-bit delta
Savings: 52 bits per TLP with sequential access patterns
| Workload | Avg TLP Size | OHC Gain |
|---|---|---|
| NVMe 4KB Random Read | 64-256 B | 15-20% |
| GPU Texture Fetch | 64-128 B | 18-25% |
| Network (100GbE) | 64-1500 B | 5-15% |
| Large Sequential | 4096 B | < 5% |
OHC provides the greatest benefit for workloads with many small TLPs (NVMe, GPU), where header overhead is proportionally larger. Large sequential transfers already have high efficiency.