Non-blocking memory reads for high-latency memory access with latency hiding
Deferrable Memory Read (DfMRd) is a PCIe 6.0+ transaction type that allows a completer to acknowledge a read request immediately and deliver the actual data later. This enables non-blocking access to high-latency memory.
Standard Memory Read:
Requester Completer
│ │
│ ─── MRd (blocking) ──────────────►│
│ │ Wait for data
│ (Requester blocked) │ (high latency)
│ │
│ ◄── CplD (data) ──────────────────│
│ │
Total latency experienced by requester
Deferrable Memory Read:
Requester Completer
│ │
│ ─── DfMRd ───────────────────────►│
│ │
│ ◄── DfCpl (Deferred) ─────────────│ Immediate!
│ │
│ (Requester continues other work) │ Fetch data
│ │ (high latency)
│ ◄── DfCplD (actual data) ─────────│
│ │
Requester not blocked during data fetch
Emerging memory technologies (CXL memory, persistent memory, far memory) have higher latency than local DRAM. Deferrable memory enables efficient access by hiding this latency.
| Memory Type | Typical Latency |
|---|---|
| Local DDR5 DRAM | ~80-100 ns |
| CXL Memory (Type 2/3) | ~150-300 ns |
| Persistent Memory | ~200-400 ns |
| Remote NUMA Node | ~150-200 ns |
| Far Memory Pool | ~500-1000+ ns |
Step 1: DfMRd Request
┌───────────┐ ┌───────────────┐
│ Requester │ ─── DfMRd ───────────────►│ Completer │
│ │ Tag=5, Addr=X │ (CXL Device) │
└───────────┘ └───────────────┘
Step 2: Deferred Completion (immediate)
┌───────────┐ ┌───────────────┐
│ Requester │ ◄── DfCpl ────────────────│ Completer │
│ │ Tag=5, Status=Deferred│ │
└───────────┘ └───────────────┘
Requester continues other operations...
Completer fetches data from memory...
Step 3: Data Completion (later)
┌───────────┐ ┌───────────────┐
│ Requester │ ◄── DfCplD ───────────────│ Completer │
│ │ Tag=5, Data │ │
└───────────┘ └───────────────┘
| TLP | Direction | Description |
|---|---|---|
| DfMRd | Requester → Completer | Deferrable Memory Read request |
| DfCpl | Completer → Requester | Deferred completion (no data yet) |
| DfCplD | Completer → Requester | Deferred completion with data |
| Status | Meaning | Action |
|---|---|---|
| Immediate CplD | Data available immediately | Normal completion |
| Deferred (DfCpl) | Request accepted, data later | Wait for DfCplD |
| UR | Unsupported Request | Use standard MRd |
The same Tag is used for the initial request and all related completions:
DfMRd uses Non-Posted credits (like standard MRd)
DfCpl uses Completion credits (1 credit, no data)
DfCplD uses Completion credits (based on data size)
Credit flow:
1. DfMRd sent:
Requester: -1 NP credit
2. DfCpl received:
Requester: holds tag (waiting for data)
3. DfCplD received:
Requester: releases tag
Deferred completions may arrive out of order relative to other transactions. The requester must handle this properly.
┌─────────────────────────────────────────────────────────────┐
│ Host CPU │
│ ┌────────────┐ │
│ │ Cache │ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Memory │ │
│ │ Controller │ │
│ └─────┬──────┘ │
└──────────────────────────┼──────────────────────────────────┘
│ PCIe/CXL
│
┌──────────────────────────▼──────────────────────────────────┐
│ CXL Memory Expander │
│ ┌───────────┐ ┌───────────┐ ┌───────────────────┐ │
│ │ CXL.io │ │ CXL.cache │ │ CXL.mem │ │
│ │ (Config) │ │ (Future) │ │ (DfMRd target) │ │
│ └───────────┘ └───────────┘ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ DDR5/HBM/etc │ │
│ │ (Memory Media) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Without DfMRd (blocking):
|─ MRd ─|─────────── wait ───────────|─ CplD ─|─ next op ─|
Total time = request + wait + next_op
With DfMRd (non-blocking):
|─DfMRd─|─DfCpl─|─── other work ───|─DfCplD─|
|────────────────────|
Parallel with memory access
Effective time = max(other_work, memory_access)
Devices supporting deferrable memory advertise via extended capability: