Hardware-guaranteed atomic read-modify-write operations across PCIe fabric
AtomicOps are PCIe TLP transactions that perform read-modify-write sequences atomically at a target memory location. The operation completes as an indivisible unit, preventing data corruption from concurrent access.
| Operation | TLP Type | Operand Size | Description |
|---|---|---|---|
| FetchAdd | 0100 0100 | 32/64 bit | Fetch value, return original, add operand |
| Swap | 0100 1100 | 32/64 bit | Fetch value, return original, write operand |
| CAS (Compare & Swap) | 0101 1100 | 32/64/128 bit | Compare to operand1, if equal swap with operand2 |
Hardware atomics provide guaranteed atomicity without requiring software locks, reducing latency, avoiding deadlocks, and enabling efficient synchronization across CPU and devices.
| Aspect | Software Lock | PCIe Atomic |
|---|---|---|
| Latency | Multiple round trips | Single atomic TLP |
| Contention | Spin/wait required | Hardware arbitration |
| Deadlock | Possible | Not possible |
| Cross-device | Complex | Native support |
Requester Completer (Memory)
│ │
│ AtomicOp Request (FetchAdd) │
│ [Addr, Operand=5] │
│ ─────────────────────────────────────►│
│ │
│ Read M[Addr] = 10
│ Write M[Addr] = 10 + 5 = 15
│ │
│ Completion with Data │
│ [Original value = 10] │
│ ◄─────────────────────────────────────│
│ │
Result: Memory updated to 15, Requester gets original 10
Requester Completer (Memory)
│ │
│ AtomicOp Request (CAS) │
│ [Addr, Compare=10, Swap=20] │
│ ─────────────────────────────────────►│
│ │
│ Read M[Addr] = 10
│ Compare: 10 == 10? YES
│ Write M[Addr] = 20
│ │
│ Completion with Data │
│ [Original value = 10] │
│ ◄─────────────────────────────────────│
│ │
Success: Memory was 10, now 20
--- If memory value doesn't match: ---
│ Read M[Addr] = 15
│ Compare: 15 == 10? NO
│ No write performed
│ │
│ Completion with Data │
│ [Original value = 15] │
│ ◄─────────────────────────────────────│
Failure: Memory unchanged, Requester sees actual value (15)
Byte 0-3 (DW0):
┌──────────────────────────────────────────────────────────────┐
│ Fmt │ Type │ TC │ Attr │ TH │ TD │ EP │ Attr │ AT │ Len │
│ 01 │ xxx │ │ │ │ │ │ │ │ │
└──────────────────────────────────────────────────────────────┘
Type Encoding:
FetchAdd 32-bit: 0100 0100
FetchAdd 64-bit: 0100 0110
Swap 32-bit: 0100 1100
Swap 64-bit: 0100 1110
CAS 32-bit: 0101 1100
CAS 64-bit: 0101 1110
CAS 128-bit: 0101 1111
Byte 4-7 (DW1):
┌──────────────────────────────────────────────────────────────┐
│ Requester ID (16) │ Tag (8) │ Last BE │ First BE │
└──────────────────────────────────────────────────────────────┘
Byte 8-11 (DW2):
┌──────────────────────────────────────────────────────────────┐
│ Address[31:2] │ Rsvd │
└──────────────────────────────────────────────────────────────┘
Data Payload:
- FetchAdd/Swap: 1 or 2 DW (operand)
- CAS: 2, 4, or 8 DW (compare + swap values)
AtomicOps always return a Completion with Data (CplD) containing the original value at the target address before the operation was performed.
| Bits | Field | Description |
|---|---|---|
| 6 | AtomicOp Routing Supported | Switch can route AtomicOps |
| 7 | 32-bit AtomicOp Completer Supported | Endpoint supports 32-bit atomics |
| 8 | 64-bit AtomicOp Completer Supported | Endpoint supports 64-bit atomics |
| 9 | 128-bit CAS Completer Supported | Endpoint supports 128-bit CAS |
| Bits | Field | Description |
|---|---|---|
| 6 | AtomicOp Requester Enable | Allow this function to issue AtomicOps |
| 7 | AtomicOp Egress Blocking | Block AtomicOps from egressing port |
// Lock using Swap atomic
void acquire_lock(volatile uint32_t *lock_addr) {
uint32_t expected = 0;
while (1) {
// AtomicOp Swap: write 1, get original
uint32_t old = pcie_atomic_swap(lock_addr, 1);
if (old == 0) {
return; // Lock acquired
}
// Spin
}
}
void release_lock(volatile uint32_t *lock_addr) {
*lock_addr = 0; // Simple write to release
}
// Increment reference count atomically
uint32_t ref_inc(volatile uint32_t *ref_count) {
return pcie_atomic_fetchadd(ref_count, 1);
}
// Decrement and check if zero
bool ref_dec_and_test(volatile uint32_t *ref_count) {
uint32_t old = pcie_atomic_fetchadd(ref_count, -1);
return (old == 1); // Was 1, now 0
}
// Push to lock-free queue using CAS
bool queue_push(queue_t *q, void *item) {
node_t *new_node = alloc_node(item);
while (1) {
node_t *tail = q->tail;
node_t *next = tail->next;
if (next == NULL) {
// Try to link new node
if (pcie_atomic_cas(&tail->next, NULL, new_node)) {
// Success, try to update tail
pcie_atomic_cas(&q->tail, tail, new_node);
return true;
}
} else {
// Tail not pointing to last, help advance
pcie_atomic_cas(&q->tail, tail, next);
}
}
}
AtomicOps are Non-Posted transactions and follow standard PCIe ordering rules for Non-Posted requests. They block at switches until completion is received.
CPU GPU
│ │
│ Write work to │
│ shared queue │
│ ─────────────────────►
│ │
│ AtomicOp FetchAdd │
│ (increment work │
│ counter) │
│ ─────────────────────►
│ │
│ ◄────── CplD ────────│ GPU sees counter++
│ │
│ │ GPU reads work
│ │ and processes
│ │
│ │ GPU AtomicOp
│ │ FetchAdd (done++)
│ ◄─────────────────────
│ │
│ CPU sees completion │