Atomic Operations Complete Guide

1. What are Atomic Operations?

What are PCIe Atomic Operations?

AtomicOps are PCIe TLP transactions that perform read-modify-write sequences atomically at a target memory location. The operation completes as an indivisible unit, preventing data corruption from concurrent access.

Supported Atomic Operations

Operation	TLP Type	Operand Size	Description
FetchAdd	0100 0100	32/64 bit	Fetch value, return original, add operand
Swap	0100 1100	32/64 bit	Fetch value, return original, write operand
CAS (Compare & Swap)	0101 1100	32/64/128 bit	Compare to operand1, if equal swap with operand2

2. Why Atomic Operations?

Why use hardware atomics over software locks?

Hardware atomics provide guaranteed atomicity without requiring software locks, reducing latency, avoiding deadlocks, and enabling efficient synchronization across CPU and devices.

Use Cases

Lock-free Queues: Producer-consumer without mutex
Reference Counting: Atomic increment/decrement
GPU Synchronization: Cross-device semaphores
NVMe: Doorbell optimization
RDMA: Remote memory synchronization
Shared Memory: Device-to-device coordination

Comparison: Atomic vs Software Lock

Aspect	Software Lock	PCIe Atomic
Latency	Multiple round trips	Single atomic TLP
Contention	Spin/wait required	Hardware arbitration
Deadlock	Possible	Not possible
Cross-device	Complex	Native support

3. Atomic Operation Flow

FetchAdd Operation

    Requester                          Completer (Memory)
        │                                      │
        │  AtomicOp Request (FetchAdd)         │
        │  [Addr, Operand=5]                   │
        │ ─────────────────────────────────────►│
        │                                      │
        │                              Read M[Addr] = 10
        │                              Write M[Addr] = 10 + 5 = 15
        │                                      │
        │  Completion with Data                │
        │  [Original value = 10]               │
        │ ◄─────────────────────────────────────│
        │                                      │
    
    Result: Memory updated to 15, Requester gets original 10

Compare-and-Swap (CAS) Operation

    Requester                          Completer (Memory)
        │                                      │
        │  AtomicOp Request (CAS)              │
        │  [Addr, Compare=10, Swap=20]         │
        │ ─────────────────────────────────────►│
        │                                      │
        │                              Read M[Addr] = 10
        │                              Compare: 10 == 10? YES
        │                              Write M[Addr] = 20
        │                                      │
        │  Completion with Data                │
        │  [Original value = 10]               │
        │ ◄─────────────────────────────────────│
        │                                      │
    
    Success: Memory was 10, now 20
    
    --- If memory value doesn't match: ---
    
        │                              Read M[Addr] = 15
        │                              Compare: 15 == 10? NO
        │                              No write performed
        │                                      │
        │  Completion with Data                │
        │  [Original value = 15]               │
        │ ◄─────────────────────────────────────│
    
    Failure: Memory unchanged, Requester sees actual value (15)

4. TLP Format

AtomicOp Request Header (32-bit address)

    Byte 0-3 (DW0):
    ┌──────────────────────────────────────────────────────────────┐
    │ Fmt │ Type    │ TC │ Attr │ TH │ TD │ EP │ Attr │ AT │ Len  │
    │ 01  │ xxx     │    │      │    │    │    │      │    │      │
    └──────────────────────────────────────────────────────────────┘
    
    Type Encoding:
    FetchAdd 32-bit:  0100 0100
    FetchAdd 64-bit:  0100 0110
    Swap 32-bit:      0100 1100
    Swap 64-bit:      0100 1110
    CAS 32-bit:       0101 1100
    CAS 64-bit:       0101 1110
    CAS 128-bit:      0101 1111
    
    Byte 4-7 (DW1):
    ┌──────────────────────────────────────────────────────────────┐
    │    Requester ID (16)     │  Tag (8)  │ Last BE │ First BE   │
    └──────────────────────────────────────────────────────────────┘
    
    Byte 8-11 (DW2):
    ┌──────────────────────────────────────────────────────────────┐
    │                    Address[31:2]                      │ Rsvd │
    └──────────────────────────────────────────────────────────────┘
    
    Data Payload:
    - FetchAdd/Swap: 1 or 2 DW (operand)
    - CAS: 2, 4, or 8 DW (compare + swap values)

AtomicOp Completion

AtomicOps always return a Completion with Data (CplD) containing the original value at the target address before the operation was performed.

5. Capability Structure

Device Capabilities 2 Register

Bits	Field	Description
6	AtomicOp Routing Supported	Switch can route AtomicOps
7	32-bit AtomicOp Completer Supported	Endpoint supports 32-bit atomics
8	64-bit AtomicOp Completer Supported	Endpoint supports 64-bit atomics
9	128-bit CAS Completer Supported	Endpoint supports 128-bit CAS

Device Control 2 Register

Bits	Field	Description
6	AtomicOp Requester Enable	Allow this function to issue AtomicOps
7	AtomicOp Egress Blocking	Block AtomicOps from egressing port

6. System Requirements

AtomicOp Routing Requirements

All switches in path MUST support AtomicOp Routing
Target endpoint MUST support required atomic size
Requester MUST have AtomicOp Requester Enable set
AtomicOp Egress Blocking MUST NOT be set on path
Root Complex MUST support AtomicOp completions

Enabling Sequence

Enumerate PCIe tree
Check AtomicOp capabilities at each switch
Check completer capabilities at target
Enable AtomicOp Requester at source
Ensure routing not blocked

7. Programming Examples

Spinlock Implementation

// Lock using Swap atomic
void acquire_lock(volatile uint32_t *lock_addr) {
    uint32_t expected = 0;
    while (1) {
        // AtomicOp Swap: write 1, get original
        uint32_t old = pcie_atomic_swap(lock_addr, 1);
        if (old == 0) {
            return; // Lock acquired
        }
        // Spin
    }
}

void release_lock(volatile uint32_t *lock_addr) {
    *lock_addr = 0; // Simple write to release
}

Reference Counting

// Increment reference count atomically
uint32_t ref_inc(volatile uint32_t *ref_count) {
    return pcie_atomic_fetchadd(ref_count, 1);
}

// Decrement and check if zero
bool ref_dec_and_test(volatile uint32_t *ref_count) {
    uint32_t old = pcie_atomic_fetchadd(ref_count, -1);
    return (old == 1); // Was 1, now 0
}

Lock-free Queue (CAS)

// Push to lock-free queue using CAS
bool queue_push(queue_t *q, void *item) {
    node_t *new_node = alloc_node(item);
    while (1) {
        node_t *tail = q->tail;
        node_t *next = tail->next;
        
        if (next == NULL) {
            // Try to link new node
            if (pcie_atomic_cas(&tail->next, NULL, new_node)) {
                // Success, try to update tail
                pcie_atomic_cas(&q->tail, tail, new_node);
                return true;
            }
        } else {
            // Tail not pointing to last, help advance
            pcie_atomic_cas(&q->tail, tail, next);
        }
    }
}

8. Rules and Ordering

Normative Rules for AtomicOps

AtomicOps use Non-Posted flow (require completion)
AtomicOps consume Non-Posted credits
AtomicOp address MUST be naturally aligned
Completer MUST serialize AtomicOps to same address
AtomicOp completions MUST return original data
Unsupported AtomicOp returns UR (Unsupported Request)

Ordering

AtomicOps are Non-Posted transactions and follow standard PCIe ordering rules for Non-Posted requests. They block at switches until completion is received.

9. Use Case: GPU-CPU Synchronization

    CPU                    GPU                     
     │                      │                      
     │ Write work to        │                      
     │ shared queue         │                      
     │ ─────────────────────►                      
     │                      │                      
     │ AtomicOp FetchAdd    │                      
     │ (increment work      │                      
     │  counter)            │                      
     │ ─────────────────────►                      
     │                      │                      
     │ ◄────── CplD ────────│ GPU sees counter++  
     │                      │                      
     │                      │ GPU reads work      
     │                      │ and processes       
     │                      │                      
     │                      │ GPU AtomicOp        
     │                      │ FetchAdd (done++)   
     │ ◄─────────────────────                      
     │                      │                      
     │ CPU sees completion  │