ATOMIC TRANSACTIONS

Atomic Operations Complete Guide

Hardware-guaranteed atomic read-modify-write operations across PCIe fabric

1. What are Atomic Operations?

What are PCIe Atomic Operations?

AtomicOps are PCIe TLP transactions that perform read-modify-write sequences atomically at a target memory location. The operation completes as an indivisible unit, preventing data corruption from concurrent access.

Supported Atomic Operations

Operation TLP Type Operand Size Description
FetchAdd 0100 0100 32/64 bit Fetch value, return original, add operand
Swap 0100 1100 32/64 bit Fetch value, return original, write operand
CAS (Compare & Swap) 0101 1100 32/64/128 bit Compare to operand1, if equal swap with operand2

2. Why Atomic Operations?

Why use hardware atomics over software locks?

Hardware atomics provide guaranteed atomicity without requiring software locks, reducing latency, avoiding deadlocks, and enabling efficient synchronization across CPU and devices.

Use Cases

Comparison: Atomic vs Software Lock

Aspect Software Lock PCIe Atomic
Latency Multiple round trips Single atomic TLP
Contention Spin/wait required Hardware arbitration
Deadlock Possible Not possible
Cross-device Complex Native support

3. Atomic Operation Flow

FetchAdd Operation

    Requester                          Completer (Memory)
        │                                      │
        │  AtomicOp Request (FetchAdd)         │
        │  [Addr, Operand=5]                   │
        │ ─────────────────────────────────────►│
        │                                      │
        │                              Read M[Addr] = 10
        │                              Write M[Addr] = 10 + 5 = 15
        │                                      │
        │  Completion with Data                │
        │  [Original value = 10]               │
        │ ◄─────────────────────────────────────│
        │                                      │
    
    Result: Memory updated to 15, Requester gets original 10

Compare-and-Swap (CAS) Operation

    Requester                          Completer (Memory)
        │                                      │
        │  AtomicOp Request (CAS)              │
        │  [Addr, Compare=10, Swap=20]         │
        │ ─────────────────────────────────────►│
        │                                      │
        │                              Read M[Addr] = 10
        │                              Compare: 10 == 10? YES
        │                              Write M[Addr] = 20
        │                                      │
        │  Completion with Data                │
        │  [Original value = 10]               │
        │ ◄─────────────────────────────────────│
        │                                      │
    
    Success: Memory was 10, now 20
    
    --- If memory value doesn't match: ---
    
        │                              Read M[Addr] = 15
        │                              Compare: 15 == 10? NO
        │                              No write performed
        │                                      │
        │  Completion with Data                │
        │  [Original value = 15]               │
        │ ◄─────────────────────────────────────│
    
    Failure: Memory unchanged, Requester sees actual value (15)

4. TLP Format

AtomicOp Request Header (32-bit address)

    Byte 0-3 (DW0):
    ┌──────────────────────────────────────────────────────────────┐
    │ Fmt │ Type    │ TC │ Attr │ TH │ TD │ EP │ Attr │ AT │ Len  │
    │ 01  │ xxx     │    │      │    │    │    │      │    │      │
    └──────────────────────────────────────────────────────────────┘
    
    Type Encoding:
    FetchAdd 32-bit:  0100 0100
    FetchAdd 64-bit:  0100 0110
    Swap 32-bit:      0100 1100
    Swap 64-bit:      0100 1110
    CAS 32-bit:       0101 1100
    CAS 64-bit:       0101 1110
    CAS 128-bit:      0101 1111
    
    Byte 4-7 (DW1):
    ┌──────────────────────────────────────────────────────────────┐
    │    Requester ID (16)     │  Tag (8)  │ Last BE │ First BE   │
    └──────────────────────────────────────────────────────────────┘
    
    Byte 8-11 (DW2):
    ┌──────────────────────────────────────────────────────────────┐
    │                    Address[31:2]                      │ Rsvd │
    └──────────────────────────────────────────────────────────────┘
    
    Data Payload:
    - FetchAdd/Swap: 1 or 2 DW (operand)
    - CAS: 2, 4, or 8 DW (compare + swap values)

AtomicOp Completion

AtomicOps always return a Completion with Data (CplD) containing the original value at the target address before the operation was performed.

5. Capability Structure

Device Capabilities 2 Register

Bits Field Description
6 AtomicOp Routing Supported Switch can route AtomicOps
7 32-bit AtomicOp Completer Supported Endpoint supports 32-bit atomics
8 64-bit AtomicOp Completer Supported Endpoint supports 64-bit atomics
9 128-bit CAS Completer Supported Endpoint supports 128-bit CAS

Device Control 2 Register

Bits Field Description
6 AtomicOp Requester Enable Allow this function to issue AtomicOps
7 AtomicOp Egress Blocking Block AtomicOps from egressing port

6. System Requirements

AtomicOp Routing Requirements

  1. All switches in path MUST support AtomicOp Routing
  2. Target endpoint MUST support required atomic size
  3. Requester MUST have AtomicOp Requester Enable set
  4. AtomicOp Egress Blocking MUST NOT be set on path
  5. Root Complex MUST support AtomicOp completions

Enabling Sequence

  1. Enumerate PCIe tree
  2. Check AtomicOp capabilities at each switch
  3. Check completer capabilities at target
  4. Enable AtomicOp Requester at source
  5. Ensure routing not blocked

7. Programming Examples

Spinlock Implementation

// Lock using Swap atomic
void acquire_lock(volatile uint32_t *lock_addr) {
    uint32_t expected = 0;
    while (1) {
        // AtomicOp Swap: write 1, get original
        uint32_t old = pcie_atomic_swap(lock_addr, 1);
        if (old == 0) {
            return; // Lock acquired
        }
        // Spin
    }
}

void release_lock(volatile uint32_t *lock_addr) {
    *lock_addr = 0; // Simple write to release
}

Reference Counting

// Increment reference count atomically
uint32_t ref_inc(volatile uint32_t *ref_count) {
    return pcie_atomic_fetchadd(ref_count, 1);
}

// Decrement and check if zero
bool ref_dec_and_test(volatile uint32_t *ref_count) {
    uint32_t old = pcie_atomic_fetchadd(ref_count, -1);
    return (old == 1); // Was 1, now 0
}

Lock-free Queue (CAS)

// Push to lock-free queue using CAS
bool queue_push(queue_t *q, void *item) {
    node_t *new_node = alloc_node(item);
    while (1) {
        node_t *tail = q->tail;
        node_t *next = tail->next;
        
        if (next == NULL) {
            // Try to link new node
            if (pcie_atomic_cas(&tail->next, NULL, new_node)) {
                // Success, try to update tail
                pcie_atomic_cas(&q->tail, tail, new_node);
                return true;
            }
        } else {
            // Tail not pointing to last, help advance
            pcie_atomic_cas(&q->tail, tail, next);
        }
    }
}

8. Rules and Ordering

Normative Rules for AtomicOps

  1. AtomicOps use Non-Posted flow (require completion)
  2. AtomicOps consume Non-Posted credits
  3. AtomicOp address MUST be naturally aligned
  4. Completer MUST serialize AtomicOps to same address
  5. AtomicOp completions MUST return original data
  6. Unsupported AtomicOp returns UR (Unsupported Request)

Ordering

AtomicOps are Non-Posted transactions and follow standard PCIe ordering rules for Non-Posted requests. They block at switches until completion is received.

9. Use Case: GPU-CPU Synchronization

    CPU                    GPU                     
     │                      │                      
     │ Write work to        │                      
     │ shared queue         │                      
     │ ─────────────────────►                      
     │                      │                      
     │ AtomicOp FetchAdd    │                      
     │ (increment work      │                      
     │  counter)            │                      
     │ ─────────────────────►                      
     │                      │                      
     │ ◄────── CplD ────────│ GPU sees counter++  
     │                      │                      
     │                      │ GPU reads work      
     │                      │ and processes       
     │                      │                      
     │                      │ GPU AtomicOp        
     │                      │ FetchAdd (done++)   
     │ ◄─────────────────────                      
     │                      │                      
     │ CPU sees completion  │