Advanced Error Reporting (AER)

Complete Technical Deep-Dive: Error Classification, Signaling, Logging, and Recovery Mechanisms

1. AER Overview

Advanced Error Reporting (AER) is an optional Extended Capability defined in the PCIe specification that provides more robust error reporting than the baseline PCIe Capability. AER enables fine-grained error classification, detailed error logging with header capture, and sophisticated error masking and severity programming.

Why AER Matters for System Reliability

  • Precise Error Identification: Distinguishes between correctable, non-fatal, and fatal errors
  • Root Cause Analysis: Captures TLP headers for debugging failed transactions
  • Flexible Severity: Allows system software to program error severity levels
  • Error Containment: Works with DPC for automatic error containment
  • Multi-Error Handling: Tracks multiple simultaneous errors

AER Capability Structure Location

AER Extended Capability ID: 0x0001 Capability Version: 2 Location: Extended Configuration Space (offset > 0xFF) Extended Capability Header Format: ┌─────────────────────────────────────────────────────────────────┐ │ 31:20 │ 19:16 │ 15:0 │ │ Next Cap Ptr │ Cap Version │ Capability ID (0x0001) │ └─────────────────────────────────────────────────────────────────┘

2. Error Classification

PCIe errors are classified into three categories based on their impact on system operation and recoverability:

Correctable Errors

Errors that the hardware can recover from without software intervention. These indicate potential issues but don't require immediate action.

  • Receiver Error (Physical Layer)
  • Bad TLP (LCRC error)
  • Bad DLLP (CRC error)
  • REPLAY_NUM Rollover
  • Replay Timer Timeout
  • Advisory Non-Fatal (optional)
  • Corrected Internal Error
  • Header Log Overflow

Uncorrectable Errors

Errors that cannot be automatically recovered and require software/system intervention.

  • Fatal: Data Link Protocol Error, Surprise Down, Flow Control Error, Malformed TLP, Receiver Overflow
  • Non-Fatal: Poisoned TLP, Completion Timeout, Completer Abort, Unexpected Completion, ACS Violation, Uncorrectable Internal Error

Error Severity Matrix

Error Type Default Severity Programmable? Impact
Data Link Protocol Error Fatal Yes Link unreliable, requires reset
Surprise Down Error Fatal No Unexpected link down
Poisoned TLP Received Non-Fatal Yes Data corruption flagged
Flow Control Protocol Error Fatal Yes FC credits corrupted
Completion Timeout Non-Fatal Yes Request never completed
Completer Abort (CA) Non-Fatal Yes Completer rejected request
Unexpected Completion Non-Fatal Yes Completion without request
Receiver Overflow Fatal Yes Buffer overflow
Malformed TLP Fatal Yes Invalid TLP structure
ECRC Error Non-Fatal Yes End-to-end CRC mismatch
Unsupported Request (UR) Non-Fatal Yes Request not supported
ACS Violation Non-Fatal Yes P2P access blocked
Uncorrectable Internal Error Fatal Yes Internal logic error
MC Blocked TLP Non-Fatal Yes Multicast TLP blocked
AtomicOp Egress Blocked Non-Fatal Yes AtomicOp routing blocked
TLP Prefix Blocked Error Non-Fatal Yes TLP Prefix not supported

3. Error Signaling Mechanisms

3.1 Error Message Types

Error Messages (Routed by ID to Root Complex): ┌──────────────────────────────────────────────────────────────┐ │ ERR_COR Message │ │ ───────────────── │ │ • Signals correctable error detected │ │ • Does NOT require immediate software action │ │ • Logged for statistical tracking │ │ • Routed to Root Complex │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ ERR_NONFATAL Message │ │ ───────────────────── │ │ • Signals uncorrectable non-fatal error │ │ • Transaction failed but link/device operational │ │ • Software should log and potentially retry │ │ • May trigger interrupt to error handler │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ ERR_FATAL Message │ │ ───────────────── │ │ • Signals uncorrectable fatal error │ │ • Link or device unreliable │ │ • Requires reset/recovery procedure │ │ • May trigger DPC if enabled │ └──────────────────────────────────────────────────────────────┘

3.2 Error Message TLP Format

Error Message TLP Header (3 DW, no data payload): Byte 0-3 (DW0): ┌────┬─────┬──────┬────┬──────┬───────────┬────────────────────┐ │ Fmt│Type │ R │TC │ R │ Attr │ Length (00h) │ │ 001│10r00│ │000 │ │ │ 0000000000 │ └────┴─────┴──────┴────┴──────┴───────────┴────────────────────┘ └─ Msg Request, Routed by ID Byte 4-7 (DW1): ┌─────────────────────────────────────────────────────────────┐ │ Requester ID (Source BDF) │ Tag (00h) │ Msg Code │ │ 16 bits │ 8 bits │ 8 bits │ └─────────────────────────────────────────────────────────────┘ Message Codes: ERR_COR: 0x30 ERR_NONFATAL: 0x31 ERR_FATAL: 0x33 Byte 8-11 (DW2): ┌─────────────────────────────────────────────────────────────┐ │ Reserved (all zeros) │ └─────────────────────────────────────────────────────────────┘

3.3 Error Signaling Flow

┌─────────────────────┐ │ Error Detected │ └──────────┬──────────┘ │ ▼ ┌──────────────────────────────┐ │ Check Error Mask Register │ │ (Is error masked?) │ └──────────────┬───────────────┘ │ ┌────────────┴────────────┐ │ Yes │ No ▼ ▼ ┌─────────────────┐ ┌─────────────────────────┐ │ Error Discarded │ │ Log in Status Register │ │ (not signaled) │ │ + Header Log (if appl.) │ └─────────────────┘ └───────────┬─────────────┘ │ ▼ ┌──────────────────────────┐ │ Determine Error Severity │ │ (from Severity Register) │ └───────────┬──────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Send ERR_COR │ │ Send ERR_NONFATAL│ │ Send ERR_FATAL │ │ Message │ │ Message │ │ Message │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └─────────────────────┴─────────────────────┘ │ ▼ ┌──────────────────────────┐ │ Root Complex │ │ ┌────────────────────┐ │ │ │ Root Error Status │ │ │ │ + Source ID │ │ │ │ + Interrupt (opt.) │ │ │ └────────────────────┘ │ └──────────────────────────┘

4. Error Logging

4.1 Header Log Capture

When an error occurs on a received TLP, the device captures the first 4 DWs (16 bytes) of the TLP header in the Header Log registers. This is critical for debugging.

Header Log Registers (4 DW = 128 bits): ┌─────────────────────────────────────────────────────────────────┐ │ Header Log Register 1 (Offset 1Ch in AER Capability) │ │ ─────────────────────────────────────────────────────────── │ │ Contains: TLP Header DW0 (Fmt/Type, Length, TC, Attr, etc.) │ │ Bits [31:0] = First 4 bytes of offending TLP │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Header Log Register 2 (Offset 20h) │ │ ───────────────────────────────── │ │ Contains: TLP Header DW1 (Requester ID, Tag, etc.) │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Header Log Register 3 (Offset 24h) │ │ ───────────────────────────────── │ │ Contains: TLP Header DW2 (Address[31:0] or Completer ID) │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Header Log Register 4 (Offset 28h) │ │ ───────────────────────────────── │ │ Contains: TLP Header DW3 (Address[63:32] for 4DW headers) │ └─────────────────────────────────────────────────────────────────┘ // Example: Decoding a captured Malformed TLP error Header Log [0] = 0x00000001 → Fmt=00, Type=00000, Length=1 Header Log [1] = 0x01000F00 → Requester ID = 01:00.0, Tag = 0x0F Header Log [2] = 0xFEE00000 → Address (Memory Read to 0xFEE00000) Header Log [3] = 0x00000000 → Upper address

4.2 TLP Prefix Log

When a TLP with End-End TLP Prefixes causes an error, the prefix information is logged separately:

TLP Prefix Log Registers (4 DW for each captured prefix): ┌─────────────────────────────────────────────────────────────────┐ │ First TLP Prefix Log (Offset 38h) │ │ ─────────────────────────────────── │ │ Bit 31: EP (End-End TLP Prefix present in original TLP) │ │ Bits 30:0: TLP Prefix content │ └─────────────────────────────────────────────────────────────────┘ Max TLP Prefixes Logged: 4 (configurable via control register)

4.3 Multiple Error Handling

First Error Pointer (in AER Capability): ┌─────────────────────────────────────────────────────────────────┐ │ Advanced Error Capabilities and Control Register (Offset 18h) │ │ ───────────────────────────────────────────────────────────── │ │ Bits [4:0]: First Error Pointer │ │ Points to first uncorrectable error bit set │ │ │ │ Value Error Type │ │ ───── ────────────────────────────────────── │ │ 00h (Reserved) │ │ 04h Data Link Protocol Error │ │ 05h Surprise Down Error │ │ 0Ch Poisoned TLP Received │ │ 0Dh Flow Control Protocol Error │ │ 0Eh Completion Timeout │ │ 0Fh Completer Abort │ │ 10h Unexpected Completion │ │ 11h Receiver Overflow │ │ 12h Malformed TLP │ │ 13h ECRC Error │ │ 14h Unsupported Request │ │ 15h ACS Violation │ │ 16h Uncorrectable Internal Error │ │ 17h MC Blocked TLP │ │ 18h AtomicOp Egress Blocked │ │ 19h TLP Prefix Blocked Error │ │ 1Ah Poisoned TLP Egress Blocked │ └─────────────────────────────────────────────────────────────────┘ Usage: When multiple uncorrectable errors occur simultaneously, First Error Pointer indicates which error was detected first. Header Log contains the header from that first error.

5. AER Register Map

5.1 Complete AER Extended Capability Structure

Offset Size Register Name ────── ──── ───────────────────────────────────────────── 00h 4 PCI Express Extended Capability Header 04h 4 Uncorrectable Error Status Register 08h 4 Uncorrectable Error Mask Register 0Ch 4 Uncorrectable Error Severity Register 10h 4 Correctable Error Status Register 14h 4 Correctable Error Mask Register 18h 4 Advanced Error Capabilities and Control Register 1Ch 16 Header Log Registers (4 DW) 2Ch 4 Root Error Command Register (Root Ports only) 30h 4 Root Error Status Register (Root Ports only) 34h 4 Error Source Identification Register (Root Ports only) 38h 4 TLP Prefix Log Register 1 3Ch 4 TLP Prefix Log Register 2 40h 4 TLP Prefix Log Register 3 44h 4 TLP Prefix Log Register 4

5.2 Uncorrectable Error Status Register (Offset 04h)

┌───────────────────────────────────────────────────────────────────────┐ │ Bit │ Field Name │ Access │ Description │ ├─────┼───────────────────────────────┼────────┼───────────────────────┤ │ 0 │ Reserved │ - │ │ │ 1 │ Reserved │ - │ │ │ 2 │ Reserved │ - │ │ │ 3 │ Reserved │ - │ │ │ 4 │ Data Link Protocol Error │ RW1C │ DLLP/TLP error │ │ 5 │ Surprise Down Error │ RW1C │ Link down unexpect. │ │ 6 │ Reserved │ - │ │ │ 12 │ Poisoned TLP Received │ RW1C │ EP bit set in TLP │ │ 13 │ Flow Control Protocol Error │ RW1C │ FC credit error │ │ 14 │ Completion Timeout │ RW1C │ Cpl not received │ │ 15 │ Completer Abort │ RW1C │ CA status returned │ │ 16 │ Unexpected Completion │ RW1C │ Cpl without request │ │ 17 │ Receiver Overflow │ RW1C │ Buffer overflow │ │ 18 │ Malformed TLP │ RW1C │ Invalid TLP format │ │ 19 │ ECRC Error │ RW1C │ ECRC check failed │ │ 20 │ Unsupported Request │ RW1C │ UR completion sent │ │ 21 │ ACS Violation │ RW1C │ P2P blocked by ACS │ │ 22 │ Uncorrectable Internal Error │ RW1C │ Internal logic error │ │ 23 │ MC Blocked TLP │ RW1C │ Multicast blocked │ │ 24 │ AtomicOp Egress Blocked │ RW1C │ AtomicOp routing err │ │ 25 │ TLP Prefix Blocked Error │ RW1C │ Prefix not supported │ │ 26 │ Poisoned TLP Egress Blocked │ RW1C │ Poison egress blocked │ └───────────────────────────────────────────────────────────────────────┘ RW1C: Read/Write-1-to-Clear (write 1 to clear the bit)

5.3 Correctable Error Status Register (Offset 10h)

┌───────────────────────────────────────────────────────────────────────┐ │ Bit │ Field Name │ Access │ Description │ ├─────┼───────────────────────────────┼────────┼───────────────────────┤ │ 0 │ Receiver Error │ RW1C │ 8b/10b or framing err │ │ 6 │ Bad TLP │ RW1C │ LCRC error detected │ │ 7 │ Bad DLLP │ RW1C │ DLLP CRC error │ │ 8 │ REPLAY_NUM Rollover │ RW1C │ Max replays exceeded │ │ 12 │ Replay Timer Timeout │ RW1C │ ACK not received │ │ 13 │ Advisory Non-Fatal Error │ RW1C │ Converted from uncorr │ │ 14 │ Corrected Internal Error │ RW1C │ Internal error fixed │ │ 15 │ Header Log Overflow │ RW1C │ Header log full │ └───────────────────────────────────────────────────────────────────────┘

5.4 Root Error Status Register (Offset 30h) - Root Ports Only

┌───────────────────────────────────────────────────────────────────────┐ │ Bit │ Field Name │ Access │ Description │ ├─────┼───────────────────────────────┼────────┼───────────────────────┤ │ 0 │ ERR_COR Received │ RW1C │ Correctable error msg │ │ 1 │ Multiple ERR_COR Received │ RW1C │ >1 ERR_COR before clr │ │ 2 │ ERR_FATAL/NONFATAL Received │ RW1C │ Uncorrectable err msg │ │ 3 │ Multiple ERR_F/NF Received │ RW1C │ >1 ERR_F/NF before clr│ │ 4 │ First Uncorrectable Fatal │ RO │ First was fatal │ │ 5 │ Non-Fatal Error Messages Rcvd │ RW1C │ Non-fatal count > 0 │ │ 6 │ Fatal Error Messages Received │ RW1C │ Fatal count > 0 │ │27:0 │ Advanced Error Interrupt Msg# │ RO │ MSI/MSI-X vector │ └───────────────────────────────────────────────────────────────────────┘

6. Error Handling Procedures

6.1 Software Error Handler Algorithm

Error Handler Pseudocode: error_handler(device): // Step 1: Read Device Status Register dev_status = config_read(device, DEV_STATUS) // Step 2: Check for errors if dev_status.correctable_error: // Read Correctable Error Status corr_status = config_read(device, AER_CORR_ERR_STATUS) log_correctable_errors(corr_status) // Clear by writing 1s to status bits config_write(device, AER_CORR_ERR_STATUS, corr_status) if dev_status.non_fatal_error OR dev_status.fatal_error: // Read Uncorrectable Error Status uncorr_status = config_read(device, AER_UNCORR_ERR_STATUS) // Read First Error Pointer first_error = config_read(device, AER_CAP_CTRL).first_error_ptr // Read Header Log for debugging header[0] = config_read(device, AER_HEADER_LOG_0) header[1] = config_read(device, AER_HEADER_LOG_1) header[2] = config_read(device, AER_HEADER_LOG_2) header[3] = config_read(device, AER_HEADER_LOG_3) // Log and analyze analyze_tlp_header(header, first_error) // Determine severity and take action severity = config_read(device, AER_UNCORR_ERR_SEV) if uncorr_status & severity: // Fatal error perform_device_reset(device) else: // Non-fatal attempt_error_recovery(device) // Clear error status config_write(device, AER_UNCORR_ERR_STATUS, uncorr_status) // Clear Device Status config_write(device, DEV_STATUS, dev_status)

6.2 Root Complex Error Handling

Root Port Error Handler: root_error_handler(root_port): // Step 1: Read Root Error Status root_status = config_read(root_port, ROOT_ERR_STATUS) // Step 2: Identify error source if root_status.err_cor_received: // Correctable error from downstream device source_id = config_read(root_port, ERR_SRC_ID).corr_src_id device = get_device_by_bdf(source_id) handle_correctable_error(device) if root_status.err_fatal_nonfatal_received: // Uncorrectable error from downstream device source_id = config_read(root_port, ERR_SRC_ID).uncorr_src_id device = get_device_by_bdf(source_id) is_fatal = root_status.first_uncorr_fatal if is_fatal: // Perform fatal error recovery // May need to reset entire hierarchy perform_hierarchy_reset(root_port) else: handle_nonfatal_error(device) // Step 3: Clear Root Error Status config_write(root_port, ROOT_ERR_STATUS, root_status)

6.3 Linux AER Commands

# View AER errors for a device: $ lspci -vvv -s 01:00.0 | grep -A 20 "Advanced Error Reporting" # Check AER status in sysfs: $ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_correctable $ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_fatal $ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_nonfatal # View kernel AER messages: $ dmesg | grep -i aer # Enable AER (usually enabled by default): $ setpci -s 01:00.0 ECAP_AER+0x8.L=0x00000000 # Unmask all errors # Example AER error in dmesg: [ 123.456789] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:01:00.0 [ 123.456790] pcie 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer [ 123.456791] pcie 0000:01:00.0: device [8086:1234] error status/mask=00000001/00000000 [ 123.456792] pcie 0000:01:00.0: [ 0] Receiver Error

7. Normative Rules

AER Implementation Rules

  1. R1: All devices MUST implement error detection for Malformed TLP errors.
  2. R2: Devices MUST signal ERR_FATAL for errors that render the link unreliable.
  3. R3: The First Error Pointer MUST point to the first uncorrectable error detected since last cleared.
  4. R4: Header Log MUST capture the header of the TLP associated with the First Error Pointer.
  5. R5: Error Mask registers MUST NOT affect error detection, only error signaling.
  6. R6: Root Ports MUST be capable of receiving and logging error messages from downstream devices.
  7. R7: Software MUST clear error status bits by writing 1 to the corresponding bit positions.
  8. R8: Error Severity register bits MUST only affect errors that are programmable-severity.
  9. R9: Advisory Non-Fatal errors SHOULD be logged as Correctable unless upgraded by software.
  10. R10: Devices supporting TLP Prefixes MUST log prefix information when prefix-related errors occur.