1. AER Overview
Advanced Error Reporting (AER) is an optional Extended Capability defined in the PCIe specification that provides more robust error reporting than the baseline PCIe Capability. AER enables fine-grained error classification, detailed error logging with header capture, and sophisticated error masking and severity programming.
Why AER Matters for System Reliability
- Precise Error Identification: Distinguishes between correctable, non-fatal, and fatal errors
- Root Cause Analysis: Captures TLP headers for debugging failed transactions
- Flexible Severity: Allows system software to program error severity levels
- Error Containment: Works with DPC for automatic error containment
- Multi-Error Handling: Tracks multiple simultaneous errors
AER Capability Structure Location
AER Extended Capability ID: 0x0001
Capability Version: 2
Location: Extended Configuration Space (offset > 0xFF)
Extended Capability Header Format:
┌─────────────────────────────────────────────────────────────────┐
│ 31:20 │ 19:16 │ 15:0 │
│ Next Cap Ptr │ Cap Version │ Capability ID (0x0001) │
└─────────────────────────────────────────────────────────────────┘
2. Error Classification
PCIe errors are classified into three categories based on their impact on system operation and recoverability:
Correctable Errors
Errors that the hardware can recover from without software intervention. These indicate potential issues but don't require immediate action.
- Receiver Error (Physical Layer)
- Bad TLP (LCRC error)
- Bad DLLP (CRC error)
- REPLAY_NUM Rollover
- Replay Timer Timeout
- Advisory Non-Fatal (optional)
- Corrected Internal Error
- Header Log Overflow
Uncorrectable Errors
Errors that cannot be automatically recovered and require software/system intervention.
- Fatal: Data Link Protocol Error, Surprise Down, Flow Control Error, Malformed TLP, Receiver Overflow
- Non-Fatal: Poisoned TLP, Completion Timeout, Completer Abort, Unexpected Completion, ACS Violation, Uncorrectable Internal Error
Error Severity Matrix
| Error Type |
Default Severity |
Programmable? |
Impact |
| Data Link Protocol Error |
Fatal |
Yes |
Link unreliable, requires reset |
| Surprise Down Error |
Fatal |
No |
Unexpected link down |
| Poisoned TLP Received |
Non-Fatal |
Yes |
Data corruption flagged |
| Flow Control Protocol Error |
Fatal |
Yes |
FC credits corrupted |
| Completion Timeout |
Non-Fatal |
Yes |
Request never completed |
| Completer Abort (CA) |
Non-Fatal |
Yes |
Completer rejected request |
| Unexpected Completion |
Non-Fatal |
Yes |
Completion without request |
| Receiver Overflow |
Fatal |
Yes |
Buffer overflow |
| Malformed TLP |
Fatal |
Yes |
Invalid TLP structure |
| ECRC Error |
Non-Fatal |
Yes |
End-to-end CRC mismatch |
| Unsupported Request (UR) |
Non-Fatal |
Yes |
Request not supported |
| ACS Violation |
Non-Fatal |
Yes |
P2P access blocked |
| Uncorrectable Internal Error |
Fatal |
Yes |
Internal logic error |
| MC Blocked TLP |
Non-Fatal |
Yes |
Multicast TLP blocked |
| AtomicOp Egress Blocked |
Non-Fatal |
Yes |
AtomicOp routing blocked |
| TLP Prefix Blocked Error |
Non-Fatal |
Yes |
TLP Prefix not supported |
3. Error Signaling Mechanisms
3.1 Error Message Types
Error Messages (Routed by ID to Root Complex):
┌──────────────────────────────────────────────────────────────┐
│ ERR_COR Message │
│ ───────────────── │
│ • Signals correctable error detected │
│ • Does NOT require immediate software action │
│ • Logged for statistical tracking │
│ • Routed to Root Complex │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ERR_NONFATAL Message │
│ ───────────────────── │
│ • Signals uncorrectable non-fatal error │
│ • Transaction failed but link/device operational │
│ • Software should log and potentially retry │
│ • May trigger interrupt to error handler │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ERR_FATAL Message │
│ ───────────────── │
│ • Signals uncorrectable fatal error │
│ • Link or device unreliable │
│ • Requires reset/recovery procedure │
│ • May trigger DPC if enabled │
└──────────────────────────────────────────────────────────────┘
3.2 Error Message TLP Format
Error Message TLP Header (3 DW, no data payload):
Byte 0-3 (DW0):
┌────┬─────┬──────┬────┬──────┬───────────┬────────────────────┐
│ Fmt│Type │ R │TC │ R │ Attr │ Length (00h) │
│ 001│10r00│ │000 │ │ │ 0000000000 │
└────┴─────┴──────┴────┴──────┴───────────┴────────────────────┘
└─ Msg Request, Routed by ID
Byte 4-7 (DW1):
┌─────────────────────────────────────────────────────────────┐
│ Requester ID (Source BDF) │ Tag (00h) │ Msg Code │
│ 16 bits │ 8 bits │ 8 bits │
└─────────────────────────────────────────────────────────────┘
Message Codes:
ERR_COR: 0x30
ERR_NONFATAL: 0x31
ERR_FATAL: 0x33
Byte 8-11 (DW2):
┌─────────────────────────────────────────────────────────────┐
│ Reserved (all zeros) │
└─────────────────────────────────────────────────────────────┘
3.3 Error Signaling Flow
┌─────────────────────┐
│ Error Detected │
└──────────┬──────────┘
│
▼
┌──────────────────────────────┐
│ Check Error Mask Register │
│ (Is error masked?) │
└──────────────┬───────────────┘
│
┌────────────┴────────────┐
│ Yes │ No
▼ ▼
┌─────────────────┐ ┌─────────────────────────┐
│ Error Discarded │ │ Log in Status Register │
│ (not signaled) │ │ + Header Log (if appl.) │
└─────────────────┘ └───────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Determine Error Severity │
│ (from Severity Register) │
└───────────┬──────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Send ERR_COR │ │ Send ERR_NONFATAL│ │ Send ERR_FATAL │
│ Message │ │ Message │ │ Message │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└─────────────────────┴─────────────────────┘
│
▼
┌──────────────────────────┐
│ Root Complex │
│ ┌────────────────────┐ │
│ │ Root Error Status │ │
│ │ + Source ID │ │
│ │ + Interrupt (opt.) │ │
│ └────────────────────┘ │
└──────────────────────────┘
4. Error Logging
4.1 Header Log Capture
When an error occurs on a received TLP, the device captures the first 4 DWs (16 bytes) of the TLP header in the Header Log registers. This is critical for debugging.
Header Log Registers (4 DW = 128 bits):
┌─────────────────────────────────────────────────────────────────┐
│ Header Log Register 1 (Offset 1Ch in AER Capability) │
│ ─────────────────────────────────────────────────────────── │
│ Contains: TLP Header DW0 (Fmt/Type, Length, TC, Attr, etc.) │
│ Bits [31:0] = First 4 bytes of offending TLP │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Header Log Register 2 (Offset 20h) │
│ ───────────────────────────────── │
│ Contains: TLP Header DW1 (Requester ID, Tag, etc.) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Header Log Register 3 (Offset 24h) │
│ ───────────────────────────────── │
│ Contains: TLP Header DW2 (Address[31:0] or Completer ID) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Header Log Register 4 (Offset 28h) │
│ ───────────────────────────────── │
│ Contains: TLP Header DW3 (Address[63:32] for 4DW headers) │
└─────────────────────────────────────────────────────────────────┘
Header Log [0] = 0x00000001 → Fmt=00, Type=00000, Length=1
Header Log [1] = 0x01000F00 → Requester ID = 01:00.0, Tag = 0x0F
Header Log [2] = 0xFEE00000 → Address (Memory Read to 0xFEE00000)
Header Log [3] = 0x00000000 → Upper address
4.2 TLP Prefix Log
When a TLP with End-End TLP Prefixes causes an error, the prefix information is logged separately:
TLP Prefix Log Registers (4 DW for each captured prefix):
┌─────────────────────────────────────────────────────────────────┐
│ First TLP Prefix Log (Offset 38h) │
│ ─────────────────────────────────── │
│ Bit 31: EP (End-End TLP Prefix present in original TLP) │
│ Bits 30:0: TLP Prefix content │
└─────────────────────────────────────────────────────────────────┘
Max TLP Prefixes Logged: 4 (configurable via control register)
4.3 Multiple Error Handling
First Error Pointer (in AER Capability):
┌─────────────────────────────────────────────────────────────────┐
│ Advanced Error Capabilities and Control Register (Offset 18h) │
│ ───────────────────────────────────────────────────────────── │
│ Bits [4:0]: First Error Pointer │
│ Points to first uncorrectable error bit set │
│ │
│ Value Error Type │
│ ───── ────────────────────────────────────── │
│ 00h (Reserved) │
│ 04h Data Link Protocol Error │
│ 05h Surprise Down Error │
│ 0Ch Poisoned TLP Received │
│ 0Dh Flow Control Protocol Error │
│ 0Eh Completion Timeout │
│ 0Fh Completer Abort │
│ 10h Unexpected Completion │
│ 11h Receiver Overflow │
│ 12h Malformed TLP │
│ 13h ECRC Error │
│ 14h Unsupported Request │
│ 15h ACS Violation │
│ 16h Uncorrectable Internal Error │
│ 17h MC Blocked TLP │
│ 18h AtomicOp Egress Blocked │
│ 19h TLP Prefix Blocked Error │
│ 1Ah Poisoned TLP Egress Blocked │
└─────────────────────────────────────────────────────────────────┘
Usage: When multiple uncorrectable errors occur simultaneously,
First Error Pointer indicates which error was detected first.
Header Log contains the header from that first error.
5. AER Register Map
5.1 Complete AER Extended Capability Structure
Offset Size Register Name
────── ──── ─────────────────────────────────────────────
00h 4 PCI Express Extended Capability Header
04h 4 Uncorrectable Error Status Register
08h 4 Uncorrectable Error Mask Register
0Ch 4 Uncorrectable Error Severity Register
10h 4 Correctable Error Status Register
14h 4 Correctable Error Mask Register
18h 4 Advanced Error Capabilities and Control Register
1Ch 16 Header Log Registers (4 DW)
2Ch 4 Root Error Command Register (Root Ports only)
30h 4 Root Error Status Register (Root Ports only)
34h 4 Error Source Identification Register (Root Ports only)
38h 4 TLP Prefix Log Register 1
3Ch 4 TLP Prefix Log Register 2
40h 4 TLP Prefix Log Register 3
44h 4 TLP Prefix Log Register 4
5.2 Uncorrectable Error Status Register (Offset 04h)
┌───────────────────────────────────────────────────────────────────────┐
│ Bit │ Field Name │ Access │ Description │
├─────┼───────────────────────────────┼────────┼───────────────────────┤
│ 0 │ Reserved │ - │ │
│ 1 │ Reserved │ - │ │
│ 2 │ Reserved │ - │ │
│ 3 │ Reserved │ - │ │
│ 4 │ Data Link Protocol Error │ RW1C │ DLLP/TLP error │
│ 5 │ Surprise Down Error │ RW1C │ Link down unexpect. │
│ 6 │ Reserved │ - │ │
│ 12 │ Poisoned TLP Received │ RW1C │ EP bit set in TLP │
│ 13 │ Flow Control Protocol Error │ RW1C │ FC credit error │
│ 14 │ Completion Timeout │ RW1C │ Cpl not received │
│ 15 │ Completer Abort │ RW1C │ CA status returned │
│ 16 │ Unexpected Completion │ RW1C │ Cpl without request │
│ 17 │ Receiver Overflow │ RW1C │ Buffer overflow │
│ 18 │ Malformed TLP │ RW1C │ Invalid TLP format │
│ 19 │ ECRC Error │ RW1C │ ECRC check failed │
│ 20 │ Unsupported Request │ RW1C │ UR completion sent │
│ 21 │ ACS Violation │ RW1C │ P2P blocked by ACS │
│ 22 │ Uncorrectable Internal Error │ RW1C │ Internal logic error │
│ 23 │ MC Blocked TLP │ RW1C │ Multicast blocked │
│ 24 │ AtomicOp Egress Blocked │ RW1C │ AtomicOp routing err │
│ 25 │ TLP Prefix Blocked Error │ RW1C │ Prefix not supported │
│ 26 │ Poisoned TLP Egress Blocked │ RW1C │ Poison egress blocked │
└───────────────────────────────────────────────────────────────────────┘
RW1C: Read/Write-1-to-Clear (write 1 to clear the bit)
5.3 Correctable Error Status Register (Offset 10h)
┌───────────────────────────────────────────────────────────────────────┐
│ Bit │ Field Name │ Access │ Description │
├─────┼───────────────────────────────┼────────┼───────────────────────┤
│ 0 │ Receiver Error │ RW1C │ 8b/10b or framing err │
│ 6 │ Bad TLP │ RW1C │ LCRC error detected │
│ 7 │ Bad DLLP │ RW1C │ DLLP CRC error │
│ 8 │ REPLAY_NUM Rollover │ RW1C │ Max replays exceeded │
│ 12 │ Replay Timer Timeout │ RW1C │ ACK not received │
│ 13 │ Advisory Non-Fatal Error │ RW1C │ Converted from uncorr │
│ 14 │ Corrected Internal Error │ RW1C │ Internal error fixed │
│ 15 │ Header Log Overflow │ RW1C │ Header log full │
└───────────────────────────────────────────────────────────────────────┘
5.4 Root Error Status Register (Offset 30h) - Root Ports Only
┌───────────────────────────────────────────────────────────────────────┐
│ Bit │ Field Name │ Access │ Description │
├─────┼───────────────────────────────┼────────┼───────────────────────┤
│ 0 │ ERR_COR Received │ RW1C │ Correctable error msg │
│ 1 │ Multiple ERR_COR Received │ RW1C │ >1 ERR_COR before clr │
│ 2 │ ERR_FATAL/NONFATAL Received │ RW1C │ Uncorrectable err msg │
│ 3 │ Multiple ERR_F/NF Received │ RW1C │ >1 ERR_F/NF before clr│
│ 4 │ First Uncorrectable Fatal │ RO │ First was fatal │
│ 5 │ Non-Fatal Error Messages Rcvd │ RW1C │ Non-fatal count > 0 │
│ 6 │ Fatal Error Messages Received │ RW1C │ Fatal count > 0 │
│27:0 │ Advanced Error Interrupt Msg# │ RO │ MSI/MSI-X vector │
└───────────────────────────────────────────────────────────────────────┘
6. Error Handling Procedures
6.1 Software Error Handler Algorithm
Error Handler Pseudocode:
error_handler(device):
dev_status = config_read(device, DEV_STATUS)
if dev_status.correctable_error:
corr_status = config_read(device, AER_CORR_ERR_STATUS)
log_correctable_errors(corr_status)
config_write(device, AER_CORR_ERR_STATUS, corr_status)
if dev_status.non_fatal_error OR dev_status.fatal_error:
uncorr_status = config_read(device, AER_UNCORR_ERR_STATUS)
first_error = config_read(device, AER_CAP_CTRL).first_error_ptr
header[0] = config_read(device, AER_HEADER_LOG_0)
header[1] = config_read(device, AER_HEADER_LOG_1)
header[2] = config_read(device, AER_HEADER_LOG_2)
header[3] = config_read(device, AER_HEADER_LOG_3)
analyze_tlp_header(header, first_error)
severity = config_read(device, AER_UNCORR_ERR_SEV)
if uncorr_status & severity:
perform_device_reset(device)
else:
attempt_error_recovery(device)
config_write(device, AER_UNCORR_ERR_STATUS, uncorr_status)
config_write(device, DEV_STATUS, dev_status)
6.2 Root Complex Error Handling
Root Port Error Handler:
root_error_handler(root_port):
root_status = config_read(root_port, ROOT_ERR_STATUS)
if root_status.err_cor_received:
source_id = config_read(root_port, ERR_SRC_ID).corr_src_id
device = get_device_by_bdf(source_id)
handle_correctable_error(device)
if root_status.err_fatal_nonfatal_received:
source_id = config_read(root_port, ERR_SRC_ID).uncorr_src_id
device = get_device_by_bdf(source_id)
is_fatal = root_status.first_uncorr_fatal
if is_fatal:
perform_hierarchy_reset(root_port)
else:
handle_nonfatal_error(device)
config_write(root_port, ROOT_ERR_STATUS, root_status)
6.3 Linux AER Commands
$ lspci -vvv -s 01:00.0 | grep -A 20 "Advanced Error Reporting"
$ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_correctable
$ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_fatal
$ cat /sys/bus/pci/devices/0000:01:00.0/aer_dev_nonfatal
$ dmesg | grep -i aer
$ setpci -s 01:00.0 ECAP_AER+0x8.L=0x00000000
[ 123.456789] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:01:00.0
[ 123.456790] pcie 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer
[ 123.456791] pcie 0000:01:00.0: device [8086:1234] error status/mask=00000001/00000000
[ 123.456792] pcie 0000:01:00.0: [ 0] Receiver Error
7. Normative Rules
AER Implementation Rules
- R1: All devices MUST implement error detection for Malformed TLP errors.
- R2: Devices MUST signal ERR_FATAL for errors that render the link unreliable.
- R3: The First Error Pointer MUST point to the first uncorrectable error detected since last cleared.
- R4: Header Log MUST capture the header of the TLP associated with the First Error Pointer.
- R5: Error Mask registers MUST NOT affect error detection, only error signaling.
- R6: Root Ports MUST be capable of receiving and logging error messages from downstream devices.
- R7: Software MUST clear error status bits by writing 1 to the corresponding bit positions.
- R8: Error Severity register bits MUST only affect errors that are programmable-severity.
- R9: Advisory Non-Fatal errors SHOULD be logged as Correctable unless upgraded by software.
- R10: Devices supporting TLP Prefixes MUST log prefix information when prefix-related errors occur.