| Symptom | Details |
|---|---|
| Reported issue | CAN bus intermittently drops all messages for 50–200 ms; self-recovers; vehicle shows sporadic DTC P1234 (CAN bus off) |
| Occurrence rate | ~3 times per hour during normal operation |
| Failed approaches | Logic analyser shows no CAN electrical faults; no DTC snapshot in production log; NVH/temperature correlation negative |
| Available | TC397 engineering sample with LA-7780 probe; MCDS trace; ETM; full DWARF symbols |
The Bug: Intermittent CAN Communication Loss
Stage 1: Reproduce and Capture with TRACE32
// Stage 1: Set up automatic capture on CAN bus-off event
// Trigger: CAN error passive → error bus-off transition
// Aurix CAN SFR: CAN0.TGFC.BOFF = 1 when bus-off occurs
Break.Set SFR:CAN0.NODE0SR.BOFF /Write /Hardware /CONDition Data.Byte(SFR:CAN0.NODE0SR.BOFF)==1
// Enable ETM trace to capture execution leading up to fault
ETM.ON
ETM.DataTrace NONE
Trace.METHOD Analyzer
Trace.Size 256MB
// Run — will halt when bus-off bit is set
PRINT "Waiting for CAN bus-off event..."
Go
WAIT !STATE.RUN() 600s // wait up to 10 minutes
IF STATE.RUN()
PRINT "Event not detected in 10 min — extend wait or verify trigger"
ELSE
(
PRINT "BUS-OFF EVENT DETECTED at " FORMAT.ADDRESS(Register(PC))
Trace.SAVE canbus_crash_trace.t32t // save full trace for analysis
Data.SAVE.Binary sram_snapshot.bin 0x70000000--0x700FFFFF
)Stage 2: Root Cause via ETM Trace Analysis
// Stage 2: Analyse saved trace to find what triggered bus-off
// Load saved trace
Trace.LOAD canbus_crash_trace.t32t
Data.LOAD.Binary sram_snapshot.bin 0x70000000
// Step 1: Find the last CAN error counter increment before bus-off
// CAN error passive threshold: 128 errors; bus-off: 256 errors
// Look for rapid error count growth in trace
Trace.Find /Address SFR:CAN0.NODE0ECR // find all accesses to error counter register
// Step 2: Examine execution ~5ms before bus-off
// Trace.Chart.TASK shows what was running at T-5ms
Trace.Chart.TASK /ZOOM -5ms 0ms // last 5ms before halt
// Step 3: Find the anomaly
// Expected: ETM shows OsTask_10ms briefly calling CAN transmit at 50kHz rate
// instead of normal 1kHz rate
Trace.Statistics.Func "Can_Write" // call count for Can_Write in trace
// FINDING: Can_Write called 483 times in last 1ms (expected: ~10)
// Step 4: Trace back to what caused the call rate spike
Trace.Find /Entry "Can_Write" // find every call to Can_Write
// All calls originate from: Nm_MainFunction → CanNm_TxConfirmation loop
// Bug: CanNm missing return guard — callback loops under error condition
// Expected finding: CanNm_TxConfirmation called recursively when CAN TX queue full
// CAN flood → bus error counter hits 256 → bus-off
// Fix: add one-shot guard in CanNm_TxConfirmation: if (s_txActive) return;Stage 3: Fix, Verify, and Prevent Regression
// Stage 3: Verify fix and add regression guard
// After fix applied and reflashed:
Data.LOAD.Elf build/app_fixed.elf /RELPATH
SYStem.Reset
// Re-run bus-off detection for 30 minutes
Break.Set SFR:CAN0.NODE0SR.BOFF /Write /Hardware /CONDition Data.Byte(SFR:CAN0.NODE0SR.BOFF)==1
ETM.ON
Trace.METHOD Analyzer
Trace.Size 256MB
PRINT "Running regression for 30 min..."
Go
WAIT !STATE.RUN() 1800s
IF STATE.RUN()
PRINT "PASS: No bus-off event in 30 minutes"
ELSE
PRINT %ERROR "FAIL: bus-off still occurring after fix"
// Regression test: add Can_Write call rate monitor to CI suite
// (catches if call rate exceeds 50/ms in any 1ms window)
// Add to hil_results: {"id":"CAN-001","name":"Can_Write rate < 50/ms"...}
// Root cause summary:
PRINT "Root cause: CanNm_TxConfirmation recursive entry under CAN TX error"
PRINT "Fix: added s_txActive guard flag; one-shot callback pattern"
PRINT "Regression test: Can_Write call rate monitor added to HIL CI suite"
PRINT "MCSS: TARA entry TS-CAN-03 updated: 'CanNm reentrance' risk closed"
Summary
The complete debugging workflow applied end-to-end: automatic capture triggered on a hardware register bit (CAN bus-off), ETM trace providing the full execution history before the fault, Trace.Statistics.Func revealing the anomalous call rate, and Trace.Find pinpointing the exact code path. The bug — a CanNm callback re-entrance loop — would have taken weeks to find by code review alone, because it only manifests under the specific condition of a full CAN TX queue. ETM trace found it in under 2 hours from first capture. The regression test added to CI ensures the same failure mode is never silently re-introduced.
🔬 Deep Dive — Core Concepts Expanded
This section builds on the foundational concepts covered above with additional technical depth, edge cases, and configuration nuances that separate competent engineers from experts. When working on production ECU projects, the details covered here are the ones most commonly responsible for integration delays and late-phase defects.
Key principles to reinforce:
- Configuration over coding: In AUTOSAR and automotive middleware environments, correctness is largely determined by ARXML configuration, not application code. A correctly implemented algorithm can produce wrong results due to a single misconfigured parameter.
- Traceability as a first-class concern: Every configuration decision should be traceable to a requirement, safety goal, or architecture decision. Undocumented configuration choices are a common source of regression defects when ECUs are updated.
- Cross-module dependencies: In tightly integrated automotive software stacks, changing one module's configuration often requires corresponding updates in dependent modules. Always perform a dependency impact analysis before submitting configuration changes.
🏭 How This Topic Appears in Production Projects
- Project integration phase: The concepts covered in this lesson are most commonly encountered during ECU integration testing — when multiple software components from different teams are combined for the first time. Issues that were invisible in unit tests frequently surface at this stage.
- Supplier/OEM interface: This is a topic that frequently appears in technical discussions between Tier-1 ECU suppliers and OEM system integrators. Engineers who can speak fluently about these details earn credibility and are often brought into critical design review meetings.
- Automotive tool ecosystem: Vector CANoe/CANalyzer, dSPACE tools, and ETAS INCA are the standard tools used to validate and measure the correct behaviour of the systems described in this lesson. Familiarity with these tools alongside the conceptual knowledge dramatically accelerates debugging in real projects.
⚠️ Common Mistakes and How to Avoid Them
- Assuming default configuration is correct: Automotive software tools ship with default configurations that are designed to compile and link, not to meet project-specific requirements. Every configuration parameter needs to be consciously set. 'It compiled' is not the same as 'it is correctly configured'.
- Skipping documentation of configuration rationale: In a 3-year ECU project with team turnover, undocumented configuration choices become tribal knowledge that disappears when engineers leave. Document why a parameter is set to a specific value, not just what it is set to.
- Testing only the happy path: Automotive ECUs must behave correctly under fault conditions, voltage variations, and communication errors. Always test the error handling paths as rigorously as the nominal operation. Many production escapes originate in untested error branches.
- Version mismatches between teams: In a multi-team project, the BSW team, SWC team, and system integration team may use different versions of the same ARXML file. Version management of all ARXML files in a shared repository is mandatory, not optional.
📊 Industry Note
Engineers who master both the theoretical concepts and the practical toolchain skills covered in this course are among the most sought-after professionals in the automotive software industry. The combination of AUTOSAR standards knowledge, safety engineering understanding, and hands-on configuration experience commands premium salaries at OEMs and Tier-1 suppliers globally.