Hands-On: Multi-Core Issue Resolution - Debugging & Tracing

Lab: Multi-Core IPC Bug

Setup	Detail
Bug description	Core0 (AUTOSAR OsTask_10ms) and Core2 (MCAL CAN ISR) both update g_canRxCount without synchronisation; count intermittently incorrect by 1
Symptom	CAN receive diagnostic counter periodically shows 1 less message than expected — intermittent, ~1 in 1000 cycles
Target	TC397: Core0 + Core2 involved; Core1 (SafeOS) not affected
Goal	1. Catch the race in hardware. 2. Apply atomic fix. 3. Verify zero races over 10,000 cycles.

Exercise 1: Catch the Race with MCDS

CMMmc_ex1_catch_race.cmm

// Use MCDS windowed watchpoint to catch the race in hardware

MCDS.ON
SYStem.Option CORESHARING ON   // synchronised halt across all cores

// Core0 write watchpoint
CORE.select 1.
Break.Set g_canRxCount /Write /Hardware /CORE 1.

// Core2 write watchpoint
CORE.select 3.
Break.Set g_canRxCount /Write /Hardware /CORE 3.

// Cross-core trigger: halt ALL cores if Core0 and Core2 write within 500ns
MCDS.Trigger.Window g_canRxCount /Write /AllCores /Window 500ns

PRINT "Running — waiting for race condition..."
Go
WAIT !STATE.RUN() 60s          // race: may take many seconds; run long enough

IF STATE.RUN()
(
    PRINT "Race NOT detected in 60s — verify watchpoints are configured correctly"
)
ELSE
(
    PRINT "RACE DETECTED at PC=" FORMAT.ADDRESS(Register(PC))
    CORE.select 1. ; PRINT "Core0: " FORMAT.ADDRESS(Register(PC))
    CORE.select 3. ; PRINT "Core2: " FORMAT.ADDRESS(Register(PC))
    Frame.view /Caller   // show which function caused the write on Core2
)

Exercise 2: Apply and Verify the Atomic Fix

CMMmc_ex2_verify_fix.cmm

// After reflashing with atomic increment fix:
// g_canRxCount++ replaced with __sync_fetch_and_add(&g_canRxCount, 1)

// 1. Verify LDMST in disassembly (not LOAD+ADD+STORE)
Data.LOAD.Elf multicore_lab_fixed.elf
List.Mix CAN_RxIndication        // disassembly of fixed function
// Expected: single LDMST instruction at the counter update

// 2. Re-run MCDS race detector — expect zero trigger in 10,000 cycles
MCDS.ON
MCDS.Trigger.Window g_canRxCount /Write /AllCores /Window 500ns

Go
WAIT 30s           // 30s = ~3000 10ms task cycles
Break

PRINT "Race triggers detected: " MCDS.Trigger.COUNT()
IF MCDS.Trigger.COUNT()==0.
    PRINT "PASS: No races detected after atomic fix"
ELSE
    PRINT %ERROR "FAIL: " MCDS.Trigger.COUNT() " race(s) still detected"

// 3. Verify counter value accuracy
LOCAL &rx_expected &rx_actual
&rx_expected=Data.Long(SFR:CAN0.SR) & 0xFFFF  // HW frame count from CAN module
&rx_actual=Var.VALUE(g_canRxCount)
PRINT "HW count: " &rx_expected " SW count: " &rx_actual
IF &rx_actual==&rx_expected
    PRINT "PASS: SW counter matches HW counter"
ELSE
    PRINT %ERROR "FAIL: counter mismatch by " (&rx_expected-&rx_actual)

Exercise 3: Spinlock Contention Measurement

CMMmc_ex3_spinlock.cmm

// Measure spinlock contention: how often does Core0 spin waiting for Core2?
// Instrumented SpinLock_Acquire: counts spin iterations before lock acquired

// STM-based spinlock wait time measurement
PROF_START macro at SpinLock_Acquire entry
PROF_END macro at first acquired state

// After 1000 cycles, check profiling slots:
Go
WAIT 10s
Break
Var.View g_spinlockProfSlot     // shows min/max/avg wait time for spinlock

// High max spinlock wait = Core2 holds lock for too long during CAN ISR
// Fix: shorten critical section in CAN ISR; don't call BSW functions under spinlock

// Check for deadlock risk: both cores waiting for same spinlock
Var.View g_spinlockHolders      // which core currently holds each spinlock
// If g_spinlockHolders[SPINLOCK_IPC] == Core0 AND Core0 is waiting -> deadlock

Summary

The MCDS windowed watchpoint caught a race condition that occurred once per ~1,000 task cycles — impossible to reproduce with breakpoints or printf. The atomic fix (LDMST via __sync_fetch_and_add) is verified first in disassembly (confirms single-instruction atomicity) then by re-running the MCDS detector for 30 seconds with zero triggers. Spinlock contention measurement is the third pillar: high max wait time indicates a core is holding a lock too long, adding IPC latency to every task that uses that resource.

🔬 Deep Dive — Core Concepts Expanded

This section builds on the foundational concepts covered above with additional technical depth, edge cases, and configuration nuances that separate competent engineers from experts. When working on production ECU projects, the details covered here are the ones most commonly responsible for integration delays and late-phase defects.

Key principles to reinforce:

Configuration over coding: In AUTOSAR and automotive middleware environments, correctness is largely determined by ARXML configuration, not application code. A correctly implemented algorithm can produce wrong results due to a single misconfigured parameter.
Traceability as a first-class concern: Every configuration decision should be traceable to a requirement, safety goal, or architecture decision. Undocumented configuration choices are a common source of regression defects when ECUs are updated.
Cross-module dependencies: In tightly integrated automotive software stacks, changing one module's configuration often requires corresponding updates in dependent modules. Always perform a dependency impact analysis before submitting configuration changes.

🏭 How This Topic Appears in Production Projects

Project integration phase: The concepts covered in this lesson are most commonly encountered during ECU integration testing — when multiple software components from different teams are combined for the first time. Issues that were invisible in unit tests frequently surface at this stage.
Supplier/OEM interface: This is a topic that frequently appears in technical discussions between Tier-1 ECU suppliers and OEM system integrators. Engineers who can speak fluently about these details earn credibility and are often brought into critical design review meetings.
Automotive tool ecosystem: Vector CANoe/CANalyzer, dSPACE tools, and ETAS INCA are the standard tools used to validate and measure the correct behaviour of the systems described in this lesson. Familiarity with these tools alongside the conceptual knowledge dramatically accelerates debugging in real projects.

⚠️ Common Mistakes and How to Avoid Them

Assuming default configuration is correct: Automotive software tools ship with default configurations that are designed to compile and link, not to meet project-specific requirements. Every configuration parameter needs to be consciously set. 'It compiled' is not the same as 'it is correctly configured'.
Skipping documentation of configuration rationale: In a 3-year ECU project with team turnover, undocumented configuration choices become tribal knowledge that disappears when engineers leave. Document why a parameter is set to a specific value, not just what it is set to.
Testing only the happy path: Automotive ECUs must behave correctly under fault conditions, voltage variations, and communication errors. Always test the error handling paths as rigorously as the nominal operation. Many production escapes originate in untested error branches.
Version mismatches between teams: In a multi-team project, the BSW team, SWC team, and system integration team may use different versions of the same ARXML file. Version management of all ARXML files in a shared repository is mandatory, not optional.

📊 Industry Note

Engineers who master both the theoretical concepts and the practical toolchain skills covered in this course are among the most sought-after professionals in the automotive software industry. The combination of AUTOSAR standards knowledge, safety engineering understanding, and hands-on configuration experience commands premium salaries at OEMs and Tier-1 suppliers globally.