Stack Overflow & Memory Corruption - Debugging & Tracing

Stack Overflow: Causes and Detection

Stack Overflow Growth Patterns

  Task stack (grows down): OsTask_10ms — 4096 bytes
  ──────────────────────────────────────────────
  Stack top (high addr):  0x70008000  ← initial SP
  Normal usage:           0x70007200  ← SP after typical activation (2 kB used)
  Safe margin:            0x70007000  ← 2 kB remaining
  Warning threshold:      0x70006000  ← < 1 kB remaining (WCET path hit)
  Stack bottom (guard):   0x70007000  ← DWT watchpoint here
  Below stack (LMU data): 0x70006FFC  ← first word overwritten on overflow

  Common causes:
  1. Unbounded recursion (no depth limit in parser/tree traversal)
  2. Large local arrays (char buf[4096] in a function: allocates entire stack budget)
  3. ISR nesting (ISR stack frame adds ~64 bytes per nesting level)
  4. Function pointer call to wrong address → infinite recursion / garbage frame

MPU Stack Guard Configuration

Cmpu_stack_guard.c

/* ARM Cortex-M MPU: configure stack guard region for each task */
/* On ProtectionHook: OS has already detected violation; this is the hardware trap */
#include "Os.h"
#include "arm_mpu.h"

/* Configure MPU region 7 as stack guard for current task */
void MPU_ConfigureTaskStackGuard(uint32_t task_stack_base,
                                  uint32_t guard_size_bytes)
{
    /* Guard region: 256-byte minimum (MPU region size must be power of 2, >= 32B) */
    MPU->RBAR = (task_stack_base & ~0x1F) | MPU_RBAR_VALID_Msk | (7u);  /* region 7 */
    MPU->RASR = MPU_RASR_ENABLE_Msk           /* enable */
              | (8u << MPU_RASR_SIZE_Pos)     /* 2^(8+1) = 512 bytes */
              | (0u << MPU_RASR_AP_Pos)       /* AP=000: no access (read+write fault) */
              | MPU_RASR_XN_Msk;             /* not executable */
    __DSB();  /* ensure MPU config takes effect before next instruction */
    __ISB();
}

/* ProtectionHook called by OS when MPU fault occurs */
ProtectionReturnType Os_ProtectionHook(StatusType faultId) {
    if (faultId == E_OS_STACKFAULT) {
        /* Log which task overflowed */
        TaskType faulting_task;
        GetTaskID(&faulting_task);
        Dem_ReportErrorStatus(DEM_EVENT_STACK_OVERFLOW, DEM_EVENT_STATUS_FAILED);
        /* Safe state: terminate task and reset ECU */
        return PRO_KILLAPPL_RESTART;
    }
    return PRO_IGNORE;
}

Stack Canary: High-Water Mark Pattern

Cstack_canary.c

/* Stack canary: fill unused stack with 0xCDCDCDCD pattern at task init */
/* Measure high-water mark offline via TRACE32 or periodic runtime check */
#include 
#include 

#define STACK_CANARY_PATTERN  0xCDCDCDCDu

typedef struct {
    uint8_t   *base;
    uint32_t   size;
    const char *name;
} StackInfo_t;

extern uint8_t OsTask_10ms_Stack[];
extern const uint32_t OsTask_10ms_StackSize;

void Stack_InitCanary(const StackInfo_t *stack) {
    memset(stack->base, 0xCD, stack->size);
}

uint32_t Stack_GetHighWaterMark(const StackInfo_t *stack) {
    const uint32_t *p = (const uint32_t *)stack->base;
    uint32_t unused_words = 0;
    while (*p == STACK_CANARY_PATTERN) {
        unused_words++;
        p++;
    }
    uint32_t used_bytes = stack->size - (unused_words * sizeof(uint32_t));
    return used_bytes;
}

/* TRACE32: monitor high-water marks for all tasks:
   Var.View Stack_GetHighWaterMark(&g_taskStackInfo[0])
   Or scan raw stack memory in Data.dump for first non-0xCD byte */

Summary

Stack overflow in embedded automotive code has four main root causes: unbounded recursion, oversized local arrays, ISR nesting, and incorrect function pointer calls. The MPU stack guard is the hardware detection mechanism: a no-access MPU region at the bottom of each task stack triggers ProtectionHook the moment the first word is overwritten. Stack canaries (0xCDCDCDCD fill) provide offline high-water mark measurement useful in development. Both mechanisms are complementary: canaries show margin trends over time; MPU provides the guaranteed detection needed for ASIL compliance.

🔬 Deep Dive — Core Concepts Expanded

This section builds on the foundational concepts covered above with additional technical depth, edge cases, and configuration nuances that separate competent engineers from experts. When working on production ECU projects, the details covered here are the ones most commonly responsible for integration delays and late-phase defects.

Key principles to reinforce:

Configuration over coding: In AUTOSAR and automotive middleware environments, correctness is largely determined by ARXML configuration, not application code. A correctly implemented algorithm can produce wrong results due to a single misconfigured parameter.
Traceability as a first-class concern: Every configuration decision should be traceable to a requirement, safety goal, or architecture decision. Undocumented configuration choices are a common source of regression defects when ECUs are updated.
Cross-module dependencies: In tightly integrated automotive software stacks, changing one module's configuration often requires corresponding updates in dependent modules. Always perform a dependency impact analysis before submitting configuration changes.

🏭 How This Topic Appears in Production Projects

Project integration phase: The concepts covered in this lesson are most commonly encountered during ECU integration testing — when multiple software components from different teams are combined for the first time. Issues that were invisible in unit tests frequently surface at this stage.
Supplier/OEM interface: This is a topic that frequently appears in technical discussions between Tier-1 ECU suppliers and OEM system integrators. Engineers who can speak fluently about these details earn credibility and are often brought into critical design review meetings.
Automotive tool ecosystem: Vector CANoe/CANalyzer, dSPACE tools, and ETAS INCA are the standard tools used to validate and measure the correct behaviour of the systems described in this lesson. Familiarity with these tools alongside the conceptual knowledge dramatically accelerates debugging in real projects.

⚠️ Common Mistakes and How to Avoid Them

Assuming default configuration is correct: Automotive software tools ship with default configurations that are designed to compile and link, not to meet project-specific requirements. Every configuration parameter needs to be consciously set. 'It compiled' is not the same as 'it is correctly configured'.
Skipping documentation of configuration rationale: In a 3-year ECU project with team turnover, undocumented configuration choices become tribal knowledge that disappears when engineers leave. Document why a parameter is set to a specific value, not just what it is set to.
Testing only the happy path: Automotive ECUs must behave correctly under fault conditions, voltage variations, and communication errors. Always test the error handling paths as rigorously as the nominal operation. Many production escapes originate in untested error branches.
Version mismatches between teams: In a multi-team project, the BSW team, SWC team, and system integration team may use different versions of the same ARXML file. Version management of all ARXML files in a shared repository is mandatory, not optional.

📊 Industry Note

Engineers who master both the theoretical concepts and the practical toolchain skills covered in this course are among the most sought-after professionals in the automotive software industry. The combination of AUTOSAR standards knowledge, safety engineering understanding, and hands-on configuration experience commands premium salaries at OEMs and Tier-1 suppliers globally.