[SOLVED] RAM overwritten during debugging, possibly via SWD

thomas.blank · Aug 25th 2023, 6:14pm

Hi,

I am currently debugging a problem on an NXP KW41Z, i.e. a Cortex M0+ via SWD. At random times (several hours of running), the device hardfaults because operating system resources get overwritten. There is no rhyme or reason as to what gets overwritten (the affected addresses change); when it happens; how to reproduce the bug, etc. I have worked on the issue for weeks now, and have covered a lot of bases. The obvious and usual avenues (Stack overflow, rogue pointers, etc) came up with nothing. There's a few additional details that would be too much to write here, but the gist would be: My colleagues and me have tried everything, checked everything, and everything we know about microcontrollers has been challenged by this bug. The RAM contents just change at random.

The problem has happened several times at my desk over the last few weeks. At the same time, 34 devices ran the exact same firmware in a testing installation, i.e. without an attached debugger. None of them have reset itself during that time, which means that none of them had the hardfault that the RAM corruption that I saw at my desk would inevitably cause. This leads me to believe that the problem might only occur with an attached debugger.

One of the few consistent things between the ~10 times that I saw the bug and inspected the memory is that the affected spots, 16 bytes aligned to the word, get overwritten with the values 0x23000012 0xE000EDF0 or 0x23000002 0xE000EDF0. Those values pop up during SWD communication. I attached a Logic Analyzer to the SWDIO and SWDCLK pins, and find transfers like these between the Segger J-Link and my device during a normal debug session:

Source Code

"SWD","v1frame",6.60293633,0.00016036,"WData 0x00000000 reg SELECT bits APSEL=0x00, APBANKSEL=0x0, PRESCALER=0x0"
"SWD","v1frame",6.6030967,5.004e-06,"Data parityok"
"SWD","v1frame",6.60310304,4.0084e-05,"Request AccessPort Write CSW"
"SWD","v1frame",6.60314319,5.004e-06,"Turnaround"
"SWD","v1frame",6.6031482,1.5028e-05,"ACK OK"
"SWD","v1frame",6.60316324,5e-06,"Turnaround"
"SWD","v1frame",6.60316883,0.000160356,"WData 0x23000012 reg CSW bits DbgSwEnable=0, Prot=0x23, SPIDEN=0, Mode=0x0, TrInProg=0, DeviceEn=0, AddrInc=Increment single, Size=Word (32 bits)"
"SWD","v1frame",6.6033292,5.004e-06,"Data parityok"
"SWD","v1frame",6.60333518,4.0084e-05,"Request AccessPort Write TAR"
"SWD","v1frame",6.60337532,5.004e-06,"Turnaround"
"SWD","v1frame",6.60338034,1.502e-05,"ACK OK"
"SWD","v1frame",6.60339537,5.004e-06,"Turnaround"
"SWD","v1frame",6.60340096,0.00016036,"WData 0xE000EDF0 reg TAR"
"SWD","v1frame",6.60356133,5.004e-06,"Data parityok"
"SWD","v1frame",6.60356755,4.008e-05,"Request AccessPort Read DRW"
"SWD","v1frame",6.6036077,5.004e-06,"Turnaround"
"SWD","v1frame",6.60361271,1.5028e-05,"ACK OK"
"SWD","v1frame",6.60362774,0.000161188,"WData 0x00000000 reg DRW"
"SWD","v1frame",6.60378894,5.004e-06,"Data parityok"
"SWD","v1frame",6.60380028,4.0084e-05,"Request DebugPort Read RDBUFF"
"SWD","v1frame",6.60384043,5.004e-06,"Turnaround"
"SWD","v1frame",6.60384544,1.5028e-05,"ACK OK"
"SWD","v1frame",6.60386048,0.000161188,"WData 0x01000000 reg RDBUFF"
"SWD","v1frame",6.60402168,5e-06,"Data parityok"
"SWD","v1frame",6.60456573,4.0084e-05,"Request DebugPort Write ABORT"
"SWD","v1frame",6.60460592,5.036e-06,"Turnaround"
"SWD","v1frame",6.60461096,1.5108e-05,"ACK OK"

Display All

0x23000012 0xE000EDF0 are the values that get written into the CSW and TAR registers. 0xE000EDF0, on this platform, is the address of the Debug Halting Control and Status Register, DHCSR.

I don't know much about SWD or how live debugging works under the hood, so I am looking for some guidance here from someone with more experience. But from what I know so far, I have a hypothesis that I would try to test:

I have a whole Eclipse-based IDE running gdb via a Segger J-Link, connected to the device via SWD. Is it feasible that, through some or any problem, a miscommunication happens on the SWD line, and through that miscommunication, those values 0x23000012 0xE000EDF0 do not end up in the CSW and TAR registers as intended, but in the RAM of the device? I am imagining that maybe the memory browser of the IDE tries to access the RAM, gdb tries to access the registers of the debug port, and through some race condition or a shoddy connection, those two separate SWD commands end up as a single SWD command that puts those value into the RAM. Is that possible? Has anyone ever seen something similiar? And how would I go about debugging or even just verifying that?

I thought about just leaving the Logic Analyzer connected while I leave the debug session running until the problem occurs. I can't do that though because the Logic Analyzer can only record so much data, not the several hours until the problem might happen. I know no way to connect the debugger and the Logic Analyzer so that when the RAM corruption occurs and the debugger detects that (e.g. by reaching a breakpoint), the Logic Analyzer stops recording and saves the last few minutes of data so that I can look at the SWD communication. Is it possible to log the SWD transfers some other way? Does gdb offer such a feature? Could I capture the USB data to the J-Link?

Any advice or comments are welcome. These last few weeks I have been aging at four times the usual speed.

Thanks,

Thomas

SEGGER - Alex · Sep 6th 2023, 9:41pm

So far, I agree to your analysis, that the pattern in memory looks somewhat too similar to the observed transfers than being a coincidence…

May I ask how you use J-Link?
1) gdb -> OpenOCD -> J-Link
2) gdb -> J-Link GDB Server -> J-Link

Accessing the DHCSR is a common operation because it tells about the „halted“ state of the core. So it is for example accessed while the application is running and the debugger executes a IsHalted() check.

Making the writes to the TAR and CSW ending up in RAM is hard to imagine.
There would bits to flip in the sequence so that instead of TAR/CSW the DATA register is accessed.
This should lead to parity errors and the debug port would reject the writes.

Indeed, 2 threads accessing the debug port in parallel but not being locked properly could cause weird stuff to happen but not in this way. What could happen is that TAR is overwritten by the 2nd thread shortly after the 1st one set it but this would only cause the following read of DATA to access another address than expected. But it would not cause the value meant for TAR to end up in DATA (which would by the way require a write access instead of a read access to DATA)

For J-Link GDB Server, I am confident that this 2 thread case cannot happen.
For OpenOCD, we cannot guarantee for anything, as it bypasses most of the J-Link logic and has its own logic.

When the issue occurs, was the session just running for several hours and waiting for a BP to hit, or was there a lot of halt / resume / single step / … in that period of time?

If just letting the target running is sufficient, then the following might be worth a try:
Create a super-small application in flash, with little stack that initializes everything after stack with 0x00000000 and then checks any initialized memory to become != 0x00000000 and run onto a breakpoint if that happens.
If the issue can be seen, it should be possible to track this down to be a chip / debugger problem.

If the problem does not turn up, it might be something in combination of what the target application / your firmware is doing.
Maybe a transition to a low power mode (WFI instruction) happening at the same time, the debugger issues an access to TAR/DATA/…
Or some priority bug inside the chip where if a debugger memory access happens at the same time as a DMA activity or something similar, leads to garbage to happen on the chip.

Source Code

Share