Hi,
I am currently debugging a problem on an NXP KW41Z, i.e. a Cortex M0+ via SWD. At random times (several hours of running), the device hardfaults because operating system resources get overwritten. There is no rhyme or reason as to what gets overwritten (the affected addresses change); when it happens; how to reproduce the bug, etc. I have worked on the issue for weeks now, and have covered a lot of bases. The obvious and usual avenues (Stack overflow, rogue pointers, etc) came up with nothing. There's a few additional details that would be too much to write here, but the gist would be: My colleagues and me have tried everything, checked everything, and everything we know about microcontrollers has been challenged by this bug. The RAM contents just change at random.
The problem has happened several times at my desk over the last few weeks. At the same time, 34 devices ran the exact same firmware in a testing installation, i.e. without an attached debugger. None of them have reset itself during that time, which means that none of them had the hardfault that the RAM corruption that I saw at my desk would inevitably cause. This leads me to believe that the problem might only occur with an attached debugger.
One of the few consistent things between the ~10 times that I saw the bug and inspected the memory is that the affected spots, 16 bytes aligned to the word, get overwritten with the values 0x23000012 0xE000EDF0 or 0x23000002 0xE000EDF0. Those values pop up during SWD communication. I attached a Logic Analyzer to the SWDIO and SWDCLK pins, and find transfers like these between the Segger J-Link and my device during a normal debug session:
Display All
0x23000012 0xE000EDF0 are the values that get written into the CSW and TAR registers. 0xE000EDF0, on this platform, is the address of the Debug Halting Control and Status Register, DHCSR.
I don't know much about SWD or how live debugging works under the hood, so I am looking for some guidance here from someone with more experience. But from what I know so far, I have a hypothesis that I would try to test:
I have a whole Eclipse-based IDE running gdb via a Segger J-Link, connected to the device via SWD. Is it feasible that, through some or any problem, a miscommunication happens on the SWD line, and through that miscommunication, those values 0x23000012 0xE000EDF0 do not end up in the CSW and TAR registers as intended, but in the RAM of the device? I am imagining that maybe the memory browser of the IDE tries to access the RAM, gdb tries to access the registers of the debug port, and through some race condition or a shoddy connection, those two separate SWD commands end up as a single SWD command that puts those value into the RAM. Is that possible? Has anyone ever seen something similiar? And how would I go about debugging or even just verifying that?
I thought about just leaving the Logic Analyzer connected while I leave the debug session running until the problem occurs. I can't do that though because the Logic Analyzer can only record so much data, not the several hours until the problem might happen. I know no way to connect the debugger and the Logic Analyzer so that when the RAM corruption occurs and the debugger detects that (e.g. by reaching a breakpoint), the Logic Analyzer stops recording and saves the last few minutes of data so that I can look at the SWD communication. Is it possible to log the SWD transfers some other way? Does gdb offer such a feature? Could I capture the USB data to the J-Link?
Any advice or comments are welcome. These last few weeks I have been aging at four times the usual speed.
Thanks,
Thomas
I am currently debugging a problem on an NXP KW41Z, i.e. a Cortex M0+ via SWD. At random times (several hours of running), the device hardfaults because operating system resources get overwritten. There is no rhyme or reason as to what gets overwritten (the affected addresses change); when it happens; how to reproduce the bug, etc. I have worked on the issue for weeks now, and have covered a lot of bases. The obvious and usual avenues (Stack overflow, rogue pointers, etc) came up with nothing. There's a few additional details that would be too much to write here, but the gist would be: My colleagues and me have tried everything, checked everything, and everything we know about microcontrollers has been challenged by this bug. The RAM contents just change at random.
The problem has happened several times at my desk over the last few weeks. At the same time, 34 devices ran the exact same firmware in a testing installation, i.e. without an attached debugger. None of them have reset itself during that time, which means that none of them had the hardfault that the RAM corruption that I saw at my desk would inevitably cause. This leads me to believe that the problem might only occur with an attached debugger.
One of the few consistent things between the ~10 times that I saw the bug and inspected the memory is that the affected spots, 16 bytes aligned to the word, get overwritten with the values 0x23000012 0xE000EDF0 or 0x23000002 0xE000EDF0. Those values pop up during SWD communication. I attached a Logic Analyzer to the SWDIO and SWDCLK pins, and find transfers like these between the Segger J-Link and my device during a normal debug session:
Source Code
- "SWD","v1frame",6.60293633,0.00016036,"WData 0x00000000 reg SELECT bits APSEL=0x00, APBANKSEL=0x0, PRESCALER=0x0"
- "SWD","v1frame",6.6030967,5.004e-06,"Data parityok"
- "SWD","v1frame",6.60310304,4.0084e-05,"Request AccessPort Write CSW"
- "SWD","v1frame",6.60314319,5.004e-06,"Turnaround"
- "SWD","v1frame",6.6031482,1.5028e-05,"ACK OK"
- "SWD","v1frame",6.60316324,5e-06,"Turnaround"
- "SWD","v1frame",6.60316883,0.000160356,"WData 0x23000012 reg CSW bits DbgSwEnable=0, Prot=0x23, SPIDEN=0, Mode=0x0, TrInProg=0, DeviceEn=0, AddrInc=Increment single, Size=Word (32 bits)"
- "SWD","v1frame",6.6033292,5.004e-06,"Data parityok"
- "SWD","v1frame",6.60333518,4.0084e-05,"Request AccessPort Write TAR"
- "SWD","v1frame",6.60337532,5.004e-06,"Turnaround"
- "SWD","v1frame",6.60338034,1.502e-05,"ACK OK"
- "SWD","v1frame",6.60339537,5.004e-06,"Turnaround"
- "SWD","v1frame",6.60340096,0.00016036,"WData 0xE000EDF0 reg TAR"
- "SWD","v1frame",6.60356133,5.004e-06,"Data parityok"
- "SWD","v1frame",6.60356755,4.008e-05,"Request AccessPort Read DRW"
- "SWD","v1frame",6.6036077,5.004e-06,"Turnaround"
- "SWD","v1frame",6.60361271,1.5028e-05,"ACK OK"
- "SWD","v1frame",6.60362774,0.000161188,"WData 0x00000000 reg DRW"
- "SWD","v1frame",6.60378894,5.004e-06,"Data parityok"
- "SWD","v1frame",6.60380028,4.0084e-05,"Request DebugPort Read RDBUFF"
- "SWD","v1frame",6.60384043,5.004e-06,"Turnaround"
- "SWD","v1frame",6.60384544,1.5028e-05,"ACK OK"
- "SWD","v1frame",6.60386048,0.000161188,"WData 0x01000000 reg RDBUFF"
- "SWD","v1frame",6.60402168,5e-06,"Data parityok"
- "SWD","v1frame",6.60456573,4.0084e-05,"Request DebugPort Write ABORT"
- "SWD","v1frame",6.60460592,5.036e-06,"Turnaround"
- "SWD","v1frame",6.60461096,1.5108e-05,"ACK OK"
I don't know much about SWD or how live debugging works under the hood, so I am looking for some guidance here from someone with more experience. But from what I know so far, I have a hypothesis that I would try to test:
I have a whole Eclipse-based IDE running gdb via a Segger J-Link, connected to the device via SWD. Is it feasible that, through some or any problem, a miscommunication happens on the SWD line, and through that miscommunication, those values 0x23000012 0xE000EDF0 do not end up in the CSW and TAR registers as intended, but in the RAM of the device? I am imagining that maybe the memory browser of the IDE tries to access the RAM, gdb tries to access the registers of the debug port, and through some race condition or a shoddy connection, those two separate SWD commands end up as a single SWD command that puts those value into the RAM. Is that possible? Has anyone ever seen something similiar? And how would I go about debugging or even just verifying that?
I thought about just leaving the Logic Analyzer connected while I leave the debug session running until the problem occurs. I can't do that though because the Logic Analyzer can only record so much data, not the several hours until the problem might happen. I know no way to connect the debugger and the Logic Analyzer so that when the RAM corruption occurs and the debugger detects that (e.g. by reaching a breakpoint), the Logic Analyzer stops recording and saves the last few minutes of data so that I can look at the SWD communication. Is it possible to log the SWD transfers some other way? Does gdb offer such a feature? Could I capture the USB data to the J-Link?
Any advice or comments are welcome. These last few weeks I have been aging at four times the usual speed.
Thanks,
Thomas