[SOLVED] RTT usage of SEGGER_RTT_ASM_ARMv7M.S

fraengers · Sep 10th 2022, 6:27pm

Hi,
can you please tell me why and how to use SEGGER_RTT_ASM_ARMv7M.S, there is not much info on that.

What is the advantage, is it faster?
How do I use it? just setting #define USE_RTT_ASM (1) in SEGGER_RTT_Conf.h doesn't do much as far as I can tell

I am working with Dave IDE (gcc) with XMC4400 (Cortex M4)

SEGGER - Alex · Sep 13th 2022, 8:43pm

Usually, there is nothing to do.
The ASM variant is active by default for GCC + Cortex-M4.

Yes, the ASM variant is faster and it is constantly fast, no matter how high/low your compiler optimizations around it are.

fraengers · Sep 16th 2022, 1:14pm

I tested two versions, 6.40 and 7.80a:

V6.40 has #define USE_RTT_ASM 1 in SEGGER_RTT_Conf.h, but this has no effect.

In both versions the presence of SEGGER_RTT_ASM_ARMv7M.S does not change the output.
Not using SEGGER_RTT.c results in errors.

Since the times for writing 82 chars are far away from what is claimed on the website (<= 1us @ 168MHz Cortex M4), I did a bit of testing.
I'm only running at 120MHz, so my times should be slower by a factor of 1.4. But the best I could do was 3.5 us.
The only way to achieve <1us for just copying bytes is by using an assembler version of memcpy I found on the internet, or by using DMA.

Best case RTT_Write I could do takes 3.5us
82 byte memcpy takes 5 us
82 byte memcpy32 takes 0.8us
82 byte copy loop takes 9.1 us
84 byte copy loop with 32bits at a time takes 2.5 us
84 byte DMA copying 32bits at a time take 0.8us

So using an assembler version of memcpy is fast, but I cannot get there. Please tell my what I'm doing wrong.

My test code below:

C Source Code

/*
"\"D:\\Programme\\DAVE\\eclipse\\ARM-GCC-49\\bin\\make\"" --output-sync -j12 all
'Building file: ../main.c'
'Invoking: ARM-GCC C Compiler'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -MMD -MT "main.o" -DXMC4500_F100x1024 -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/XMCLib/inc" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Infineon/XMC4500_series/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries" -O3 -ffunction-sections -fdata-sections -Wall -std=gnu99 -mfloat-abi=softfp -Wa,-adhlns="main.o.lst" -pipe -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.d main.o" -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mthumb -g -gdwarf-2 -o "main.o" "../main.c"
'Finished building: ../main.c'
'Building target: rtt.elf'
'Invoking: ARM-GCC C Linker'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -T"../linker_script.ld" -nostartfiles -Xlinker --gc-sections -specs=nano.specs -specs=nosys.specs -Wl,-Map,"rtt.map" -mfloat-abi=softfp -mfpu=fpv4-sp-d16 -mcpu=cortex-m4 -mthumb -g -gdwarf-2 -o "rtt.elf" "@objects.rsp" -lm
'Finished building target: rtt.elf'
'Invoking: ARM-GCC Create Flash Image'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objcopy" -O ihex "rtt.elf" "rtt.hex"
'Finished building: rtt.hex'
'Invoking: ARM-GCC Print Size'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-size" --format=berkeley "rtt.elf"
text data bss dec hex filename
5176 168 3276 8620 21ac rtt.elf
'Finished building: rtt.siz'
'Invoking: ARM-GCC Create Listing'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objdump" -h -S "rtt.elf" > "rtt.lst"
'Finished building: rtt.lst'
*/
#include "xmc_common.h"
#include <xmc_gpio.h>
#include <xmc_dma.h>
//#include "RTT/SEGGER_RTT.h" // 7.80a
#include "RTT_old/SEGGER_RTT.h" // 6.40
#include "memcpy32.h"
char src[84] = "1111111111222222222233333333334444444444555555555566666666667777777777888888888899FF";
char dst[84] = "000000000000000000000000000000000000000000000000000000000000000000000000000000000000";
#define pin_high PORT2->OMR = 0x00000002UL;
#define pin_low PORT2->OMR = 0x00020000UL;
#define delay for (uint32_t i = 0; i < 100; i++) { asm volatile ("NOP"); }
#define reset_dst for (uint32_t i = 0; i < 84; i++) { dst[i] = 0; }
void check(void) {
if (memcmp(src, dst, 82) != 0)
__BKPT();
}
void GPDMA0_0_IRQHandler(void) {
pin_low
check();
reset_dst
__BKPT();
}
int main(void) {
// GPIO
XMC_GPIO_CONFIG_t gpio_cfg = {
.mode = XMC_GPIO_MODE_OUTPUT_PUSH_PULL,
.output_level = XMC_GPIO_OUTPUT_LEVEL_LOW,
};
XMC_GPIO_Init(P2_1, &gpio_cfg);
// DMA
XMC_DMA_CH_CONFIG_t dma_ch_config = {
.block_size = 21,
.src_addr = (uint32_t)&src[0],
.dst_addr = (uint32_t)&dst[0],
.src_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
.dst_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
.src_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
.dst_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
.src_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
.dst_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
.transfer_flow = XMC_DMA_CH_TRANSFER_FLOW_M2M_DMA,
.transfer_type = XMC_DMA_CH_TRANSFER_TYPE_SINGLE_BLOCK,
.enable_interrupt = true
};
XMC_DMA_Init(XMC_DMA0);
XMC_DMA_CH_Init(XMC_DMA0, 0, &dma_ch_config);
XMC_DMA_CH_EnableEvent(XMC_DMA0, 0, XMC_DMA_CH_EVENT_TRANSFER_COMPLETE);
NVIC_SetPriority(GPDMA0_0_IRQn, NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 63, 0));
NVIC_EnableIRQ(GPDMA0_0_IRQn);
// RTT
SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);
// Tests
pin_high
SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 3.5 us @ -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP
// RTT 6.40 -O0: 7.7 us
// RTT 6.40 -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 19.0 us
// RTT 6.40 -O3 6.6 us
// RTT 6.40 -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 3.5 us
// RTT 7.80a -O0: 7.9 us
// RTT 7.80a -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 19.1 us
// RTT 7.80a -O3 6.5 us
// RTT 7.80a -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 6.3 us
pin_low
delay
reset_dst
pin_high
memcpy(dst, src, 82); // 5 us @ -O3
pin_low
check();
reset_dst
delay
pin_high
// https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99
memcpy32(dst, src, 82); // 0.87 us @ -O3
pin_low
check();
reset_dst
delay
pin_high
{ // copying 82 x 1byte 9.2 us @ -O3
char * d = dst;
char * s = src;
for (volatile uint32_t i = 0; i < 82; i++) {
*d++ = *s++;
}
}
pin_low
check();
reset_dst
delay
pin_high
{ // copying 21 x 4bytses 2.58us @ -O3
uint32_t * d = (uint32_t *)dst;
uint32_t * s = (uint32_t *)src;
for (volatile uint32_t i = 0; i < 21; i++) { // copies 84 char
*d++ = *s++;
}
}
pin_low
check();
reset_dst
delay
pin_high
// copying 21 x 4bytses with DMA 0.87 us @ -O3
XMC_DMA_CH_Enable(XMC_DMA0, 0);
while (1U) {
}
}

Display All

SEGGER - Fabian · Feb 15th 2023, 5:09pm

Hi,
it seems like there might be some misunderstandings here.
So I hope to clear them out with this post.

1) When is the ASM sub module used
The ASM sub module contains an ASM variant of SEGGER_RTT_WriteSkipNoLock().
This function is used per default for most cores (e.g. Cortex-M4).
However, you are using SEGGER_RTT_Write() not SEGGER_RTT_WriteSkipNoLock().
So in your example code the ASM routine is actually not used.
You can check this by looking into the SEGGER_RTT.c and SEGGER_RTT.h files, which contain the compiler switches (asm available or not) and code used by RTT (asm version of SEGGER_RTT_WriteSkipNoLock() used or not).

2) What does output time actually mean
The output time is the time it takes the RTT module to output data.
It is the time between the call of the RTT function until the data is available to be read by J-Link, so until it is in the buffer, without overhead(!).

So to measure the output time can be done by using a scope and an application that toggles a pin:
a) Set the pin (e.g. LED pin).

b) Measure the time (clear to set) of the following calls to get overhead time:
BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, 0, 0);SEGGER_RTT_UNLOCK();BSP_SetLED(0);c) Measure the time (clear to set) of the following calls to get actual output time:
BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, "01234567890123456789012345678901234567890123456789012345678901234567890123456789\r\n", 82);SEGGER_RTT_UNLOCK();BSP_SetLED(0);I repeated the measurement on the SEGGER Cortex-M Trace Reference Board (168MHz) and the result was as follows:
Measurement 1) (overhead): 2.04usMeasurement 2) (82 chars): 2.70us=> Time without overhead: 2.70us - 2.04us = 0.66us

3) What factors can impact the output time
The test I did was executed from flash.
As the string is copied from flash to RAM, the RAM write speed and cache handling (if the core has any) has an impact on the time measured for example.

BR
Fabian

fraengers · Feb 22nd 2023, 12:13am

The main problem is that I cannot get to the "time to do a single memcopy" speed. Or worded differently: my memcpy is slow.

here are some results:

Source Code

// J-Link / RTT V7.86
SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);
SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 6.58 us @ -O3
SEGGER_RTT_Write(0, "0", 1); // 1.83 us @ -O3
SEGGER_RTT_WriteNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 6.29 us @ -O3
SEGGER_RTT_WriteNoLock(0, 0, 0); // 1.33 us @ -O3
SEGGER_RTT_WriteSkipNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 4.5 us @ -O3
SEGGER_RTT_WriteSkipNoLock(0, "0", 1); // 0.75 us @ -O3
memcpy(dst, src, 82); // 4.96 us @ -O3
memcpy32(dst, src, 82); // 0.9 us @ -O3 https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99

Display All

Are you using a special memcpy function? (I'm not a software engineer so I can't confidently tell from the code)

If so, why is it not as fast as in your test? The mepcpy32 function shows that it can be fast.
If not, why is it RTT still faster than calling memcpy manually?

SEGGER - Fabian · Mar 8th 2023, 5:12pm

Hi,
Your question is not related to RTT but is concerning a general topic
related to a general C and Assembler language.

Please understand that this far exceeds the scope of the help we can
provide in this Forum or in our support ticket system.

I suggest that you take the time to check the RTT target side implementation, which you have all sources for.
You could either
a) follow the function calls or
b) step through the code
until you find where the memory is copied in the source.

As a hint:
As long as RTT_USE_ASM is set, you can find the memory copy implementation in SEGGER_RTT_ASM_ARMv7M.S => SEGGER_RTT_ASM_WriteSkipNoLock().

I dare say that it is well enough documented to find the code where the memory is copied.

Regarding the question why some memcpy implementations are faster than others:
I suggest to search the internet for this information. I am sure you will find well explained answers to this questions.

If you can come up with a faster/more optimized routine to copy the memory,
you are free to adjust the code you are using for RTT.
You have all the sources and are free to change them.
Please understand however, that we can not provide any support or similar for code
adjusted by users.

Please understand that for these reasons we cannot provide any more answers regarding this topic.
We will close this thread now.

BR
Fabian

C Source Code

Source Code

Share