[SOLVED] RTT usage of SEGGER_RTT_ASM_ARMv7M.S

fraengers · September 10, 2022 at 6:27 PM

Hi,
can you please tell me why and how to use SEGGER_RTT_ASM_ARMv7M.S, there is not much info on that.

What is the advantage, is it faster?
How do I use it? just setting #define USE_RTT_ASM (1) in SEGGER_RTT_Conf.h doesn't do much as far as I can tell

I am working with Dave IDE (gcc) with XMC4400 (Cortex M4)

SEGGER - Alex · September 13, 2022 at 8:43 PM

Usually, there is nothing to do.
The ASM variant is active by default for GCC + Cortex-M4.

Yes, the ASM variant is faster and it is constantly fast, no matter how high/low your compiler optimizations around it are.

fraengers · September 16, 2022 at 1:14 PM

I tested two versions, 6.40 and 7.80a:

V6.40 has #define USE_RTT_ASM 1 in SEGGER_RTT_Conf.h, but this has no effect.

In both versions the presence of SEGGER_RTT_ASM_ARMv7M.S does not change the output.
Not using SEGGER_RTT.c results in errors.

Since the times for writing 82 chars are far away from what is claimed on the website (<= 1us @ 168MHz Cortex M4), I did a bit of testing.
I'm only running at 120MHz, so my times should be slower by a factor of 1.4. But the best I could do was 3.5 us.
The only way to achieve <1us for just copying bytes is by using an assembler version of memcpy I found on the internet, or by using DMA.

Best case RTT_Write I could do takes 3.5us
82 byte memcpy takes 5 us
82 byte memcpy32 takes 0.8us
82 byte copy loop takes 9.1 us
84 byte copy loop with 32bits at a time takes 2.5 us
84 byte DMA copying 32bits at a time take 0.8us

So using an assembler version of memcpy is fast, but I cannot get there. Please tell my what I'm doing wrong.

My test code below:

C

/*
"\"D:\\Programme\\DAVE\\eclipse\\ARM-GCC-49\\bin\\make\"" --output-sync -j12 all 
'Building file: ../main.c'
'Invoking: ARM-GCC C Compiler'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -MMD -MT "main.o" -DXMC4500_F100x1024 -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/XMCLib/inc" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Infineon/XMC4500_series/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries" -O3 -ffunction-sections -fdata-sections -Wall -std=gnu99 -mfloat-abi=softfp -Wa,-adhlns="main.o.lst" -pipe -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.d main.o" -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mthumb -g -gdwarf-2 -o "main.o" "../main.c" 
'Finished building: ../main.c'


'Building target: rtt.elf'
'Invoking: ARM-GCC C Linker'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -T"../linker_script.ld" -nostartfiles -Xlinker --gc-sections -specs=nano.specs -specs=nosys.specs -Wl,-Map,"rtt.map" -mfloat-abi=softfp -mfpu=fpv4-sp-d16 -mcpu=cortex-m4 -mthumb -g -gdwarf-2 -o "rtt.elf" "@objects.rsp"  -lm
'Finished building target: rtt.elf'


'Invoking: ARM-GCC Create Flash Image'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objcopy" -O ihex "rtt.elf" "rtt.hex"
'Finished building: rtt.hex'


'Invoking: ARM-GCC Print Size'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-size" --format=berkeley "rtt.elf"
   text	   data	    bss	    dec	    hex	filename
   5176	    168	   3276	   8620	   21ac	rtt.elf
'Finished building: rtt.siz'


'Invoking: ARM-GCC Create Listing'
"D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objdump" -h -S "rtt.elf" > "rtt.lst"
'Finished building: rtt.lst'
*/


#include "xmc_common.h"
#include <xmc_gpio.h>
#include <xmc_dma.h>
//#include "RTT/SEGGER_RTT.h"  // 7.80a
#include "RTT_old/SEGGER_RTT.h"  // 6.40
#include "memcpy32.h"


char src[84] = "1111111111222222222233333333334444444444555555555566666666667777777777888888888899FF";
char dst[84] = "000000000000000000000000000000000000000000000000000000000000000000000000000000000000";


#define pin_high PORT2->OMR = 0x00000002UL;
#define pin_low PORT2->OMR = 0x00020000UL;
#define delay for (uint32_t i = 0; i < 100; i++) { asm volatile ("NOP"); }
#define reset_dst for (uint32_t i = 0; i < 84; i++) { dst[i] = 0; }


void check(void) {
    if (memcmp(src, dst, 82) != 0)
        __BKPT();
}




void GPDMA0_0_IRQHandler(void) {
    pin_low


    check();
    reset_dst
    __BKPT();
}




int main(void) {
    // GPIO
    XMC_GPIO_CONFIG_t gpio_cfg = {
        .mode = XMC_GPIO_MODE_OUTPUT_PUSH_PULL,
        .output_level = XMC_GPIO_OUTPUT_LEVEL_LOW,
    };
    XMC_GPIO_Init(P2_1, &gpio_cfg);


    // DMA
    XMC_DMA_CH_CONFIG_t dma_ch_config = {
        .block_size = 21,
        .src_addr = (uint32_t)&src[0],
        .dst_addr = (uint32_t)&dst[0],
        .src_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
        .dst_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
        .src_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
        .dst_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
        .src_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
        .dst_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
        .transfer_flow = XMC_DMA_CH_TRANSFER_FLOW_M2M_DMA,
        .transfer_type = XMC_DMA_CH_TRANSFER_TYPE_SINGLE_BLOCK,
        .enable_interrupt = true
    };
    XMC_DMA_Init(XMC_DMA0);
    XMC_DMA_CH_Init(XMC_DMA0, 0, &dma_ch_config);
    XMC_DMA_CH_EnableEvent(XMC_DMA0, 0, XMC_DMA_CH_EVENT_TRANSFER_COMPLETE);
    NVIC_SetPriority(GPDMA0_0_IRQn, NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 63, 0));
    NVIC_EnableIRQ(GPDMA0_0_IRQn);


    // RTT
    SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);


    // Tests
    pin_high
    SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82);     // 3.5 us @ -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP
    // RTT 6.40     -O0:                                    7.7 us
    // RTT 6.40     -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP   19.0 us
    // RTT 6.40     -O3                                     6.6 us
    // RTT 6.40     -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP    3.5 us
    // RTT 7.80a    -O0:                                    7.9 us
    // RTT 7.80a    -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP   19.1 us
    // RTT 7.80a    -O3                                     6.5 us
    // RTT 7.80a    -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP    6.3 us
    pin_low


    delay
    reset_dst


    pin_high
    memcpy(dst, src, 82); // 5 us @ -O3
    pin_low


    check();
    reset_dst
    delay


    pin_high
    // https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99
    memcpy32(dst, src, 82); // 0.87 us @ -O3
    pin_low


    check();
    reset_dst
    delay


    pin_high
    { // copying 82 x 1byte  9.2 us @ -O3
        char * d = dst;
        char * s = src;
        for (volatile uint32_t i = 0; i < 82; i++) {
            *d++ = *s++;
        }
    }
    pin_low


    check();
    reset_dst
    delay


    pin_high
    { // copying 21 x 4bytses  2.58us @ -O3
        uint32_t * d = (uint32_t *)dst;
        uint32_t * s = (uint32_t *)src;
        for (volatile uint32_t i = 0; i < 21; i++) { // copies 84 char
            *d++ = *s++;
        }
    }
    pin_low


    check();
    reset_dst
    delay


    pin_high
    // copying 21 x 4bytses with DMA  0.87 us @ -O3
    XMC_DMA_CH_Enable(XMC_DMA0, 0);




    while (1U) {
    }
}

Display More

SEGGER - Fabian · February 15, 2023 at 5:09 PM

Hi,
it seems like there might be some misunderstandings here.
So I hope to clear them out with this post.

1) When is the ASM sub module used
The ASM sub module contains an ASM variant of SEGGER_RTT_WriteSkipNoLock().
This function is used per default for most cores (e.g. Cortex-M4).
However, you are using SEGGER_RTT_Write() not SEGGER_RTT_WriteSkipNoLock().
So in your example code the ASM routine is actually not used.
You can check this by looking into the SEGGER_RTT.c and SEGGER_RTT.h files, which contain the compiler switches (asm available or not) and code used by RTT (asm version of SEGGER_RTT_WriteSkipNoLock() used or not).

2) What does output time actually mean
The output time is the time it takes the RTT module to output data.
It is the time between the call of the RTT function until the data is available to be read by J-Link, so until it is in the buffer, without overhead(!).

So to measure the output time can be done by using a scope and an application that toggles a pin:
a) Set the pin (e.g. LED pin).

b) Measure the time (clear to set) of the following calls to get overhead time:
BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, 0, 0);SEGGER_RTT_UNLOCK();BSP_SetLED(0);c) Measure the time (clear to set) of the following calls to get actual output time:
BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, "01234567890123456789012345678901234567890123456789012345678901234567890123456789\r\n", 82);SEGGER_RTT_UNLOCK();BSP_SetLED(0);I repeated the measurement on the SEGGER Cortex-M Trace Reference Board (168MHz) and the result was as follows:
Measurement 1) (overhead): 2.04usMeasurement 2) (82 chars): 2.70us=> Time without overhead: 2.70us - 2.04us = 0.66us

3) What factors can impact the output time
The test I did was executed from flash.
As the string is copied from flash to RAM, the RAM write speed and cache handling (if the core has any) has an impact on the time measured for example.

BR
Fabian

fraengers · February 22, 2023 at 12:13 AM

The main problem is that I cannot get to the "time to do a single memcopy" speed. Or worded differently: my memcpy is slow.

here are some results:

Code

// J-Link / RTT V7.86


SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);


SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82);               // 6.58 us @ -O3
SEGGER_RTT_Write(0, "0", 1); 				                                                                    // 1.83 us @ -O3
SEGGER_RTT_WriteNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82);         // 6.29 us @ -O3
SEGGER_RTT_WriteNoLock(0, 0, 0);                                                                                            // 1.33 us @ -O3
SEGGER_RTT_WriteSkipNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82);     // 4.5  us @ -O3
SEGGER_RTT_WriteSkipNoLock(0, "0", 1);                                                                                      // 0.75 us @ -O3
memcpy(dst, src, 82);                                                                                                       // 4.96 us @ -O3
memcpy32(dst, src, 82);                                                                                                     // 0.9 us @ -O3  https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99

Display More

Are you using a special memcpy function? (I'm not a software engineer so I can't confidently tell from the code)

If so, why is it not as fast as in your test? The mepcpy32 function shows that it can be fast.
If not, why is it RTT still faster than calling memcpy manually?

SEGGER - Fabian · March 8, 2023 at 5:12 PM

Hi,
Your question is not related to RTT but is concerning a general topic
related to a general C and Assembler language.

Please understand that this far exceeds the scope of the help we can
provide in this Forum or in our support ticket system.

I suggest that you take the time to check the RTT target side implementation, which you have all sources for.
You could either
a) follow the function calls or
b) step through the code
until you find where the memory is copied in the source.

As a hint:
As long as RTT_USE_ASM is set, you can find the memory copy implementation in SEGGER_RTT_ASM_ARMv7M.S => SEGGER_RTT_ASM_WriteSkipNoLock().

I dare say that it is well enough documented to find the code where the memory is copied.

Regarding the question why some memcpy implementations are faster than others:
I suggest to search the internet for this information. I am sure you will find well explained answers to this questions.

If you can come up with a faster/more optimized routine to copy the memory,
you are free to adjust the code you are using for RTT.
You have all the sources and are free to change them.
Please understand however, that we can not provide any support or similar for code
adjusted by users.

Please understand that for these reasons we cannot provide any more answers regarding this topic.
We will close this thread now.

BR
Fabian

Participate now!