[SOLVED] RTT usage of SEGGER_RTT_ASM_ARMv7M.S

This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

  • [SOLVED] RTT usage of SEGGER_RTT_ASM_ARMv7M.S

    Hi,
    can you please tell me why and how to use SEGGER_RTT_ASM_ARMv7M.S, there is not much info on that.
    1. What is the advantage, is it faster?
    2. How do I use it? just setting #define USE_RTT_ASM (1) in SEGGER_RTT_Conf.h doesn't do much as far as I can tell
    I am working with Dave IDE (gcc) with XMC4400 (Cortex M4)
  • Usually, there is nothing to do.
    The ASM variant is active by default for GCC + Cortex-M4.

    Yes, the ASM variant is faster and it is constantly fast, no matter how high/low your compiler optimizations around it are.
    Please read the forum rules before posting.

    Keep in mind, this is *not* a support forum.
    Our engineers will try to answer your questions between their projects if possible but this can be delayed by longer periods of time.
    Should you be entitled to support you can contact us via our support system: segger.com/ticket/

    Or you can contact us via e-mail.
  • I tested two versions, 6.40 and 7.80a:


    V6.40 has #define USE_RTT_ASM 1 in SEGGER_RTT_Conf.h, but this has no effect.

    In both versions the presence of SEGGER_RTT_ASM_ARMv7M.S does not change the output.
    Not using SEGGER_RTT.c results in errors.

    Since the times for writing 82 chars are far away from what is claimed on the website (<= 1us @ 168MHz Cortex M4), I did a bit of testing.
    I'm only running at 120MHz, so my times should be slower by a factor of 1.4. But the best I could do was 3.5 us.
    The only way to achieve <1us for just copying bytes is by using an assembler version of memcpy I found on the internet, or by using DMA.

    Best case RTT_Write I could do takes 3.5us
    82 byte memcpy takes 5 us
    82 byte memcpy32 takes 0.8us
    82 byte copy loop takes 9.1 us
    84 byte copy loop with 32bits at a time takes 2.5 us
    84 byte DMA copying 32bits at a time take 0.8us

    So using an assembler version of memcpy is fast, but I cannot get there. Please tell my what I'm doing wrong.

    My test code below:

    C Source Code

    1. /*
    2. "\"D:\\Programme\\DAVE\\eclipse\\ARM-GCC-49\\bin\\make\"" --output-sync -j12 all
    3. 'Building file: ../main.c'
    4. 'Invoking: ARM-GCC C Compiler'
    5. "D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -MMD -MT "main.o" -DXMC4500_F100x1024 -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/XMCLib/inc" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries/CMSIS/Infineon/XMC4500_series/Include" -I"D:/Anwendungsdaten/Dave/WS/rtt" -I"D:/Anwendungsdaten/Dave/WS/rtt/Libraries" -O3 -ffunction-sections -fdata-sections -Wall -std=gnu99 -mfloat-abi=softfp -Wa,-adhlns="main.o.lst" -pipe -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.d main.o" -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mthumb -g -gdwarf-2 -o "main.o" "../main.c"
    6. 'Finished building: ../main.c'
    7. 'Building target: rtt.elf'
    8. 'Invoking: ARM-GCC C Linker'
    9. "D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-gcc" -T"../linker_script.ld" -nostartfiles -Xlinker --gc-sections -specs=nano.specs -specs=nosys.specs -Wl,-Map,"rtt.map" -mfloat-abi=softfp -mfpu=fpv4-sp-d16 -mcpu=cortex-m4 -mthumb -g -gdwarf-2 -o "rtt.elf" "@objects.rsp" -lm
    10. 'Finished building target: rtt.elf'
    11. 'Invoking: ARM-GCC Create Flash Image'
    12. "D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objcopy" -O ihex "rtt.elf" "rtt.hex"
    13. 'Finished building: rtt.hex'
    14. 'Invoking: ARM-GCC Print Size'
    15. "D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-size" --format=berkeley "rtt.elf"
    16. text data bss dec hex filename
    17. 5176 168 3276 8620 21ac rtt.elf
    18. 'Finished building: rtt.siz'
    19. 'Invoking: ARM-GCC Create Listing'
    20. "D:/Programme/DAVE/eclipse/ARM-GCC-49/bin/arm-none-eabi-objdump" -h -S "rtt.elf" > "rtt.lst"
    21. 'Finished building: rtt.lst'
    22. */
    23. #include "xmc_common.h"
    24. #include <xmc_gpio.h>
    25. #include <xmc_dma.h>
    26. //#include "RTT/SEGGER_RTT.h" // 7.80a
    27. #include "RTT_old/SEGGER_RTT.h" // 6.40
    28. #include "memcpy32.h"
    29. char src[84] = "1111111111222222222233333333334444444444555555555566666666667777777777888888888899FF";
    30. char dst[84] = "000000000000000000000000000000000000000000000000000000000000000000000000000000000000";
    31. #define pin_high PORT2->OMR = 0x00000002UL;
    32. #define pin_low PORT2->OMR = 0x00020000UL;
    33. #define delay for (uint32_t i = 0; i < 100; i++) { asm volatile ("NOP"); }
    34. #define reset_dst for (uint32_t i = 0; i < 84; i++) { dst[i] = 0; }
    35. void check(void) {
    36. if (memcmp(src, dst, 82) != 0)
    37. __BKPT();
    38. }
    39. void GPDMA0_0_IRQHandler(void) {
    40. pin_low
    41. check();
    42. reset_dst
    43. __BKPT();
    44. }
    45. int main(void) {
    46. // GPIO
    47. XMC_GPIO_CONFIG_t gpio_cfg = {
    48. .mode = XMC_GPIO_MODE_OUTPUT_PUSH_PULL,
    49. .output_level = XMC_GPIO_OUTPUT_LEVEL_LOW,
    50. };
    51. XMC_GPIO_Init(P2_1, &gpio_cfg);
    52. // DMA
    53. XMC_DMA_CH_CONFIG_t dma_ch_config = {
    54. .block_size = 21,
    55. .src_addr = (uint32_t)&src[0],
    56. .dst_addr = (uint32_t)&dst[0],
    57. .src_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
    58. .dst_transfer_width = XMC_DMA_CH_TRANSFER_WIDTH_32,
    59. .src_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
    60. .dst_address_count_mode = XMC_DMA_CH_ADDRESS_COUNT_MODE_INCREMENT,
    61. .src_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
    62. .dst_burst_length = XMC_DMA_CH_BURST_LENGTH_1,
    63. .transfer_flow = XMC_DMA_CH_TRANSFER_FLOW_M2M_DMA,
    64. .transfer_type = XMC_DMA_CH_TRANSFER_TYPE_SINGLE_BLOCK,
    65. .enable_interrupt = true
    66. };
    67. XMC_DMA_Init(XMC_DMA0);
    68. XMC_DMA_CH_Init(XMC_DMA0, 0, &dma_ch_config);
    69. XMC_DMA_CH_EnableEvent(XMC_DMA0, 0, XMC_DMA_CH_EVENT_TRANSFER_COMPLETE);
    70. NVIC_SetPriority(GPDMA0_0_IRQn, NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 63, 0));
    71. NVIC_EnableIRQ(GPDMA0_0_IRQn);
    72. // RTT
    73. SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);
    74. // Tests
    75. pin_high
    76. SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 3.5 us @ -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP
    77. // RTT 6.40 -O0: 7.7 us
    78. // RTT 6.40 -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 19.0 us
    79. // RTT 6.40 -O3 6.6 us
    80. // RTT 6.40 -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 3.5 us
    81. // RTT 7.80a -O0: 7.9 us
    82. // RTT 7.80a -O0 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 19.1 us
    83. // RTT 7.80a -O3 6.5 us
    84. // RTT 7.80a -O3 + SEGGER_RTT_MEMCPY_USE_BYTELOOP 6.3 us
    85. pin_low
    86. delay
    87. reset_dst
    88. pin_high
    89. memcpy(dst, src, 82); // 5 us @ -O3
    90. pin_low
    91. check();
    92. reset_dst
    93. delay
    94. pin_high
    95. // https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99
    96. memcpy32(dst, src, 82); // 0.87 us @ -O3
    97. pin_low
    98. check();
    99. reset_dst
    100. delay
    101. pin_high
    102. { // copying 82 x 1byte 9.2 us @ -O3
    103. char * d = dst;
    104. char * s = src;
    105. for (volatile uint32_t i = 0; i < 82; i++) {
    106. *d++ = *s++;
    107. }
    108. }
    109. pin_low
    110. check();
    111. reset_dst
    112. delay
    113. pin_high
    114. { // copying 21 x 4bytses 2.58us @ -O3
    115. uint32_t * d = (uint32_t *)dst;
    116. uint32_t * s = (uint32_t *)src;
    117. for (volatile uint32_t i = 0; i < 21; i++) { // copies 84 char
    118. *d++ = *s++;
    119. }
    120. }
    121. pin_low
    122. check();
    123. reset_dst
    124. delay
    125. pin_high
    126. // copying 21 x 4bytses with DMA 0.87 us @ -O3
    127. XMC_DMA_CH_Enable(XMC_DMA0, 0);
    128. while (1U) {
    129. }
    130. }
    Display All
  • Hi,
    it seems like there might be some misunderstandings here.
    So I hope to clear them out with this post.

    1) When is the ASM sub module used
    The ASM sub module contains an ASM variant of SEGGER_RTT_WriteSkipNoLock().
    This function is used per default for most cores (e.g. Cortex-M4).
    However, you are using SEGGER_RTT_Write() not SEGGER_RTT_WriteSkipNoLock().
    So in your example code the ASM routine is actually not used.
    You can check this by looking into the SEGGER_RTT.c and SEGGER_RTT.h files, which contain the compiler switches (asm available or not) and code used by RTT (asm version of SEGGER_RTT_WriteSkipNoLock() used or not).

    2) What does output time actually mean
    The output time is the time it takes the RTT module to output data.
    It is the time between the call of the RTT function until the data is available to be read by J-Link, so until it is in the buffer, without overhead(!).

    So to measure the output time can be done by using a scope and an application that toggles a pin:
    a) Set the pin (e.g. LED pin).

    b) Measure the time (clear to set) of the following calls to get overhead time:
    BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, 0, 0);SEGGER_RTT_UNLOCK();BSP_SetLED(0);c) Measure the time (clear to set) of the following calls to get actual output time:
    BSP_ClrLED(0);SEGGER_RTT_LOCK();Status = SEGGER_RTT_WriteNoLock(0, "01234567890123456789012345678901234567890123456789012345678901234567890123456789\r\n", 82);SEGGER_RTT_UNLOCK();BSP_SetLED(0);I repeated the measurement on the SEGGER Cortex-M Trace Reference Board (168MHz) and the result was as follows:
    Measurement 1) (overhead): 2.04usMeasurement 2) (82 chars): 2.70us=> Time without overhead: 2.70us - 2.04us = 0.66us

    3) What factors can impact the output time
    The test I did was executed from flash.
    As the string is copied from flash to RAM, the RAM write speed and cache handling (if the core has any) has an impact on the time measured for example.

    BR
    Fabian
    Please read the forum rules before posting.

    Keep in mind, this is *not* a support forum.
    Our engineers will try to answer your questions between their projects if possible but this can be delayed by longer periods of time.
    Should you be entitled to support you can contact us via our support system: segger.com/ticket/

    Or you can contact us via e-mail.
  • The main problem is that I cannot get to the "time to do a single memcopy" speed. Or worded differently: my memcpy is slow.

    here are some results:

    Source Code

    1. // J-Link / RTT V7.86
    2. SEGGER_RTT_ConfigUpBuffer(0, NULL, NULL, 0, SEGGER_RTT_MODE_NO_BLOCK_SKIP);
    3. SEGGER_RTT_Write(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 6.58 us @ -O3
    4. SEGGER_RTT_Write(0, "0", 1); // 1.83 us @ -O3
    5. SEGGER_RTT_WriteNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 6.29 us @ -O3
    6. SEGGER_RTT_WriteNoLock(0, 0, 0); // 1.33 us @ -O3
    7. SEGGER_RTT_WriteSkipNoLock(0, "012345678901234567890123456789012345678901234567890123456789012345678901234567890", 82); // 4.5 us @ -O3
    8. SEGGER_RTT_WriteSkipNoLock(0, "0", 1); // 0.75 us @ -O3
    9. memcpy(dst, src, 82); // 4.96 us @ -O3
    10. memcpy32(dst, src, 82); // 0.9 us @ -O3 https://gist.github.com/Erlkoenig90/fa1dc89e52e63a3697ad1eaaad19ee99
    Display All
    Are you using a special memcpy function? (I'm not a software engineer so I can't confidently tell from the code)

    • If so, why is it not as fast as in your test? The mepcpy32 function shows that it can be fast.
    • If not, why is it RTT still faster than calling memcpy manually?
  • Hi,
    Your question is not related to RTT but is concerning a general topic
    related to a general C and Assembler language.

    Please understand that this far exceeds the scope of the help we can
    provide in this Forum or in our support ticket system.

    I suggest that you take the time to check the RTT target side implementation, which you have all sources for.
    You could either
    a) follow the function calls or
    b) step through the code
    until you find where the memory is copied in the source.


    As a hint:
    As long as RTT_USE_ASM is set, you can find the memory copy implementation in SEGGER_RTT_ASM_ARMv7M.S => SEGGER_RTT_ASM_WriteSkipNoLock().

    I dare say that it is well enough documented to find the code where the memory is copied.

    Regarding the question why some memcpy implementations are faster than others:
    I suggest to search the internet for this information. I am sure you will find well explained answers to this questions.

    If you can come up with a faster/more optimized routine to copy the memory,
    you are free to adjust the code you are using for RTT.
    You have all the sources and are free to change them.
    Please understand however, that we can not provide any support or similar for code
    adjusted by users.

    Please understand that for these reasons we cannot provide any more answers regarding this topic.
    We will close this thread now.

    BR
    Fabian
    Please read the forum rules before posting.

    Keep in mind, this is *not* a support forum.
    Our engineers will try to answer your questions between their projects if possible but this can be delayed by longer periods of time.
    Should you be entitled to support you can contact us via our support system: segger.com/ticket/

    Or you can contact us via e-mail.