#4739 new defect

Fatal errors running GDB testcase against libdebugger

Reported by: Kévin Le Gouguec Owned by: Chris Johns
Priority: normal Milestone:
Component: lib/debugger Version: 6
Severity: normal Keywords:
Cc: Blocked By:
Blocking:

Description

Hi,

We observe fatal errors when replaying this GDB testcase against libdebugger on aarch64:

<https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.c>

$ aarch64-rtems6-g++ -o repro          \
    -g -x c++                          \
    next-while-other-thread-longjmps.c \
    rtems_init.c                       \
    -lbsd -lm -ldebugger -Wl,-gc-sections

# start "repro" on a target listening on $HOST:$PORT

$ aarch64-rtems6-gdb repro                             \
    -ex 'break next-while-other-thread-longjmps.c:106' \
    -ex "target remote $HOST:$PORT"                    \
    -ex 'set variable release_debugger = 1'            \
    -ex 'continue'                                     \
    -ex 'next'

This is with rtems commit 9ed7103c618 "score: Simplify Chain_Node definition" (2022-09-23). While working on shrinking down the reproducer, we found that the exact error depends on what's happening in rtems_init.c. Attached are two variants named "minimal" and "lessminimal".

"minimal" yields:

*** FATAL ***
fatal source: 0 (INTERNAL_ERROR_CORE)
CPU: 1
fatal code: 30 (INTERNAL_ERROR_BAD_THREAD_DISPATCH_DISABLE_LEVEL)
RTEMS version: 6.0.0
RTEMS tools: 10.3.1 20210409 (RTEMS 6, RSB 3950b1e2d857a5dba054cdd230f52635f1bad4dc-modified, Newlib 9069cb9)
executing thread ID: 0x0b010001
executing thread name: 

"lessminimal" triggers a different problem, which is the one we initially observed:

*** FATAL ***
fatal source: 9 (RTEMS_FATAL_SOURCE_EXCEPTION)
CPU: 1

X0   = 0x0000000040bf4a90 X17  = 0x00000000400211b0
X1   = 0x000000004036bed8 X18  = 0x0000000000000014
X2   = 0x0000000040326d00 X19  = 0x0000000040bf3f60
X3   = 0x0000000040b98e00 X20  = 0x0000000000000202
X4   = 0x0000000040b98be0 X21  = 0x0000000040326900
X5   = 0x000000004036bf50 X22  = 0x0000000040326c5c
X6   = 0x0000000000000000 X23  = 0x0000000040326d00
X7   = 0x000000004036c068 X24  = 0x0000000000000000
X8   = 0x0000000000000068 X25  = 0x0000000000000201
X9   = 0x0000000000000204 X26  = 0x0000000040326c58
X10  = 0x00000000fffffffa X27  = 0x000000004036bc08
X11  = 0x000000004f83f668 X28  = 0x0000000000000000
X12  = 0x0000000000000014 FP   = 0x0000000040b98be0
X13  = 0x000000004f83f600 LR   = 0x0000000040144a08
X14  = 0x0000000000000000 SP   = 0x0000000040b98be0
X15  = 0x0000000000000000 PC   = 0x0000000040144a08
X16  = 0x0000000040041c20 DAIF = 0x00000000000003c0
VEC  = 0x0000000000000004 CPSR = 0x0000000060000345
ESR  = EC: 0b111100 IL: 0b1 ISS: 0b0000000000000000000000000
       BRK instruction execution in AArch64 state
FAR  = 0x0000000000000000
FPCR = 0x0000000000000000 FPSR = 0x0000000000000000
Q00  = 0x3b3b3b3b3b3b3b3b3b3b3b3b3b3b3b3b
Q01  = 0x2d2e31703a633b743a633b746e6f4376
Q02  = 0x2e31703a346632313466323132003330
Q03  = 0x00000000000000000000000000000000
Q04  = 0x00000000000000000000000000200000
Q05  = 0x00000000000004000000040000000000
Q06  = 0x00000000000000000000000000000000
Q07  = 0x80200802802008028020080280200802
Q08  = 0x00000000000000000000000000000000
Q09  = 0x00000000000000000000000000000000
Q10  = 0x00000000000000000000000000000000
Q11  = 0x00000000000000000000000000000000
Q12  = 0x00000000000000000000000000000000
Q13  = 0x00000000000000000000000000000000
Q14  = 0x00000000000000000000000000000000
Q15  = 0x00000000000000000000000000000000
Q16  = 0x40100401401004014010040140100401
Q17  = 0x00002000000000200000002008000400
Q18  = 0x00000000000000000000000000200000
Q19  = 0x00000000000000000000000000000000
Q20  = 0x00000000000000000000000000000000
Q21  = 0x00000000000000000000000000000000
Q22  = 0x00000000000000000000000000000000
Q23  = 0x00000000000000000000000000000000
Q24  = 0x00000000000000000000000000000000
Q25  = 0x00000000000000000000000000000000
Q26  = 0x00000000000000000000000000000000
Q27  = 0x00000000000000000000000000000000
Q28  = 0x00000000000000000000000000000000
Q29  = 0x00000000000000000000000000000000
Q30  = 0x00000000000000000000000000000000
Q31  = 0x00000000000000000000000000000000
RTEMS version: 6.0.0
RTEMS tools: 10.3.1 20210409 (RTEMS 6, RSB 3950b1e2d857a5dba054cdd230f52635f1bad4dc-modified, Newlib 9069cb9)
executing thread ID: 0x0a010003
executing thread name: TIME

Our config.ini:

[aarch64/xilinx_zynqmp_lp64_qemu]
RTEMS_POSIX_API = True
RTEMS_SMP = True

We have not managed to get a backtrace yet; setting a breakpoint in e.g. bsp_fatal_extension just causes GDB to hang when running "next". Let us know if there is something more we can do on our side to make these errors simpler to investigate.

FWIW, we can reproduce both errors by simplifying next-while-other-thread-longjmps.c to only spawn 1 thread running thread_try_catch, instead of 10 threads running thread_longjmp and thread_try_catch.

Attachments (2)

rtems_init_lessminimal.c (6.0 KB) - added by Kévin Le Gouguec on 10/07/22 at 13:57:59.
rtems_init_minimal.c (4.6 KB) - added by Kévin Le Gouguec on 10/07/22 at 13:58:13.

Download all attachments as: .zip

Change History (8)

Changed on 10/07/22 at 13:57:59 by Kévin Le Gouguec

Attachment: rtems_init_lessminimal.c added

Changed on 10/07/22 at 13:58:13 by Kévin Le Gouguec

Attachment: rtems_init_minimal.c added

comment:1 Changed on 10/08/22 at 14:38:01 by Joel Sherrill

Thank you for the very detailed bug report. The person who did the aarch64 libdebugger port is on holiday so I thought I would ask a couple of questions while they were away.

Is the commit you mentioned from a git bisect or just the random hash you were testing against?

Did you happen to test this with RTEMS build with SMP disabled? I don't think that matters here but you never know.

And were you testing against real hardware or qemu? I don't recall which architectures qemu supports enough to even let libdebugger work. Since it has its own gdb server, often it does not simulate the needed debug registers/capabilities.

If anyone in the community can test this against another architecture, it would be appreciated. Since it appears to be locking related, I suspect this is an architecture independent failure.

comment:2 Changed on 10/10/22 at 14:12:51 by Kévin Le Gouguec

The person who did the aarch64 libdebugger port is on holiday so I thought I would ask a couple of questions while they were away.

Thanks for following up in the meantime!

Is the commit you mentioned from a git bisect or just the random hash you
were testing against?

It happened to be the tip of the master branch when I started working on a reproducer; we originally observed this on 6c36cb7a486 "aarch64: always boot into EL1NS" (2022-01-12).

Did you happen to test this with RTEMS build with SMP disabled?

Not yet; I'll note to give it a try.

And were you testing against real hardware or qemu? I don't recall which architectures qemu supports enough to even let libdebugger work. Since it has its own gdb server, often it does not simulate the needed debug registers/capabilities.

We are testing against QEMU (7.0.0 plus some patches). FWIW, in the context of our debugger testsuite, we are finding libdebugger more reliable than qemu -gdb. The former has its share of issues, but at least results are mostly reproducible from one run to the next; QEMU's built-in GDB server on the other hand is much less deterministic in our experience.

comment:3 Changed on 10/11/22 at 02:24:59 by Chris Johns

In rtems_init_minimal.c I noticed:

/* Debugger lock, there is no way to ask the libdebugger to wait for a
 * connection, that would block the kernel.  Instead use the following global
 * variable and 'set release_debugger = 1' in GDB when ready.  */

I can add a feature to libdebugger to have it wait but I am not sure what happens once connected? Do I break and wait for the user via GDB or continue or should this be optional?

I think a wait for connection option is a good idea.

comment:4 in reply to:  3 ; Changed on 10/12/22 at 08:42:31 by Kévin Le Gouguec

Replying to Chris Johns:

I can add a feature to libdebugger to have it wait but I am not sure what happens once connected? Do I break and wait for the user via GDB or continue or should this be optional?

Thanks for weighing in on this! Yes, I think it would make sense to have an option to wait for GDB to (1) connect and (2) send a "continue". Kind of similar to how QEMU has -gdb/-s to enable the debug stub, and a dedicated switch, -S, to suspend execution at startup until the debugger says to resume?

Don't know how that would translate in terms of API (adding an argument to rtems_debugger_start, if breaking the API is allowed; adding a function to request the wait after rtems_debugger_start has been called; …); should I open a separate feature request for this?

comment:5 in reply to:  4 ; Changed on 10/13/22 at 00:52:37 by Chris Johns

Replying to Kévin Le Gouguec:

Replying to Chris Johns:

I can add a feature to libdebugger to have it wait but I am not sure what happens once connected? Do I break and wait for the user via GDB or continue or should this be optional?

Thanks for weighing in on this! Yes, I think it would make sense to have an option to wait for GDB to (1) connect and (2) send a "continue". Kind of similar to how QEMU has -gdb/-s to enable the debug stub, and a dedicated switch, -S, to suspend execution at startup until the debugger says to resume?

My pleasure and yes I agree it should be optional.

Don't know how that would translate in terms of API (adding an argument to rtems_debugger_start, if breaking the API is allowed; adding a function to request the wait after rtems_debugger_start has been called; …);

I think breaking the API may be worth the short term pain. I tend to prefer a const char* option type argument that can be NULL. That argument can then be passed around and parsed by backends, transports etc and options can come and go without breaking the API.

should I open a separate feature request for this?

Yes please and assign to me.

comment:6 in reply to:  5 Changed on 10/13/22 at 10:04:52 by Kévin Le Gouguec

Replying to Chris Johns:

should I open a separate feature request for this?

Yes please and assign to me.

Done in #4740.

Note: See TracTickets for help on using tickets.