#2811 closed enhancement (fixed)

More robust thread dispatching on SMP and ARM Cortex-M

Reported by: Sebastian Huber Owned by: Sebastian Huber
Priority: normal Milestone: 5.1
Component: score Version: 4.11
Severity: normal Keywords:
Cc: Blocked By:
Blocking:

Description

On SMP configurations, it is a fatal error to call blocking operating system with interrupts disabled, since this prevents delivery of inter-processor interrupts. This could lead to executing threads which are not allowed to execute resulting in undefined behaviour.

The ARM Cortex-M port has a similar problem, since the interrupt state is not a part of the thread context.

Add a new CPU port function:

/**
 * @brief Returns true if interrupts are enabled in the specified ISR level,
 * otherwise returns false.
 *
 * @param[in] level The ISR level.
 *
 * @retval true Interrupts are enabled in the ISR level.
 * @retval false Otherwise.
 */
RTEMS_INLINE_ROUTINE bool _CPU_ISR_Is_enabled( uint32_t level )
{
  return false;
}

Use this function to ensure that _Thread_Do_dispatch() is called with an interrupt level with enabled interrupts, otherwise call _Terminate().

Change History (17)

comment:1 Changed on Nov 15, 2016 at 6:47:40 AM by Chris Johns

Is this specific to an SMP build?

comment:2 Changed on Nov 15, 2016 at 6:50:03 AM by Sebastian Huber

I will add _CPU_ISR_Is_enabled() unconditionally. The fatal error will be limited to SMP and ARM Cortex-M. However, its an application bug in general to call blocking operations with interrupts disabled.

comment:3 in reply to:  2 Changed on Nov 15, 2016 at 8:54:24 PM by Chris Johns

Replying to sebastian.huber:

I will add _CPU_ISR_Is_enabled() unconditionally. The fatal error will be limited to SMP and ARM Cortex-M.

I would prefer to see an error code returned. Fatal errors should be limited to internal issues like inconsistencies where the kernel cannot reasonable continue and not user errors with an API. Fatal errors can also be difficult to track down with threading unless you have a thread aware debugger. Not all users or BSPs have this support.

However, its an application bug in general to call blocking operations with interrupts disabled.

Yes I would agree and if a user makes a call to RTEMS that is not valid we should return a suitable error or status code. For a single core system we have allowed this in past release so as a general change this is something that changes our current behavior.

comment:4 Changed on Nov 16, 2016 at 6:17:31 AM by Sebastian Huber

I think a fatal error is more appropriate here.

  • Applications which have this usage error needs to be fixed at compile-time. It makes no sense to ship an SMP application with this bug.
  • Return codes can be ignored. I definitely have seen code like this before:
    /* This cannot fail, we know the identifier is valid */
    (void) pthread_mutex_lock(&mtx);
    
  • This ticket is a result of porting a real world application from uni-processor to SMP. If you are not an expert of the operating system internals and your application has this bug, then you need easily a couple of days to figure out the problem. So, it is important to make sure it gets detected.
  • To figure out what caused a fatal error is easy. The (source, error) pair uniquely identifies the source code location of the error. With a stack trace and the executing thread you get enough information to locate the problem in the code. There is no need for a thread aware debugger.
  • This is a new constraint specific to SMP. Existing software may be simply unaware of this issue. However, its important to detect this constraint violation.
  • _Thread_Do_dispatch() has no return value. Adding this check to other places would be much more difficult, error prone. with more space and time overhead, and labour intensive to test.

comment:5 in reply to:  4 ; Changed on Nov 16, 2016 at 7:43:30 AM by Chris Johns

Replying to sebastian.huber:

I think a fatal error is more appropriate here.

  • Applications which have this usage error needs to be fixed at compile-time. It makes no sense to ship an SMP application with this bug.

A fatal error is still run-time and not a compile time error so you have lost me here.

  • Return codes can be ignored. I definitely have seen code like this before:
    /* This cannot fail, we know the identifier is valid */
    (void) pthread_mutex_lock(&mtx);
    

This is a different issue and a change of topic. We provide the means for errors to be analyzed and that is our boundary.

  • This ticket is a result of porting a real world application from uni-processor to SMP. If you are not an expert of the operating system internals and your application has this bug, then you need easily a couple of days to figure out the problem. So, it is important to make sure it gets detected.

I agree with detecting the issue and there being an error. It is the delivery we are discussing.

The error code should provide some help just like the fatal error code. If one can the other can.

How many fatal errors instance are there in RTEMS in the kernel? Not the number of error code, but the specific locations a fatal error can appear, ie code/line pairs? I have never audited this.

  • To figure out what caused a fatal error is easy. The (source, error) pair uniquely identifies the source code location of the error.

The source location is a line the kernel's core code which means users need to step into this code and figure out the answer. I have been hit by this with SMP and it is hard.

With a stack trace and the executing thread you get enough information to locate the problem in the code. There is no need for a thread aware debugger.

This implies testing will highlight the issue because you have a debugger to give you this data. Currently RTEMS standard or default stack traces that get called on a fatal error provide little if any information that could be used to resolve the exact source, eg the thread id executing or even better an unwinder (dreaming here). Better support for tier 1 archs would help.

  • This is a new constraint specific to SMP. Existing software may be simply unaware of this issue. However, its important to detect this constraint violation.

I agree it is important.

  • _Thread_Do_dispatch() has no return value. Adding this check to other places would be much more difficult, error prone. with more space and time overhead, and labour intensive to test.

There are no other similar tests happening now on the blocking paths?

comment:6 in reply to:  5 ; Changed on Nov 16, 2016 at 8:07:30 AM by Sebastian Huber

Replying to chrisj:

Replying to sebastian.huber:

I think a fatal error is more appropriate here.

  • Applications which have this usage error needs to be fixed at compile-time. It makes no sense to ship an SMP application with this bug.

A fatal error is still run-time and not a compile time error so you have lost me here.

It is an error that must be fixed during development. Otherwise you have a broken product.


  • Return codes can be ignored. I definitely have seen code like this before:
    /* This cannot fail, we know the identifier is valid */
    (void) pthread_mutex_lock(&mtx);
    

This is a different issue and a change of topic. We provide the means for errors to be analyzed and that is our boundary.

  • This ticket is a result of porting a real world application from uni-processor to SMP. If you are not an expert of the operating system internals and your application has this bug, then you need easily a couple of days to figure out the problem. So, it is important to make sure it gets detected.

I agree with detecting the issue and there being an error. It is the delivery we are discussing.

The error code should provide some help just like the fatal error code. If one can the other can.

How many fatal errors instance are there in RTEMS in the kernel? Not the number of error code, but the specific locations a fatal error can appear, ie code/line pairs? I have never audited this.

See Internal_errors_Core_list, we have a test for every fatal internal error.

  • To figure out what caused a fatal error is easy. The (source, error) pair uniquely identifies the source code location of the error.

The source location is a line the kernel's core code which means users need to step into this code and figure out the answer. I have been hit by this with SMP and it is hard.

With a stack trace and the executing thread you get enough information to locate the problem in the code. There is no need for a thread aware debugger.

This implies testing will highlight the issue because you have a debugger to give you this data. Currently RTEMS standard or default stack traces that get called on a fatal error provide little if any information that could be used to resolve the exact source, eg the thread id executing or even better an unwinder (dreaming here). Better support for tier 1 archs would help.

Improved fatal error diagnostics is a different topic. With a debugger is a matter of seconds to figure out the problem spot of a fatal error.

  • This is a new constraint specific to SMP. Existing software may be simply unaware of this issue. However, its important to detect this constraint violation.

I agree it is important.

  • _Thread_Do_dispatch() has no return value. Adding this check to other places would be much more difficult, error prone. with more space and time overhead, and labour intensive to test.

There are no other similar tests happening now on the blocking paths?

No, this is a weak area in RTEMS. For example call rtems_task_wake_after() in an interrupt service routine. You don't get any status information that this is stupid.

For now, I think a fatal error is sufficient. In case there is really a problem with this in the field, we can still improve things. What matters is that this constraint violation gets detected, otherwise you can spend hours on debugging.

comment:7 in reply to:  6 Changed on Nov 17, 2016 at 1:29:24 AM by Chris Johns

Replying to sebastian.huber:

Replying to chrisj:

Replying to sebastian.huber:

I think a fatal error is more appropriate here.

  • Applications which have this usage error needs to be fixed at compile-time. It makes no sense to ship an SMP application with this bug.

A fatal error is still run-time and not a compile time error so you have lost me here.

It is an error that must be fixed during development. Otherwise you have a broken product.

This goes for checking returned error codes as well.

The logical end to this path of discussion is to remove all fatal error checks from the kernel in production because they should never appear. I am sure you would agree this is not practical therefore there is always a finite chance of a fatal error happening and robust systems need to take this issue seriously.

In the case of this specific check or test we need to find what is practical. The issues are how it effects a user who encounters it, what it does to the kernel in terms of overhead and complexity, and what can be implemented now and and what could be implemented given the time and resources.

How many fatal errors instance are there in RTEMS in the kernel? Not the number of error code, but the specific locations a fatal error can appear, ie code/line pairs? I have never audited this.

See Internal_errors_Core_list, we have a test for every fatal internal error.

Where is the user documentation? The C User guide is lacking detail for this and the other generic errors cases.

Looking further into this we have around 300 calls to rtems_fatal_error_occurred less some for noise due to an approximate count. Of this around 200 are in the c/src tree with about 50 under the cpukit. For example we have rtems_fatal_error_occurred(0xdeadbeef); in cpukit/libfs/src/nfsclient/src/nfs.c.

The we have the family of *_Fatal* calls. Getting decent counts is not easy so I will not put them there. We have _CPU_Fatal_halt, _BSP_Fatal_error, _POSIX_Fatal_error, MPCI_Fatal, and _SMP_Fatal. Without decent control we end up with _CPU_Fatal_halt(RTEMS_FATAL_SOURCE_EXCEPTION, 0xECC0); which is in the NIOS2 interrupt handler.

I am concerned about adding to this without a suitable means to provide accessible user documentation on what we do.

This implies testing will highlight the issue because you have a debugger to give you this data. Currently RTEMS standard or default stack traces that get called on a fatal error provide little if any information that could be used to resolve the exact source, eg the thread id executing or even better an unwinder (dreaming here). Better support for tier 1 archs would help.

Improved fatal error diagnostics is a different topic.

I do not agree. If you argue a case for using fatal errors then I see it as reasonable we discuss how this impacts users.

With a debugger is a matter of seconds to figure out the problem spot of a fatal error.

Yes, however the time needed to understand the issue depends on the person looking and you and I are not suitable to judge this.

A fatal error could occur in production simply due to the fact there is code in the build that can generate them, and they do happen when performing system integration when there are no debuggers connected. How the error gets translated by a user with no knowledge of the RTEMS internals is the question being proposed. The more fatal errors we add the more of a problem this becomes and this ticket wants to add another.

  • This is a new constraint specific to SMP. Existing software may be simply unaware of this issue. However, its important to detect this constraint violation.

I agree it is important.

  • _Thread_Do_dispatch() has no return value. Adding this check to other places would be much more difficult, error prone. with more space and time overhead, and labour intensive to test.

There are no other similar tests happening now on the blocking paths?

No, this is a weak area in RTEMS. For example call rtems_task_wake_after() in an interrupt service routine. You don't get any status information that this is stupid.

Yes we have a few cases like this.

For now, I think a fatal error is sufficient.

Ok, however there is a need for documentation on the error aimed at the user being added to our user manuals.

I would like to see better default exception output and after this discussion I can see a real need for the debug server needing to catch any fatal errors and break the system leaving the user in the thread stack frame with the error.

In case there is really a problem with this in the field, we can still improve things.

In terms of the implementation, sure, in terms of supporting users with suitable documentation I believe we have a problem now.

What matters is that this constraint violation gets detected, otherwise you can spend hours on debugging.

Yes this is the most important item to get resolved.

comment:8 Changed on Nov 17, 2016 at 2:51:24 PM by Sebastian Huber

I will focus on fatal error documentation improvements. At least for this particular new one.

comment:9 Changed on Nov 18, 2016 at 7:04:08 AM by Sebastian Huber <sebastian.huber@…>

In 537f00ebe83370b8336361b8ae34d4a71e7023bb/rtems:

score: Restrict task interrupt level to 0 on SMP

Update #2811.

comment:10 Changed on Nov 18, 2016 at 7:04:23 AM by Sebastian Huber <sebastian.huber@…>

In 408609f6b9cd8e03d3886b7c150efbf7e59b5fb0/rtems:

score: Add _ISR_Is_enabled()

In contrast to _ISR_Get_level() the _ISR_Is_enabled() function evaluates
a level parameter and returns a boolean value.

Update #2811.

comment:11 Changed on Nov 23, 2016 at 11:54:10 AM by Sebastian Huber <sebastian.huber@…>

In 84e6f15c828869eb7d293096cfcfa0563b5752b3/rtems:

score: Robust thread dispatch

On SMP configurations, it is a fatal error to call blocking operating
system with interrupts disabled, since this prevents delivery of
inter-processor interrupts. This could lead to executing threads which
are not allowed to execute resulting in undefined behaviour.

The ARM Cortex-M port has a similar problem, since the interrupt state
is not a part of the thread context.

Update #2811.

comment:12 Changed on Dec 2, 2016 at 12:57:09 PM by Sebastian Huber <sebastian.huber@…>

In b07e642a2b1249cd64048c5e9f5e45254df7ae65/rtems:

smpthreadlife01: Fix due to robust thread dispatch

Update #2811.

comment:13 Changed on Dec 9, 2016 at 12:04:53 PM by Sebastian Huber

Resolution: fixed
Status: newclosed

Documentation update is done.

comment:14 Changed on May 11, 2017 at 7:31:02 AM by Sebastian Huber

Milestone: 4.124.12.0

comment:15 Changed on Sep 21, 2017 at 11:33:40 AM by Sebastian Huber <sebastian.huber@…>

In a4bca685/rtems:

bsps/powerpc: Fix robust thread dispatch

Implement thread dispatch code in ppc_exc_wrapup() similar to
ppc_exc_interrupt().

Update #2811.

comment:16 Changed on Oct 9, 2017 at 6:10:35 AM by Sebastian Huber <sebastian.huber@…>

In 80933ab3/rtems:

bsps/powerpc: Fix robust thread dispatch again

Use the saved MSR to account for FPU and AltiVec? settings.

Update #2811.

comment:17 Changed on Nov 9, 2017 at 6:27:14 AM by Sebastian Huber

Milestone: 4.12.05.1

Milestone renamed

Note: See TracTickets for help on using tickets.