= SMP =


[[TOC(Developer/SMP, depth=2)]]

=  Status  =


The [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] support for RTEMS is work in progress.  Basic support is available for ARM, PowerPC, SPARC and Intel x86.
=  Requirements  =


No public requirements exist currently.
=  Design Issues  =

=  Low-Level Start  =

== Status ==


The low-level start is guided by the per-CPU control state
[http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__PerCPU.html#gabad09777c1e3a7b7f3efcae54938d418 Per_CPU_Control::state].  See also ''_Per_CPU_Change_state()'' and
''_Per_CPU_Wait_for_state()''.

[wiki:File:rtems-smp-low-level-states.png File:rtems-smp-low-level-states.png]
== Future Directions ==


None.
=  Processor Affinity  =

==  Status  ==


Thread processor affinity is not supported.
==  Future Directions  ==


Implement the missing feature.  This consists of two parts

 *  the application level API, and
 *  the scheduler support.

There is no POSIX API available that covers thread processor affinity.  Linux
is the de facto standard system for high performance computing so it is
obvious to choose a Linux compatible API for RTEMS.  Linux provides
[http://man7.org/linux/man-pages/man3/CPU_SET.3.html CPU_SET(3)] to manage sets
of processors.  Thread processor affinity can be controlled via
[http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html SCHED_SETAFFINITY(2)]
and
[http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html PTHREAD_SETAFFINITY_NP(3)].
RTEMS should provide these APIs and implement them.  It is not possible to use
the Linux files directly since they have a pure GPL license.

The scheduler support for arbitrary processor affinities is a major challenge.
Schedulability analysis for schedulers with arbitrary processor affinities is a
current research topic <ref name="Gujarati2013>Arpan Gujarati, Felipe Cerqueira, and Björn Brandenburg. Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities, Proceedings of the 25th Euromicro Conference on Real-Time Systems (ECRTS 2013), July 2013.</ref>.

Support for arbitrary processor affinities may lead to massive thread
migrations in case a new thread is scheduled.  Suppose we have M processors
0, ..., M - 1 and M + 1 threads 0, ..., M.  Thread i can run on processors i
and i + 1 for i = 0, ..., M - 2.  The thread M - 1 runs only on processor
M - 1.  The M runs only on processor 0.  A thread i has a higher priority than
thread i + 1 for i = 0, ..., M, e.g. thread 0 is the highest priority thread.
Suppose at time T0 threads 0, ..., M - 2 and M are currently scheduled.  The
thread i runs on processor i + 1 for i = 0, ..., M - 2 and thread M runs on
processor 0.  Now at time T1 thread M - 1 gets ready.  It casts out thread M
since this is the lowest priority thread.  Since thread M - 1 can run only on
processor M - 1, the threads 0, ..., M - 2 have to migrate from processor i to
processor i - 1 for i = 0, ..., M - 2.  So one thread gets ready and the
threads of all but one processor must migrate.  The threads forced to migrate
all have a higher priority than the new ready thread M - 1.
{| class="wikitable"
|+ align="bottom" | Example for M = 3
! Time
! Thread
! Processor 0
! Processor 1
! Processor 2
|-
| rowspan="4" | T0
| 0
|
| style="background-color:green" |
|
|-
| 1
|
|
| style="background-color:yellow" |
|-
| 2
|
|
|
|-
| 3
| style="background-color:red" |
|
|
|-
| rowspan="4" | T1
| 0
| style="background-color:green" |
|
|
|-
| 1
|
| style="background-color:yellow" |
|
|-
| 2
|
|
| style="background-color:purple" |
|-
| 3
|
|
|
|-
|}

In the example above the Linux Push and Pull scheduler would not find a
processor for thread M - 1 = 2 and thread M = 3 would continue to execute even
though it has a lower priority.

The general scheduling problem with arbitrary processor affinities is a
matching problem in a bipartite graph.  There are two disjoint vertex sets.
The set of ready threads and the set of processors.  The edges indicate that a
thread can run on a processor.  The scheduler must find a maximum matching
which fulfills in addition other constraints.  For example the highest priority
threads should be scheduled first.  The performed thread migrations should be
minimal.  The augmenting path algorithm needs O(VE) time to find a maximum
matching, here V is the ready thread count plus processor count and E is the
number of edges.  It is particularly bad that the time complexity depends on
the ready thread count.  It is an open problem if an algorithm exits that is
good enough for real-time applications.

[wiki:File:rtems-smp-affinity.png File:rtems-smp-affinity.png]
=  Atomic Operations  =

==  Status  ==


The <stdatomic.h> header file is now available in Newlib.  GCC 4.8 supports C11 atomic operations.  Proper atomic operations support for LEON3 is included in
GCC 4.9.  According to the SPARC GCC maintainer it is possible to back port this to GCC 4.8.  A GSoC 2013 project works on an atomic operations API for RTEMS.
One part will be a read-write lock using a phase-fair lock implementation.

Example ticket lock with C11 atomics.
 #include <stdatomic.h>
 
 struct ticket {
 	atomic_uint ticket;
 	atomic_uint now_serving;
 };
 
 void acquire(struct ticket *t)
 {
 	unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
 
 	while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
 		/* Wait */
 	}
 }
 
 void release(struct ticket *t)
 {
 	unsigned int current_ticket = atomic_load_explicit(&t->now_serving, memory_order_relaxed);
 
 	atomic_store_explicit(&t->now_serving, current_ticket + 1U, memory_order_release);
 }

The generated assembler code looks pretty good.  Please note that GCC generates ''CAS'' instructions and not ''CASA'' instructions.
 	.file	"ticket.c"
 	.section	".text"
 	.align 4
 	.global acquire
 	.type	acquire, #function
 	.proc	020
 acquire:
 	ld	[%o0], %g1
 	mov	%g1, %g2
 .LL7:
 	add	%g1, 1, %g1
 	cas	[%o0], %g2, %g1
 	cmp	%g1, %g2
 	bne,a	.LL7
 	 mov	%g1, %g2
 	add	%o0, 4, %o0
 .LL4:
 	ld	[%o0], %g1
 	cmp	%g1, %g2
 	bne	.LL4
 	 nop
 	jmp	%o7+8
 	 nop
 	.size	acquire, .-acquire
 	.align 4
 	.global release
 	.type	release, #function
 	.proc	020
 release:
 	ld	[%o0+4], %g1
 	add	%g1, 1, %g1
 	st	%g1, [%o0+4]
 	jmp	%o7+8
 	 nop
 	.size	release, .-release
 	.ident	"GCC: (GNU) 4.9.0 20130917 (experimental)"
==  Future Directions  ==


 *  Review and integrate the GSoC work.
 *  Make use of atomic operations.
=  SMP Locks  =

==  Status  ==


SMP locks are implemented as a ticket lock using CPU architecture specific atomic operations.
==  Future Directions  ==


 *  Use a local context to be able to use scalable lock implementations like the Mellor-Crummey and Scotty (MCS) queue-based locks.
 *  Introduce read-write locks.  Use phase-fair read-write lock implementation.  This can be used for example by the time management code.  The system time may be read frequently, but updates are infrequent.
=  ISR Locks  =

==  Status  ==


On single processor configurations disabling of interrupts ensures mutual
exclusion.  This is no longer true on SMP since other processors continue to
execute freely.  On SMP the disabling of interrupts must be combined with an
SMP lock.  The ISR locks degrade to simple interrupt disable/enable sequences
on single processor configurations.  On SMP configurations they use an SMP lock
to ensure mutual exclusion throughout the system.
==  Future Directions  ==


 *  See [wiki:#SMP_Locks  SMP locks].
 *  Ensure via a RTEMS assertion that normal interrupt disable/sequences are only used intentional outside of the Giant lock critical sections.  Review usage of ISR disable/enable sequences of the complete code base.
=  Giant Lock vs. Fine Grained Locking  =

==  Status  ==


The operating system state is currently protected by three things

 *  the per-processor thread dispatch disable level,
 *  the Giant lock (a global recursive SMP lock), and
 *  the disabling of interrupts.

For example the operating services like ''rtems_semaphore_release()'' follow this
pattern:

 operation(id):
 	disable_thread_dispatching()
 	acquire_giant_lock()
 	obj = get_obj_by_id(id)
 	status = obj->operation()
 	put_object(obj)
 	release_giant_lock()
 	enable_thread_dispatching()
 	return status

All high level operations are serialized due to the Giant lock.  The
introduction of the Giant lock allowed a straight forward usage of the existing
single processor code in an SMP configuration.  However a Giant lock will
likely lead to undesirable high worst case latencies which linear increase with
the processor count <ref name="Brandenburg2011">Björn B. Brandenburg, Scheduling and Locking in Multiprocessor Real-Time Operating Systems, https://www.cs.unc.edu/~anderson/diss/bbbdiss.pdf, 2011.</ref>.
==  Future Directions  ==


The Giant lock should be replaced with fine grained locking.  In order to
introduce fine grained locking the state transitions inside a section protected
by the Giant lock must be examined.  As an example the event send and receive
operations are presented in a simplified state diagram.

[wiki:File:rtems-smp-events.png 645x600px]

The colour indicates which locks are held during a state and state transition.
In case of a blocking operation we have four synchronization states which are
used to split up the complex blocking operation into parts in order to decrease
the interrupt latency

 *  nothing happened,
 *  satisfied,
 *  timeout, and
 *  synchronized.

This state must be available for each thread and resource (e.g. event,
semaphore, message queue).  Since this would lead to a prohibitive high memory
demand and a complicated implementation some optimizations have been performed
in RTEMS.  For events a global variable is used to indicate the synchronization
state.  For resource objects the synchronization state is part of the object.
Since at most one processor can execute a blocking operation (ensured by the
Giant lock or the singe processor configuration) it is possible to use the
state variable only for the executing thread.

In case one SMP lock is used per resource the resource must indicate which
thread is about to perform a blocking operation since more than one executing
thread exists.  For example two threads can try to obtain different semaphores
on different processors at the same time.

A blocking operation will use not only the resource object, but also other
resources like the scheduler shared by all threads of a dispatching domain.
Thus the scheduler needs its own locks.  This will lead to nested locking and
deadlocks must be prevented.
=  Watchdog Handler  =

==  Status  ==


The [http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__ScoreWatchdog.html Watchdog Handler] is a global resource which provides time management in
RTEMS.  It is a core service of the operating system and used by the high level APIs.  It is used for time of day maintenance and time events based on ticks or
wall clock time for example.  The term ''watchdog'' has nothing to do with hardware watchdogs.
==  Future Directions  ==


The watchdog uses delta chains to manage time based events.  One global
watchdog resource leads to scalability problems.  A SMP system is scalable if additional processors will increase the performance metrics.  If we add more processors, then in general the number of timers increases proportionally and thus leads to longer delta chains.  Also the chance for lock contention increases.  Thus the watchdog resource should
move to the scheduler scope with per scheduler instances.  Additional processors can then use new scheduler instances.
=  Per-CPU Control  =

==  Status  ==


Per-CPU control state is available for each configured CPU in a statically
created global table ''_Per_CPU_Information''.  Each per-CPU control is cache
aligned to prevent false sharing and to provide simple access via assembly
code.  CPU ports can add custom fields to the per-CPU control.  This is used on
SPARC for the ISR dispatch disable indication.
==  Future Directions  ==


None.
=  Interrupt Support  =

==  Status  ==


The interrupt support is BSP specific in general.
==  Future Directions  ==


 *  The SMP capable BSPs should implement the [http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__rtems__interrupt__extension.html Interrupt Manager Extension].
 *  Add interrupt processor affinity API to ''Interrupt Manager Extension''.  This should use the affinity sets API used for thread processor affinity.
=  Global Scheduler  =

==  Status  ==


 *  Scheduler selection is a link-time configuration option.
 *  Basic SMP support for schedulers like processor allocation and extended thread state information.
 *  Two global fixed priority schedulers (G-FP) are available with SMP support.
==  Future Directions  ==


 *  Add more schedulers, e.g. EDF, phase-fair, early-release fair, etc.
 *  Allow thread processor affinity selection for each thread.
 *  The scheduler operations have no context parameter instead they use the global variable '_Scheduler'.  The scheduler operations signature should be changed to use a parameter for the scheduler specific context to allow more than one scheduler in the system.  This can be used to enable scheduler hierarchies or partitioned schedulers.
 *  The scheduler locking is inconsistent.  Some scheduler operations are invoked with interrupts disabled.  Some like the yield operations are called without interrupts disabled and must disable interrupts locally.
 *  The new SMP schedulers lack test coverage.
 *  Implement clustered (or partitioned) scheduling with a sophisticated resource sharing protocol (e.g. <ref name="BurnsWellings2013">A. Burns and A.J. Wellings, A Schedulability Compatible Multiprocessor Resource Sharing Protocol - MrsP, Proceedings of the 25th Euromicro Conference on Real-Time Systems (ECRTS 2013), July 2013. </ref>).  To benefit from this the Giant lock must be eliminated.
=  Clustered Scheduling  =

==  Status  ==


Clustered scheduling is not supported.
==  Future Directions  ==


Suppose we have a set of processors.  The set of processors can be partitioned
into non-empty, pairwise disjoint subsets.  Such a subset is called a
dispatching domain.  Dispatching domains should be set up in the initialization
phase of the system before application-level threads (and worker threads) can
be attached to them.  For now the partition must be specified by the
application at linktime: this restriction is acceptable in this study project
as no change should be allowable to dispatching domains at run time.
Associated with each dispatching domain is exactly one scheduler instance.
This association (which defines the scheduler algorithm for the domain) must be
specified by the application at linktime, upon or after the creation of the
dispatching domain.  So we have clustered scheduling.

At any one time a thread belongs to exactly one dispatching domain.  There will
be no scheduling hierarchies and no intersections between dispatching domains.
If no dispatching domain is explicitly specified, then it will inherit the one
from the executing context to support external code that is not aware of
clustered scheduling.  This inheritance feature is erroneous and static checks
should be made on the program code and link directives to prevent it from
happening in this study project.

Do we need the ability to move a thread from one dispatching domain to another?
Application-level threads do not migrate.  Worker threads do no either.
Worker threads do no either.  The
only allowable (and desirable) migration is for the execution of
application-level threads in global resources - not the full task, only that
''protected'' execution - in accord with <ref name="BurnsWellings2013"/>.
=  Task Variables  =

==  Status  ==


Task variables cannot be used on SMP.  The alternatives are thread-local
storage, POSIX keys or special access functions using the thread control block
of the executing thread.  For Newlib the access to the re-entrancy structure is
now performed via ''__getreent()'',  see also ''__DYNAMIC_REENT__'' in Newlib.
==  Future Directions  ==


 *  Fix ADA self task reference.
 *  Fix file system environment.
 *  Fix C++ support.
=  Non-Preempt Mode for Mutual Exclusion  =

==  Status  ==


Disabling of thread preemption cannot be used to ensure mutual exclusion on SMP.
The non-preempt mode is disabled on SMP and a run-time error will occur if such
a mode change request is issued.  The alternatives are mutexes and condition
variables.
==  Future Directions  ==


 *  Add condition variables to the Classic API.
 *  Fix pthread_once().
 *  Fix the block device cache ''bdbuf''.
 *  Fix C++ support.
=  Thread Restart  =

==  Status  ==


The restart of threads other than self is not implemented.
==  Future Directions  ==


Implement the missing feature.  The difficult path is the restart of executing
remote threads.  Executing remote threads may perform a context switch at any
time.  This would overwrite a newly initialized thread context.  An option is a
post-switch handler that performs a self-restart if necessary.

Post-switch handlers have some disadvantages:

 *  They add a function call via function pointer overhead to every invocation of ''_Thread_Dispatch()''.  Post-switch handlers have to check on every invocation if they have something to do.
 *  Currently post-switch handlers are only used for signal management.  Some work was done to install post-switch handlers only on demand since performance measurements showed that they are indeed a problem.
 *  Depending on the low-level interrupt support they may execute with interrupts disabled and thus increase the worst-case interrupt latency.
 *  Dynamic add/remove of post-switch handlers is problematic since in this case some lock strategy must be used which makes ''_Thread_Dispatch()'' more complex and has an execution overhead.
 *  Per-CPU specific post-switch handlers are an option.  They can use the per-CPU lock which must be obtained in ''_Thread_Dispatch()'' anyway.

Why execute post-switch handlers with interrupts disabled on some architectures
(e.g. ARM and PowerPC)?  The most common interrupt handling sequence looks like
this:

[wiki:File:rtems-smp-isr-1.png File:rtems-smp-isr-1.png]

In this sequence it is not possible to enable interrupts around the
''_Thread_Dispatch()'' call.  This could lead to an unlimited number of interrupt
contexts saved on the thread stack.  To overcome this issue some architectures
use a flag variable that indicates this particular execution environment (e.g.
the SPARC).  Here the sequence looks like this:

[wiki:File:rtems-smp-isr-2.png File:rtems-smp-isr-2.png]

The ISR thread dispatch disable flag must be part of the thread context.  The
context switch will save/restore the per-CPU ISR dispatch disable flag to/from
the thread context.  Each thread stack must have space for at least two
interrupt contexts.

This scheme could be used for all architectures.  Further discussion is
necessary.  An alternative to post-switch handlers is currently unknown.
=  Thread Delete  =

==  Status  ==


Deletion of threads is not implemented.
==  Future Directions  ==


Implement the missing feature.  The key problems are lock order reversal issues
(Giant vs. allocator lock).  Self deletion imposes the problem that some thread
resources like the thread stack and the thread object must be freed by a second
party.  On single processor configurations this was done with a hack that
doesn't work on SMP.  See also [http://www.rtems.org/pipermail/rtems-devel/2013-July/003509.html Thread Life-Cycle Changes.]
=  Semaphores and Mutexes  =

==  Status  ==


The semaphore and mutex objects use ''_Objects_Get_isr_disable()''.  On SMP
configurations this first acquires the Giant lock and then interrupts are
disabled.
==  Future Directions  ==


Use an ISR lock per object to improve the performance for uncontested
operations.  See also [wiki:#Giant_Lock_vs._Fine_Grained_Locking Giant Lock vs. Fine Grained Locking].
= Implementations =

=  Tool Chain  =

==  Binutils  ==


A Binutils 2.24 or later release must be used due to the LEON3 support.
==  GCC  ==


A GCC 4.8 2013-11-25 (e.g. 4.8.3) or later must be used due to the LEON3 support.  The LEON3 support for GCC includes a proper C11 memory model definition for this processor and C11 atomic operations using the CAS instruction.  The backport of the LEON3 support was initiated by EB [http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02255.html].
==  GDB  ==


Current GDB versions have problems with the debug format generated by GCC 4.8 and later [https://sourceware.org/bugzilla/show_bug.cgi?id=16215].
=  Profiling  =

==  Reason  ==


SMP lock and interrupt processing profiling is necessary to fulfill some
observability requirements.  Vital timing data can be gathered on a per object
basis through profiling.
==  RTEMS API Changes  ==


None.
==  High-Performance CPU Counters  ==


In order to measure short time intervals we have to add a high-performance CPU
counter support to the CPU port API.  This is can be also used as an
replacement for the BSP specific benchmark timers.  It may also be used to
implement busy wait loops which are required by some device drivers.

 /**
  * @brief Integer type for CPU counter values.
  */
 typedef XXX CPU_counter;
 
 /**
  * brief Returns the current CPU counter value.
  */
 CPU_counter _CPU_counter_Get()
 
 /**
  * brief Mask for arithmetic operations with the CPU counter value.
  *
  * All arithmetic operations are defined as A = ( C op B ) & MASK.
  */
 CPU_counter _CPU_counter_Mask()
 
 /**
  * brief Converts a CPU counter value into nanoseconds.
  */
 uint64_t _CPU_counter_To_nanoseconds( CPU_counter counter )
==  SMP Lock Profiling  ==


The SMP lock profiling will be a RTEMS build configuration time option
(RTEMS_LOCK_PROFILING).  The following statistics are proposed.

 #define SMP_LOCK_STATS_CONTENTION_COUNTS 4
 
 /**
  * @brief SMP lock statistics.
  *
  * The lock acquire attempt instant is the point in time right after the
  * interrupt disable action in the lock acquire sequence.
  *
  * The lock acquire instant is the point in time right after the lock
  * acquisition.  This is the begin of the critical section code execution.
  *
  * The lock release instant is the point in time right before the interrupt
  * enable action in the lock release sequence.
  *
  * The lock section time is the time elapsed between the lock acquire instant
  * and the lock release instant.
  *
  * The lock acquire time is the time elapsed between the lock acquire attempt
  * instant and the lock acquire instant.
  */
 struct SMP_lock_Stats {
 #ifdef RTEMS_LOCK_PROFILING
   /**
    * @brief The last lock acquire instant in CPU counter ticks.
    *
    * This value is used to measure the lock section time.
    */
   CPU_counter acquire_instant;
 
   /**
    * @brief The maximum lock section time in CPU counter ticks.
    */
   CPU_counter max_section_time;
 
   /**
    * @brief The maximum lock acquire time in CPU counter ticks.
    */
   CPU_counter max_acquire_time;
 
   /**
    * @brief The count of lock uses.
    *
    * This value may overflow.
    */
   uint64_t usage_count;
 
   /**
    * @brief The counts of lock acquire operations with contention.
    *
    * The contention count for index N corresponds to a lock acquire attempt
    * with an initial queue length of N + 1.  The last index corresponds to all
    * lock acquire attempts with an initial queue length greater than or equal
    * to SMP_LOCK_STATS_CONTENTION_COUNTS.
    *
    * The values may overflow.
    */
   uint64_t contention_counts[SMP_LOCK_STATS_CONTENTION_COUNTS];
 
   /**
    * @brief Total lock section time in CPU counter ticks.
    *
    * The average lock section time is the total section time divided by the
    * lock usage count.
    *
    * This value may overflow.
    */
   uint64_t total_section_time;
 #endif /* RTEMS_LOCK_PROFILING */
 }
 
 struct SMP_lock_Control {
   ... lock data ...
   SMP_lock_Stats Stats;
 };

A function should be added to monitor the lock contention.

 /**
  * @brief Called in case of lock contention.
  *
  * @param[in] counter The spin loop iteration counter.
  */
 void _SMP_lock_Contention_monitor(
   const SMP_lock_Control *lock,
   int counter
 );

A ticket lock can then look like this:

 void acquire(struct ticket *t)
 {
 	unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
 #ifdef RTEMS_LOCK_PROFILING
 	int counter = 0;
 #endif /* RTEMS_LOCK_PROFILING */
 
 	while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
 #ifdef RTEMS_LOCK_PROFILING
 		++counter;
 		_SMP_lock_Contention_monitor(t, counter);
 #endif /* RTEMS_LOCK_PROFILING */
 	}
 }

SMP lock statistics can be evaluated use the following method.

 typedef void ( *SMP_lock_Visitor )(
   void *arg,
   SMP_lock_Control *lock,
   SMP_lock_Class lock_class,
   Objects_Name lock_name
 );
 
 /**
  * @brief Iterates through all system SMP locks and invokes the visitor for
  * each lock.
  */
 void _SMP_lock_Iterate( SMP_lock_Visitor visitor, void *arg );
==  Interrupt and Thread Profiling  ==


The interrupt and thread profiling will be a RTEMS build configuration time
option (RTEMS_INTERRUPT_AND_THREAD_PROFILING).

The time spent on interrupts and the time of disabled thread dispatching should
be monitored per-processor.  The time between the interrupt recognition by the
processor and the actuals start of the interrupt handler code execution should
be monitored per-processor if the hardware supports this.

 /**
  * @brief Per-CPU statistics.
  */
 struct Per_CPU_Stats {
 #ifdef RTEMS_INTERRUPT_AND_THREAD_PROFILING
   /**
    * @brief The thread dispatch disabled begin instant in CPU counter ticks.
    *
    * This value is used to measure the time of disabled thread dispatching.
    */
   CPU_counter thread_dispatch_disabled_instant;
 
   /**
    * @brief The last outer-most interrupt begin instant in CPU counter ticks.
    *
    * This value is used to measure the interrupt processing time.
    */
   CPU_counter outer_most_interrupt_instant;
 
   /**
    * @brief The maximum interrupt delay in CPU counter ticks if supported by
    * the hardware.
    */
   CPU_counter max_interrupt_delay;
 
   /**
    * @brief The maximum time of disabled thread dispatching in CPU counter
    * ticks.
    */
   CPU_counter max_thread_dispatch_disabled_time;
 
   /**
    * @brief Count of times when the thread dispatch disable level changes from
    * zero to one in thread context.
    *
    * This value may overflow.
    */
   uint64_t thread_dispatch_disabled_count;
 
   /**
    * @brief Total time of disabled thread dispatching in CPU counter ticks.
    *
    * The average time of disabled thread dispatching is the total time of
    * disabled thread dispatching divided by the thread dispatch disabled
    * count.
    *
    * This value may overflow.
    */
   uint64_t total_thread_dispatch_disabled_time;
 
   /**
    * @brief Count of times when the interrupt nest level changes from zero to
    * one.
    *
    * This value may overflow.
    */
   uint64_t interrupt_count;
 
   /**
    * @brief Total time of interrupt processing in CPU counter ticks.
    *
    * The average time of interrupt processing is the total time of interrupt
    * processing divided by the interrupt count.
    *
    * This value may overflow.
    */
   uint64_t total_interrupt_time;
 #endif /* RTEMS_INTERRUPT_AND_THREAD_PROFILING */
 }
 
 struct Per_CPU_Control {
   ... per-CPU data ...
   Per_CPU_Stats Stats;
 };
==  Status  ==


CPU counter support is complete [http://git.rtems.org/rtems/commit/?id=24bf11eca11947d961cc9bb5f7d92dabff169e93].
=  Interrupt Support  =

==  Reason  ==


Applications should be able to distribute the interrupt load throughout the
system.  In combination with partitioned/clustered scheduling this can reduce
the amount of inter-processor synchronization and thread migrations.
==  RTEMS API Changes  ==


Each interrupt needs a processor affinity set in the RTEMS SMP configuration.  The
rtems_interrupt_handler_install() function will not alter the processor
affinity set of the interrupt vector.  At system start-up all interrupts except
the inter-processor interrupts must be initialized to have a affinity with the
initialization processor only.

Two new functions should be added to alter and retrieve the processor affinity
sets of interrupt vectors.

 /**
  * @brief Sets the processor affinity set of an interrupt vector.
  *
  * @param[in] vector The interrupt vector number.
  * @param[in] affinity_set_size Size of the specified affinity set buffer in
  * bytes.  This value must be positive.
  * @param[in] affinity_set The new processor affinity set for the interrupt
  * vector.  This pointer must not be @c NULL.  A set bit in the affinity set
  * means that the interrupt can occur on this processor and a cleared bit
  * means the opposite.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID The vector number is invalid.
  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
  */
 rtems_status_code rtems_interrupt_set_affinity(
   rtems_vector vector,
   size_t affinity_set_size,
   const cpu_set_t *affinity_set
 );

 /**
  * @brief Gets the processor affinity set of an interrupt vector.
  *
  * @param[in] vector The interrupt vector number.
  * @param[in] affinity_set_size Size of the specified affinity set buffer in
  * bytes.  This value must be positive.
  * @param[out] affinity_set The current processor affinity set of the
  * interrupt vector.  This pointer must not be @c NULL.  A set bit in the
  * affinity set means that the interrupt can occur on this processor and a
  * cleared bit means the opposite.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID The vector number is invalid.
  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
  * current processor affinity set of the interrupt vector.
  */
 rtems_status_code rtems_interrupt_get_affinity(
   rtems_vector vector,
   size_t affinity_set_size,
   cpu_set_t *affinity_set
 );
=  Clustered Scheduling  =

==  Reason  ==


Partitioned/clustered scheduling helps to control the worst-case latencies in
the system.  The goal is to reduce the amount of shared state in the system and
thus prevention of lock contention.  Modern multi-processor systems tend to
have several layers of data and instruction caches.  With partitioned/clustered
scheduling it is possible to honor the cache topology of a system and thus
avoid expensive cache synchronization traffic.
==  RTEMS API Changes  ==


Functions for scheduler management.

 /**
  * @brief Identifies a scheduler by its name.
  *
  * The scheduler name is determined by the scheduler configuration.
  *
  * @param[in] name The scheduler name.
  * @param[out] scheduler_id The scheduler identifier associated with the name.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_NAME Invalid scheduler name.
  */
 rtems_status_code rtems_scheduler_ident(
   rtems_name name,
   rtems_id *scheduler_id
 );

 /**
  * @brief Gets the set of processors owned by the scheduler.
  *
  * @param[in] scheduler_id Identifier of the scheduler.
  * @param[in] processor_set_size Size of the specified processor set buffer in
  * bytes.  This value must be positive.
  * @param[out] processor_set The processor set owned by the scheduler.  This
  * pointer must not be @c NULL.  A set bit in the processor set means that
  * this processor is owned by the scheduler and a cleared bit means the
  * opposite.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid scheduler identifier.
  * @retval RTEMS_INVALID_CPU_SET The processor set buffer is too small for the
  * set of processors owned by the scheduler.
  */
 rtems_status_code rtems_scheduler_get_processors(
   rtems_id scheduler_id,
   size_t processor_set_size,
   cpu_set_t *processor_set
 );

Each thread needs a processor affinity set in the RTEMS SMP configuration.  The
rtems_task_create() function will use the processor affinity set of the
executing thread to initialize the processor affinity set of the created
task.  This enables backward compatibility for existing software.

Two new functions should be added to alter and retrieve the processor affinity
sets of tasks.

 /**
  * @brief Sets the processor affinity set of a task.
  *
  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
  * the executing task.
  * @param[in] affinity_set_size Size of the specified affinity set buffer in
  * bytes.  This value must be positive.
  * @param[in] affinity_set The new processor affinity set for the task.  This
  * pointer must not be @c NULL.  A set bit in the affinity set means that the
  * task can execute on this processor and a cleared bit means the opposite.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid task identifier.
  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
  */
 rtems_status_code rtems_task_set_affinity(
   rtems_id task_id,
   size_t affinity_set_size,
   const cpu_set_t *affinity_set
 );

 /**
  * @brief Gets the processor affinity set of a task.
  *
  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
  * the executing task.
  * @param[in] affinity_set_size Size of the specified affinity set buffer in
  * bytes.  This value must be positive.
  * @param[out] affinity_set The current processor affinity set of the task.
  * This pointer must not be @c NULL.  A set bit in the affinity set means that
  * the task can execute on this processor and a cleared bit means the
  * opposite.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid task identifier.
  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
  * current processor affinity set of the task.
  */
 rtems_status_code rtems_task_get_affinity(
   rtems_id task_id,
   size_t affinity_set_size,
   cpu_set_t *affinity_set
 );

Two new functions should be added to alter and retrieve the scheduler of tasks.

 /**
  * @brief Sets the scheduler of a task.
  *
  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
  * the executing task.
  * @param[in] scheduler_id Identifier of the scheduler.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid task identifier.
  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier.
  *
  * @see rtems_scheduler_ident().
  */
 rtems_status_code rtems_task_set_scheduler(
   rtems_id task_id,
   rtems_id scheduler_id
 );

 /**
  * @brief Gets the scheduler of a task.
  *
  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
  * the executing task.
  * @param[out] scheduler_id Identifier of the scheduler.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid task identifier.
  */
 rtems_status_code rtems_task_get_scheduler(
   rtems_id task_id,
   rtems_id *scheduler_id
 );
==  Scheduler Configuration  ==


There are two options for the scheduler instance configuration

# static configuration by means of global data structures, and
# configuration at run-time via function calls.

For a configuration at run-time the system must start with a default scheduler.
The global constructors are called in this environment.  The order of global
constructor invocation is unpredictable so it is difficult to create threads in
this context since the run-time scheduler configuration may not exist yet.
Since scheduler data structures are allocated from the workspace the
configuration must take a later run-time setup of schedulers into account for
the workspace size estimate.  In case the default scheduler is not appropriate
it must be replaced which gives raise to some implementation difficulties.
Since the processor availability is determined by hardware constraints it is
unclear which benefits a run-time configuration has.  For now run-time
configuration of scheduler instances will be not implemented.

The focus is now on static configuration.  Every scheduler needs a control
context.  The scheduler API must provide a macro which creates a global
scheduler instance specific data structure with a designator name as a
mandatory parameter.  The scheduler instance creation macro may require
additional scheduler specific configuration options.  For example a
fixed-priority scheduler instance must know the maximum priority level to
allocate the ready chain control table.

Once the scheduler instances are configured it must be specified for each
processor in the system which scheduler instance owns this processor or if this
processor is not used by the RTEMS system.

For each processor except the initialization processor a scheduler instance is
optional so that other operating systems can run independent of this RTEMS
system on this processor.  It is a fatal error to omit a scheduler instance for
the initialization processor.  The initialization processor is the processor
which executes the boot_card() function.

 /**
  * @brief Processor configuration.
  *
  * Use RTEMS_CPU_CONFIG_INIT() to initialize this structure.
  */
 typedef struct {
   /**
    * @brief Scheduler instance for this processor.
    *
    * It is possible to omit a scheduler instance for this processor by using
    * the @c NULL pointer.  In this case RTEMS will not use this processor and
    * other operating systems may claim it.
    */
   Scheduler_Control *scheduler;
 } rtems_cpu_config;
 
 /**
  * @brief Processor configuration initializer.
  *
  * @param scheduler The reference to a scheduler instance or @c NULL.
  *
  * @see rtems_cpu_config.
  */
 #define RTEMS_CPU_CONFIG_INIT(scheduler) \
   { ( scheduler ) }

Scheduler and processor configuration example:

 RTEMS_SCHED_DEFINE_FP_SMP(fp0, rtems_build_name(' ', 'F', 'P', '0'), 256);
 RTEMS_SCHED_DEFINE_FP_SMP(fp1, rtems_build_name(' ', 'F', 'P', '1'), 64);
 RTEMS_SCHED_DEFINE_EDF_SMP(edf0, rtems_build_name('E', 'D', 'F', '0'));
 
 const rtems_cpu_config rtems_cpu_config_table[] = {
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp0)),
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
   RTEMS_CPU_CONFIG_INIT(NULL),
   RTEMS_CPU_CONFIG_INIT(NULL),
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)),
   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)
 };
 
 const size_t rtems_cpu_config_count =

   RTEMS_ARRAY_SIZE(rtems_cpu_config_table);

An alternative to the processor configuration table would be to specify in the
scheduler instance which processors are owned by the instance.  This would
require a static initialization of CPU sets which is difficult.  Also the
schedulers have to be registered somewhere, so some sort of table is needed
anyway.  Since a processor can be owned by at most one scheduler instance this
configuration approach enables an additional error source which is avoided by
the processor configuration table.
==  Scheduler Implementation  ==


Currently the scheduler operations have no control context and use global
variables instead.  Thus the scheduler operations signatures must change to use
a scheduler control context as the first parameter, e.g.

 typedef struct Scheduler_Control Scheduler_Control;
 
 typedef struct {
   [...]
   void ( *set_affinity )(
     Scheduler_Control *self,
     Thread_Control *thread,
     size_t affinity_set_size,
     const cpu_set_t *affinity_set
   );
   [...]
 } Scheduler_Operations;
 
 /**
  * @brief General scheduler control.
  */
 struct Scheduler_Control {
   /**
    * @brief The scheduler operations.
    */
   Scheduler_Operations Operations;
 
   /**
    * @brief Size of the owned processor set in bytes.
    */
   size_t owned_cpu_set_size
 
   /**
    * @brief Reference to the owned processor set.
    *
    * A set bit means this processor is owned by this scheduler instance, a
    * cleared bit means the opposite.
    */
   cpu_set_t *owned_cpu_set;
 };

Single processor configurations benefit also from this change since it makes
all dependencies explicit and easier to access (allows more efficient machine
code).
=  Multiprocessor Resource Sharing Protocol - MrsP  =

==  Reason  ==


In general, application-level threads are not independent and may indeed share logical resources. In a partitioned-scheduling system, where capacity allows, resources are allocated on the same processor as the threads that share it: in that case those resources are termed ''local''.  Where needed, resources may also reside on processors other than that of (some of) their sharing threads: in that case those resources are termed ''global''.

For partitioned scheduling of application-level threads and local resources, two choices are possible, which meet the ITT requirement of achieving predictable time behaviour for platform software
 *  (1) fixed-priority scheduling with the immediate priority ceiling for controlling access to local resources, or
 *  (2) EDF scheduling with the stack resource protocol for controlling access to local resources.

Choice (1) is more natural with RTEMS. Both alternatives require preemption, whose disruption to the thread's cache working set may in part be attenuated by the use of techniques known as limited-preemption, which have been successfully demonstrated by the UoP. Conversely, run-to-completion (non-preemptive) semantics is known to achieve much lower schedulable utilization, which cannot make it a plausible candidate for performance-hungry systems. The use of fixed-priority scheduling for each processor where application-level threads locally run allows response time analysis to be used, which, on the basis of the worst-case execution time of individual threads and on the graph of resource usage among threads, determines the worst-case completion time of every individual thread run on that processor, offering absolute guarantees on the schedulable utilization of every individual processor that uses that algorithm.

The determination of worst-case execution time of software programs running on one processor of an SMP is significantly more complex than its single-processor analogous.  This is because every local execution suffers massive interference effects from its co-runners on other processors, regardless of functional independence.  Simplistically upper bounding the possible interference incurs exceeding pessimism, which causes massive reduction in the allowable system load.  More accurate interference analysis may incur prohibitive costs unless simplifying assumptions can be safely made, one of which is strictly static partitioning: previous studies run by ESA [http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf]
have indicated possible approaches to that problem.  Global scheduling greatly exacerbates the difficulty of that problem and thus is unwelcome.  With fixed-priority scheduling, each logical resource shared by multiple application-level threads (or their corresponding lock) must be statically attached a ceiling priority attribute computed as an upper bound to the static
priority of all threads that may use that resource: when an application-level thread acquires the resource lock, the priority of that thread is raised to the
ceiling of the resource; upon relinquishing the lock to that resource, the thread must return to the priority level that it had prior to acquiring the lock.  (It has been shown that the implementation of this protocol does not need resource locks at all, as the mere fact for a thread to be running implies that all of the resources that it may want to use are free at this time and none can be claimed by lower-priority threads.  However, it may be easier for an operating system to attach the ceiling value to the resource lock than to
any other data structures.)  This simple protocol allows application-level threads to acquire multiple resources without risking deadlock.

When global resources are used, which is very desirable to alleviate the complexity of task allocation, the resource access control protocol in use, whether (1) or (2) in the earlier taxonomy, must be specialized, so that global resources can be syntactically told apart from local resources.  Luckily, two solutions have been very recently proposed to address this problem.  One solution, known as Optimal Migratory Priority Inheritance (OMIP), was proposed by Björn Brandenburg in his work entitled ''A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications'', presented at the ECRTS conference in July 2013.  The other solution, known as Multiprocessor Resource Sharing Protocol (MrsP), was proposed by Alan Burns and Andy Wellings in their work entitled ''A Schedulability Compatible Multiprocessor Resource Sharing Protocol'', presented at the very same ECRTS 2013 conference.  OMIP
aims to solve the problem of guaranteed bounded blocking in global resource sharing in clustered scheduling on a multiprocessor, where a cluster is a collection of at least one processor.  MrsP aims to achieve the smallest bound to the cumulative duration of blocking suffered by threads waiting to access a global resource in partitioned scheduling (clusters with exactly one
processor), while allowing schedulability analysis to be performed per processor, using response time analysis, which is a technique fairly well known to industry.  No other global resource sharing protocol on SMP, including OMIP, is able to guarantee that.  For this reason MrsP is the choice for this study.

The MrsP protocol requires that
 *  (a) threads waiting for global resources spin at ceiling priority on their local processor, and with a ceiling value greater than any other local resource on that processor, and
 *  (b) the execution within the global resource may migrate to other processors where application-level threads waiting to access that resource are spinning.

Feature (a) prevents lower-priority threads from running in preference to the waiting higher-priority thread and stealing resources that it might want to use in the future as part of the current execution; should that stealing happen, the blocking penalty potentially suffered on access to global resources would skyrocket to untenable levels. 
Feature (b), which brings in the sole welcome extent of migration in the proposed model, which is useful when higher-priority tasks running on the processor of the global resource prevent it from completing execution; in that case, the slack allowed for by local spinning on other processors where other threads are waiting, is used to speed up the completion of the execution in the global resource and therefore reduce blocking.
==  RTEMS API Changes  ==


For MrsP we need the ability to specify the priority ceilings per scheduler
domain.

 typedef struct {
   rtems_id scheduler_id;
   rtems_task_priority priority;
 } rtems_task_priority_by_scheduler;

 /**
  * @brief Sets the priority ceilings per scheduler for a semaphore with
  * priority ceiling protocol.
  *
  * @param[in] semaphore_id Identifier of the semaphore.
  * @param[in] priority_ceilings A table with priority ceilings by scheduler.
  * In case one scheduler appears multiple times, the setting with the highest
  * index will be used.  This semaphore object is then bound to the specified
  * scheduler domains.  It is an error to use this semaphore object on other
  * scheduler domains.  The specified schedulers must be compatible, e.g.
  * migration from one scheduler domain to another must be defined.
  * @param[in] priority_ceilings_count Count of priority ceilings by scheduler
  * pairs in the table.
  *
  * @retval RTEMS_SUCCESSFUL Successful operation.
  * @retval RTEMS_INVALID_ID Invalid semaphore identifier.
  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier in the table.
  * @retval RTEMS_INVALID_PRIORITY Invalid task priority in the table.
  */
 rtems_status_code rtems_semaphore_set_priority_ceilings(
   rtems_id semaphore_id,
   const rtems_task_priority_by_scheduler *priority_ceilings,
   size_t priority_ceilings_count
 );
==  Implementation  ==


The critical part in the MrsP is the migration of the lock holder in case of
preemption by a higher-priority thread.  A post-switch action is used to detect
this event.  The post-switch action will remove the thread from the current
scheduler domain and add it to the scheduler domain of the first waiting thread
which is executing.  A resource release will remove this thread from the
temporary scheduler domain and move it back to the original scheduler domain.
=  Fine Grained Locking  =

==  Reason  ==


Fine grained locking is of utmost importance to get a scalable operating system
that can guarantee reasonable worst-case latencies.  With the current Giant
lock in RTEMS the worst-case latencies of every operating system service
increase with each processor added to the system.  Since the Giant lock state
is shared among all processors a huge cache synchronization overhead
contributes to the worst-case latencies.  Fine grained locking in combination
with partitioned/clustered scheduling helps to avoid these problems since now
the operating system state is distributed allowing true parallelism of
independent components.
==  RTEMS API Changes  ==


None.
==  Locking Protocol Analysis  ==


As a sample operating system operation the existing mutex
obtain/release/timeout sequences will be analysed.  All ISR disable/enable
points (highlighted with colours) must be turned into an appropriate SMP clock
(e.g. a ticket spin lock).  One goal is that an uncontested mutex obtain will
use no SMP locks except the one associated with the mutex object itself.  Is
this possible with the current structure?

 mutex_obtain(id, wait, timeout):
 	<span style="color:red">level = ISR_disable()</span>
 	mtx = mutex_get(id)
 	executing = get_executing_thread()
 	wait_control = executing.get_wait_control()
 	wait_control.set_status(SUCCESS)
 	if !mtx.is_locked():
 		mtx.lock(executing)
 		if mtx.use_ceiling_protocol():
 			thread_dispatch_disable()
 			<span style="color:red">ISR_enable(level)</span>
 			executing.boost_priority(mtx.get_ceiling())
 			thread_dispatch_enable()
 		else:
 			<span style="color:red">ISR_enable(level)</span>
 	else if mtx.is_holder(executing):
 		mtx.increment_nest_level()
 		<span style="color:red">ISR_enable(level)</span>
 	else if !wait:
 		<span style="color:red">ISR_enable(level)</span>
 		wait_control.set_status(UNSATISFIED)
 	else:
 		wait_queue = mtx.get_wait_queue()
 		wait_queue.set_sync_status(NOTHING_HAPPENED)
 		executing.set_wait_queue(wait_queue))
 		thread_dispatch_disable()
 		<span style="color:red">ISR_enable(level)</span>
 		if mtx.use_inherit_priority():
 			mtx.get_holder().boost_priority(executing.get_priority()))
 		<span style="color:fuchsia">level = ISR_disable()</span>
 		if executing.is_ready():
 			executing.set_state(MUTEX_BLOCKING_STATE)
 			scheduler_block(executing)
 		else:
 			executing.add_state(MUTEX_BLOCKING_STATE)
 		<span style="color:fuchsia">ISR_enable(level)</span>
 		if timeout:
 			timer_start(timeout, executing, mtx)
 		<span style="color:blue">level = ISR_disable()</span>
 		search_thread = wait_queue.first()
 		while search_thread != wait_queue.tail():
 			if executing.priority() <= search_thread.priority():
 				break
 			<span style="color:blue">ISR_enable(level)</span>
 			<span style="color:blue">level = ISR_disable()</span>
 			if search_thread.is_state_set(MUTEX_BLOCKING_STATE):
 				search_thread = search_thread.next()
 			else:
 				search_thread = wait_queue.first()
 		sync_status = wait_queue.get_sync_status()
 		if sync_state == NOTHING_HAPPENED:
 			wait_queue.set_sync_status(SYNCHRONIZED)
 			wait_queue.enqueue(search_thread, executing)
 			executing.set_wait_queue(wait_queue)
 			<span style="color:blue">ISR_enable(level)</span>
 		else:
 			executing.set_wait_queue(NULL)
 			if executing.is_timer_active():
 				executing.deactivate_timer()
 				<span style="color:blue">ISR_enable(level)</span>
 				executing.remove_timer()
 			else:
 				<span style="color:blue">ISR_enable(level)</span>
 			<span style="color:fuchsia">level = ISR_disable()</span>
 			if executing.is_state_set(MUTEX_BLOCKING_STATE):
 				executing.clear_state(MUTEX_BLOCKING_STATE)
 				if executing.is_ready():
 					scheduler_unblock(executing)
 			<span style="color:fuchsia">ISR_enable(level)</span>
 		thread_dispatch_enable()
 	return wait_control.get_status()
 
 mutex_release(id):
 	thread_dispatch_disable()
 	mtx = mutex_get(id)
 	executing = get_executing_thread()
 	nest_level = mtx.decrement_nest_level()
 	if nest_level == 0:
 		if mtx.use_ceiling_protocol() or mtx.use_inherit_priority():
 			executing.restore_priority()
 		wait_queue = mtx.get_wait_queue()
 		thread = NULL
 		<span style="color:red">level = ISR_disable()</span>
 		thread = wait_queue.dequeue()
 		if thread != NULL:
 			thread.set_wait_queue(NULL)
 			if thread.is_timer_active():
 				thread.deactivate_timer()
 				<span style="color:red">ISR_enable(level)</span>
 				thread.remove_timer()
 			else:
 				<span style="color:red">ISR_enable(level)</span>
 			<span style="color:fuchsia">level = ISR_disable()</span>
 			if thread.is_state_set(MUTEX_BLOCKING_STATE):
 				thread.clear_state(MUTEX_BLOCKING_STATE)
 				if thread.is_ready():
 					scheduler_unblock(thread)
 			<span style="color:fuchsia">ISR_enable(level)</span>
 		else:
 			<span style="color:red">ISR_enable(level)</span>
 		<span style="color:blue">level = ISR_disable()</span>
 		if thread == NULL:
 			sync_status = wait_queue.get_sync_status()
 			if sync_status == TIMEOUT || sync_status == NOTHING_HAPPENED:
 				wait_queue.set_sync_status(SATISFIED)
 				thread = executing
 		<span style="color:blue">ISR_enable(level)</span>
 		if thread != NULL:
 			mtx.new_holder(thread)
 			if mtx.use_ceiling_protocol():
 				thread.boost_priority(mtx.get_ceiling())
 		else:
 			mtx.unlock()
 	thread_dispatch_enable()
 
 
 mutex_timeout(thread, mtx):
 	<span style="color:red">level = ISR_disable()</span>
 	wait_queue = thread.get_wait_queue()
 	if wait_queue != NULL:
 		sync_status = wait_queue.get_sync_status()
 		if sync_status != SYNCHRONIZED and thread.is_executing():
 			if sync_status != SATISFIED:
 				wait_queue.set_sync_status(TIMEOUT)
 				wait_control = executing.get_wait_control()
 				wait_control.set_status(TIMEOUT)
 			<span style="color:red">ISR_enable(level)</span>
 		else:
 			<span style="color:red">ISR_enable(level)</span>
 			<span style="color:lime">level = ISR_disable()</span>
 			wait_queue = thread.get_wait_queue()
 			if wait_queue != NULL:
 				wait_queue.extract(executing)
 				if thread.is_timer_active():
 					thread.deactivate_timer()
 					<span style="color:lime">ISR_enable(level)</span>
 					thread.remove_timer()
 				else:
 					<span style="color:lime">ISR_enable(level)</span>
 				<span style="color:fuchsia">level = ISR_disable()</span>
 				if thread.is_state_set(MUTEX_BLOCKING_STATE):
 					thread.clear_state(MUTEX_BLOCKING_STATE)
 					if thread.is_ready():
 						scheduler_unblock(thread)
 				<span style="color:fuchsia">ISR_enable(level)</span>
 			else:
 				<span style="color:lime">ISR_enable(level)</span>
 	else:
 		<span style="color:red">ISR_enable(level)</span>

The thread.remove_timer() operation is quite complex and a problem area of its
own.  See discussion about the [wiki:Watchdog_Handler Watchdog Handler].

The scheduler operation points are in a good shape.  Here we can easily use one
SMP lock for the thread state and one SMP lock for the scheduler state.

The big problem is that the mutex object state changes and the thread enqueue
operation are split up into several parts.  This was done to ensure a next to
optimal interrupt latency with only constant-time sections of disabled
interrupts.  The trade-off is that we have a very complex blocking sequence.
After the first mutex state change in the mutex_obtain() the mutex doesn't know
directly which thread is about to block on that mutex.  Some sequences assume
exactly one executing thread in the system, which is not true on a SMP system
with more than one processor.  With an SMP lock per mutex object all state
information for this mutex must be present in the mutex object.  So the locked
mutex to locked mutex with a waiting thread change must be atomic with respect
to the mutex SMP lock.  Later mutex timeout or release operations can then get
the waiting thread and deal with it accordingly.
==  Implementation  ==


The blocking operation synchronization state must move from the synchronization
object (e.g. mutex, message queue) to the thread wait information
(Thread_Wait_information) since it is clear that the thread has to block and
not the object.  There may be also multiple threads on different processors
which queue up on a mutex at the same time.

Blocking threads must be registered in the object under the SMP lock of the
object and must be independent of the scheduler data structures.  Thus we can
no longer use the normal chain node of the thread and instead have to add a
chain node to the thread wait information.

The thread queue operations must no longer use internal locks (e.g. ISR
disable/enable).  This simplifies them considerable.  The thread queue
operations must be performed under the SMP lock of the object.  The drawback is
that the time of disabled interrupts increases.  The FIFO thread queue
operations are trivial doubly-linked list operations.  The priority thread
queue operations execute in a worst-case time which depends only on the maximum
number of priorities.
=  Post-Switch Actions  =

==  Reason  ==


Currently threads are assigned to processors for execution by the scheduler
responsible for this thread.  It is unknown to the system when a thread
actually starts or terminates execution on its assigned processor.  The
termination event is important for the following features
 *  explicit thread migration, e.g. if a thread should move from one scheduler domain to another,
 *  thread deletion, since the thread stack is in use until the thread stopped execution, or
 *  restart of threads executing on a remote processor.
==  RTEMS API Changes  ==


None.
==  Implementation  ==


One approach to do post-switch actions could be to spin on the per-processor
variable reflecting the executing thread.  This has at least two problems
# it doesn't work if the executing thread wants to alter its own state, and
# this spinning must be done with the scheduler lock held and interrupts disabled, this is a disaster for the interrupt latency,

The proposed solution is to use an optional action handler which is active in
case the thread execution termination matters.  In _Thread_Dispatch() we have
already the post-switch extensions invoked after a thread switch.
Unfortunately they execute after thread dispatching is enabled again and at
this point the current processor may have already changed due to thread
migration requested by an interrupt.

We need a context which executes right after the thread context switch on the
current processor, but without the per-processor lock acquired (to prevent lock
order reversal problems and keep the interrupt latency small).  For this we
introduce a post-switch action chain (PSAC).  Each thread will have its own
PSAC control.  The PSAC operations like addition to, removal from and iteration
over the chain are protected by the corresponding thread lock.  Each action
will have a local context.  The heir thread will execute the action handlers on
behalf of the thread of interest.  Since thread dispatching is disabled action
handlers cannot block.

The execution time of post-switch actions increases the worst-case thread
dispatch latency since the heir thread must do work for another thread.

On demand post-switch actions help to implement the Multiprocessor Resource
Sharing Protocol (MrsP) proposed by Burns and Wellings.  Threads executing a
global critical section can add a post-switch action which will trigger the
thread migration in case of pre-emption by a local high-priority thread.

 thread_dispatch:
 	again = true
 	while again:
 		level = ISR.disable()
 		current_cpu = get_current_cpu()
 		current_cpu.disable_thread_dispatch()
 		ISR.enable(level)
 		executing = current_cpu.get_executing()
 		current_cpu.acquire()
 		if current_cpu.is_thread_dispatch_necessary():
 			heir = current_cpu.get_heir()
 			current_cpu.set_thread_dispatch_necessary(false)
 			current_cpu.set_executing(heir)
 			executing.set_executing(false)
 			heir.set_executing(true)
 			if executing != heir:
 				last = switch(executing, heir)
 				current_cpu = get_current_cpu()
 				actions = last.get_actions()
 				if actions.is_empty():
 					again = false
 				else:
 					current_cpu.release()
 					last.acquire()
 					if last.get_cpu() == current_cpu:
 						while !actions.is_empty():
 							action = actions.pop()
 							action.do(current_cpu, last)
 					last.release()
 					current_cpu.enable_thread_dispatch()
 	current_cpu.enable_thread_dispatch()
 	current_cpu.release()

It is important to check that the thread is still assigned to the current
processor, since after the release of the per-processor lock we have a new
executing thread and the thread of interest may migrated to another processor
already.  Since the heir thread has now a reference to the thread of interest
we have to make sure that deletion requests are deferred until the post-switch
actions have been executed.

An efficient why to get the last executing thread (the thread of interest)
throughout the context switch is to return the context pointer of the last
executing thread.  With a simple offset operation we get the thread control
block.
=  Thread Delete/Restart  =

==  Reason  ==


Deletion of threads may be required by some parallel libraries.
==  RTEMS API Changes  ==


None.
==  Implementation  ==


The current implementation to manage a thread life-cycle in RTEMS has some
weaknesses that turn into severe problems on SMP.  It leads also to POSIX and
C++ standard conformance defects in some cases.  Currently the thread
life-cycle changes are protected by the thread dispatch disable level and some
parts by the allocator mutex.  Since the thread dispatch disable level is
actually a giant mutex on SMP this leads in combination with the allocator
mutex to lock order reversal problems.

The usage of a unified work areas is also broken at the moment
[https://www.rtems.org/bugzilla/show_bug.cgi?id=2152].

There is also an outstanding thread cancellation bug
[https://www.rtems.org/bugzilla/show_bug.cgi?id=2035].

One problematic path is the destruction of threads.  Here we have currently the
following sequence:

<ol>
<li>Obtain the allocator mutex.</li>
<li>Disable thread dispatching.</li>
<li>Invalidate the object identifier.</li>
<li>Enable thread dispatching.</li>
<li>Call the thread delete extensions in the context of the deleting thread
(not necessarily the deleted thread).  The POSIX cleanup handlers are called
here from the POSIX delete extension.  POSIX mandates that the cleanup handler
are executed in the context of the corresponding thread.  So here we have a
POSIX violation
[http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_03].
</li>
<li>Remove the thread from the scheduling and watchdog resources.</li>
<li>Delete scheduling, floating-point, stack and extensions resources.  Now the
deleted thread may execute on a freed thread stack!</li>
<li>Free the object.  Now the object (thread control block) is available for
re-use, but it is still used by the thread!  Only the disabled thread
dispatching prevents chaos.</li>
<li>Release the allocator mutex.  Now we have a lock order reversal (see step 1.
and 2.).</li>
<li>Enable thread dispatching.  Here a deleted executing thread disappears.  On
SMP we have also a race-condition here.  This step looks in detail:
{{{
if ( _Thread_Dispatch_decrement_disable_level() == 0 )
	/*
	 * Here another processor may re-use resources of a deleted executing
	 * thread, e.g. the stack.
	 */
	_Thread_Dispatch();
}
}}}
</li>
</ol>

To overcome the issues we need considerable implementation changes in Score.
The thread life-cycle state must be explicit and independent of the thread
dispatch disable level and allocator mutex protection.

The thread life-cycle is determined by the following actions:

; CREATE : A thread is created.
; START : Starts a thread.  The thread must be dormant to get started.
; RESTART : Restarts a thread.  The thread must not be dormant to get restarted.
; SUSPEND : Suspends a thread.
; RESUME : Resumes a thread.
; DELETE : Deletes a thread.
; SET_PROTECTION : Sets the new protection state and returns the previous.  This action is new.

The following thread life-cycle states are proposed.  These states are
orthogonal to the blocking states, e.g. DORMANT, SUSPENDED etc.:

; PROTECTED : The thread is protected from immediate restart, delete and suspend actions.  Can be controlled by pthread_setcancelsta