Context Navigation

Notice: We have migrated to GitLab launching 2024-05-01 see here: https://gitlab.rtems.org/

Changes between Version 30 and Version 31 of Developer/SMP

Timestamp:: 01/13/14 16:09:10 (10 years ago)
Author:: Sh
Comment:: /* Implementation */

Legend:

: Unmodified
: Added
: Removed
: Modified

Developer/SMP

-                      v30
+                      v31
 Use an ISR lock per object to improve the performance for uncontested
 operations.  See also [wiki:#Giant_Lock_vs._Fine_Grained_Locking Giant Lock vs. Fine Grained Locking].
+=  Implementation  =
+=  Testing  =
+=  Overview  =
+[http://en.wikipedia.org/wiki/Test-driven_development Test-driven development]
+will be used to implement the accepted changes and features.  New tests cases
+will follow the patterns present in existing test cases of the RTEMS test
+suite.  The RTEMS project has a script framework to run the RTEMS test suite.
+This works best if the tests can run on a fast simulator.
+=  Worst-Case Timings  =
+The RTEMS test suite has some timing tests (testsuites/tmtests and testsuites/psxtmtests).  These tests yield only some sample timing values with no statistical significance.  For a real-time operating system reliable worst-case timing values of critical functions are interesting.  A presence of instruction and data caches must be taken into account.  Measurements for the worst-case timing should run in the following context
+ *  instruction caches are invalidated,
+ *  data caches are completely dirty with unrelated data, since in this case data for the section of interest must first evict dirty cache lines and then load the values from main memory,
+ *  on SMP configurations all other processors should try to saturate the bus.
+Timing values should be acquired multiple times to get a statistical significant value.
+Currently most operating system functions are monolithic global functions which can be only tested as a single unit.  This leads to very complex test setups since interrupts must be used to trigger different paths through these functions.  In order to implement fine grained locking large parts of core operating system functions must be restructured.  As part of this process the critical sections should be implemented as inline functions which can be tested independently.  All critical sections for a specific lock should be tested in isolation so that worst-case timing values can be determined.  Since all the core operating system functions are aggregations of critical sections it is possible to give worst-case timing estimates provided the worst-case timings are known each critical section on its own.
+It is not the goal of the project to characterise finely all RTEMS services in combination of the all possible load figures in the processors.  However, instrumentation of critical items such as SMP locks and interrupt disabling will allow collecting a good idea about the system behaviour when the system is stressed by all sorts of RTEMS tests, demonstrators and parallel library tests.
+=  References  =
+<references/>
+= Implementations =
+=  Tool Chain  =
+==  Binutils  ==
+A Binutils 2.24 or later release must be used due to the LEON3 support.
+==  GCC  ==
+A GCC 4.8 2013-11-25 (e.g. 4.8.3) or later must be used due to the LEON3 support.  The LEON3 support for GCC includes a proper C11 memory model definition for this processor and C11 atomic operations using the CAS instruction.  The backport of the LEON3 support was initiated by EB [http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02255.html].
+==  GDB  ==
+Current GDB versions have problems with the debug format generated by GCC 4.8 and later [https://sourceware.org/bugzilla/show_bug.cgi?id=16215].
+=  Profiling  =
+==  Reason  ==
+SMP lock and interrupt processing profiling is necessary to fulfill some
+observability requirements.  Vital timing data can be gathered on a per object
+basis through profiling.
+==  RTEMS API Changes  ==
+None.
+==  High-Performance CPU Counters  ==
+In order to measure short time intervals we have to add a high-performance CPU
+counter support to the CPU port API.  This is can be also used as an
+replacement for the BSP specific benchmark timers.  It may also be used to
+implement busy wait loops which are required by some device drivers.
+ /**
+  * @brief Integer type for CPU counter values.
+  */
+ typedef XXX CPU_counter;
+ /**
+  * brief Returns the current CPU counter value.
+  */
+ CPU_counter _CPU_counter_Get()
+ /**
+  * brief Mask for arithmetic operations with the CPU counter value.
+  *
+  * All arithmetic operations are defined as A = ( C op B ) & MASK.
+  */
+ CPU_counter _CPU_counter_Mask()
+ /**
+  * brief Converts a CPU counter value into nanoseconds.
+  */
+ uint64_t _CPU_counter_To_nanoseconds( CPU_counter counter )
+==  SMP Lock Profiling  ==
+The SMP lock profiling will be a RTEMS build configuration time option
+(RTEMS_LOCK_PROFILING).  The following statistics are proposed.
+ #define SMP_LOCK_STATS_CONTENTION_COUNTS 4
+ /**
+  * @brief SMP lock statistics.
+  *
+  * The lock acquire attempt instant is the point in time right after the
+  * interrupt disable action in the lock acquire sequence.
+  *
+  * The lock acquire instant is the point in time right after the lock
+  * acquisition.  This is the begin of the critical section code execution.
+  *
+  * The lock release instant is the point in time right before the interrupt
+  * enable action in the lock release sequence.
+  *
+  * The lock section time is the time elapsed between the lock acquire instant
+  * and the lock release instant.
+  *
+  * The lock acquire time is the time elapsed between the lock acquire attempt
+  * instant and the lock acquire instant.
+  */
+ struct SMP_lock_Stats {
+ #ifdef RTEMS_LOCK_PROFILING
+   /**
+    * @brief The last lock acquire instant in CPU counter ticks.
+    *
+    * This value is used to measure the lock section time.
+    */
+   CPU_counter acquire_instant;
+   /**
+    * @brief The maximum lock section time in CPU counter ticks.
+    */
+   CPU_counter max_section_time;
+   /**
+    * @brief The maximum lock acquire time in CPU counter ticks.
+    */
+   CPU_counter max_acquire_time;
+   /**
+    * @brief The count of lock uses.
+    *
+    * This value may overflow.
+    */
+   uint64_t usage_count;
+   /**
+    * @brief The counts of lock acquire operations with contention.
+    *
+    * The contention count for index N corresponds to a lock acquire attempt
+    * with an initial queue length of N + 1.  The last index corresponds to all
+    * lock acquire attempts with an initial queue length greater than or equal
+    * to SMP_LOCK_STATS_CONTENTION_COUNTS.
+    *
+    * The values may overflow.
+    */
+   uint64_t contention_counts[SMP_LOCK_STATS_CONTENTION_COUNTS];
+   /**
+    * @brief Total lock section time in CPU counter ticks.
+    *
+    * The average lock section time is the total section time divided by the
+    * lock usage count.
+    *
+    * This value may overflow.
+    */
+   uint64_t total_section_time;
+ #endif /* RTEMS_LOCK_PROFILING */
+ }
+ struct SMP_lock_Control {
+   ... lock data ...
+   SMP_lock_Stats Stats;
+ };
+A function should be added to monitor the lock contention.
+ /**
+  * @brief Called in case of lock contention.
+  *
+  * @param[in] counter The spin loop iteration counter.
+  */
+ void _SMP_lock_Contention_monitor(
+   const SMP_lock_Control *lock,
+   int counter
+ );
+A ticket lock can then look like this:
+ void acquire(struct ticket *t)
+ {
+        unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
+ #ifdef RTEMS_LOCK_PROFILING
+        int counter = 0;
+ #endif /* RTEMS_LOCK_PROFILING */
+        while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
+ #ifdef RTEMS_LOCK_PROFILING
+                ++counter;
+                _SMP_lock_Contention_monitor(t, counter);
+ #endif /* RTEMS_LOCK_PROFILING */
+        }
+ }
+SMP lock statistics can be evaluated use the following method.
+ typedef void ( *SMP_lock_Visitor )(
+   void *arg,
+   SMP_lock_Control *lock,
+   SMP_lock_Class lock_class,
+   Objects_Name lock_name
+ );
+ /**
+  * @brief Iterates through all system SMP locks and invokes the visitor for
+  * each lock.
+  */
+ void _SMP_lock_Iterate( SMP_lock_Visitor visitor, void *arg );
+==  Interrupt and Thread Profiling  ==
+The interrupt and thread profiling will be a RTEMS build configuration time
+option (RTEMS_INTERRUPT_AND_THREAD_PROFILING).
+The time spent on interrupts and the time of disabled thread dispatching should
+be monitored per-processor.  The time between the interrupt recognition by the
+processor and the actuals start of the interrupt handler code execution should
+be monitored per-processor if the hardware supports this.
+ /**
+  * @brief Per-CPU statistics.
+  */
+ struct Per_CPU_Stats {
+ #ifdef RTEMS_INTERRUPT_AND_THREAD_PROFILING
+   /**
+    * @brief The thread dispatch disabled begin instant in CPU counter ticks.
+    *
+    * This value is used to measure the time of disabled thread dispatching.
+    */
+   CPU_counter thread_dispatch_disabled_instant;
+   /**
+    * @brief The last outer-most interrupt begin instant in CPU counter ticks.
+    *
+    * This value is used to measure the interrupt processing time.
+    */
+   CPU_counter outer_most_interrupt_instant;
+   /**
+    * @brief The maximum interrupt delay in CPU counter ticks if supported by
+    * the hardware.
+    */
+   CPU_counter max_interrupt_delay;
+   /**
+    * @brief The maximum time of disabled thread dispatching in CPU counter
+    * ticks.
+    */
+   CPU_counter max_thread_dispatch_disabled_time;
+   /**
+    * @brief Count of times when the thread dispatch disable level changes from
+    * zero to one in thread context.
+    *
+    * This value may overflow.
+    */
+   uint64_t thread_dispatch_disabled_count;
+   /**
+    * @brief Total time of disabled thread dispatching in CPU counter ticks.
+    *
+    * The average time of disabled thread dispatching is the total time of
+    * disabled thread dispatching divided by the thread dispatch disabled
+    * count.
+    *
+    * This value may overflow.
+    */
+   uint64_t total_thread_dispatch_disabled_time;
+   /**
+    * @brief Count of times when the interrupt nest level changes from zero to
+    * one.
+    *
+    * This value may overflow.
+    */
+   uint64_t interrupt_count;
+   /**
+    * @brief Total time of interrupt processing in CPU counter ticks.
+    *
+    * The average time of interrupt processing is the total time of interrupt
+    * processing divided by the interrupt count.
+    *
+    * This value may overflow.
+    */
+   uint64_t total_interrupt_time;
+ #endif /* RTEMS_INTERRUPT_AND_THREAD_PROFILING */
+ }
+ struct Per_CPU_Control {
+   ... per-CPU data ...
+   Per_CPU_Stats Stats;
+ };
+=  Interrupt Support  =
+==  Reason  ==
+Applications should be able to distribute the interrupt load throughout the
+system.  In combination with partitioned/clustered scheduling this can reduce
+the amount of inter-processor synchronization and thread migrations.
+==  RTEMS API Changes  ==
+Each interrupt needs a processor affinity set in the RTEMS SMP configuration.  The
+rtems_interrupt_handler_install() function will not alter the processor
+affinity set of the interrupt vector.  At system start-up all interrupts except
+the inter-processor interrupts must be initialized to have a affinity with the
+initialization processor only.
+Two new functions should be added to alter and retrieve the processor affinity
+sets of interrupt vectors.
+ /**
+  * @brief Sets the processor affinity set of an interrupt vector.
+  *
+  * @param[in] vector The interrupt vector number.
+  * @param[in] affinity_set_size Size of the specified affinity set buffer in
+  * bytes.  This value must be positive.
+  * @param[in] affinity_set The new processor affinity set for the interrupt
+  * vector.  This pointer must not be @c NULL.  A set bit in the affinity set
+  * means that the interrupt can occur on this processor and a cleared bit
+  * means the opposite.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID The vector number is invalid.
+  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
+  */
+ rtems_status_code rtems_interrupt_set_affinity(
+   rtems_vector vector,
+   size_t affinity_set_size,
+   const cpu_set_t *affinity_set
+ );
+ /**
+  * @brief Gets the processor affinity set of an interrupt vector.
+  *
+  * @param[in] vector The interrupt vector number.
+  * @param[in] affinity_set_size Size of the specified affinity set buffer in
+  * bytes.  This value must be positive.
+  * @param[out] affinity_set The current processor affinity set of the
+  * interrupt vector.  This pointer must not be @c NULL.  A set bit in the
+  * affinity set means that the interrupt can occur on this processor and a
+  * cleared bit means the opposite.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID The vector number is invalid.
+  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
+  * current processor affinity set of the interrupt vector.
+  */
+ rtems_status_code rtems_interrupt_get_affinity(
+   rtems_vector vector,
+   size_t affinity_set_size,
+   cpu_set_t *affinity_set
+ );
+=  Clustered Scheduling  =
+==  Reason  ==
+Partitioned/clustered scheduling helps to control the worst-case latencies in
+the system.  The goal is to reduce the amount of shared state in the system and
+thus prevention of lock contention.  Modern multi-processor systems tend to
+have several layers of data and instruction caches.  With partitioned/clustered
+scheduling it is possible to honor the cache topology of a system and thus
+avoid expensive cache synchronization traffic.
+==  RTEMS API Changes  ==
+Functions for scheduler management.
+ /**
+  * @brief Identifies a scheduler by its name.
+  *
+  * The scheduler name is determined by the scheduler configuration.
+  *
+  * @param[in] name The scheduler name.
+  * @param[out] scheduler_id The scheduler identifier associated with the name.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_NAME Invalid scheduler name.
+  */
+ rtems_status_code rtems_scheduler_ident(
+   rtems_name name,
+   rtems_id *scheduler_id
+ );
+ /**
+  * @brief Gets the set of processors owned by the scheduler.
+  *
+  * @param[in] scheduler_id Identifier of the scheduler.
+  * @param[in] processor_set_size Size of the specified processor set buffer in
+  * bytes.  This value must be positive.
+  * @param[out] processor_set The processor set owned by the scheduler.  This
+  * pointer must not be @c NULL.  A set bit in the processor set means that
+  * this processor is owned by the scheduler and a cleared bit means the
+  * opposite.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid scheduler identifier.
+  * @retval RTEMS_INVALID_CPU_SET The processor set buffer is too small for the
+  * set of processors owned by the scheduler.
+  */
+ rtems_status_code rtems_scheduler_get_processors(
+   rtems_id scheduler_id,
+   size_t processor_set_size,
+   cpu_set_t *processor_set
+ );
+Each thread needs a processor affinity set in the RTEMS SMP configuration.  The
+rtems_task_create() function will use the processor affinity set of the
+executing thread to initialize the processor affinity set of the created
+task.  This enables backward compatibility for existing software.
+Two new functions should be added to alter and retrieve the processor affinity
+sets of tasks.
+ /**
+  * @brief Sets the processor affinity set of a task.
+  *
+  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
+  * the executing task.
+  * @param[in] affinity_set_size Size of the specified affinity set buffer in
+  * bytes.  This value must be positive.
+  * @param[in] affinity_set The new processor affinity set for the task.  This
+  * pointer must not be @c NULL.  A set bit in the affinity set means that the
+  * task can execute on this processor and a cleared bit means the opposite.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid task identifier.
+  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
+  */
+ rtems_status_code rtems_task_set_affinity(
+   rtems_id task_id,
+   size_t affinity_set_size,
+   const cpu_set_t *affinity_set
+ );
+ /**
+  * @brief Gets the processor affinity set of a task.
+  *
+  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
+  * the executing task.
+  * @param[in] affinity_set_size Size of the specified affinity set buffer in
+  * bytes.  This value must be positive.
+  * @param[out] affinity_set The current processor affinity set of the task.
+  * This pointer must not be @c NULL.  A set bit in the affinity set means that
+  * the task can execute on this processor and a cleared bit means the
+  * opposite.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid task identifier.
+  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
+  * current processor affinity set of the task.
+  */
+ rtems_status_code rtems_task_get_affinity(
+   rtems_id task_id,
+   size_t affinity_set_size,
+   cpu_set_t *affinity_set
+ );
+Two new functions should be added to alter and retrieve the scheduler of tasks.
+ /**
+  * @brief Sets the scheduler of a task.
+  *
+  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
+  * the executing task.
+  * @param[in] scheduler_id Identifier of the scheduler.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid task identifier.
+  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier.
+  *
+  * @see rtems_scheduler_ident().
+  */
+ rtems_status_code rtems_task_set_scheduler(
+   rtems_id task_id,
+   rtems_id scheduler_id
+ );
+ /**
+  * @brief Gets the scheduler of a task.
+  *
+  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
+  * the executing task.
+  * @param[out] scheduler_id Identifier of the scheduler.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid task identifier.
+  */
+ rtems_status_code rtems_task_get_scheduler(
+   rtems_id task_id,
+   rtems_id *scheduler_id
+ );
+==  Scheduler Configuration  ==
+There are two options for the scheduler instance configuration
+# static configuration by means of global data structures, and
+# configuration at run-time via function calls.
+For a configuration at run-time the system must start with a default scheduler.
+The global constructors are called in this environment.  The order of global
+constructor invocation is unpredictable so it is difficult to create threads in
+this context since the run-time scheduler configuration may not exist yet.
+Since scheduler data structures are allocated from the workspace the
+configuration must take a later run-time setup of schedulers into account for
+the workspace size estimate.  In case the default scheduler is not appropriate
+it must be replaced which gives raise to some implementation difficulties.
+Since the processor availability is determined by hardware constraints it is
+unclear which benefits a run-time configuration has.  For now run-time
+configuration of scheduler instances will be not implemented.
+The focus is now on static configuration.  Every scheduler needs a control
+context.  The scheduler API must provide a macro which creates a global
+scheduler instance specific data structure with a designator name as a
+mandatory parameter.  The scheduler instance creation macro may require
+additional scheduler specific configuration options.  For example a
+fixed-priority scheduler instance must know the maximum priority level to
+allocate the ready chain control table.
+Once the scheduler instances are configured it must be specified for each
+processor in the system which scheduler instance owns this processor or if this
+processor is not used by the RTEMS system.
+For each processor except the initialization processor a scheduler instance is
+optional so that other operating systems can run independent of this RTEMS
+system on this processor.  It is a fatal error to omit a scheduler instance for
+the initialization processor.  The initialization processor is the processor
+which executes the boot_card() function.
+ /**
+  * @brief Processor configuration.
+  *
+  * Use RTEMS_CPU_CONFIG_INIT() to initialize this structure.
+  */
+ typedef struct {
+   /**
+    * @brief Scheduler instance for this processor.
+    *
+    * It is possible to omit a scheduler instance for this processor by using
+    * the @c NULL pointer.  In this case RTEMS will not use this processor and
+    * other operating systems may claim it.
+    */
+   Scheduler_Control *scheduler;
+ } rtems_cpu_config;
+ /**
+  * @brief Processor configuration initializer.
+  *
+  * @param scheduler The reference to a scheduler instance or @c NULL.
+  *
+  * @see rtems_cpu_config.
+  */
+ #define RTEMS_CPU_CONFIG_INIT(scheduler) \
+   { ( scheduler ) }
+Scheduler and processor configuration example:
+ RTEMS_SCHED_DEFINE_FP_SMP(fp0, rtems_build_name(' ', 'F', 'P', '0'), 256);
+ RTEMS_SCHED_DEFINE_FP_SMP(fp1, rtems_build_name(' ', 'F', 'P', '1'), 64);
+ RTEMS_SCHED_DEFINE_EDF_SMP(edf0, rtems_build_name('E', 'D', 'F', '0'));
+ const rtems_cpu_config rtems_cpu_config_table[] = {
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp0)),
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
+   RTEMS_CPU_CONFIG_INIT(NULL),
+   RTEMS_CPU_CONFIG_INIT(NULL),
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)),
+   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)
+ };
+ const size_t rtems_cpu_config_count =
+   RTEMS_ARRAY_SIZE(rtems_cpu_config_table);
+An alternative to the processor configuration table would be to specify in the
+scheduler instance which processors are owned by the instance.  This would
+require a static initialization of CPU sets which is difficult.  Also the
+schedulers have to be registered somewhere, so some sort of table is needed
+anyway.  Since a processor can be owned by at most one scheduler instance this
+configuration approach enables an additional error source which is avoided by
+the processor configuration table.
+==  Scheduler Implementation  ==
+Currently the scheduler operations have no control context and use global
+variables instead.  Thus the scheduler operations signatures must change to use
+a scheduler control context as the first parameter, e.g.
+ typedef struct Scheduler_Control Scheduler_Control;
+ typedef struct {
+   [...]
+   void ( *set_affinity )(
+     Scheduler_Control *self,
+     Thread_Control *thread,
+     size_t affinity_set_size,
+     const cpu_set_t *affinity_set
+   );
+   [...]
+ } Scheduler_Operations;
+ /**
+  * @brief General scheduler control.
+  */
+ struct Scheduler_Control {
+   /**
+    * @brief The scheduler operations.
+    */
+   Scheduler_Operations Operations;
+   /**
+    * @brief Size of the owned processor set in bytes.
+    */
+   size_t owned_cpu_set_size
+   /**
+    * @brief Reference to the owned processor set.
+    *
+    * A set bit means this processor is owned by this scheduler instance, a
+    * cleared bit means the opposite.
+    */
+   cpu_set_t *owned_cpu_set;
+ };
+Single processor configurations benefit also from this change since it makes
+all dependencies explicit and easier to access (allows more efficient machine
+code).
+=  Multiprocessor Resource Sharing Protocol - MrsP  =
+==  Reason  ==
+In general, application-level threads are not independent and may indeed share logical resources. In a partitioned-scheduling system, where capacity allows, resources are allocated on the same processor as the threads that share it: in that case those resources are termed ''local''.  Where needed, resources may also reside on processors other than that of (some of) their sharing threads: in that case those resources are termed ''global''.
+For partitioned scheduling of application-level threads and local resources, two choices are possible, which meet the ITT requirement of achieving predictable time behaviour for platform software
+ *  (1) fixed-priority scheduling with the immediate priority ceiling for controlling access to local resources, or
+ *  (2) EDF scheduling with the stack resource protocol for controlling access to local resources.
+Choice (1) is more natural with RTEMS. Both alternatives require preemption, whose disruption to the thread's cache working set may in part be attenuated by the use of techniques known as limited-preemption, which have been successfully demonstrated by the UoP. Conversely, run-to-completion (non-preemptive) semantics is known to achieve much lower schedulable utilization, which cannot make it a plausible candidate for performance-hungry systems. The use of fixed-priority scheduling for each processor where application-level threads locally run allows response time analysis to be used, which, on the basis of the worst-case execution time of individual threads and on the graph of resource usage among threads, determines the worst-case completion time of every individual thread run on that processor, offering absolute guarantees on the schedulable utilization of every individual processor that uses that algorithm.
+The determination of worst-case execution time of software programs running on one processor of an SMP is significantly more complex than its single-processor analogous.  This is because every local execution suffers massive interference effects from its co-runners on other processors, regardless of functional independence.  Simplistically upper bounding the possible interference incurs exceeding pessimism, which causes massive reduction in the allowable system load.  More accurate interference analysis may incur prohibitive costs unless simplifying assumptions can be safely made, one of which is strictly static partitioning: previous studies run by ESA [http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf]
+have indicated possible approaches to that problem.  Global scheduling greatly exacerbates the difficulty of that problem and thus is unwelcome.  With fixed-priority scheduling, each logical resource shared by multiple application-level threads (or their corresponding lock) must be statically attached a ceiling priority attribute computed as an upper bound to the static
+priority of all threads that may use that resource: when an application-level thread acquires the resource lock, the priority of that thread is raised to the
+ceiling of the resource; upon relinquishing the lock to that resource, the thread must return to the priority level that it had prior to acquiring the lock.  (It has been shown that the implementation of this protocol does not need resource locks at all, as the mere fact for a thread to be running implies that all of the resources that it may want to use are free at this time and none can be claimed by lower-priority threads.  However, it may be easier for an operating system to attach the ceiling value to the resource lock than to
+any other data structures.)  This simple protocol allows application-level threads to acquire multiple resources without risking deadlock.
+When global resources are used, which is very desirable to alleviate the complexity of task allocation, the resource access control protocol in use, whether (1) or (2) in the earlier taxonomy, must be specialized, so that global resources can be syntactically told apart from local resources.  Luckily, two solutions have been very recently proposed to address this problem.  One solution, known as Optimal Migratory Priority Inheritance (OMIP), was proposed by Björn Brandenburg in his work entitled ''A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications'', presented at the ECRTS conference in July 2013.  The other solution, known as Multiprocessor Resource Sharing Protocol (MrsP), was proposed by Alan Burns and Andy Wellings in their work entitled ''A Schedulability Compatible Multiprocessor Resource Sharing Protocol'', presented at the very same ECRTS 2013 conference.  OMIP
+aims to solve the problem of guaranteed bounded blocking in global resource sharing in clustered scheduling on a multiprocessor, where a cluster is a collection of at least one processor.  MrsP aims to achieve the smallest bound to the cumulative duration of blocking suffered by threads waiting to access a global resource in partitioned scheduling (clusters with exactly one
+processor), while allowing schedulability analysis to be performed per processor, using response time analysis, which is a technique fairly well known to industry.  No other global resource sharing protocol on SMP, including OMIP, is able to guarantee that.  For this reason MrsP is the choice for this study.
+The MrsP protocol requires that
+ *  (a) threads waiting for global resources spin at ceiling priority on their local processor, and with a ceiling value greater than any other local resource on that processor, and
+ *  (b) the execution within the global resource may migrate to other processors where application-level threads waiting to access that resource are spinning.
+Feature (a) prevents lower-priority threads from running in preference to the waiting higher-priority thread and stealing resources that it might want to use in the future as part of the current execution; should that stealing happen, the blocking penalty potentially suffered on access to global resources would skyrocket to untenable levels.
+Feature (b), which brings in the sole welcome extent of migration in the proposed model, which is useful when higher-priority tasks running on the processor of the global resource prevent it from completing execution; in that case, the slack allowed for by local spinning on other processors where other threads are waiting, is used to speed up the completion of the execution in the global resource and therefore reduce blocking.
+==  RTEMS API Changes  ==
+For MrsP we need the ability to specify the priority ceilings per scheduler
+domain.
+ typedef struct {
+   rtems_id scheduler_id;
+   rtems_task_priority priority;
+ } rtems_task_priority_by_scheduler;
+ /**
+  * @brief Sets the priority ceilings per scheduler for a semaphore with
+  * priority ceiling protocol.
+  *
+  * @param[in] semaphore_id Identifier of the semaphore.
+  * @param[in] priority_ceilings A table with priority ceilings by scheduler.
+  * In case one scheduler appears multiple times, the setting with the highest
+  * index will be used.  This semaphore object is then bound to the specified
+  * scheduler domains.  It is an error to use this semaphore object on other
+  * scheduler domains.  The specified schedulers must be compatible, e.g.
+  * migration from one scheduler domain to another must be defined.
+  * @param[in] priority_ceilings_count Count of priority ceilings by scheduler
+  * pairs in the table.
+  *
+  * @retval RTEMS_SUCCESSFUL Successful operation.
+  * @retval RTEMS_INVALID_ID Invalid semaphore identifier.
+  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier in the table.
+  * @retval RTEMS_INVALID_PRIORITY Invalid task priority in the table.
+  */
+ rtems_status_code rtems_semaphore_set_priority_ceilings(
+   rtems_id semaphore_id,
+   const rtems_task_priority_by_scheduler *priority_ceilings,
+   size_t priority_ceilings_count
+ );
+==  Implementation  ==
+The critical part in the MrsP is the migration of the lock holder in case of
+preemption by a higher-priority thread.  A post-switch action is used to detect
+this event.  The post-switch action will remove the thread from the current
+scheduler domain and add it to the scheduler domain of the first waiting thread
+which is executing.  A resource release will remove this thread from the
+temporary scheduler domain and move it back to the original scheduler domain.
+=  Fine Grained Locking  =
+==  Reason  ==
+Fine grained locking is of utmost importance to get a scalable operating system
+that can guarantee reasonable worst-case latencies.  With the current Giant
+lock in RTEMS the worst-case latencies of every operating system service
+increase with each processor added to the system.  Since the Giant lock state
+is shared among all processors a huge cache synchronization overhead
+contributes to the worst-case latencies.  Fine grained locking in combination
+with partitioned/clustered scheduling helps to avoid these problems since now
+the operating system state is distributed allowing true parallelism of
+independent components.
+==  RTEMS API Changes  ==
+None.
+==  Locking Protocol Analysis  ==
+As a sample operating system operation the existing mutex
+obtain/release/timeout sequences will be analysed.  All ISR disable/enable
+points (highlighted with colours) must be turned into an appropriate SMP clock
+(e.g. a ticket spin lock).  One goal is that an uncontested mutex obtain will
+use no SMP locks except the one associated with the mutex object itself.  Is
+this possible with the current structure?
+ mutex_obtain(id, wait, timeout):
+        <span style="color:red">level = ISR_disable()</span>
+        mtx = mutex_get(id)
+        executing = get_executing_thread()
+        wait_control = executing.get_wait_control()
+        wait_control.set_status(SUCCESS)
+        if !mtx.is_locked():
+                mtx.lock(executing)
+                if mtx.use_ceiling_protocol():
+                        thread_dispatch_disable()
+                        <span style="color:red">ISR_enable(level)</span>
+                        executing.boost_priority(mtx.get_ceiling())
+                        thread_dispatch_enable()
+                else:
+                        <span style="color:red">ISR_enable(level)</span>
+        else if mtx.is_holder(executing):
+                mtx.increment_nest_level()
+                <span style="color:red">ISR_enable(level)</span>
+        else if !wait:
+                <span style="color:red">ISR_enable(level)</span>
+                wait_control.set_status(UNSATISFIED)
+        else:
+                wait_queue = mtx.get_wait_queue()
+                wait_queue.set_sync_status(NOTHING_HAPPENED)
+                executing.set_wait_queue(wait_queue))
+                thread_dispatch_disable()
+                <span style="color:red">ISR_enable(level)</span>
+                if mtx.use_inherit_priority():
+                        mtx.get_holder().boost_priority(executing.get_priority()))
+                <span style="color:fuchsia">level = ISR_disable()</span>
+                if executing.is_ready():
+                        executing.set_state(MUTEX_BLOCKING_STATE)
+                        scheduler_block(executing)
+                else:
+                        executing.add_state(MUTEX_BLOCKING_STATE)
+                <span style="color:fuchsia">ISR_enable(level)</span>
+                if timeout:
+                        timer_start(timeout, executing, mtx)
+                <span style="color:blue">level = ISR_disable()</span>
+                search_thread = wait_queue.first()
+                while search_thread != wait_queue.tail():
+                        if executing.priority() <= search_thread.priority():
+                                break
+                        <span style="color:blue">ISR_enable(level)</span>
+                        <span style="color:blue">level = ISR_disable()</span>
+                        if search_thread.is_state_set(MUTEX_BLOCKING_STATE):
+                                search_thread = search_thread.next()
+                        else:
+                                search_thread = wait_queue.first()
+                sync_status = wait_queue.get_sync_status()
+                if sync_state == NOTHING_HAPPENED:
+                        wait_queue.set_sync_status(SYNCHRONIZED)
+                        wait_queue.enqueue(search_thread, executing)
+                        executing.set_wait_queue(wait_queue)
+                        <span style="color:blue">ISR_enable(level)</span>
+                else:
+                        executing.set_wait_queue(NULL)
+                        if executing.is_timer_active():
+                                executing.deactivate_timer()
+                                <span style="color:blue">ISR_enable(level)</span>
+                                executing.remove_timer()
+                        else:
+                                <span style="color:blue">ISR_enable(level)</span>
+                        <span style="color:fuchsia">level = ISR_disable()</span>
+                        if executing.is_state_set(MUTEX_BLOCKING_STATE):
+                                executing.clear_state(MUTEX_BLOCKING_STATE)
+                                if executing.is_ready():
+                                        scheduler_unblock(executing)
+                        <span style="color:fuchsia">ISR_enable(level)</span>
+                thread_dispatch_enable()
+        return wait_control.get_status()
+ mutex_release(id):
+        thread_dispatch_disable()
+        mtx = mutex_get(id)
+        executing = get_executing_thread()
+        nest_level = mtx.decrement_nest_level()
+        if nest_level == 0:
+                if mtx.use_ceiling_protocol() or mtx.use_inherit_priority():
+                        executing.restore_priority()
+                wait_queue = mtx.get_wait_queue()
+                thread = NULL
+                <span style="color:red">level = ISR_disable()</span>
+                thread = wait_queue.dequeue()
+                if thread != NULL:
+                        thread.set_wait_queue(NULL)
+                        if thread.is_timer_active():
+                                thread.deactivate_timer()
+                                <span style="color:red">ISR_enable(level)</span>
+                                thread.remove_timer()
+                        else:
+                                <span style="color:red">ISR_enable(level)</span>
+                        <span style="color:fuchsia">level = ISR_disable()</span>
+                        if thread.is_state_set(MUTEX_BLOCKING_STATE):
+                                thread.clear_state(MUTEX_BLOCKING_STATE)
+                                if thread.is_ready():
+                                        scheduler_unblock(thread)
+                        <span style="color:fuchsia">ISR_enable(level)</span>
+                else:
+                        <span style="color:red">ISR_enable(level)</span>
+                <span style="color:blue">level = ISR_disable()</span>
+                if thread == NULL:
+                        sync_status = wait_queue.get_sync_status()
+                        if sync_status == TIMEOUT || sync_status == NOTHING_HAPPENED:
+                                wait_queue.set_sync_status(SATISFIED)
+                                thread = executing
+                <span style="color:blue">ISR_enable(level)</span>
+                if thread != NULL:
+                        mtx.new_holder(thread)
+                        if mtx.use_ceiling_protocol():
+                                thread.boost_priority(mtx.get_ceiling())
+                else:
+                        mtx.unlock()
+        thread_dispatch_enable()
+ mutex_timeout(thread, mtx):
+        <span style="color:red">level = ISR_disable()</span>
+        wait_queue = thread.get_wait_queue()
+        if wait_queue != NULL:
+                sync_status = wait_queue.get_sync_status()
+                if sync_status != SYNCHRONIZED and thread.is_executing():
+                        if sync_status != SATISFIED:
+                                wait_queue.set_sync_status(TIMEOUT)
+                                wait_control = executing.get_wait_control()
+                                wait_control.set_status(TIMEOUT)
+                        <span style="color:red">ISR_enable(level)</span>
+                else:
+                        <span style="color:red">ISR_enable(level)</span>
+                        <span style="color:lime">level = ISR_disable()</span>
+                        wait_queue = thread.get_wait_queue()
+                        if wait_queue != NULL:
+                                wait_queue.extract(executing)
+                                if thread.is_timer_active():
+                                        thread.deactivate_timer()
+                                        <span style="color:lime">ISR_enable(level)</span>
+                                        thread.remove_timer()
+                                else:
+                                        <span style="color:lime">ISR_enable(level)</span>
+                                <span style="color:fuchsia">level = ISR_disable()</span>
+                                if thread.is_state_set(MUTEX_BLOCKING_STATE):
+                                        thread.clear_state(MUTEX_BLOCKING_STATE)
+                                        if thread.is_ready():
+                                                scheduler_unblock(thread)
+                                <span style="color:fuchsia">ISR_enable(level)</span>
+                        else:
+                                <span style="color:lime">ISR_enable(level)</span>
+        else:
+                <span style="color:red">ISR_enable(level)</span>
+The thread.remove_timer() operation is quite complex and a problem area of its
+own.  See discussion about the [wiki:Watchdog_Handler Watchdog Handler].
+The scheduler operation points are in a good shape.  Here we can easily use one
+SMP lock for the thread state and one SMP lock for the scheduler state.
+The big problem is that the mutex object state changes and the thread enqueue
+operation are split up into several parts.  This was done to ensure a next to
+optimal interrupt latency with only constant-time sections of disabled
+interrupts.  The trade-off is that we have a very complex blocking sequence.
+After the first mutex state change in the mutex_obtain() the mutex doesn't know
+directly which thread is about to block on that mutex.  Some sequences assume
+exactly one executing thread in the system, which is not true on a SMP system
+with more than one processor.  With an SMP lock per mutex object all state
+information for this mutex must be present in the mutex object.  So the locked
+mutex to locked mutex with a waiting thread change must be atomic with respect
+to the mutex SMP lock.  Later mutex timeout or release operations can then get
+the waiting thread and deal with it accordingly.
+==  Implementation  ==
+The blocking operation synchronization state must move from the synchronization
+object (e.g. mutex, message queue) to the thread wait information
+(Thread_Wait_information) since it is clear that the thread has to block and
+not the object.  There may be also multiple threads on different processors
+which queue up on a mutex at the same time.
+Blocking threads must be registered in the object under the SMP lock of the
+object and must be independent of the scheduler data structures.  Thus we can
+no longer use the normal chain node of the thread and instead have to add a
+chain node to the thread wait information.
+The thread queue operations must no longer use internal locks (e.g. ISR
+disable/enable).  This simplifies them considerable.  The thread queue
+operations must be performed under the SMP lock of the object.  The drawback is
+that the time of disabled interrupts increases.  The FIFO thread queue
+operations are trivial doubly-linked list operations.  The priority thread
+queue operations execute in a worst-case time which depends only on the maximum
+number of priorities.
+=  Post-Switch Actions  =
+==  Reason  ==
+Currently threads are assigned to processors for execution by the scheduler
+responsible for this thread.  It is unknown to the system when a thread
+actually starts or terminates execution on its assigned processor.  The
+termination event is important for the following features
+ *  explicit thread migration, e.g. if a thread should move from one scheduler domain to another,
+ *  thread deletion, since the thread stack is in use until the thread stopped execution, or
+ *  restart of threads executing on a remote processor.
+==  RTEMS API Changes  ==
+None.
+==  Implementation  ==
+One approach to do post-switch actions could be to spin on the per-processor
+variable reflecting the executing thread.  This has at least two problems
+# it doesn't work if the executing thread wants to alter its own state, and
+# this spinning must be done with the scheduler lock held and interrupts disabled, this is a disaster for the interrupt latency,
+The proposed solution is to use an optional action handler which is active in
+case the thread execution termination matters.  In _Thread_Dispatch() we have
+already the post-switch extensions invoked after a thread switch.
+Unfortunately they execute after thread dispatching is enabled again and at
+this point the current processor may have already changed due to thread
+migration requested by an interrupt.
+We need a context which executes right after the thread context switch on the
+current processor, but without the per-processor lock acquired (to prevent lock
+order reversal problems and keep the interrupt latency small).  For this we
+introduce a post-switch action chain (PSAC).  Each thread will have its own
+PSAC control.  The PSAC operations like addition to, removal from and iteration
+over the chain are protected by the corresponding thread lock.  Each action
+will have a local context.  The heir thread will execute the action handlers on
+behalf of the thread of interest.  Since thread dispatching is disabled action
+handlers cannot block.
+The execution time of post-switch actions increases the worst-case thread
+dispatch latency since the heir thread must do work for another thread.
+On demand post-switch actions help to implement the Multiprocessor Resource
+Sharing Protocol (MrsP) proposed by Burns and Wellings.  Threads executing a
+global critical section can add a post-switch action which will trigger the
+thread migration in case of pre-emption by a local high-priority thread.
+ thread_dispatch:
+        again = true
+        while again:
+                level = ISR.disable()
+                current_cpu = get_current_cpu()
+                current_cpu.disable_thread_dispatch()
+                ISR.enable(level)
+                executing = current_cpu.get_executing()
+                current_cpu.acquire()
+                if current_cpu.is_thread_dispatch_necessary():
+                        heir = current_cpu.get_heir()
+                        current_cpu.set_thread_dispatch_necessary(false)
+                        current_cpu.set_executing(heir)
+                        executing.set_executing(false)
+                        heir.set_executing(true)
+                        if executing != heir:
+                                last = switch(executing, heir)
+                                current_cpu = get_current_cpu()
+                                actions = last.get_actions()
+                                if actions.is_empty():
+                                        again = false
+                                else:
+                                        current_cpu.release()
+                                        last.acquire()
+                                        if last.get_cpu() == current_cpu:
+                                                while !actions.is_empty():
+                                                        action = actions.pop()
+                                                        action.do(current_cpu, last)
+                                        last.release()
+                                        current_cpu.enable_thread_dispatch()
+        current_cpu.enable_thread_dispatch()
+        current_cpu.release()
+It is important to check that the thread is still assigned to the current
+processor, since after the release of the per-processor lock we have a new
+executing thread and the thread of interest may migrated to another processor
+already.  Since the heir thread has now a reference to the thread of interest
+we have to make sure that deletion requests are deferred until the post-switch
+actions have been executed.
+An efficient why to get the last executing thread (the thread of interest)
+throughout the context switch is to return the context pointer of the last
+executing thread.  With a simple offset operation we get the thread control
+block.
+=  Thread Delete/Restart  =
+==  Reason  ==
+Deletion of threads may be required by some parallel libraries.
+==  RTEMS API Changes  ==
+None.
+==  Implementation  ==
+The current implementation to manage a thread life-cycle in RTEMS has some
+weaknesses that turn into severe problems on SMP.  It leads also to POSIX and
+C++ standard conformance defects in some cases.  Currently the thread
+life-cycle changes are protected by the thread dispatch disable level and some
+parts by the allocator mutex.  Since the thread dispatch disable level is
+actually a giant mutex on SMP this leads in combination with the allocator
+mutex to lock order reversal problems.
+The usage of a unified work areas is also broken at the moment
+[https://www.rtems.org/bugzilla/show_bug.cgi?id=2152].
+There is also an outstanding thread cancellation bug
+[https://www.rtems.org/bugzilla/show_bug.cgi?id=2035].
+One problematic path is the destruction of threads.  Here we have currently the
+following sequence:
+<ol>
+<li>Obtain the allocator mutex.</li>
+<li>Disable thread dispatching.</li>
+<li>Invalidate the object identifier.</li>
+<li>Enable thread dispatching.</li>
+<li>Call the thread delete extensions in the context of the deleting thread
+(not necessarily the deleted thread).  The POSIX cleanup handlers are called
+here from the POSIX delete extension.  POSIX mandates that the cleanup handler
+are executed in the context of the corresponding thread.  So here we have a
+POSIX violation
+[http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_03].
+</li>
+<li>Remove the thread from the scheduling and watchdog resources.</li>
+<li>Delete scheduling, floating-point, stack and extensions resources.  Now the
+deleted thread may execute on a freed thread stack!</li>
+<li>Free the object.  Now the object (thread control block) is available for
+re-use, but it is still used by the thread!  Only the disabled thread
+dispatching prevents chaos.</li>
+<li>Release the allocator mutex.  Now we have a lock order reversal (see step 1.
+and 2.).</li>
+<li>Enable thread dispatching.  Here a deleted executing thread disappears.  On
+SMP we have also a race-condition here.  This step looks in detail:
+{{{
+if ( _Thread_Dispatch_decrement_disable_level() == 0 )
+        /*
+         * Here another processor may re-use resources of a deleted executing
+         * thread, e.g. the stack.
+         */
+        _Thread_Dispatch();
+}
+}}}
+</li>
+</ol>
+To overcome the issues we need considerable implementation changes in Score.
+The thread life-cycle state must be explicit and independent of the thread
+dispatch disable level and allocator mutex protection.
+The thread life-cycle is determined by the following actions:
+; CREATE : A thread is created.
+; START : Starts a thread.  The thread must be dormant to get started.
+; RESTART : Restarts a thread.  The thread must not be dormant to get restarted.
+; SUSPEND : Suspends a thread.
+; RESUME : Resumes a thread.
+; DELETE : Deletes a thread.
+; SET_PROTECTION : Sets the new protection state and returns the previous.  This action is new.
+The following thread life-cycle states are proposed.  These states are
+orthog