Notice: We have migrated to GitLab launching 2024-05-01 see here: https://gitlab.rtems.org/

Changes between Version 96 and Version 97 of Developer/SMP


Ignore:
Timestamp:
12/23/16 14:22:24 (8 years ago)
Author:
Sebastian Huber
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Developer/SMP

    v96 v97  
    1 [[TOC(Developer/SMP, depth=3)]]
    2 
    3 =  Status of Effort  =
    4 
    5 
    6 The [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] support for RTEMS is a work in progress.  Basic support is available for ARM, PowerPC, SPARC and Intel x86.
    7 
    8 
    9 
    10 || '''Design Issues''' || '''Summary''' || '''Notes''' ||
    11 || [#Low-LevelStart Low-Level Start] || TBD || ||
    12 || [#ProcessorAffinity Processor Affinity] || Ongoing || ||
    13 || [#Atomic_Operations Atomic Operations] || TBD || ||
    14 || [#SMPLocks SMP Locks] || TBD || ||
    15 || [#ISRLocks ISR Locks] || TBD || ||
    16 || [#GiantLockvs.FineGrainedLocking Giant Lock vs. Fine Grained Locking] || TBD || ||
    17 || [#WatchdogHandler Watchdog Handler] || TBD || ||
    18 || [#Per-CPUControl Per-CPU Control] || TBD || ||
    19 || [#InterruptSupport Interrupt Support] || TBD || ||
    20 || [#GlobalScheduler Global Scheduler] || TBD || ||
    21 || [#ClusteredScheduling Clustered Scheduling] || TBD || ||
    22 || [#TaskVariables Task Variables] || TBD || ||
    23 || [#Non-PreemptModeforMutualExclusion Non-Preempt Mode for Mutual Exclusion] || TBD || ||
    24 || [#ThreadRestart Thread Restart] || TBD || ||
    25 || [#ThreadDelete Thread Delete] || TBD || ||
    26 || [#SemaphoresAndMutexes Semaphores and Mutexes] || TBD || ||
    27 ||
    28 
    29 
    30 
    31 || '''Implementation Status''' || '''Summary''' || '''Notes''' ||
    32 || [#ToolChain Tool Chain] || TBD || ||
    33 || [#Profiling Profiling] || Complete || CPU counter support is complete.  The profiling support is complete.
    34 || [#InterruptSupport1 Interrupt Support] || TBD || ||
    35 || [#ClusteredScheduling1 Clustered Scheduling] || Complete || ||
    36 || [#MultiprocessorResourceSharingProtocol-MrsP  Multiprocessor Resource Sharing Protocol - MrsP] || Complete || ||
    37 || [#FineGrainedLocking Fine Grained Locking] || TBD || ||
    38 || [#Post-SwitchActions Post-Switch Actions] || Complete || Implemented as post-switch thread-actions
    39 || [#ThreadDeleteRestart Thread Delete/Restart] || Complete || ||
    40 || [#BarrierSynchronization Barrier Synchronization] || Complete || ||
    41 || [#Low-LevelBroadcasts Low-Level Broadcasts] || TBD || ||
    42 || [#TermiosFramework Termios Framework] || Complete || ||
    43 =  Requirements  =
    44 
    45 
    46  *  Implementation Language
    47   *  The implementation language shall be C11 (ISO/IEC 9899:2011) or assembler.
    48   *  The CPU architecture shall support lock-free atomic operations for unsigned long integers.
    49  *  SMP Synchronization Primitives
    50   *  Atomic operations and fences shall be used to implement higher-level synchronization primitives.
    51   *  An SMP lock which ensures mutual exclusion and FIFO ordering shall be provided.
    52   *  An SMP read-write lock shall be provided which offers phase-fair ordering [#BrandenburgAnderson2010].
    53   *  A re-usable SMP barrier shall be provided.
    54   *  SMP synchronization primitives may execute infinitely without progress in case other processors execute erroneous code.
    55  *  System Initialization and Shutdown
    56   *  Before execution starts on the entry symbol on at least one processor a boot loader must load the text and read-only sections.
    57   *  A read-only application configuration shall select the boot processor.
    58   *  The boot processor shall initialize the data and BSS sections if not already performed by a boot loader.
    59   *  A CPU architecture or BSP specific method shall ensure that objects resident in the data or BSS section are not accessed before the boot processor or boot loader initialized these sections.
    60   *  The boot processor shall execute the serial system initialization.
    61   *  It shall be possible to shutdown the system anytime after data and BSS section initialization from any processor.
    62   *  The fatal extension handler shall be invoked during system shutdown on each processor.
    63   *  Invocation of the shutdown procedure while holding SMP synchronization primitives may lead to dead-lock.
    64  *  Thread Life Cycle
    65   *  Asynchronous thread deletion shall be possible.
    66   *  Concurrent thread deletion shall be possible.  At most one thread shall start the deletion process successfully, other threads shall observe an error status.
    67   *  Usage of thread identifiers which are re-used by newly created threads with the aim to access a deleted thread may lead to unpredictable results.
    68   *  Threads shall have a method to change and restore the ability for asynchronous thread deletion of the executing thread.
    69   *  The POSIX cleanup handler shall be available for all RTEMS build configurations.
    70   *  The POSIX cleanup handler shall execute in the context of the deleted thread.
    71   *  The POSIX cleanup handler shall execute in case a thread is re-started in the context of the re-started thread.
    72   *  The POSIX keys shall be available for all RTEMS build configurations.
    73   *  The POSIX key destructor shall execute in the context of the deleted thread.
    74   *  The POSIX key destructor shall execute in case a thread is re-started in the context of the re-started thread.
    75   *  In case a thread owns semaphore objects after the cleanup procedure, then this shall result in a fatal error.
    76  *  Non-Thread Object Life Cycle
    77   *  Concurrent object deletion is for example deletion requests issued by multiple threads for one object or usage of an object while a deletion request was issued.
    78   *  Concurrent object deletion may have unpredictable results.
    79   *  Usage of objects during deletion of this object may have unpredictable results.
    80  *  Classic API
    81   *  Usage of task variables shall lead to a compile time error.
    82   *  Usage of task non-preempt mode shall lead to a compile time error.
    83   *  Usage of interrupt disable/enable shall lead to a run-time time error if RTEMS debug is enabled and the executing context is unsafe.
    84   *  All other RTEMS object services shall behave like in the single-processor configuration.
    85  *  Profiling
    86   *  The profiling shall be a RTEMS build configuration option.
    87   *  It shall be possible to measure time intervals up to the system tick interval with little overhead in every execution context.
    88   *  It shall be possible to obtain profiling information for the lowest-level system operations like thread dispatch disabled sections, interrupt processing and SMP locks.
    89   *  There shall be a method for the application to retrieve profiling information of the system.
    90  *  Interrupt Support
    91   *  Interrupts shall have an interrupt affinity to the boot processor by default.
    92   *  It shall be possible to set the interrupt affinity of interrupt sources.
    93  *  !Clustered/Partitioned Scheduling
    94   *  RTEMS shall allow the set of processors in a system to be partitioned into pairwise disjoint subsets.  Each subset of processors shall be owned by exactly one scheduler instance.
    95   *  A clustered/partitioned fixed-priority scheduler shall be provided.
    96   *  The application configuration shall provide a set of processors which may be used to run the application.
    97   *  The application configuration shall define the scheduler instances.
    98   *  The CPU architecture or BSP shall be able to reduce the set of processors provided by the application configuration to reflect the actual hardware.
    99   *  The application configuration of a scheduler instance shall specify if the set of processors can be reduced.
    100   *  The application configuration of a scheduler instance shall specify if the set of processors can be expanded with processors available by the actual hardware and not assigned to other scheduler instances.
    101  *  Fine Grained Locking
    102   *  No giant lock protecting the system state shall be necessary.
    103   *  Non-blocking operations shall use only an object specific lock.
    104 =  Application Impact of SMP  =
    105 
    106 TBD
    107 
    108  *  Task Variables
    109  *  Interrupt !Disable/Enable
    110  *  No Preempt
    111  *  Keys
    112  *  Thread Local Storage
    113 =  Design Issues  =
    114 
    115 ==  Low-Level Start  ==
    116 
    117 === Status ===
    118 
    119 
    120 The low-level start is guided by the per-CPU control state
    121 [http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__PerCPU.html#gabad09777c1e3a7b7f3efcae54938d418 Per_CPU_Control::state].  See also {{{_Per_CPU_Change_state()}}} and
    122 {{{_Per_CPU_Wait_for_state()}}}.
    123 
    124 
    125 [[Image(rtems-smp-low-level-states.png)]]
    126 
    127 
    128 === Future Directions ===
    129 
    130 
    131 None.
    132 ==  Processor Affinity  ==
    133 
    134 ===  Status  ===
    135 
    136 
    137 As of 15 April 2014, a scheduler with support for managing the data associated with thread processor affinity has been added. The actual affinity logic in the scheduler remains to be implemented.
    138 
    139 The APIs for <sys/cpuset.h>, pthread affinity, and Classic API task affinity have been community reviewed and patches have been posted. The current work is "data complete" in that one can get and set using the APIs but no scheduler currently supports affinity.
    140 ===  Background  ===
    141 
    142 
    143 There are multiple pieces of work involved in supporting thread processor affinity
    144 
    145  *  the POSIX application level API,
    146  *  the Classic application level API,
    147  *  scheduler framework modifications, and
    148  *  actual scheduler support.
    149 
    150 There is no POSIX API available that covers thread processor affinity.  Linux is the de facto standard system for high performance computing so it is obvious to choose a Linux compatible API for RTEMS.  Linux provides
    151 [http://man7.org/linux/man-pages/man3/CPU_SET.3.html CPU_SET(3)] to manage sets of processors.  Thread processor affinity can be controlled via
    152 [http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html SCHED_SETAFFINITY(2)]
    153 and
    154 [http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html PTHREAD_SETAFFINITY_NP(3)].
    155 
    156  *  Once the work is complete, RTEMS will provide these APIs.  The implementation shares no code with GNU/Linux since it is not possible to use the Linux files directly since they have a pure GPL license.
    157 
    158 Work remains to be done on adding affinity to a new SMP Scheduler variant. The current plan calls for having the schedulers without affinity ignore input on a set affinity operation and return a cpu set with all CPUs set on all get affinity operations. A new SMP scheduler variant will be added to support affinity. Initially, this scheduler will only be sufficient to properly manage the thread's CPU set as a data element. Then an assignment algorithm will be added.
    159 ===  Theory  ===
    160 
    161 
    162 The scheduler support for arbitrary processor affinities is a major challenge.
    163 Schedulability analysis for schedulers with arbitrary processor affinities is a
    164 current research topic [#GujaratiCerqueiraBandenburg2013].
    165 
    166 Support for arbitrary processor affinities may lead to massive thread
    167 migrations in case a new thread is scheduled.  Suppose we have M processors
    168 0, ..., M - 1 and M + 1 threads 0, ..., M.  Thread i can run on processors i
    169 and i + 1 for i = 0, ..., M - 2.  The thread M - 1 runs only on processor
    170 M - 1.  The M runs only on processor 0.  A thread i has a higher priority than
    171 thread i + 1 for i = 0, ..., M, e.g. thread 0 is the highest priority thread.
    172 Suppose at time T0 threads 0, ..., M - 2 and M are currently scheduled.  The
    173 thread i runs on processor i + 1 for i = 0, ..., M - 2 and thread M runs on
    174 processor 0.  Now at time T1 thread M - 1 gets ready.  It casts out thread M
    175 since this is the lowest priority thread.  Since thread M - 1 can run only on
    176 processor M - 1, the threads 0, ..., M - 2 have to migrate from processor i to
    177 processor i - 1 for i = 0, ..., M - 2.  So one thread gets ready and the
    178 threads of all but one processor must migrate.  The threads forced to migrate
    179 all have a higher priority than the new ready thread M - 1.
    180 {| class="wikitable"
    181 |+ align="bottom" | Example for M = 3
    182 ! Time
    183 ! Thread
    184 ! Processor 0
    185 ! Processor 1
    186 ! Processor 2
    187 |-
    188 | rowspan="4" | T0
    189 | 0
    190 |
    191 | style="background-color:green" |
    192 |
    193 |-
    194 | 1
    195 |
    196 |
    197 | style="background-color:yellow" |
    198 |-
    199 | 2
    200 |
    201 |
    202 |
    203 |-
    204 | 3
    205 | style="background-color:red" |
    206 |
    207 |
    208 |-
    209 | rowspan="4" | T1
    210 | 0
    211 | style="background-color:green" |
    212 |
    213 |
    214 |-
    215 | 1
    216 |
    217 | style="background-color:yellow" |
    218 |
    219 |-
    220 | 2
    221 |
    222 |
    223 | style="background-color:purple" |
    224 |-
    225 | 3
    226 |
    227 |
    228 |
    229 |-
    230 |}
    231 
    232 In the example above the Linux Push and Pull scheduler would not find a
    233 processor for thread M - 1 = 2 and thread M = 3 would continue to execute even
    234 though it has a lower priority.
    235 
    236 The general scheduling problem with arbitrary processor affinities is a
    237 matching problem in a bipartite graph.  There are two disjoint vertex sets.
    238 The set of ready threads and the set of processors.  The edges indicate that a
    239 thread can run on a processor.  The scheduler must find a maximum matching
    240 which fulfills in addition other constraints.  For example the highest priority
    241 threads should be scheduled first.  The performed thread migrations should be
    242 minimal.  The augmenting path algorithm needs O(VE) time to find a maximum
    243 matching, here V is the ready thread count plus processor count and E is the
    244 number of edges.  It is particularly bad that the time complexity depends on
    245 the ready thread count.  It is an open problem if an algorithm exits that is
    246 good enough for real-time applications.
    247 
    248 
    249 [[Image(rtems-smp-affinity.png)]]
    250 
    251 ==  Atomic Operations  ==
    252 
    253 ===  Status  ===
    254 
    255 
    256 The <stdatomic.h> header file is now available in Newlib.  GCC 4.8 supports C11 atomic operations.  Proper atomic operations support for LEON3 is included in
    257 GCC 4.9.  According to the SPARC GCC maintainer it is possible to back port this to GCC 4.8.  A GSoC 2013 project works on an atomic operations API for RTEMS.
    258 One part will be a read-write lock using a phase-fair lock implementation.
    259 
    260 Example ticket lock with C11 atomics.
    261 {{{
    262 #!c
    263 #include <stdatomic.h>
    264 
    265 struct ticket {
    266         atomic_uint ticket;
    267         atomic_uint now_serving;
    268 };
    269 
    270 void acquire(struct ticket *t)
    271 {
    272         unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
    273 
    274         while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
    275                 /* Wait */
    276         }
    277 }
    278 
    279 void release(struct ticket *t)
    280 {
    281         unsigned int current_ticket = atomic_load_explicit(&t->now_serving, memory_order_relaxed);
    282 
    283         atomic_store_explicit(&t->now_serving, current_ticket + 1U, memory_order_release);
    284 }
    285 }}}
    286 
    287 The generated assembler code looks pretty good.  Please note that GCC generates ''CAS'' instructions and not ''CASA'' instructions.
    288 {{{
    289 #!asm
    290         .file   "ticket.c"
    291         .section        ".text"
    292         .align 4
    293         .global acquire
    294         .type   acquire, #function
    295         .proc   020
    296 acquire:
    297         ld      [%o0], %g1
    298         mov     %g1, %g2
    299 .LL7:
    300         add     %g1, 1, %g1
    301         cas     [%o0], %g2, %g1
    302         cmp     %g1, %g2
    303         bne,a   .LL7
    304          mov    %g1, %g2
    305         add     %o0, 4, %o0
    306 .LL4:
    307         ld      [%o0], %g1
    308         cmp     %g1, %g2
    309         bne     .LL4
    310          nop
    311         jmp     %o7+8
    312          nop
    313         .size   acquire, .-acquire
    314         .align 4
    315         .global release
    316         .type   release, #function
    317         .proc   020
    318 release:
    319         ld      [%o0+4], %g1
    320         add     %g1, 1, %g1
    321         st      %g1, [%o0+4]
    322         jmp     %o7+8
    323          nop
    324         .size   release, .-release
    325         .ident  "GCC: (GNU) 4.9.0 20130917 (experimental)"
    326 }}}
    327 
    328 === Future Directions ===
    329 
    330 
    331  *  Review and integrate the GSoC work.
    332  *  Make use of atomic operations.
    333 ==  SMP Locks  ==
    334 
    335 ===  Status  ===
    336 
    337 
    338 SMP locks are implemented as a ticket lock using CPU architecture specific atomic operations.  The SMP locks use a local context to be able to use scalable lock implementations like the Mellor-Crummey and Scotty (MCS) queue-based locks.
    339 === Future Directions ===
    340 
    341 
    342 Introduce read-write locks.  Use phase-fair read-write lock implementation.  This can be used for example by the time management code.  The system time may be read frequently, but updates are infrequent.
    343 ==  ISR Locks  ==
    344 
    345 ===  Status  ===
    346 
    347 
    348 On single processor configurations disabling of interrupts ensures mutual
    349 exclusion.  This is no longer true on SMP since other processors continue to
    350 execute freely.  On SMP the disabling of interrupts must be combined with an
    351 SMP lock.  The ISR locks degrade to simple interrupt disable/enable sequences
    352 on single processor configurations.  On SMP configurations they use an SMP lock
    353 to ensure mutual exclusion throughout the system.
    354 === Future Directions ===
    355 
    356 
    357  *  See [#SMPLocks  SMP locks].
    358  *  Ensure via a RTEMS assertion that normal interrupt disable/sequences are only used intentional outside of the Giant lock critical sections.  Review usage of ISR disable/enable sequences of the complete code base.
    359 ==  Giant Lock vs. Fine Grained Locking  ==
    360 
    361 ===  Status  ===
    362 
    363 
    364 The operating system state is currently protected by three things
    365 
    366  *  the per-processor thread dispatch disable level,
    367  *  the Giant lock (a global recursive SMP lock), and
    368  *  the disabling of interrupts.
    369 
    370 For example the operating services like ''rtems_semaphore_release()'' follow this
    371 pattern:
    372 
    373 {{{
    374 operation(id):
    375         disable_thread_dispatching()
    376         acquire_giant_lock()
    377         obj = get_obj_by_id(id)
    378         status = obj->operation()
    379         put_object(obj)
    380         release_giant_lock()
    381         enable_thread_dispatching()
    382         return status
    383 }}}
    384 
    385 All high level operations are serialized due to the Giant lock.  The
    386 introduction of the Giant lock allowed a straight forward usage of the existing
    387 single processor code in an SMP configuration.  However a Giant lock will
    388 likely lead to undesirable high worst case latencies which linear increase with
    389 the processor count [#Brandenburg2011].
    390 === Future Directions ===
    391 
    392 
    393 The Giant lock should be replaced with fine grained locking.  In order to
    394 introduce fine grained locking the state transitions inside a section protected
    395 by the Giant lock must be examined.  As an example the event send and receive
    396 operations are presented in a simplified state diagram.
    397 
    398 [[Image(rtems-smp-events.png, 600px)]]
    399 
    400 The color indicates which locks are held during a state and state transition.
    401 In case of a blocking operation we have four synchronization states which are
    402 used to split up the complex blocking operation into parts in order to decrease
    403 the interrupt latency
    404 
    405  *  nothing happened,
    406  *  satisfied,
    407  *  timeout, and
    408  *  synchronized.
    409 
    410 This state must be available for each thread and resource (e.g. event,
    411 semaphore, message queue).  Since this would lead to a prohibitive high memory
    412 demand and a complicated implementation some optimizations have been performed
    413 in RTEMS.  For events a global variable is used to indicate the synchronization
    414 state.  For resource objects the synchronization state is part of the object.
    415 Since at most one processor can execute a blocking operation (ensured by the
    416 Giant lock or the singe processor configuration) it is possible to use the
    417 state variable only for the executing thread.
    418 
    419 In case one SMP lock is used per resource the resource must indicate which
    420 thread is about to perform a blocking operation since more than one executing
    421 thread exists.  For example two threads can try to obtain different semaphores
    422 on different processors at the same time.
    423 
    424 A blocking operation will use not only the resource object, but also other
    425 resources like the scheduler shared by all threads of a dispatching domain.
    426 Thus the scheduler needs its own locks.  This will lead to nested locking and
    427 deadlocks must be prevented.
    428 ==  Watchdog Handler  ==
    429 
    430 ===  Status  ===
    431 
    432 
    433 The [http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__ScoreWatchdog.html Watchdog Handler] is a global resource which provides time management in
    434 RTEMS.  It is a core service of the operating system and used by the high level APIs.  It is used for time of day maintenance and time events based on ticks or
    435 wall clock time for example.  The term ''watchdog'' has nothing to do with hardware watchdogs.
    436 === Future Directions ===
    437 
    438 
    439 The watchdog uses delta chains to manage time based events.  One global
    440 watchdog resource leads to scalability problems.  A SMP system is scalable if additional processors will increase the performance metrics.  If we add more processors, then in general the number of timers increases proportionally and thus leads to longer delta chains.  Also the chance for lock contention increases.  Thus the watchdog resource should
    441 move to the scheduler scope with per scheduler instances.  Additional processors can then use new scheduler instances.
    442 ==  Per-CPU Control  ==
    443 
    444 ===  Status  ===
    445 
    446 
    447 Per-CPU control state is available for each configured CPU in a statically
    448 created global table {{{_Per_CPU_Information}}}.  Each per-CPU control is cache
    449 aligned to prevent false sharing and to provide simple access via assembly
    450 code.  CPU ports can add custom fields to the per-CPU control.  This is used on
    451 SPARC for the ISR dispatch disable indication.
    452 === Future Directions ===
    453 
    454 
    455 None.
    456 ==  Interrupt Support  ==
    457 
    458 ===  Status  ===
    459 
    460 
    461 The interrupt support is BSP specific in general.
    462 === Future Directions ===
    463 
    464 
    465  *  The SMP capable BSPs should implement the [http://www.rtems.org/onlinedocs/doxygen/cpukit/html/group__rtems__interrupt__extension.html Interrupt Manager Extension].
    466  *  Add interrupt processor affinity API to ''Interrupt Manager Extension''.  This should use the affinity sets API used for thread processor affinity.
    467 ==  Global Scheduler  ==
    468 
    469 ===  Status  ===
    470 
    471 
    472  *  Scheduler selection is a link-time configuration option.
    473  *  Basic SMP support for schedulers like processor allocation and extended thread state information.
    474  *  Two global fixed priority schedulers (G-FP) are available with SMP support.
    475 === Future Directions ===
    476 
    477 
    478  *  Add more schedulers, e.g. EDF, phase-fair, early-release fair, etc.
    479  *  Allow thread processor affinity selection for each thread.
    480  *  The scheduler operations have no context parameter instead they use the global variable '_Scheduler'.  The scheduler operations signature should be changed to use a parameter for the scheduler specific context to allow more than one scheduler in the system.  This can be used to enable scheduler hierarchies or partitioned schedulers.
    481  *  The scheduler locking is inconsistent.  Some scheduler operations are invoked with interrupts disabled.  Some like the yield operations are called without interrupts disabled and must disable interrupts locally.
    482  *  The new SMP schedulers lack test coverage.
    483  *  Implement clustered (or partitioned) scheduling with a sophisticated resource sharing protocol (e.g. [#BurnsWellings2013]).  To benefit from this the Giant lock must be eliminated.
    484 ==  Clustered Scheduling  ==
    485 
    486 ===  Status  ===
    487 
    488 
    489 Clustered scheduling is not supported.
    490 === Future Directions ===
    491 
    492 
    493 Suppose we have a set of processors.  The set of processors can be partitioned
    494 into non-empty, pairwise disjoint subsets.  Such a subset is called a
    495 dispatching domain.  Dispatching domains should be set up in the initialization
    496 phase of the system before application-level threads (and worker threads) can
    497 be attached to them.  For now the partition must be specified by the
    498 application at linktime: this restriction is acceptable in this study project
    499 as no change should be allowable to dispatching domains at run time.
    500 Associated with each dispatching domain is exactly one scheduler instance.
    501 This association (which defines the scheduler algorithm for the domain) must be
    502 specified by the application at linktime, upon or after the creation of the
    503 dispatching domain.  So we have clustered scheduling.
    504 
    505 At any one time a thread belongs to exactly one dispatching domain.  There will
    506 be no scheduling hierarchies and no intersections between dispatching domains.
    507 If no dispatching domain is explicitly specified, then it will inherit the one
    508 from the executing context to support external code that is not aware of
    509 clustered scheduling.  This inheritance feature is erroneous and static checks
    510 should be made on the program code and link directives to prevent it from
    511 happening in this study project.
    512 
    513 Do we need the ability to move a thread from one dispatching domain to another?
    514 Application-level threads do not migrate.  Worker threads do no either.
    515 Worker threads do no either.  The
    516 only allowable (and desirable) migration is for the execution of
    517 application-level threads in global resources - not the full task, only that
    518 ''protected'' execution - in accord with [#BurnsWellings2013].
    519 ==  Task Variables  ==
    520 
    521 ===  Status  ===
    522 
    523 
    524 Task variables cannot be used on SMP.  The alternatives are thread-local
    525 storage, POSIX keys or special access functions using the thread control block
    526 of the executing thread.  For Newlib the access to the re-entrancy structure is
    527 now performed via {{{__getreent()}}},  see also {{{__DYNAMIC_REENT__}}} in Newlib.  The POSIX keys and the POSIX once function are now available for all RTEMS configurations (they no longer depend on POSIX enabled).  Task variables have been replaced with POSIX keys for the RTEMS shell, the file system environment and the C++ support.
    528 === Future Directions ===
    529 
    530 
    531  *  Fix Ada self task reference.
    532 ==  Non-Preempt Mode for Mutual Exclusion  ==
    533 
    534 ===  Status  ===
    535 
    536 
    537 Disabling of thread preemption cannot be used to ensure mutual exclusion on SMP.
    538 The non-preempt mode is disabled on SMP and a run-time error will occur if such
    539 a mode change request is issued.  The alternatives are mutexes and condition
    540 variables.
    541 === Future Directions ===
    542 
    543 
    544  *  Add condition variables to the Classic API.
    545  *  Fix pthread_once().
    546  *  Fix the block device cache ''bdbuf''.
    547  *  Fix C++ support.
    548 ==  Thread Restart  ==
    549 
    550 ===  Status  ===
    551 
    552 
    553 The restart of threads is implemented.
    554 === Future Directions ===
    555 
    556 
    557 The restart of threads is implemented via post-switch thread actions.  The post-switch thread actions use the per-CPU lock and have very little overhead if no post-switch actions must be performed.  Most architectures should be updated to use an interrupt epilogue similar to the SPARC to avoid long interrupt disable times.
    558 
    559 Why execute post-switch handlers with interrupts disabled on some architectures
    560 (e.g. ARM and PowerPC)?  The most common interrupt handling sequence looks like
    561 this:
    562 
    563 [[Image(rtems-smp-isr-1.png)]]
    564 
    565 In this sequence it is not possible to enable interrupts around the
    566 {{{_Thread_Dispatch()}}} call.  This could lead to an unlimited number of interrupt
    567 contexts saved on the thread stack.  To overcome this issue some architectures
    568 use a flag variable that indicates this particular execution environment (e.g.
    569 the SPARC).  Here the sequence looks like this:
    570 
    571 [[Image(rtems-smp-isr-2.png)]]
    572 
    573 The ISR thread dispatch disable flag must be part of the thread context.  The
    574 context switch will save/restore the per-CPU ISR dispatch disable flag to/from
    575 the thread context.  Each thread stack must have space for at least two
    576 interrupt contexts.
    577 
    578 This scheme could be used for all architectures.
    579 ==  Thread Delete  ==
    580 
    581 === Status ===
    582 
    583 
    584 Deletion of threads is implemented.
    585 === Future Directions ===
    586 
    587 
    588 None.
    589 ==  Semaphores and Mutexes  ==
    590 
    591 === Status ===
    592 
    593 
    594 The semaphore and mutex objects use {{{_Objects_Get_isr_disable()}}}.  On SMP
    595 configurations this first acquires the Giant lock and then interrupts are
    596 disabled.
    597 === Future Directions ===
    598 
    599 
    600 Use an ISR lock per object to improve the performance for uncontested
    601 operations.  See also [#GiantLockvs.FineGrainedLocking Giant Lock vs. Fine Grained Locking].
    602 =  Implementations  =
    603 
    604 ==  Tool Chain  ==
    605 
    606 ===  Binutils  ===
    607 
    608 
    609 A Binutils 2.24 or later release must be used due to the LEON3 support.
    610 ===  GCC  ===
    611 
    612 
    613 A GCC 4.8 2013-11-25 (e.g. 4.8.3) or later must be used due to the LEON3 support.  The LEON3 support for GCC includes a proper C11 memory model definition for this processor and C11 atomic operations using the CAS instruction.  The backport of the LEON3 support was initiated by EB [http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02255.html].
    614 ===  GDB  ===
    615 
    616 A GDB 7.8.2 or later must be used since earlier versions have problems with the
    617 debug format generated by GCC 4.8 and later
    618 [https://sourceware.org/bugzilla/show_bug.cgi?id=16215].
    619 
    620 ==  Profiling  ==
    621 
    622 === Implementation ===
    623 
    624 
    625 SMP lock and interrupt processing profiling is necessary to fulfill some
    626 observability requirements.  Vital timing data can be gathered on a per object
    627 basis through profiling.
    628 === Status ===
    629 
    630 
    631 CPU counter support is complete [24bf11eca11947d961cc9bb5f7d92dabff169e93/rtems].  The profiling support is complete [4dad4b84112d57cf6e77409f8e267706db446ec0/rtems].
    632 === RTEMS API Changes ===
    633 
    634 
    635 None.
    636 ===  High-Performance CPU Counters  ===
    637 
    638 
    639 In order to measure short time intervals we have to add a high-performance CPU
    640 counter support to the CPU port API.  This is can be also used as an
    641 replacement for the BSP specific benchmark timers.  It may also be used to
    642 implement busy wait loops which are required by some device drivers.
    643 
    644 {{{
    645 #!c
    646 /**
    647  * @brief Integer type for CPU counter values.
    648  */
    649 typedef XXX CPU_counter;
    650 
    651 /**
    652  * brief Returns the current CPU counter value.
    653  */
    654 CPU_counter _CPU_counter_Get()
    655 
    656 /**
    657  * brief Mask for arithmetic operations with the CPU counter value.
    658  *
    659  * All arithmetic operations are defined as A = ( C op B ) & MASK.
    660  */
    661 CPU_counter _CPU_counter_Mask()
    662 
    663 /**
    664  * brief Converts a CPU counter value into nanoseconds.
    665  */
    666 uint64_t _CPU_counter_To_nanoseconds( CPU_counter counter )
    667 }}}
    668 
    669 ===  SMP Lock Profiling  ===
    670 
    671 
    672 The SMP lock profiling will be a RTEMS build configuration time option
    673 (RTEMS_LOCK_PROFILING).  The following statistics are proposed.
    674 
    675 {{{
    676 #!c
    677 #define SMP_LOCK_STATS_CONTENTION_COUNTS 4
    678 
    679 /**
    680  * @brief SMP lock statistics.
    681  *
    682  * The lock acquire attempt instant is the point in time right after the
    683  * interrupt disable action in the lock acquire sequence.
    684  *
    685  * The lock acquire instant is the point in time right after the lock
    686  * acquisition.  This is the begin of the critical section code execution.
    687  *
    688  * The lock release instant is the point in time right before the interrupt
    689  * enable action in the lock release sequence.
    690  *
    691  * The lock section time is the time elapsed between the lock acquire instant
    692  * and the lock release instant.
    693  *
    694  * The lock acquire time is the time elapsed between the lock acquire attempt
    695  * instant and the lock acquire instant.
    696  */
    697 struct SMP_lock_Stats {
    698 #ifdef RTEMS_LOCK_PROFILING
    699   /**
    700    * @brief The last lock acquire instant in CPU counter ticks.
    701    *
    702    * This value is used to measure the lock section time.
    703    */
    704   CPU_counter acquire_instant;
    705 
    706   /**
    707    * @brief The maximum lock section time in CPU counter ticks.
    708    */
    709   CPU_counter max_section_time;
    710 
    711   /**
    712    * @brief The maximum lock acquire time in CPU counter ticks.
    713    */
    714   CPU_counter max_acquire_time;
    715 
    716   /**
    717    * @brief The count of lock uses.
    718    *
    719    * This value may overflow.
    720    */
    721   uint64_t usage_count;
    722 
    723   /**
    724    * @brief The counts of lock acquire operations with contention.
    725    *
    726    * The contention count for index N corresponds to a lock acquire attempt
    727    * with an initial queue length of N + 1.  The last index corresponds to all
    728    * lock acquire attempts with an initial queue length greater than or equal
    729    * to SMP_LOCK_STATS_CONTENTION_COUNTS.
    730    *
    731    * The values may overflow.
    732    */
    733   uint64_t contention_counts[SMP_LOCK_STATS_CONTENTION_COUNTS];
    734 
    735   /**
    736    * @brief Total lock section time in CPU counter ticks.
    737    *
    738    * The average lock section time is the total section time divided by the
    739    * lock usage count.
    740    *
    741    * This value may overflow.
    742    */
    743   uint64_t total_section_time;
    744 #endif /* RTEMS_LOCK_PROFILING */
    745 }
    746 
    747 struct SMP_lock_Control {
    748   ... lock data ...
    749   SMP_lock_Stats Stats;
    750 };
    751 }}}
    752 
    753 A function should be added to monitor the lock contention.
    754 
    755 {{{
    756 #!c
    757 /**
    758  * @brief Called in case of lock contention.
    759  *
    760  * @param[in] counter The spin loop iteration counter.
    761  */
    762 void _SMP_lock_Contention_monitor(
    763   const SMP_lock_Control *lock,
    764   int counter
    765 );
    766 }}}
    767 
    768 A ticket lock can then look like this:
    769 
    770 {{{
    771 #!c
    772 void acquire(struct ticket *t)
    773 {
    774         unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
    775 #ifdef RTEMS_LOCK_PROFILING
    776         int counter = 0;
    777 #endif /* RTEMS_LOCK_PROFILING */
    778 
    779         while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
    780 #ifdef RTEMS_LOCK_PROFILING
    781                 ++counter;
    782                 _SMP_lock_Contention_monitor(t, counter);
    783 #endif /* RTEMS_LOCK_PROFILING */
    784         }
    785 }
    786 }}}
    787 
    788 SMP lock statistics can be evaluated use the following method.
    789 
    790 {{{
    791 #!c
    792 typedef void ( *SMP_lock_Visitor )(
    793   void *arg,
    794   SMP_lock_Control *lock,
    795   SMP_lock_Class lock_class,
    796   Objects_Name lock_name
    797 );
    798 
    799 /**
    800  * @brief Iterates through all system SMP locks and invokes the visitor for
    801  * each lock.
    802  */
    803 void _SMP_lock_Iterate( SMP_lock_Visitor visitor, void *arg );
    804 }}}
    805 
    806 ===  Interrupt and Thread Profiling  ===
    807 
    808 
    809 The interrupt and thread profiling will be a RTEMS build configuration time
    810 option (RTEMS_INTERRUPT_AND_THREAD_PROFILING).
    811 
    812 The time spent on interrupts and the time of disabled thread dispatching should
    813 be monitored per-processor.  The time between the interrupt recognition by the
    814 processor and the actuals start of the interrupt handler code execution should
    815 be monitored per-processor if the hardware supports this.
    816 
    817 {{{
    818 #!c
    819 /**
    820  * @brief Per-CPU statistics.
    821  */
    822 struct Per_CPU_Stats {
    823 #ifdef RTEMS_INTERRUPT_AND_THREAD_PROFILING
    824   /**
    825    * @brief The thread dispatch disabled begin instant in CPU counter ticks.
    826    *
    827    * This value is used to measure the time of disabled thread dispatching.
    828    */
    829   CPU_counter thread_dispatch_disabled_instant;
    830 
    831   /**
    832    * @brief The last outer-most interrupt begin instant in CPU counter ticks.
    833    *
    834    * This value is used to measure the interrupt processing time.
    835    */
    836   CPU_counter outer_most_interrupt_instant;
    837 
    838   /**
    839    * @brief The maximum interrupt delay in CPU counter ticks if supported by
    840    * the hardware.
    841    */
    842   CPU_counter max_interrupt_delay;
    843 
    844   /**
    845    * @brief The maximum time of disabled thread dispatching in CPU counter
    846    * ticks.
    847    */
    848   CPU_counter max_thread_dispatch_disabled_time;
    849 
    850   /**
    851    * @brief Count of times when the thread dispatch disable level changes from
    852    * zero to one in thread context.
    853    *
    854    * This value may overflow.
    855    */
    856   uint64_t thread_dispatch_disabled_count;
    857 
    858   /**
    859    * @brief Total time of disabled thread dispatching in CPU counter ticks.
    860    *
    861    * The average time of disabled thread dispatching is the total time of
    862    * disabled thread dispatching divided by the thread dispatch disabled
    863    * count.
    864    *
    865    * This value may overflow.
    866    */
    867   uint64_t total_thread_dispatch_disabled_time;
    868 
    869   /**
    870    * @brief Count of times when the interrupt nest level changes from zero to
    871    * one.
    872    *
    873    * This value may overflow.
    874    */
    875   uint64_t interrupt_count;
    876 
    877   /**
    878    * @brief Total time of interrupt processing in CPU counter ticks.
    879    *
    880    * The average time of interrupt processing is the total time of interrupt
    881    * processing divided by the interrupt count.
    882    *
    883    * This value may overflow.
    884    */
    885   uint64_t total_interrupt_time;
    886 #endif /* RTEMS_INTERRUPT_AND_THREAD_PROFILING */
    887 }
    888 
    889 struct Per_CPU_Control {
    890   ... per-CPU data ...
    891   Per_CPU_Stats Stats;
    892 };
    893 }}}
    894 
    895 ==  Interrupt Support  ==
    896 
    897 === Implementation ===
    898 
    899 
    900 Applications should be able to distribute the interrupt load throughout the
    901 system.  In combination with partitioned/clustered scheduling this can reduce
    902 the amount of inter-processor synchronization and thread migrations.
    903 === Status ===
    904 
    905 
    906 This is TBD.
    907 === RTEMS API Changes ===
    908 
    909 
    910 Each interrupt needs a processor affinity set in the RTEMS SMP configuration.  The
    911 rtems_interrupt_handler_install() function will not alter the processor
    912 affinity set of the interrupt vector.  At system start-up all interrupts except
    913 the inter-processor interrupts must be initialized to have a affinity with the
    914 initialization processor only.
    915 
    916 Two new functions should be added to alter and retrieve the processor affinity
    917 sets of interrupt vectors.
    918 
    919 {{{
    920 #!c
    921 /**
    922  * @brief Sets the processor affinity set of an interrupt vector.
    923  *
    924  * @param[in] vector The interrupt vector number.
    925  * @param[in] affinity_set_size Size of the specified affinity set buffer in
    926  * bytes.  This value must be positive.
    927  * @param[in] affinity_set The new processor affinity set for the interrupt
    928  * vector.  This pointer must not be @c NULL.  A set bit in the affinity set
    929  * means that the interrupt can occur on this processor and a cleared bit
    930  * means the opposite.
    931  *
    932  * @retval RTEMS_SUCCESSFUL Successful operation.
    933  * @retval RTEMS_INVALID_ID The vector number is invalid.
    934  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
    935  */
    936 rtems_status_code rtems_interrupt_set_affinity(
    937   rtems_vector vector,
    938   size_t affinity_set_size,
    939   const cpu_set_t *affinity_set
    940 );
    941 
    942 /**
    943  * @brief Gets the processor affinity set of an interrupt vector.
    944  *
    945  * @param[in] vector The interrupt vector number.
    946  * @param[in] affinity_set_size Size of the specified affinity set buffer in
    947  * bytes.  This value must be positive.
    948  * @param[out] affinity_set The current processor affinity set of the
    949  * interrupt vector.  This pointer must not be @c NULL.  A set bit in the
    950  * affinity set means that the interrupt can occur on this processor and a
    951  * cleared bit means the opposite.
    952  *
    953  * @retval RTEMS_SUCCESSFUL Successful operation.
    954  * @retval RTEMS_INVALID_ID The vector number is invalid.
    955  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
    956  * current processor affinity set of the interrupt vector.
    957  */
    958 rtems_status_code rtems_interrupt_get_affinity(
    959   rtems_vector vector,
    960   size_t affinity_set_size,
    961   cpu_set_t *affinity_set
    962 );
    963 }}}
    964 
    965 ==  Clustered Scheduling  ==
    966 
    967 === Implementation ===
    968 
    969 
    970 Partitioned/clustered scheduling helps to control the worst-case latencies in
    971 the system.  The goal is to reduce the amount of shared state in the system and
    972 thus prevention of lock contention.  Modern multi-processor systems tend to
    973 have several layers of data and instruction caches.  With partitioned/clustered
    974 scheduling it is possible to honor the cache topology of a system and thus
    975 avoid expensive cache synchronization traffic.
    976 === Status ===
    977 
    978 
    979 Support for clustered/partitioned scheduling is complete [c5831a3f9af11228dbdaabaf01f69d37e55684ef/rtems].
    980 === RTEMS API Changes ===
    981 
    982 
    983 Functions for scheduler management.
    984 
    985 {{{
    986 #!c
    987 /**
    988  * @brief Identifies a scheduler by its name.
    989  *
    990  * The scheduler name is determined by the scheduler configuration.
    991  *
    992  * @param[in] name The scheduler name.
    993  * @param[out] scheduler_id The scheduler identifier associated with the name.
    994  *
    995  * @retval RTEMS_SUCCESSFUL Successful operation.
    996  * @retval RTEMS_INVALID_NAME Invalid scheduler name.
    997  */
    998 rtems_status_code rtems_scheduler_ident(
    999   rtems_name name,
    1000   rtems_id *scheduler_id
    1001 );
    1002 
    1003 /**
    1004  * @brief Gets the set of processors owned by the scheduler.
    1005  *
    1006  * @param[in] scheduler_id Identifier of the scheduler.
    1007  * @param[in] processor_set_size Size of the specified processor set buffer in
    1008  * bytes.  This value must be positive.
    1009  * @param[out] processor_set The processor set owned by the scheduler.  This
    1010  * pointer must not be @c NULL.  A set bit in the processor set means that
    1011  * this processor is owned by the scheduler and a cleared bit means the
    1012  * opposite.
    1013  *
    1014  * @retval RTEMS_SUCCESSFUL Successful operation.
    1015  * @retval RTEMS_INVALID_ID Invalid scheduler identifier.
    1016  * @retval RTEMS_INVALID_CPU_SET The processor set buffer is too small for the
    1017  * set of processors owned by the scheduler.
    1018  */
    1019 rtems_status_code rtems_scheduler_get_processors(
    1020   rtems_id scheduler_id,
    1021   size_t processor_set_size,
    1022   cpu_set_t *processor_set
    1023 );
    1024 }}}
    1025 
    1026 Each thread needs a processor affinity set in the RTEMS SMP configuration.  The
    1027 rtems_task_create() function will use the processor affinity set of the
    1028 executing thread to initialize the processor affinity set of the created
    1029 task.  This enables backward compatibility for existing software.
    1030 
    1031 Two new functions should be added to alter and retrieve the processor affinity
    1032 sets of tasks.
    1033 
    1034 {{{
    1035 #!c
    1036 /**
    1037  * @brief Sets the processor affinity set of a task.
    1038  *
    1039  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
    1040  * the executing task.
    1041  * @param[in] affinity_set_size Size of the specified affinity set buffer in
    1042  * bytes.  This value must be positive.
    1043  * @param[in] affinity_set The new processor affinity set for the task.  This
    1044  * pointer must not be @c NULL.  A set bit in the affinity set means that the
    1045  * task can execute on this processor and a cleared bit means the opposite.
    1046  *
    1047  * @retval RTEMS_SUCCESSFUL Successful operation.
    1048  * @retval RTEMS_INVALID_ID Invalid task identifier.
    1049  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
    1050  */
    1051 rtems_status_code rtems_task_set_affinity(
    1052   rtems_id task_id,
    1053   size_t affinity_set_size,
    1054   const cpu_set_t *affinity_set
    1055 );
    1056 
    1057 /**
    1058  * @brief Gets the processor affinity set of a task.
    1059  *
    1060  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
    1061  * the executing task.
    1062  * @param[in] affinity_set_size Size of the specified affinity set buffer in
    1063  * bytes.  This value must be positive.
    1064  * @param[out] affinity_set The current processor affinity set of the task.
    1065  * This pointer must not be @c NULL.  A set bit in the affinity set means that
    1066  * the task can execute on this processor and a cleared bit means the
    1067  * opposite.
    1068  *
    1069  * @retval RTEMS_SUCCESSFUL Successful operation.
    1070  * @retval RTEMS_INVALID_ID Invalid task identifier.
    1071  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
    1072  * current processor affinity set of the task.
    1073  */
    1074 rtems_status_code rtems_task_get_affinity(
    1075   rtems_id task_id,
    1076   size_t affinity_set_size,
    1077   cpu_set_t *affinity_set
    1078 );
    1079 }}}
    1080 
    1081 Two new functions should be added to alter and retrieve the scheduler of tasks.
    1082 
    1083 {{{
    1084 #!c
    1085 /**
    1086  * @brief Sets the scheduler of a task.
    1087  *
    1088  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
    1089  * the executing task.
    1090  * @param[in] scheduler_id Identifier of the scheduler.
    1091  *
    1092  * @retval RTEMS_SUCCESSFUL Successful operation.
    1093  * @retval RTEMS_INVALID_ID Invalid task identifier.
    1094  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier.
    1095  *
    1096  * @see rtems_scheduler_ident().
    1097  */
    1098 rtems_status_code rtems_task_set_scheduler(
    1099   rtems_id task_id,
    1100   rtems_id scheduler_id
    1101 );
    1102 
    1103 /**
    1104  * @brief Gets the scheduler of a task.
    1105  *
    1106  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
    1107  * the executing task.
    1108  * @param[out] scheduler_id Identifier of the scheduler.
    1109  *
    1110  * @retval RTEMS_SUCCESSFUL Successful operation.
    1111  * @retval RTEMS_INVALID_ID Invalid task identifier.
    1112  */
    1113 rtems_status_code rtems_task_get_scheduler(
    1114   rtems_id task_id,
    1115   rtems_id *scheduler_id
    1116 );
    1117 }}}
    1118 
    1119 ===  Scheduler Configuration  ===
    1120 
    1121 
    1122 There are two options for the scheduler instance configuration
    1123 
    1124 # static configuration by means of global data structures, and
    1125 # configuration at run-time via function calls.
    1126 
    1127 For a configuration at run-time the system must start with a default scheduler.
    1128 The global constructors are called in this environment.  The order of global
    1129 constructor invocation is unpredictable so it is difficult to create threads in
    1130 this context since the run-time scheduler configuration may not exist yet.
    1131 Since scheduler data structures are allocated from the workspace the
    1132 configuration must take a later run-time setup of schedulers into account for
    1133 the workspace size estimate.  In case the default scheduler is not appropriate
    1134 it must be replaced which gives raise to some implementation difficulties.
    1135 Since the processor availability is determined by hardware constraints it is
    1136 unclear which benefits a run-time configuration has.  For now run-time
    1137 configuration of scheduler instances will be not implemented.
    1138 
    1139 The focus is now on static configuration.  Every scheduler needs a control
    1140 context.  The scheduler API must provide a macro which creates a global
    1141 scheduler instance specific data structure with a designator name as a
    1142 mandatory parameter.  The scheduler instance creation macro may require
    1143 additional scheduler specific configuration options.  For example a
    1144 fixed-priority scheduler instance must know the maximum priority level to
    1145 allocate the ready chain control table.
    1146 
    1147 Once the scheduler instances are configured it must be specified for each
    1148 processor in the system which scheduler instance owns this processor or if this
    1149 processor is not used by the RTEMS system.
    1150 
    1151 For each processor except the initialization processor a scheduler instance is
    1152 optional so that other operating systems can run independent of this RTEMS
    1153 system on this processor.  It is a fatal error to omit a scheduler instance for
    1154 the initialization processor.  The initialization processor is the processor
    1155 which executes the boot_card() function.
    1156 
    1157 {{{
    1158 #!c
    1159 /**
    1160  * @brief Processor configuration.
    1161  *
    1162  * Use RTEMS_CPU_CONFIG_INIT() to initialize this structure.
    1163  */
    1164 typedef struct {
    1165   /**
    1166    * @brief Scheduler instance for this processor.
    1167    *
    1168    * It is possible to omit a scheduler instance for this processor by using
    1169    * the @c NULL pointer.  In this case RTEMS will not use this processor and
    1170    * other operating systems may claim it.
    1171    */
    1172   Scheduler_Control *scheduler;
    1173 } rtems_cpu_config;
    1174 
    1175 /**
    1176  * @brief Processor configuration initializer.
    1177  *
    1178  * @param scheduler The reference to a scheduler instance or @c NULL.
    1179  *
    1180  * @see rtems_cpu_config.
    1181  */
    1182 #define RTEMS_CPU_CONFIG_INIT(scheduler) \
    1183   { ( scheduler ) }
    1184 }}}
    1185 
    1186 Scheduler and processor configuration example:
    1187 
    1188 {{{
    1189 #!c
    1190 RTEMS_SCHED_DEFINE_FP_SMP(fp0, rtems_build_name(' ', 'F', 'P', '0'), 256);
    1191 RTEMS_SCHED_DEFINE_FP_SMP(fp1, rtems_build_name(' ', 'F', 'P', '1'), 64);
    1192 RTEMS_SCHED_DEFINE_EDF_SMP(edf0, rtems_build_name('E', 'D', 'F', '0'));
    1193 
    1194 const rtems_cpu_config rtems_cpu_config_table[] = {
    1195   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp0)),
    1196   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
    1197   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
    1198   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
    1199   RTEMS_CPU_CONFIG_INIT(NULL),
    1200   RTEMS_CPU_CONFIG_INIT(NULL),
    1201   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)),
    1202   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)
    1203 };
    1204 
    1205 const size_t rtems_cpu_config_count =
    1206 
    1207   RTEMS_ARRAY_SIZE(rtems_cpu_config_table);
    1208 }}}
    1209 
    1210 An alternative to the processor configuration table would be to specify in the
    1211 scheduler instance which processors are owned by the instance.  This would
    1212 require a static initialization of CPU sets which is difficult.  Also the
    1213 schedulers have to be registered somewhere, so some sort of table is needed
    1214 anyway.  Since a processor can be owned by at most one scheduler instance this
    1215 configuration approach enables an additional error source which is avoided by
    1216 the processor configuration table.
    1217 === Scheduler Implementation  ===
    1218 
    1219 
    1220 Currently the scheduler operations have no control context and use global
    1221 variables instead.  Thus the scheduler operations signatures must change to use
    1222 a scheduler control context as the first parameter, e.g.
    1223 
    1224 {{{
    1225 #!c
    1226 typedef struct Scheduler_Control Scheduler_Control;
    1227 
    1228 typedef struct {
    1229   [...]
    1230   void ( *set_affinity )(
    1231     Scheduler_Control *self,
    1232     Thread_Control *thread,
    1233     size_t affinity_set_size,
    1234     const cpu_set_t *affinity_set
    1235   );
    1236   [...]
    1237 } Scheduler_Operations;
    1238 
    1239 /**
    1240  * @brief General scheduler control.
    1241  */
    1242 struct Scheduler_Control {
    1243   /**
    1244    * @brief The scheduler operations.
    1245    */
    1246   Scheduler_Operations Operations;
    1247 
    1248   /**
    1249    * @brief Size of the owned processor set in bytes.
    1250    */
    1251   size_t owned_cpu_set_size
    1252 
    1253   /**
    1254    * @brief Reference to the owned processor set.
    1255    *
    1256    * A set bit means this processor is owned by this scheduler instance, a
    1257    * cleared bit means the opposite.
    1258    */
    1259   cpu_set_t *owned_cpu_set;
    1260 };
    1261 }}}
    1262 
    1263 Single processor configurations benefit also from this change since it makes
    1264 all dependencies explicit and easier to access (allows more efficient machine
    1265 code).
    1266 ==  Multiprocessor Resource Sharing Protocol - MrsP  ==
    1267 
    1268 === Implementation ===
    1269 
    1270 
    1271 In general, application-level threads are not independent and may indeed share logical resources. In a partitioned-scheduling system, where capacity allows, resources are allocated on the same processor as the threads that share it: in that case those resources are termed ''local''.  Where needed, resources may also reside on processors other than that of (some of) their sharing threads: in that case those resources are termed ''global''.
    1272 
    1273 For partitioned scheduling of application-level threads and local resources, two choices are possible, which meet the ITT requirement of achieving predictable time behaviour for platform software
    1274  *  (1) fixed-priority scheduling with the immediate priority ceiling for controlling access to local resources, or
    1275  *  (2) EDF scheduling with the stack resource protocol for controlling access to local resources.
    1276 
    1277 Choice (1) is more natural with RTEMS. Both alternatives require preemption, whose disruption to the thread's cache working set may in part be attenuated by the use of techniques known as limited-preemption, which have been successfully demonstrated by the UoP. Conversely, run-to-completion (non-preemptive) semantics is known to achieve much lower schedulable utilization, which cannot make it a plausible candidate for performance-hungry systems. The use of fixed-priority scheduling for each processor where application-level threads locally run allows response time analysis to be used, which, on the basis of the worst-case execution time of individual threads and on the graph of resource usage among threads, determines the worst-case completion time of every individual thread run on that processor, offering absolute guarantees on the schedulable utilization of every individual processor that uses that algorithm.
    1278 
    1279 The determination of worst-case execution time of software programs running on one processor of an SMP is significantly more complex than its single-processor analogous.  This is because every local execution suffers massive interference effects from its co-runners on other processors, regardless of functional independence.  Simplistically upper bounding the possible interference incurs exceeding pessimism, which causes massive reduction in the allowable system load.  More accurate interference analysis may incur prohibitive costs unless simplifying assumptions can be safely made, one of which is strictly static partitioning: previous studies run by ESA [http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf]
    1280 have indicated possible approaches to that problem.  Global scheduling greatly exacerbates the difficulty of that problem and thus is unwelcome.  With fixed-priority scheduling, each logical resource shared by multiple application-level threads (or their corresponding lock) must be statically attached a ceiling priority attribute computed as an upper bound to the static
    1281 priority of all threads that may use that resource: when an application-level thread acquires the resource lock, the priority of that thread is raised to the
    1282 ceiling of the resource; upon relinquishing the lock to that resource, the thread must return to the priority level that it had prior to acquiring the lock.  (It has been shown that the implementation of this protocol does not need resource locks at all, as the mere fact for a thread to be running implies that all of the resources that it may want to use are free at this time and none can be claimed by lower-priority threads.  However, it may be easier for an operating system to attach the ceiling value to the resource lock than to
    1283 any other data structures.)  This simple protocol allows application-level threads to acquire multiple resources without risking deadlock.
    1284 
    1285 When global resources are used, which is very desirable to alleviate the complexity of task allocation, the resource access control protocol in use, whether (1) or (2) in the earlier taxonomy, must be specialized, so that global resources can be syntactically told apart from local resources.  Luckily, two solutions have been very recently proposed to address this problem.  One solution, known as Optimal Migratory Priority Inheritance (OMIP), was proposed by Björn Brandenburg in his work entitled ''A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications'', presented at the ECRTS conference in July 2013.  The other solution, known as Multiprocessor Resource Sharing Protocol (MrsP), was proposed by Alan Burns and Andy Wellings in their work entitled ''A Schedulability Compatible Multiprocessor Resource Sharing Protocol'', presented at the very same ECRTS 2013 conference.  OMIP
    1286 aims to solve the problem of guaranteed bounded blocking in global resource sharing in clustered scheduling on a multiprocessor, where a cluster is a collection of at least one processor.  MrsP aims to achieve the smallest bound to the cumulative duration of blocking suffered by threads waiting to access a global resource in partitioned scheduling (clusters with exactly one
    1287 processor), while allowing schedulability analysis to be performed per processor, using response time analysis, which is a technique fairly well known to industry.  No other global resource sharing protocol on SMP, including OMIP, is able to guarantee that.  For this reason MrsP is the choice for this study.
    1288 
    1289 The MrsP protocol requires that
    1290  *  (a) threads waiting for global resources spin at ceiling priority on their local processor, and with a ceiling value greater than any other local resource on that processor, and
    1291  *  (b) the execution within the global resource may migrate to other processors where application-level threads waiting to access that resource are spinning.
    1292 
    1293 Feature (a) prevents lower-priority threads from running in preference to the waiting higher-priority thread and stealing resources that it might want to use in the future as part of the current execution; should that stealing happen, the blocking penalty potentially suffered on access to global resources would skyrocket to untenable levels.
    1294 Feature (b), which brings in the sole welcome extent of migration in the proposed model, which is useful when higher-priority tasks running on the processor of the global resource prevent it from completing execution; in that case, the slack allowed for by local spinning on other processors where other threads are waiting, is used to speed up the completion of the execution in the global resource and therefore reduce blocking.
    1295 === Status ===
    1296 
    1297 
    1298 Implementation is complete [5c3d2509593476869e791111cd3d93cc1e840b3a/rtems].
    1299 === RTEMS API Changes ===
    1300 
    1301 
    1302 A new semaphore attribute enables MrsP.
    1303 
    1304 {{{
    1305 #!c
    1306 /**
    1307  * @brief Semaphore attribute to select the multiprocessor resource sharing
    1308  * protocol MrsP.
    1309  *
    1310  * This attribute is mutually exclusive with RTEMS_PRIORITY_CEILING and
    1311  * RTEMS_INHERIT_PRIORITY.
    1312  */
    1313 #define RTEMS_MULTIPROCESSOR_RESOURCE_SHARING 0x00000100
    1314 }}}
    1315 
    1316 For MrsP we need the ability to specify the priority ceilings per scheduler
    1317 domain.
    1318 
    1319 {{{
    1320 #!c
    1321 typedef struct {
    1322   rtems_id scheduler_id;
    1323   rtems_task_priority priority;
    1324 } rtems_task_priority_by_scheduler;
    1325 
    1326 /**
    1327  * @brief Sets the priority ceilings per scheduler for a semaphore with
    1328  * priority ceiling protocol.
    1329  *
    1330  * @param[in] semaphore_id Identifier of the semaphore.
    1331  * @param[in] priority_ceilings A table with priority ceilings by scheduler.
    1332  * In case one scheduler appears multiple times, the setting with the highest
    1333  * index will be used.  This semaphore object is then bound to the specified
    1334  * scheduler domains.  It is an error to use this semaphore object on other
    1335  * scheduler domains.  The specified schedulers must be compatible, e.g.
    1336  * migration from one scheduler domain to another must be defined.
    1337  * @param[in] priority_ceilings_count Count of priority ceilings by scheduler
    1338  * pairs in the table.
    1339  *
    1340  * @retval RTEMS_SUCCESSFUL Successful operation.
    1341  * @retval RTEMS_INVALID_ID Invalid semaphore identifier.
    1342  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier in the table.
    1343  * @retval RTEMS_INVALID_PRIORITY Invalid task priority in the table.
    1344  */
    1345 rtems_status_code rtems_semaphore_set_priority_ceilings(
    1346   rtems_id semaphore_id,
    1347   const rtems_task_priority_by_scheduler *priority_ceilings,
    1348   size_t priority_ceilings_count
    1349 );
    1350 }}}
    1351 
    1352 ===  Implementation  ===
    1353 
    1354 
    1355 The critical part in the MrsP is the migration of the lock holder in case of
    1356 preemption by a higher-priority thread.  A post-switch action is used to detect
    1357 this event.  The post-switch action will remove the thread from the current
    1358 scheduler domain and add it to the scheduler domain of the first waiting thread
    1359 which is executing.  A resource release will remove this thread from the
    1360 temporary scheduler domain and move it back to the original scheduler domain.
    1361 ==  Fine Grained Locking  ==
    1362 
    1363 === Implementation ===
    1364 
    1365 
    1366 Fine grained locking is of utmost importance to get a scalable operating system
    1367 that can guarantee reasonable worst-case latencies.  With the current Giant
    1368 lock in RTEMS the worst-case latencies of every operating system service
    1369 increase with each processor added to the system.  Since the Giant lock state
    1370 is shared among all processors a huge cache synchronization overhead
    1371 contributes to the worst-case latencies.  Fine grained locking in combination
    1372 with partitioned/clustered scheduling helps to avoid these problems since now
    1373 the operating system state is distributed allowing true parallelism of
    1374 independent components.
    1375 === Status ===
    1376 
    1377 
    1378 This is TBD.
    1379 === RTEMS API Changes ===
    1380 
    1381 
    1382 None.
    1383 ===  Locking Protocol Analysis  ===
    1384 
    1385 
    1386 As a sample operating system operation the existing mutex
    1387 obtain/release/timeout sequences will be analyzed.  All ISR disable/enable
    1388 points (highlighted with colors) must be turned into an appropriate SMP clock
    1389 (e.g. a ticket spin lock).  One goal is that an uncontested mutex obtain will
    1390 use no SMP locks except the one associated with the mutex object itself.  Is
    1391 this possible with the current structure?
    1392 
    1393 {{{
    1394 #!html
    1395 <pre>
    1396 mutex_obtain(id, wait, timeout):
    1397         <span style="color:red">level = ISR_disable()</span>
    1398         mtx = mutex_get(id)
    1399         executing = get_executing_thread()
    1400         wait_control = executing.get_wait_control()
    1401         wait_control.set_status(SUCCESS)
    1402         if !mtx.is_locked():
    1403                 mtx.lock(executing)
    1404                 if mtx.use_ceiling_protocol():
    1405                         thread_dispatch_disable()
    1406                         <span style="color:red">ISR_enable(level)</span>
    1407                         executing.boost_priority(mtx.get_ceiling())
    1408                         thread_dispatch_enable()
    1409                 else:
    1410                         <span style="color:red">ISR_enable(level)</span>
    1411         else if mtx.is_holder(executing):
    1412                 mtx.increment_nest_level()
    1413                 <span style="color:red">ISR_enable(level)</span>
    1414         else if !wait:
    1415                 <span style="color:red">ISR_enable(level)</span>
    1416                 wait_control.set_status(UNSATISFIED)
    1417         else:
    1418                 wait_queue = mtx.get_wait_queue()
    1419                 wait_queue.set_sync_status(NOTHING_HAPPENED)
    1420                 executing.set_wait_queue(wait_queue))
    1421                 thread_dispatch_disable()
    1422                 <span style="color:red">ISR_enable(level)</span>
    1423                 if mtx.use_inherit_priority():
    1424                         mtx.get_holder().boost_priority(executing.get_priority()))
    1425                 <span style="color:fuchsia">level = ISR_disable()</span>
    1426                 if executing.is_ready():
    1427                         executing.set_state(MUTEX_BLOCKING_STATE)
    1428                         scheduler_block(executing)
    1429                 else:
    1430                         executing.add_state(MUTEX_BLOCKING_STATE)
    1431                 <span style="color:fuchsia">ISR_enable(level)</span>
    1432                 if timeout:
    1433                         timer_start(timeout, executing, mtx)
    1434                 <span style="color:blue">level = ISR_disable()</span>
    1435                 search_thread = wait_queue.first()
    1436                 while search_thread != wait_queue.tail():
    1437                         if executing.priority() <= search_thread.priority():
    1438                                 break
    1439                         <span style="color:blue">ISR_enable(level)</span>
    1440                         <span style="color:blue">level = ISR_disable()</span>
    1441                         if search_thread.is_state_set(MUTEX_BLOCKING_STATE):
    1442                                 search_thread = search_thread.next()
    1443                         else:
    1444                                 search_thread = wait_queue.first()
    1445                 sync_status = wait_queue.get_sync_status()
    1446                 if sync_state == NOTHING_HAPPENED:
    1447                         wait_queue.set_sync_status(SYNCHRONIZED)
    1448                         wait_queue.enqueue(search_thread, executing)
    1449                         executing.set_wait_queue(wait_queue)
    1450                         <span style="color:blue">ISR_enable(level)</span>
    1451                 else:
    1452                         executing.set_wait_queue(NULL)
    1453                         if executing.is_timer_active():
    1454                                 executing.deactivate_timer()
    1455                                 <span style="color:blue">ISR_enable(level)</span>
    1456                                 executing.remove_timer()
    1457                         else:
    1458                                 <span style="color:blue">ISR_enable(level)</span>
    1459                         <span style="color:fuchsia">level = ISR_disable()</span>
    1460                         if executing.is_state_set(MUTEX_BLOCKING_STATE):
    1461                                 executing.clear_state(MUTEX_BLOCKING_STATE)
    1462                                 if executing.is_ready():
    1463                                         scheduler_unblock(executing)
    1464                         <span style="color:fuchsia">ISR_enable(level)</span>
    1465                 thread_dispatch_enable()
    1466         return wait_control.get_status()
    1467 
    1468 mutex_release(id):
    1469         thread_dispatch_disable()
    1470         mtx = mutex_get(id)
    1471         executing = get_executing_thread()
    1472         nest_level = mtx.decrement_nest_level()
    1473         if nest_level == 0:
    1474                 if mtx.use_ceiling_protocol() or mtx.use_inherit_priority():
    1475                         executing.restore_priority()
    1476                 wait_queue = mtx.get_wait_queue()
    1477                 thread = NULL
    1478                 <span style="color:red">level = ISR_disable()</span>
    1479                 thread = wait_queue.dequeue()
    1480                 if thread != NULL:
    1481                         thread.set_wait_queue(NULL)
    1482                         if thread.is_timer_active():
    1483                                 thread.deactivate_timer()
    1484                                 <span style="color:red">ISR_enable(level)</span>
    1485                                 thread.remove_timer()
    1486                         else:
    1487                                 <span style="color:red">ISR_enable(level)</span>
    1488                         <span style="color:fuchsia">level = ISR_disable()</span>
    1489                         if thread.is_state_set(MUTEX_BLOCKING_STATE):
    1490                                 thread.clear_state(MUTEX_BLOCKING_STATE)
    1491                                 if thread.is_ready():
    1492                                         scheduler_unblock(thread)
    1493                         <span style="color:fuchsia">ISR_enable(level)</span>
    1494                 else:
    1495                         <span style="color:red">ISR_enable(level)</span>
    1496                 <span style="color:blue">level = ISR_disable()</span>
    1497                 if thread == NULL:
    1498                         sync_status = wait_queue.get_sync_status()
    1499                         if sync_status == TIMEOUT || sync_status == NOTHING_HAPPENED:
    1500                                 wait_queue.set_sync_status(SATISFIED)
    1501                                 thread = executing
    1502                 <span style="color:blue">ISR_enable(level)</span>
    1503                 if thread != NULL:
    1504                         mtx.new_holder(thread)
    1505                         if mtx.use_ceiling_protocol():
    1506                                 thread.boost_priority(mtx.get_ceiling())
    1507                 else:
    1508                         mtx.unlock()
    1509         thread_dispatch_enable()
    1510 
    1511 mutex_timeout(thread, mtx):
    1512         <span style="color:red">level = ISR_disable()</span>
    1513         wait_queue = thread.get_wait_queue()
    1514         if wait_queue != NULL:
    1515                 sync_status = wait_queue.get_sync_status()
    1516                 if sync_status != SYNCHRONIZED and thread.is_executing():
    1517                         if sync_status != SATISFIED:
    1518                                 wait_queue.set_sync_status(TIMEOUT)
    1519                                 wait_control = executing.get_wait_control()
    1520                                 wait_control.set_status(TIMEOUT)
    1521                         <span style="color:red">ISR_enable(level)</span>
    1522                 else:
    1523                         <span style="color:red">ISR_enable(level)</span>
    1524                         <span style
    1525 </pre>
    1526 }}}
    1527 
    1528 = References =
    1529 
    1530 * [https://www.cs.unc.edu/~anderson/diss/bbbdiss.pdf Björn B. Brandenburg, Scheduling and Locking in Multiprocessor Real-Time Operating Systems, 2011.][=#Brandenburg2011]
    1531 * [http://www.mpi-sws.org/~bbb/papers/pdf/rtsj11.pdf Björn B. Brandenburg and James H. Anderson, Spin-Based Reader-Writer Synchronization for Multiprocessor Real-Time Systems, July 2010.][=#BrandenburgAnderson2010]
    1532 * A. Burns and A.J. Wellings, A Schedulability Compatible Multiprocessor Resource Sharing Protocol - MrsP, Proceedings of the 25th Euromicro Conference on Real-Time Systems (ECRTS 2013), July 2013.[=#BurnsWellings2013]
    1533 * Arpan Gujarati, Felipe Cerqueira, and Björn Brandenburg. Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities, Proceedings of the 25th Euromicro Conference on Real-Time Systems (ECRTS 2013), July 2013.[=#GujaratiCerqueiraBandenburg2013]
     1RTEMS SMP is ready for production systems, see [http://microelectronics.esa.int/ngmp/RTEMS-SMP-StatusReportEmbBrains-rev3-2016-12.pdf RTEMS SMP Status Report].