Changes between Version 30 and Version 31 of Developer/SMP


Ignore:
Timestamp:
Jan 13, 2014, 4:09:10 PM (6 years ago)
Author:
Sh
Comment:

/* Implementation */

Legend:

Unmodified
Added
Removed
Modified
  • Developer/SMP

    v30 v31  
    504504Use an ISR lock per object to improve the performance for uncontested
    505505operations.  See also [wiki:#Giant_Lock_vs._Fine_Grained_Locking Giant Lock vs. Fine Grained Locking].
    506 =  Implementation  =
    507 
    508 =  Testing  =
    509 
    510 =  Overview  =
    511 
    512 
    513 [http://en.wikipedia.org/wiki/Test-driven_development Test-driven development]
    514 will be used to implement the accepted changes and features.  New tests cases
    515 will follow the patterns present in existing test cases of the RTEMS test
    516 suite.  The RTEMS project has a script framework to run the RTEMS test suite.
    517 This works best if the tests can run on a fast simulator.
    518 =  Worst-Case Timings  =
    519 
    520 
    521 The RTEMS test suite has some timing tests (testsuites/tmtests and testsuites/psxtmtests).  These tests yield only some sample timing values with no statistical significance.  For a real-time operating system reliable worst-case timing values of critical functions are interesting.  A presence of instruction and data caches must be taken into account.  Measurements for the worst-case timing should run in the following context
    522  *  instruction caches are invalidated,
    523  *  data caches are completely dirty with unrelated data, since in this case data for the section of interest must first evict dirty cache lines and then load the values from main memory,
    524  *  on SMP configurations all other processors should try to saturate the bus.
    525 Timing values should be acquired multiple times to get a statistical significant value.
    526 
    527 Currently most operating system functions are monolithic global functions which can be only tested as a single unit.  This leads to very complex test setups since interrupts must be used to trigger different paths through these functions.  In order to implement fine grained locking large parts of core operating system functions must be restructured.  As part of this process the critical sections should be implemented as inline functions which can be tested independently.  All critical sections for a specific lock should be tested in isolation so that worst-case timing values can be determined.  Since all the core operating system functions are aggregations of critical sections it is possible to give worst-case timing estimates provided the worst-case timings are known each critical section on its own.
    528 
    529 It is not the goal of the project to characterise finely all RTEMS services in combination of the all possible load figures in the processors.  However, instrumentation of critical items such as SMP locks and interrupt disabling will allow collecting a good idea about the system behaviour when the system is stressed by all sorts of RTEMS tests, demonstrators and parallel library tests.
    530 =  References  =
    531 
    532 
    533 <references/>
     506= Implementations =
     507
     508=  Tool Chain  =
     509
     510==  Binutils  ==
     511
     512
     513A Binutils 2.24 or later release must be used due to the LEON3 support.
     514==  GCC  ==
     515
     516
     517A GCC 4.8 2013-11-25 (e.g. 4.8.3) or later must be used due to the LEON3 support.  The LEON3 support for GCC includes a proper C11 memory model definition for this processor and C11 atomic operations using the CAS instruction.  The backport of the LEON3 support was initiated by EB [http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02255.html].
     518==  GDB  ==
     519
     520
     521Current GDB versions have problems with the debug format generated by GCC 4.8 and later [https://sourceware.org/bugzilla/show_bug.cgi?id=16215].
     522=  Profiling  =
     523
     524==  Reason  ==
     525
     526
     527SMP lock and interrupt processing profiling is necessary to fulfill some
     528observability requirements.  Vital timing data can be gathered on a per object
     529basis through profiling.
     530==  RTEMS API Changes  ==
     531
     532
     533None.
     534==  High-Performance CPU Counters  ==
     535
     536
     537In order to measure short time intervals we have to add a high-performance CPU
     538counter support to the CPU port API.  This is can be also used as an
     539replacement for the BSP specific benchmark timers.  It may also be used to
     540implement busy wait loops which are required by some device drivers.
     541
     542 /**
     543  * @brief Integer type for CPU counter values.
     544  */
     545 typedef XXX CPU_counter;
     546 
     547 /**
     548  * brief Returns the current CPU counter value.
     549  */
     550 CPU_counter _CPU_counter_Get()
     551 
     552 /**
     553  * brief Mask for arithmetic operations with the CPU counter value.
     554  *
     555  * All arithmetic operations are defined as A = ( C op B ) & MASK.
     556  */
     557 CPU_counter _CPU_counter_Mask()
     558 
     559 /**
     560  * brief Converts a CPU counter value into nanoseconds.
     561  */
     562 uint64_t _CPU_counter_To_nanoseconds( CPU_counter counter )
     563==  SMP Lock Profiling  ==
     564
     565
     566The SMP lock profiling will be a RTEMS build configuration time option
     567(RTEMS_LOCK_PROFILING).  The following statistics are proposed.
     568
     569 #define SMP_LOCK_STATS_CONTENTION_COUNTS 4
     570 
     571 /**
     572  * @brief SMP lock statistics.
     573  *
     574  * The lock acquire attempt instant is the point in time right after the
     575  * interrupt disable action in the lock acquire sequence.
     576  *
     577  * The lock acquire instant is the point in time right after the lock
     578  * acquisition.  This is the begin of the critical section code execution.
     579  *
     580  * The lock release instant is the point in time right before the interrupt
     581  * enable action in the lock release sequence.
     582  *
     583  * The lock section time is the time elapsed between the lock acquire instant
     584  * and the lock release instant.
     585  *
     586  * The lock acquire time is the time elapsed between the lock acquire attempt
     587  * instant and the lock acquire instant.
     588  */
     589 struct SMP_lock_Stats {
     590 #ifdef RTEMS_LOCK_PROFILING
     591   /**
     592    * @brief The last lock acquire instant in CPU counter ticks.
     593    *
     594    * This value is used to measure the lock section time.
     595    */
     596   CPU_counter acquire_instant;
     597 
     598   /**
     599    * @brief The maximum lock section time in CPU counter ticks.
     600    */
     601   CPU_counter max_section_time;
     602 
     603   /**
     604    * @brief The maximum lock acquire time in CPU counter ticks.
     605    */
     606   CPU_counter max_acquire_time;
     607 
     608   /**
     609    * @brief The count of lock uses.
     610    *
     611    * This value may overflow.
     612    */
     613   uint64_t usage_count;
     614 
     615   /**
     616    * @brief The counts of lock acquire operations with contention.
     617    *
     618    * The contention count for index N corresponds to a lock acquire attempt
     619    * with an initial queue length of N + 1.  The last index corresponds to all
     620    * lock acquire attempts with an initial queue length greater than or equal
     621    * to SMP_LOCK_STATS_CONTENTION_COUNTS.
     622    *
     623    * The values may overflow.
     624    */
     625   uint64_t contention_counts[SMP_LOCK_STATS_CONTENTION_COUNTS];
     626 
     627   /**
     628    * @brief Total lock section time in CPU counter ticks.
     629    *
     630    * The average lock section time is the total section time divided by the
     631    * lock usage count.
     632    *
     633    * This value may overflow.
     634    */
     635   uint64_t total_section_time;
     636 #endif /* RTEMS_LOCK_PROFILING */
     637 }
     638 
     639 struct SMP_lock_Control {
     640   ... lock data ...
     641   SMP_lock_Stats Stats;
     642 };
     643
     644A function should be added to monitor the lock contention.
     645
     646 /**
     647  * @brief Called in case of lock contention.
     648  *
     649  * @param[in] counter The spin loop iteration counter.
     650  */
     651 void _SMP_lock_Contention_monitor(
     652   const SMP_lock_Control *lock,
     653   int counter
     654 );
     655
     656A ticket lock can then look like this:
     657
     658 void acquire(struct ticket *t)
     659 {
     660        unsigned int my_ticket = atomic_fetch_add_explicit(&t->ticket, 1, memory_order_relaxed);
     661 #ifdef RTEMS_LOCK_PROFILING
     662        int counter = 0;
     663 #endif /* RTEMS_LOCK_PROFILING */
     664 
     665        while (atomic_load_explicit(&t->now_serving, memory_order_acquire) != my_ticket) {
     666 #ifdef RTEMS_LOCK_PROFILING
     667                ++counter;
     668                _SMP_lock_Contention_monitor(t, counter);
     669 #endif /* RTEMS_LOCK_PROFILING */
     670        }
     671 }
     672
     673SMP lock statistics can be evaluated use the following method.
     674
     675 typedef void ( *SMP_lock_Visitor )(
     676   void *arg,
     677   SMP_lock_Control *lock,
     678   SMP_lock_Class lock_class,
     679   Objects_Name lock_name
     680 );
     681 
     682 /**
     683  * @brief Iterates through all system SMP locks and invokes the visitor for
     684  * each lock.
     685  */
     686 void _SMP_lock_Iterate( SMP_lock_Visitor visitor, void *arg );
     687==  Interrupt and Thread Profiling  ==
     688
     689
     690The interrupt and thread profiling will be a RTEMS build configuration time
     691option (RTEMS_INTERRUPT_AND_THREAD_PROFILING).
     692
     693The time spent on interrupts and the time of disabled thread dispatching should
     694be monitored per-processor.  The time between the interrupt recognition by the
     695processor and the actuals start of the interrupt handler code execution should
     696be monitored per-processor if the hardware supports this.
     697
     698 /**
     699  * @brief Per-CPU statistics.
     700  */
     701 struct Per_CPU_Stats {
     702 #ifdef RTEMS_INTERRUPT_AND_THREAD_PROFILING
     703   /**
     704    * @brief The thread dispatch disabled begin instant in CPU counter ticks.
     705    *
     706    * This value is used to measure the time of disabled thread dispatching.
     707    */
     708   CPU_counter thread_dispatch_disabled_instant;
     709 
     710   /**
     711    * @brief The last outer-most interrupt begin instant in CPU counter ticks.
     712    *
     713    * This value is used to measure the interrupt processing time.
     714    */
     715   CPU_counter outer_most_interrupt_instant;
     716 
     717   /**
     718    * @brief The maximum interrupt delay in CPU counter ticks if supported by
     719    * the hardware.
     720    */
     721   CPU_counter max_interrupt_delay;
     722 
     723   /**
     724    * @brief The maximum time of disabled thread dispatching in CPU counter
     725    * ticks.
     726    */
     727   CPU_counter max_thread_dispatch_disabled_time;
     728 
     729   /**
     730    * @brief Count of times when the thread dispatch disable level changes from
     731    * zero to one in thread context.
     732    *
     733    * This value may overflow.
     734    */
     735   uint64_t thread_dispatch_disabled_count;
     736 
     737   /**
     738    * @brief Total time of disabled thread dispatching in CPU counter ticks.
     739    *
     740    * The average time of disabled thread dispatching is the total time of
     741    * disabled thread dispatching divided by the thread dispatch disabled
     742    * count.
     743    *
     744    * This value may overflow.
     745    */
     746   uint64_t total_thread_dispatch_disabled_time;
     747 
     748   /**
     749    * @brief Count of times when the interrupt nest level changes from zero to
     750    * one.
     751    *
     752    * This value may overflow.
     753    */
     754   uint64_t interrupt_count;
     755 
     756   /**
     757    * @brief Total time of interrupt processing in CPU counter ticks.
     758    *
     759    * The average time of interrupt processing is the total time of interrupt
     760    * processing divided by the interrupt count.
     761    *
     762    * This value may overflow.
     763    */
     764   uint64_t total_interrupt_time;
     765 #endif /* RTEMS_INTERRUPT_AND_THREAD_PROFILING */
     766 }
     767 
     768 struct Per_CPU_Control {
     769   ... per-CPU data ...
     770   Per_CPU_Stats Stats;
     771 };
     772=  Interrupt Support  =
     773
     774==  Reason  ==
     775
     776
     777Applications should be able to distribute the interrupt load throughout the
     778system.  In combination with partitioned/clustered scheduling this can reduce
     779the amount of inter-processor synchronization and thread migrations.
     780==  RTEMS API Changes  ==
     781
     782
     783Each interrupt needs a processor affinity set in the RTEMS SMP configuration.  The
     784rtems_interrupt_handler_install() function will not alter the processor
     785affinity set of the interrupt vector.  At system start-up all interrupts except
     786the inter-processor interrupts must be initialized to have a affinity with the
     787initialization processor only.
     788
     789Two new functions should be added to alter and retrieve the processor affinity
     790sets of interrupt vectors.
     791
     792 /**
     793  * @brief Sets the processor affinity set of an interrupt vector.
     794  *
     795  * @param[in] vector The interrupt vector number.
     796  * @param[in] affinity_set_size Size of the specified affinity set buffer in
     797  * bytes.  This value must be positive.
     798  * @param[in] affinity_set The new processor affinity set for the interrupt
     799  * vector.  This pointer must not be @c NULL.  A set bit in the affinity set
     800  * means that the interrupt can occur on this processor and a cleared bit
     801  * means the opposite.
     802  *
     803  * @retval RTEMS_SUCCESSFUL Successful operation.
     804  * @retval RTEMS_INVALID_ID The vector number is invalid.
     805  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
     806  */
     807 rtems_status_code rtems_interrupt_set_affinity(
     808   rtems_vector vector,
     809   size_t affinity_set_size,
     810   const cpu_set_t *affinity_set
     811 );
     812
     813 /**
     814  * @brief Gets the processor affinity set of an interrupt vector.
     815  *
     816  * @param[in] vector The interrupt vector number.
     817  * @param[in] affinity_set_size Size of the specified affinity set buffer in
     818  * bytes.  This value must be positive.
     819  * @param[out] affinity_set The current processor affinity set of the
     820  * interrupt vector.  This pointer must not be @c NULL.  A set bit in the
     821  * affinity set means that the interrupt can occur on this processor and a
     822  * cleared bit means the opposite.
     823  *
     824  * @retval RTEMS_SUCCESSFUL Successful operation.
     825  * @retval RTEMS_INVALID_ID The vector number is invalid.
     826  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
     827  * current processor affinity set of the interrupt vector.
     828  */
     829 rtems_status_code rtems_interrupt_get_affinity(
     830   rtems_vector vector,
     831   size_t affinity_set_size,
     832   cpu_set_t *affinity_set
     833 );
     834=  Clustered Scheduling  =
     835
     836==  Reason  ==
     837
     838
     839Partitioned/clustered scheduling helps to control the worst-case latencies in
     840the system.  The goal is to reduce the amount of shared state in the system and
     841thus prevention of lock contention.  Modern multi-processor systems tend to
     842have several layers of data and instruction caches.  With partitioned/clustered
     843scheduling it is possible to honor the cache topology of a system and thus
     844avoid expensive cache synchronization traffic.
     845==  RTEMS API Changes  ==
     846
     847
     848Functions for scheduler management.
     849
     850 /**
     851  * @brief Identifies a scheduler by its name.
     852  *
     853  * The scheduler name is determined by the scheduler configuration.
     854  *
     855  * @param[in] name The scheduler name.
     856  * @param[out] scheduler_id The scheduler identifier associated with the name.
     857  *
     858  * @retval RTEMS_SUCCESSFUL Successful operation.
     859  * @retval RTEMS_INVALID_NAME Invalid scheduler name.
     860  */
     861 rtems_status_code rtems_scheduler_ident(
     862   rtems_name name,
     863   rtems_id *scheduler_id
     864 );
     865
     866 /**
     867  * @brief Gets the set of processors owned by the scheduler.
     868  *
     869  * @param[in] scheduler_id Identifier of the scheduler.
     870  * @param[in] processor_set_size Size of the specified processor set buffer in
     871  * bytes.  This value must be positive.
     872  * @param[out] processor_set The processor set owned by the scheduler.  This
     873  * pointer must not be @c NULL.  A set bit in the processor set means that
     874  * this processor is owned by the scheduler and a cleared bit means the
     875  * opposite.
     876  *
     877  * @retval RTEMS_SUCCESSFUL Successful operation.
     878  * @retval RTEMS_INVALID_ID Invalid scheduler identifier.
     879  * @retval RTEMS_INVALID_CPU_SET The processor set buffer is too small for the
     880  * set of processors owned by the scheduler.
     881  */
     882 rtems_status_code rtems_scheduler_get_processors(
     883   rtems_id scheduler_id,
     884   size_t processor_set_size,
     885   cpu_set_t *processor_set
     886 );
     887
     888Each thread needs a processor affinity set in the RTEMS SMP configuration.  The
     889rtems_task_create() function will use the processor affinity set of the
     890executing thread to initialize the processor affinity set of the created
     891task.  This enables backward compatibility for existing software.
     892
     893Two new functions should be added to alter and retrieve the processor affinity
     894sets of tasks.
     895
     896 /**
     897  * @brief Sets the processor affinity set of a task.
     898  *
     899  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
     900  * the executing task.
     901  * @param[in] affinity_set_size Size of the specified affinity set buffer in
     902  * bytes.  This value must be positive.
     903  * @param[in] affinity_set The new processor affinity set for the task.  This
     904  * pointer must not be @c NULL.  A set bit in the affinity set means that the
     905  * task can execute on this processor and a cleared bit means the opposite.
     906  *
     907  * @retval RTEMS_SUCCESSFUL Successful operation.
     908  * @retval RTEMS_INVALID_ID Invalid task identifier.
     909  * @retval RTEMS_INVALID_CPU_SET Invalid processor affinity set.
     910  */
     911 rtems_status_code rtems_task_set_affinity(
     912   rtems_id task_id,
     913   size_t affinity_set_size,
     914   const cpu_set_t *affinity_set
     915 );
     916
     917 /**
     918  * @brief Gets the processor affinity set of a task.
     919  *
     920  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
     921  * the executing task.
     922  * @param[in] affinity_set_size Size of the specified affinity set buffer in
     923  * bytes.  This value must be positive.
     924  * @param[out] affinity_set The current processor affinity set of the task.
     925  * This pointer must not be @c NULL.  A set bit in the affinity set means that
     926  * the task can execute on this processor and a cleared bit means the
     927  * opposite.
     928  *
     929  * @retval RTEMS_SUCCESSFUL Successful operation.
     930  * @retval RTEMS_INVALID_ID Invalid task identifier.
     931  * @retval RTEMS_INVALID_CPU_SET The affinity set buffer is too small for the
     932  * current processor affinity set of the task.
     933  */
     934 rtems_status_code rtems_task_get_affinity(
     935   rtems_id task_id,
     936   size_t affinity_set_size,
     937   cpu_set_t *affinity_set
     938 );
     939
     940Two new functions should be added to alter and retrieve the scheduler of tasks.
     941
     942 /**
     943  * @brief Sets the scheduler of a task.
     944  *
     945  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
     946  * the executing task.
     947  * @param[in] scheduler_id Identifier of the scheduler.
     948  *
     949  * @retval RTEMS_SUCCESSFUL Successful operation.
     950  * @retval RTEMS_INVALID_ID Invalid task identifier.
     951  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier.
     952  *
     953  * @see rtems_scheduler_ident().
     954  */
     955 rtems_status_code rtems_task_set_scheduler(
     956   rtems_id task_id,
     957   rtems_id scheduler_id
     958 );
     959
     960 /**
     961  * @brief Gets the scheduler of a task.
     962  *
     963  * @param[in] task_id Identifier of the task.  Use @ref RTEMS_SELF to select
     964  * the executing task.
     965  * @param[out] scheduler_id Identifier of the scheduler.
     966  *
     967  * @retval RTEMS_SUCCESSFUL Successful operation.
     968  * @retval RTEMS_INVALID_ID Invalid task identifier.
     969  */
     970 rtems_status_code rtems_task_get_scheduler(
     971   rtems_id task_id,
     972   rtems_id *scheduler_id
     973 );
     974==  Scheduler Configuration  ==
     975
     976
     977There are two options for the scheduler instance configuration
     978
     979# static configuration by means of global data structures, and
     980# configuration at run-time via function calls.
     981
     982For a configuration at run-time the system must start with a default scheduler.
     983The global constructors are called in this environment.  The order of global
     984constructor invocation is unpredictable so it is difficult to create threads in
     985this context since the run-time scheduler configuration may not exist yet.
     986Since scheduler data structures are allocated from the workspace the
     987configuration must take a later run-time setup of schedulers into account for
     988the workspace size estimate.  In case the default scheduler is not appropriate
     989it must be replaced which gives raise to some implementation difficulties.
     990Since the processor availability is determined by hardware constraints it is
     991unclear which benefits a run-time configuration has.  For now run-time
     992configuration of scheduler instances will be not implemented.
     993
     994The focus is now on static configuration.  Every scheduler needs a control
     995context.  The scheduler API must provide a macro which creates a global
     996scheduler instance specific data structure with a designator name as a
     997mandatory parameter.  The scheduler instance creation macro may require
     998additional scheduler specific configuration options.  For example a
     999fixed-priority scheduler instance must know the maximum priority level to
     1000allocate the ready chain control table.
     1001
     1002Once the scheduler instances are configured it must be specified for each
     1003processor in the system which scheduler instance owns this processor or if this
     1004processor is not used by the RTEMS system.
     1005
     1006For each processor except the initialization processor a scheduler instance is
     1007optional so that other operating systems can run independent of this RTEMS
     1008system on this processor.  It is a fatal error to omit a scheduler instance for
     1009the initialization processor.  The initialization processor is the processor
     1010which executes the boot_card() function.
     1011
     1012 /**
     1013  * @brief Processor configuration.
     1014  *
     1015  * Use RTEMS_CPU_CONFIG_INIT() to initialize this structure.
     1016  */
     1017 typedef struct {
     1018   /**
     1019    * @brief Scheduler instance for this processor.
     1020    *
     1021    * It is possible to omit a scheduler instance for this processor by using
     1022    * the @c NULL pointer.  In this case RTEMS will not use this processor and
     1023    * other operating systems may claim it.
     1024    */
     1025   Scheduler_Control *scheduler;
     1026 } rtems_cpu_config;
     1027 
     1028 /**
     1029  * @brief Processor configuration initializer.
     1030  *
     1031  * @param scheduler The reference to a scheduler instance or @c NULL.
     1032  *
     1033  * @see rtems_cpu_config.
     1034  */
     1035 #define RTEMS_CPU_CONFIG_INIT(scheduler) \
     1036   { ( scheduler ) }
     1037
     1038Scheduler and processor configuration example:
     1039
     1040 RTEMS_SCHED_DEFINE_FP_SMP(fp0, rtems_build_name(' ', 'F', 'P', '0'), 256);
     1041 RTEMS_SCHED_DEFINE_FP_SMP(fp1, rtems_build_name(' ', 'F', 'P', '1'), 64);
     1042 RTEMS_SCHED_DEFINE_EDF_SMP(edf0, rtems_build_name('E', 'D', 'F', '0'));
     1043 
     1044 const rtems_cpu_config rtems_cpu_config_table[] = {
     1045   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp0)),
     1046   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
     1047   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
     1048   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_FP_SMP(fp1)),
     1049   RTEMS_CPU_CONFIG_INIT(NULL),
     1050   RTEMS_CPU_CONFIG_INIT(NULL),
     1051   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)),
     1052   RTEMS_CPU_CONFIG_INIT(RTEMS_SCHED_REF_EDF_SMP(edf0)
     1053 };
     1054 
     1055 const size_t rtems_cpu_config_count =
     1056
     1057   RTEMS_ARRAY_SIZE(rtems_cpu_config_table);
     1058
     1059An alternative to the processor configuration table would be to specify in the
     1060scheduler instance which processors are owned by the instance.  This would
     1061require a static initialization of CPU sets which is difficult.  Also the
     1062schedulers have to be registered somewhere, so some sort of table is needed
     1063anyway.  Since a processor can be owned by at most one scheduler instance this
     1064configuration approach enables an additional error source which is avoided by
     1065the processor configuration table.
     1066==  Scheduler Implementation  ==
     1067
     1068
     1069Currently the scheduler operations have no control context and use global
     1070variables instead.  Thus the scheduler operations signatures must change to use
     1071a scheduler control context as the first parameter, e.g.
     1072
     1073 typedef struct Scheduler_Control Scheduler_Control;
     1074 
     1075 typedef struct {
     1076   [...]
     1077   void ( *set_affinity )(
     1078     Scheduler_Control *self,
     1079     Thread_Control *thread,
     1080     size_t affinity_set_size,
     1081     const cpu_set_t *affinity_set
     1082   );
     1083   [...]
     1084 } Scheduler_Operations;
     1085 
     1086 /**
     1087  * @brief General scheduler control.
     1088  */
     1089 struct Scheduler_Control {
     1090   /**
     1091    * @brief The scheduler operations.
     1092    */
     1093   Scheduler_Operations Operations;
     1094 
     1095   /**
     1096    * @brief Size of the owned processor set in bytes.
     1097    */
     1098   size_t owned_cpu_set_size
     1099 
     1100   /**
     1101    * @brief Reference to the owned processor set.
     1102    *
     1103    * A set bit means this processor is owned by this scheduler instance, a
     1104    * cleared bit means the opposite.
     1105    */
     1106   cpu_set_t *owned_cpu_set;
     1107 };
     1108
     1109Single processor configurations benefit also from this change since it makes
     1110all dependencies explicit and easier to access (allows more efficient machine
     1111code).
     1112=  Multiprocessor Resource Sharing Protocol - MrsP  =
     1113
     1114==  Reason  ==
     1115
     1116
     1117In general, application-level threads are not independent and may indeed share logical resources. In a partitioned-scheduling system, where capacity allows, resources are allocated on the same processor as the threads that share it: in that case those resources are termed ''local''.  Where needed, resources may also reside on processors other than that of (some of) their sharing threads: in that case those resources are termed ''global''.
     1118
     1119For partitioned scheduling of application-level threads and local resources, two choices are possible, which meet the ITT requirement of achieving predictable time behaviour for platform software
     1120 *  (1) fixed-priority scheduling with the immediate priority ceiling for controlling access to local resources, or
     1121 *  (2) EDF scheduling with the stack resource protocol for controlling access to local resources.
     1122
     1123Choice (1) is more natural with RTEMS. Both alternatives require preemption, whose disruption to the thread's cache working set may in part be attenuated by the use of techniques known as limited-preemption, which have been successfully demonstrated by the UoP. Conversely, run-to-completion (non-preemptive) semantics is known to achieve much lower schedulable utilization, which cannot make it a plausible candidate for performance-hungry systems. The use of fixed-priority scheduling for each processor where application-level threads locally run allows response time analysis to be used, which, on the basis of the worst-case execution time of individual threads and on the graph of resource usage among threads, determines the worst-case completion time of every individual thread run on that processor, offering absolute guarantees on the schedulable utilization of every individual processor that uses that algorithm.
     1124
     1125The determination of worst-case execution time of software programs running on one processor of an SMP is significantly more complex than its single-processor analogous.  This is because every local execution suffers massive interference effects from its co-runners on other processors, regardless of functional independence.  Simplistically upper bounding the possible interference incurs exceeding pessimism, which causes massive reduction in the allowable system load.  More accurate interference analysis may incur prohibitive costs unless simplifying assumptions can be safely made, one of which is strictly static partitioning: previous studies run by ESA [http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf]
     1126have indicated possible approaches to that problem.  Global scheduling greatly exacerbates the difficulty of that problem and thus is unwelcome.  With fixed-priority scheduling, each logical resource shared by multiple application-level threads (or their corresponding lock) must be statically attached a ceiling priority attribute computed as an upper bound to the static
     1127priority of all threads that may use that resource: when an application-level thread acquires the resource lock, the priority of that thread is raised to the
     1128ceiling of the resource; upon relinquishing the lock to that resource, the thread must return to the priority level that it had prior to acquiring the lock.  (It has been shown that the implementation of this protocol does not need resource locks at all, as the mere fact for a thread to be running implies that all of the resources that it may want to use are free at this time and none can be claimed by lower-priority threads.  However, it may be easier for an operating system to attach the ceiling value to the resource lock than to
     1129any other data structures.)  This simple protocol allows application-level threads to acquire multiple resources without risking deadlock.
     1130
     1131When global resources are used, which is very desirable to alleviate the complexity of task allocation, the resource access control protocol in use, whether (1) or (2) in the earlier taxonomy, must be specialized, so that global resources can be syntactically told apart from local resources.  Luckily, two solutions have been very recently proposed to address this problem.  One solution, known as Optimal Migratory Priority Inheritance (OMIP), was proposed by Björn Brandenburg in his work entitled ''A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications'', presented at the ECRTS conference in July 2013.  The other solution, known as Multiprocessor Resource Sharing Protocol (MrsP), was proposed by Alan Burns and Andy Wellings in their work entitled ''A Schedulability Compatible Multiprocessor Resource Sharing Protocol'', presented at the very same ECRTS 2013 conference.  OMIP
     1132aims to solve the problem of guaranteed bounded blocking in global resource sharing in clustered scheduling on a multiprocessor, where a cluster is a collection of at least one processor.  MrsP aims to achieve the smallest bound to the cumulative duration of blocking suffered by threads waiting to access a global resource in partitioned scheduling (clusters with exactly one
     1133processor), while allowing schedulability analysis to be performed per processor, using response time analysis, which is a technique fairly well known to industry.  No other global resource sharing protocol on SMP, including OMIP, is able to guarantee that.  For this reason MrsP is the choice for this study.
     1134
     1135The MrsP protocol requires that
     1136 *  (a) threads waiting for global resources spin at ceiling priority on their local processor, and with a ceiling value greater than any other local resource on that processor, and
     1137 *  (b) the execution within the global resource may migrate to other processors where application-level threads waiting to access that resource are spinning.
     1138
     1139Feature (a) prevents lower-priority threads from running in preference to the waiting higher-priority thread and stealing resources that it might want to use in the future as part of the current execution; should that stealing happen, the blocking penalty potentially suffered on access to global resources would skyrocket to untenable levels.
     1140Feature (b), which brings in the sole welcome extent of migration in the proposed model, which is useful when higher-priority tasks running on the processor of the global resource prevent it from completing execution; in that case, the slack allowed for by local spinning on other processors where other threads are waiting, is used to speed up the completion of the execution in the global resource and therefore reduce blocking.
     1141==  RTEMS API Changes  ==
     1142
     1143
     1144For MrsP we need the ability to specify the priority ceilings per scheduler
     1145domain.
     1146
     1147 typedef struct {
     1148   rtems_id scheduler_id;
     1149   rtems_task_priority priority;
     1150 } rtems_task_priority_by_scheduler;
     1151
     1152 /**
     1153  * @brief Sets the priority ceilings per scheduler for a semaphore with
     1154  * priority ceiling protocol.
     1155  *
     1156  * @param[in] semaphore_id Identifier of the semaphore.
     1157  * @param[in] priority_ceilings A table with priority ceilings by scheduler.
     1158  * In case one scheduler appears multiple times, the setting with the highest
     1159  * index will be used.  This semaphore object is then bound to the specified
     1160  * scheduler domains.  It is an error to use this semaphore object on other
     1161  * scheduler domains.  The specified schedulers must be compatible, e.g.
     1162  * migration from one scheduler domain to another must be defined.
     1163  * @param[in] priority_ceilings_count Count of priority ceilings by scheduler
     1164  * pairs in the table.
     1165  *
     1166  * @retval RTEMS_SUCCESSFUL Successful operation.
     1167  * @retval RTEMS_INVALID_ID Invalid semaphore identifier.
     1168  * @retval RTEMS_INVALID_SECOND_ID Invalid scheduler identifier in the table.
     1169  * @retval RTEMS_INVALID_PRIORITY Invalid task priority in the table.
     1170  */
     1171 rtems_status_code rtems_semaphore_set_priority_ceilings(
     1172   rtems_id semaphore_id,
     1173   const rtems_task_priority_by_scheduler *priority_ceilings,
     1174   size_t priority_ceilings_count
     1175 );
     1176==  Implementation  ==
     1177
     1178
     1179The critical part in the MrsP is the migration of the lock holder in case of
     1180preemption by a higher-priority thread.  A post-switch action is used to detect
     1181this event.  The post-switch action will remove the thread from the current
     1182scheduler domain and add it to the scheduler domain of the first waiting thread
     1183which is executing.  A resource release will remove this thread from the
     1184temporary scheduler domain and move it back to the original scheduler domain.
     1185=  Fine Grained Locking  =
     1186
     1187==  Reason  ==
     1188
     1189
     1190Fine grained locking is of utmost importance to get a scalable operating system
     1191that can guarantee reasonable worst-case latencies.  With the current Giant
     1192lock in RTEMS the worst-case latencies of every operating system service
     1193increase with each processor added to the system.  Since the Giant lock state
     1194is shared among all processors a huge cache synchronization overhead
     1195contributes to the worst-case latencies.  Fine grained locking in combination
     1196with partitioned/clustered scheduling helps to avoid these problems since now
     1197the operating system state is distributed allowing true parallelism of
     1198independent components.
     1199==  RTEMS API Changes  ==
     1200
     1201
     1202None.
     1203==  Locking Protocol Analysis  ==
     1204
     1205
     1206As a sample operating system operation the existing mutex
     1207obtain/release/timeout sequences will be analysed.  All ISR disable/enable
     1208points (highlighted with colours) must be turned into an appropriate SMP clock
     1209(e.g. a ticket spin lock).  One goal is that an uncontested mutex obtain will
     1210use no SMP locks except the one associated with the mutex object itself.  Is
     1211this possible with the current structure?
     1212
     1213 mutex_obtain(id, wait, timeout):
     1214        <span style="color:red">level = ISR_disable()</span>
     1215        mtx = mutex_get(id)
     1216        executing = get_executing_thread()
     1217        wait_control = executing.get_wait_control()
     1218        wait_control.set_status(SUCCESS)
     1219        if !mtx.is_locked():
     1220                mtx.lock(executing)
     1221                if mtx.use_ceiling_protocol():
     1222                        thread_dispatch_disable()
     1223                        <span style="color:red">ISR_enable(level)</span>
     1224                        executing.boost_priority(mtx.get_ceiling())
     1225                        thread_dispatch_enable()
     1226                else:
     1227                        <span style="color:red">ISR_enable(level)</span>
     1228        else if mtx.is_holder(executing):
     1229                mtx.increment_nest_level()
     1230                <span style="color:red">ISR_enable(level)</span>
     1231        else if !wait:
     1232                <span style="color:red">ISR_enable(level)</span>
     1233                wait_control.set_status(UNSATISFIED)
     1234        else:
     1235                wait_queue = mtx.get_wait_queue()
     1236                wait_queue.set_sync_status(NOTHING_HAPPENED)
     1237                executing.set_wait_queue(wait_queue))
     1238                thread_dispatch_disable()
     1239                <span style="color:red">ISR_enable(level)</span>
     1240                if mtx.use_inherit_priority():
     1241                        mtx.get_holder().boost_priority(executing.get_priority()))
     1242                <span style="color:fuchsia">level = ISR_disable()</span>
     1243                if executing.is_ready():
     1244                        executing.set_state(MUTEX_BLOCKING_STATE)
     1245                        scheduler_block(executing)
     1246                else:
     1247                        executing.add_state(MUTEX_BLOCKING_STATE)
     1248                <span style="color:fuchsia">ISR_enable(level)</span>
     1249                if timeout:
     1250                        timer_start(timeout, executing, mtx)
     1251                <span style="color:blue">level = ISR_disable()</span>
     1252                search_thread = wait_queue.first()
     1253                while search_thread != wait_queue.tail():
     1254                        if executing.priority() <= search_thread.priority():
     1255                                break
     1256                        <span style="color:blue">ISR_enable(level)</span>
     1257                        <span style="color:blue">level = ISR_disable()</span>
     1258                        if search_thread.is_state_set(MUTEX_BLOCKING_STATE):
     1259                                search_thread = search_thread.next()
     1260                        else:
     1261                                search_thread = wait_queue.first()
     1262                sync_status = wait_queue.get_sync_status()
     1263                if sync_state == NOTHING_HAPPENED:
     1264                        wait_queue.set_sync_status(SYNCHRONIZED)
     1265                        wait_queue.enqueue(search_thread, executing)
     1266                        executing.set_wait_queue(wait_queue)
     1267                        <span style="color:blue">ISR_enable(level)</span>
     1268                else:
     1269                        executing.set_wait_queue(NULL)
     1270                        if executing.is_timer_active():
     1271                                executing.deactivate_timer()
     1272                                <span style="color:blue">ISR_enable(level)</span>
     1273                                executing.remove_timer()
     1274                        else:
     1275                                <span style="color:blue">ISR_enable(level)</span>
     1276                        <span style="color:fuchsia">level = ISR_disable()</span>
     1277                        if executing.is_state_set(MUTEX_BLOCKING_STATE):
     1278                                executing.clear_state(MUTEX_BLOCKING_STATE)
     1279                                if executing.is_ready():
     1280                                        scheduler_unblock(executing)
     1281                        <span style="color:fuchsia">ISR_enable(level)</span>
     1282                thread_dispatch_enable()
     1283        return wait_control.get_status()
     1284 
     1285 mutex_release(id):
     1286        thread_dispatch_disable()
     1287        mtx = mutex_get(id)
     1288        executing = get_executing_thread()
     1289        nest_level = mtx.decrement_nest_level()
     1290        if nest_level == 0:
     1291                if mtx.use_ceiling_protocol() or mtx.use_inherit_priority():
     1292                        executing.restore_priority()
     1293                wait_queue = mtx.get_wait_queue()
     1294                thread = NULL
     1295                <span style="color:red">level = ISR_disable()</span>
     1296                thread = wait_queue.dequeue()
     1297                if thread != NULL:
     1298                        thread.set_wait_queue(NULL)
     1299                        if thread.is_timer_active():
     1300                                thread.deactivate_timer()
     1301                                <span style="color:red">ISR_enable(level)</span>
     1302                                thread.remove_timer()
     1303                        else:
     1304                                <span style="color:red">ISR_enable(level)</span>
     1305                        <span style="color:fuchsia">level = ISR_disable()</span>
     1306                        if thread.is_state_set(MUTEX_BLOCKING_STATE):
     1307                                thread.clear_state(MUTEX_BLOCKING_STATE)
     1308                                if thread.is_ready():
     1309                                        scheduler_unblock(thread)
     1310                        <span style="color:fuchsia">ISR_enable(level)</span>
     1311                else:
     1312                        <span style="color:red">ISR_enable(level)</span>
     1313                <span style="color:blue">level = ISR_disable()</span>
     1314                if thread == NULL:
     1315                        sync_status = wait_queue.get_sync_status()
     1316                        if sync_status == TIMEOUT || sync_status == NOTHING_HAPPENED:
     1317                                wait_queue.set_sync_status(SATISFIED)
     1318                                thread = executing
     1319                <span style="color:blue">ISR_enable(level)</span>
     1320                if thread != NULL:
     1321                        mtx.new_holder(thread)
     1322                        if mtx.use_ceiling_protocol():
     1323                                thread.boost_priority(mtx.get_ceiling())
     1324                else:
     1325                        mtx.unlock()
     1326        thread_dispatch_enable()
     1327 
     1328 
     1329 mutex_timeout(thread, mtx):
     1330        <span style="color:red">level = ISR_disable()</span>
     1331        wait_queue = thread.get_wait_queue()
     1332        if wait_queue != NULL:
     1333                sync_status = wait_queue.get_sync_status()
     1334                if sync_status != SYNCHRONIZED and thread.is_executing():
     1335                        if sync_status != SATISFIED:
     1336                                wait_queue.set_sync_status(TIMEOUT)
     1337                                wait_control = executing.get_wait_control()
     1338                                wait_control.set_status(TIMEOUT)
     1339                        <span style="color:red">ISR_enable(level)</span>
     1340                else:
     1341                        <span style="color:red">ISR_enable(level)</span>
     1342                        <span style="color:lime">level = ISR_disable()</span>
     1343                        wait_queue = thread.get_wait_queue()
     1344                        if wait_queue != NULL:
     1345                                wait_queue.extract(executing)
     1346                                if thread.is_timer_active():
     1347                                        thread.deactivate_timer()
     1348                                        <span style="color:lime">ISR_enable(level)</span>
     1349                                        thread.remove_timer()
     1350                                else:
     1351                                        <span style="color:lime">ISR_enable(level)</span>
     1352                                <span style="color:fuchsia">level = ISR_disable()</span>
     1353                                if thread.is_state_set(MUTEX_BLOCKING_STATE):
     1354                                        thread.clear_state(MUTEX_BLOCKING_STATE)
     1355                                        if thread.is_ready():
     1356                                                scheduler_unblock(thread)
     1357                                <span style="color:fuchsia">ISR_enable(level)</span>
     1358                        else:
     1359                                <span style="color:lime">ISR_enable(level)</span>
     1360        else:
     1361                <span style="color:red">ISR_enable(level)</span>
     1362
     1363The thread.remove_timer() operation is quite complex and a problem area of its
     1364own.  See discussion about the [wiki:Watchdog_Handler Watchdog Handler].
     1365
     1366The scheduler operation points are in a good shape.  Here we can easily use one
     1367SMP lock for the thread state and one SMP lock for the scheduler state.
     1368
     1369The big problem is that the mutex object state changes and the thread enqueue
     1370operation are split up into several parts.  This was done to ensure a next to
     1371optimal interrupt latency with only constant-time sections of disabled
     1372interrupts.  The trade-off is that we have a very complex blocking sequence.
     1373After the first mutex state change in the mutex_obtain() the mutex doesn't know
     1374directly which thread is about to block on that mutex.  Some sequences assume
     1375exactly one executing thread in the system, which is not true on a SMP system
     1376with more than one processor.  With an SMP lock per mutex object all state
     1377information for this mutex must be present in the mutex object.  So the locked
     1378mutex to locked mutex with a waiting thread change must be atomic with respect
     1379to the mutex SMP lock.  Later mutex timeout or release operations can then get
     1380the waiting thread and deal with it accordingly.
     1381==  Implementation  ==
     1382
     1383
     1384The blocking operation synchronization state must move from the synchronization
     1385object (e.g. mutex, message queue) to the thread wait information
     1386(Thread_Wait_information) since it is clear that the thread has to block and
     1387not the object.  There may be also multiple threads on different processors
     1388which queue up on a mutex at the same time.
     1389
     1390Blocking threads must be registered in the object under the SMP lock of the
     1391object and must be independent of the scheduler data structures.  Thus we can
     1392no longer use the normal chain node of the thread and instead have to add a
     1393chain node to the thread wait information.
     1394
     1395The thread queue operations must no longer use internal locks (e.g. ISR
     1396disable/enable).  This simplifies them considerable.  The thread queue
     1397operations must be performed under the SMP lock of the object.  The drawback is
     1398that the time of disabled interrupts increases.  The FIFO thread queue
     1399operations are trivial doubly-linked list operations.  The priority thread
     1400queue operations execute in a worst-case time which depends only on the maximum
     1401number of priorities.
     1402=  Post-Switch Actions  =
     1403
     1404==  Reason  ==
     1405
     1406
     1407Currently threads are assigned to processors for execution by the scheduler
     1408responsible for this thread.  It is unknown to the system when a thread
     1409actually starts or terminates execution on its assigned processor.  The
     1410termination event is important for the following features
     1411 *  explicit thread migration, e.g. if a thread should move from one scheduler domain to another,
     1412 *  thread deletion, since the thread stack is in use until the thread stopped execution, or
     1413 *  restart of threads executing on a remote processor.
     1414==  RTEMS API Changes  ==
     1415
     1416
     1417None.
     1418==  Implementation  ==
     1419
     1420
     1421One approach to do post-switch actions could be to spin on the per-processor
     1422variable reflecting the executing thread.  This has at least two problems
     1423# it doesn't work if the executing thread wants to alter its own state, and
     1424# this spinning must be done with the scheduler lock held and interrupts disabled, this is a disaster for the interrupt latency,
     1425
     1426The proposed solution is to use an optional action handler which is active in
     1427case the thread execution termination matters.  In _Thread_Dispatch() we have
     1428already the post-switch extensions invoked after a thread switch.
     1429Unfortunately they execute after thread dispatching is enabled again and at
     1430this point the current processor may have already changed due to thread
     1431migration requested by an interrupt.
     1432
     1433We need a context which executes right after the thread context switch on the
     1434current processor, but without the per-processor lock acquired (to prevent lock
     1435order reversal problems and keep the interrupt latency small).  For this we
     1436introduce a post-switch action chain (PSAC).  Each thread will have its own
     1437PSAC control.  The PSAC operations like addition to, removal from and iteration
     1438over the chain are protected by the corresponding thread lock.  Each action
     1439will have a local context.  The heir thread will execute the action handlers on
     1440behalf of the thread of interest.  Since thread dispatching is disabled action
     1441handlers cannot block.
     1442
     1443The execution time of post-switch actions increases the worst-case thread
     1444dispatch latency since the heir thread must do work for another thread.
     1445
     1446On demand post-switch actions help to implement the Multiprocessor Resource
     1447Sharing Protocol (MrsP) proposed by Burns and Wellings.  Threads executing a
     1448global critical section can add a post-switch action which will trigger the
     1449thread migration in case of pre-emption by a local high-priority thread.
     1450
     1451 thread_dispatch:
     1452        again = true
     1453        while again:
     1454                level = ISR.disable()
     1455                current_cpu = get_current_cpu()
     1456                current_cpu.disable_thread_dispatch()
     1457                ISR.enable(level)
     1458                executing = current_cpu.get_executing()
     1459                current_cpu.acquire()
     1460                if current_cpu.is_thread_dispatch_necessary():
     1461                        heir = current_cpu.get_heir()
     1462                        current_cpu.set_thread_dispatch_necessary(false)
     1463                        current_cpu.set_executing(heir)
     1464                        executing.set_executing(false)
     1465                        heir.set_executing(true)
     1466                        if executing != heir:
     1467                                last = switch(executing, heir)
     1468                                current_cpu = get_current_cpu()
     1469                                actions = last.get_actions()
     1470                                if actions.is_empty():
     1471                                        again = false
     1472                                else:
     1473                                        current_cpu.release()
     1474                                        last.acquire()
     1475                                        if last.get_cpu() == current_cpu:
     1476                                                while !actions.is_empty():
     1477                                                        action = actions.pop()
     1478                                                        action.do(current_cpu, last)
     1479                                        last.release()
     1480                                        current_cpu.enable_thread_dispatch()
     1481        current_cpu.enable_thread_dispatch()
     1482        current_cpu.release()
     1483
     1484It is important to check that the thread is still assigned to the current
     1485processor, since after the release of the per-processor lock we have a new
     1486executing thread and the thread of interest may migrated to another processor
     1487already.  Since the heir thread has now a reference to the thread of interest
     1488we have to make sure that deletion requests are deferred until the post-switch
     1489actions have been executed.
     1490
     1491An efficient why to get the last executing thread (the thread of interest)
     1492throughout the context switch is to return the context pointer of the last
     1493executing thread.  With a simple offset operation we get the thread control
     1494block.
     1495=  Thread Delete/Restart  =
     1496
     1497==  Reason  ==
     1498
     1499
     1500Deletion of threads may be required by some parallel libraries.
     1501==  RTEMS API Changes  ==
     1502
     1503
     1504None.
     1505==  Implementation  ==
     1506
     1507
     1508The current implementation to manage a thread life-cycle in RTEMS has some
     1509weaknesses that turn into severe problems on SMP.  It leads also to POSIX and
     1510C++ standard conformance defects in some cases.  Currently the thread
     1511life-cycle changes are protected by the thread dispatch disable level and some
     1512parts by the allocator mutex.  Since the thread dispatch disable level is
     1513actually a giant mutex on SMP this leads in combination with the allocator
     1514mutex to lock order reversal problems.
     1515
     1516The usage of a unified work areas is also broken at the moment
     1517[https://www.rtems.org/bugzilla/show_bug.cgi?id=2152].
     1518
     1519There is also an outstanding thread cancellation bug
     1520[https://www.rtems.org/bugzilla/show_bug.cgi?id=2035].
     1521
     1522One problematic path is the destruction of threads.  Here we have currently the
     1523following sequence:
     1524
     1525<ol>
     1526<li>Obtain the allocator mutex.</li>
     1527<li>Disable thread dispatching.</li>
     1528<li>Invalidate the object identifier.</li>
     1529<li>Enable thread dispatching.</li>
     1530<li>Call the thread delete extensions in the context of the deleting thread
     1531(not necessarily the deleted thread).  The POSIX cleanup handlers are called
     1532here from the POSIX delete extension.  POSIX mandates that the cleanup handler
     1533are executed in the context of the corresponding thread.  So here we have a
     1534POSIX violation
     1535[http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_03].
     1536</li>
     1537<li>Remove the thread from the scheduling and watchdog resources.</li>
     1538<li>Delete scheduling, floating-point, stack and extensions resources.  Now the
     1539deleted thread may execute on a freed thread stack!</li>
     1540<li>Free the object.  Now the object (thread control block) is available for
     1541re-use, but it is still used by the thread!  Only the disabled thread
     1542dispatching prevents chaos.</li>
     1543<li>Release the allocator mutex.  Now we have a lock order reversal (see step 1.
     1544and 2.).</li>
     1545<li>Enable thread dispatching.  Here a deleted executing thread disappears.  On
     1546SMP we have also a race-condition here.  This step looks in detail:
     1547{{{
     1548if ( _Thread_Dispatch_decrement_disable_level() == 0 )
     1549        /*
     1550         * Here another processor may re-use resources of a deleted executing
     1551         * thread, e.g. the stack.
     1552         */
     1553        _Thread_Dispatch();
     1554}
     1555}}}
     1556</li>
     1557</ol>
     1558
     1559To overcome the issues we need considerable implementation changes in Score.
     1560The thread life-cycle state must be explicit and independent of the thread
     1561dispatch disable level and allocator mutex protection.
     1562
     1563The thread life-cycle is determined by the following actions:
     1564
     1565; CREATE : A thread is created.
     1566; START : Starts a thread.  The thread must be dormant to get started.
     1567; RESTART : Restarts a thread.  The thread must not be dormant to get restarted.
     1568; SUSPEND : Suspends a thread.
     1569; RESUME : Resumes a thread.
     1570; DELETE : Deletes a thread.
     1571; SET_PROTECTION : Sets the new protection state and returns the previous.  This action is new.
     1572
     1573The following thread life-cycle states are proposed.  These states are
     1574orthog