source: rtems-docs/c-user/symmetric_multiprocessing_services.rst @ c2ee227

5am
Last change on this file since c2ee227 was c2ee227, checked in by Sebastian Huber <sebastian.huber@…>, on 03/07/18 at 13:06:48

c-user: Promote clustered scheduler configuration

Add own section for the clustered scheduler configuration.

  • Property mode set to 100644
File size: 33.7 KB
Line 
1.. comment SPDX-License-Identifier: CC-BY-SA-4.0
2
3.. COMMENT: COPYRIGHT (c) 2014.
4.. COMMENT: On-Line Applications Research Corporation (OAR).
5.. COMMENT: Copyright (c) 2017 embedded brains GmbH.
6.. COMMENT: All rights reserved.
7
8.. index:: Symmetric Multiprocessing
9.. index:: SMP
10
11Symmetric Multiprocessing (SMP)
12*******************************
13
14Introduction
15============
16
17The Symmetric Multiprocessing (SMP) support of the RTEMS is available on
18
19- ARMv7-A,
20
21- PowerPC, and
22
23- SPARC.
24
25.. warning::
26
27   The SMP support must be explicitly enabled via the ``--enable-smp``
28   configure command line option for the :term:`BSP` build.
29
30RTEMS is supposed to be a real-time operating system.  What does this mean in
31the context of SMP?  The RTEMS interpretation of real-time on SMP is the
32support for :ref:`ClusteredScheduling` with priority based schedulers and
33adequate locking protocols.  One aim is to enable a schedulability analysis
34under the sporadic task model :cite:`Brandenburg:2011:SL`
35:cite:`Burns:2013:MrsP`.
36
37The directives provided by the SMP support are:
38
39- rtems_get_processor_count_ - Get processor count
40
41- rtems_get_current_processor_ - Get current processor index
42
43Background
44==========
45
46Application Configuration
47-------------------------
48
49By default, the maximum processor count is set to one in the application
50configuration.  To enable SMP, the application configuration option
51:ref:`CONFIGURE_MAXIMUM_PROCESSORS <CONFIGURE_MAXIMUM_PROCESSORS>` must be
52defined to a value greater than one.  It is recommended to use the smallest
53value suitable for the application in order to save memory.  Each processor
54needs an idle thread and interrupt stack for example.
55
56The default scheduler for SMP applications supports up to 32 processors and is
57a global fixed priority scheduler, see also :ref:`ConfigurationSchedulersClustered`.
58
59The following compile-time test can be used to check if the SMP support is
60available or not.
61
62.. code-block:: c
63
64    #include <rtems.h>
65
66    #ifdef RTEMS_SMP
67    #warning "SMP support is enabled"
68    #else
69    #warning "SMP support is disabled"
70    #endif
71
72Examples
73--------
74
75For example applications see `testsuites/smptests
76<https://git.rtems.org/rtems/tree/testsuites/smptests>`_.
77
78Uniprocessor versus SMP Parallelism
79-----------------------------------
80
81Uniprocessor systems have long been used in embedded systems. In this hardware
82model, there are some system execution characteristics which have long been
83taken for granted:
84
85- one task executes at a time
86
87- hardware events result in interrupts
88
89There is no true parallelism. Even when interrupts appear to occur at the same
90time, they are processed in largely a serial fashion.  This is true even when
91the interupt service routines are allowed to nest.  From a tasking viewpoint,
92it is the responsibility of the real-time operatimg system to simulate
93parallelism by switching between tasks.  These task switches occur in response
94to hardware interrupt events and explicit application events such as blocking
95for a resource or delaying.
96
97With symmetric multiprocessing, the presence of multiple processors allows for
98true concurrency and provides for cost-effective performance
99improvements. Uniprocessors tend to increase performance by increasing clock
100speed and complexity. This tends to lead to hot, power hungry microprocessors
101which are poorly suited for many embedded applications.
102
103The true concurrency is in sharp contrast to the single task and interrupt
104model of uniprocessor systems. This results in a fundamental change to
105uniprocessor system characteristics listed above. Developers are faced with a
106different set of characteristics which, in turn, break some existing
107assumptions and result in new challenges. In an SMP system with N processors,
108these are the new execution characteristics.
109
110- N tasks execute in parallel
111
112- hardware events result in interrupts
113
114There is true parallelism with a task executing on each processor and the
115possibility of interrupts occurring on each processor. Thus in contrast to
116their being one task and one interrupt to consider on a uniprocessor, there are
117N tasks and potentially N simultaneous interrupts to consider on an SMP system.
118
119This increase in hardware complexity and presence of true parallelism results
120in the application developer needing to be even more cautious about mutual
121exclusion and shared data access than in a uniprocessor embedded system. Race
122conditions that never or rarely happened when an application executed on a
123uniprocessor system, become much more likely due to multiple threads executing
124in parallel. On a uniprocessor system, these race conditions would only happen
125when a task switch occurred at just the wrong moment. Now there are N-1 tasks
126executing in parallel all the time and this results in many more opportunities
127for small windows in critical sections to be hit.
128
129.. index:: task affinity
130.. index:: thread affinity
131
132Task Affinity
133-------------
134
135RTEMS provides services to manipulate the affinity of a task. Affinity is used
136to specify the subset of processors in an SMP system on which a particular task
137can execute.
138
139By default, tasks have an affinity which allows them to execute on any
140available processor.
141
142Task affinity is a possible feature to be supported by SMP-aware
143schedulers. However, only a subset of the available schedulers support
144affinity. Although the behavior is scheduler specific, if the scheduler does
145not support affinity, it is likely to ignore all attempts to set affinity.
146
147The scheduler with support for arbitary processor affinities uses a proof of
148concept implementation.  See https://devel.rtems.org/ticket/2510.
149
150.. index:: task migration
151.. index:: thread migration
152
153Task Migration
154--------------
155
156With more than one processor in the system tasks can migrate from one processor
157to another.  There are four reasons why tasks migrate in RTEMS.
158
159- The scheduler changes explicitly via
160  :ref:`rtems_task_set_scheduler() <rtems_task_set_scheduler>` or similar
161  directives.
162
163- The task processor affinity changes explicitly via
164  :ref:`rtems_task_set_affinity() <rtems_task_set_affinity>` or similar
165  directives.
166
167- The task resumes execution after a blocking operation.  On a priority based
168  scheduler it will evict the lowest priority task currently assigned to a
169  processor in the processor set managed by the scheduler instance.
170
171- The task moves temporarily to another scheduler instance due to locking
172  protocols like the :ref:`MrsP` or the :ref:`OMIP`.
173
174Task migration should be avoided so that the working set of a task can stay on
175the most local cache level.
176
177.. _ClusteredScheduling:
178
179Clustered Scheduling
180--------------------
181
182The scheduler is responsible to assign processors to some of the threads which
183are ready to execute.  Trouble starts if more ready threads than processors
184exist at the same time.  There are various rules how the processor assignment
185can be performed attempting to fulfill additional constraints or yield some
186overall system properties.  As a matter of fact it is impossible to meet all
187requirements at the same time.  The way a scheduler works distinguishes
188real-time operating systems from general purpose operating systems.
189
190We have clustered scheduling in case the set of processors of a system is
191partitioned into non-empty pairwise-disjoint subsets of processors.  These
192subsets are called clusters.  Clusters with a cardinality of one are
193partitions.  Each cluster is owned by exactly one scheduler instance.  In case
194the cluster size equals the processor count, it is called global scheduling.
195
196Modern SMP systems have multi-layer caches.  An operating system which neglects
197cache constraints in the scheduler will not yield good performance.  Real-time
198operating systems usually provide priority (fixed or job-level) based
199schedulers so that each of the highest priority threads is assigned to a
200processor.  Priority based schedulers have difficulties in providing cache
201locality for threads and may suffer from excessive thread migrations
202:cite:`Brandenburg:2011:SL` :cite:`Compagnin:2014:RUN`.  Schedulers that use local run
203queues and some sort of load-balancing to improve the cache utilization may not
204fulfill global constraints :cite:`Gujarati:2013:LPP` and are more difficult to
205implement than one would normally expect :cite:`Lozi:2016:LSDWC`.
206
207Clustered scheduling was implemented for RTEMS SMP to best use the cache
208topology of a system and to keep the worst-case latencies under control.  The
209low-level SMP locks use FIFO ordering.  So, the worst-case run-time of
210operations increases with each processor involved.  The scheduler configuration
211is quite flexible and done at link-time, see :ref:`Configuring Clustered
212Schedulers`.  It is possible to re-assign processors to schedulers during
213run-time via :ref:`rtems_scheduler_add_processor()
214<rtems_scheduler_add_processor>` and :ref:`rtems_scheduler_remove_processor()
215<rtems_scheduler_remove_processor>`.  The schedulers are implemented in an
216object-oriented fashion.
217
218The problem is to provide synchronization
219primitives for inter-cluster synchronization (more than one cluster is involved
220in the synchronization process). In RTEMS there are currently some means
221available
222
223- events,
224
225- message queues,
226
227- mutexes using the :ref:`OMIP`,
228
229- mutexes using the :ref:`MrsP`, and
230
231- binary and counting semaphores.
232
233The clustered scheduling approach enables separation of functions with
234real-time requirements and functions that profit from fairness and high
235throughput provided the scheduler instances are fully decoupled and adequate
236inter-cluster synchronization primitives are used.
237
238To set the scheduler of a task see :ref:`rtems_scheduler_ident()
239<rtems_scheduler_ident>` and :ref:`rtems_task_set_scheduler()
240<rtems_task_set_scheduler>`.
241
242OpenMP
243------
244
245OpenMP support for RTEMS is available via the GCC provided libgomp.  There is
246libgomp support for RTEMS in the POSIX configuration of libgomp since GCC 4.9
247(requires a Newlib snapshot after 2015-03-12). In GCC 6.1 or later (requires a
248Newlib snapshot after 2015-07-30 for <sys/lock.h> provided self-contained
249synchronization objects) there is a specialized libgomp configuration for RTEMS
250which offers a significantly better performance compared to the POSIX
251configuration of libgomp.  In addition application configurable thread pools
252for each scheduler instance are available in GCC 6.1 or later.
253
254The run-time configuration of libgomp is done via environment variables
255documented in the `libgomp manual <https://gcc.gnu.org/onlinedocs/libgomp/>`_.
256The environment variables are evaluated in a constructor function which
257executes in the context of the first initialization task before the actual
258initialization task function is called (just like a global C++ constructor).
259To set application specific values, a higher priority constructor function must
260be used to set up the environment variables.
261
262.. code-block:: c
263
264    #include <stdlib.h>
265    void __attribute__((constructor(1000))) config_libgomp( void )
266    {
267        setenv( "OMP_DISPLAY_ENV", "VERBOSE", 1 );
268        setenv( "GOMP_SPINCOUNT", "30000", 1 );
269        setenv( "GOMP_RTEMS_THREAD_POOLS", "1$2@SCHD", 1 );
270    }
271
272The environment variable ``GOMP_RTEMS_THREAD_POOLS`` is RTEMS-specific.  It
273determines the thread pools for each scheduler instance.  The format for
274``GOMP_RTEMS_THREAD_POOLS`` is a list of optional
275``<thread-pool-count>[$<priority>]@<scheduler-name>`` configurations separated
276by ``:`` where:
277
278- ``<thread-pool-count>`` is the thread pool count for this scheduler instance.
279
280- ``$<priority>`` is an optional priority for the worker threads of a thread
281  pool according to ``pthread_setschedparam``.  In case a priority value is
282  omitted, then a worker thread will inherit the priority of the OpenMP master
283  thread that created it.  The priority of the worker thread is not changed by
284  libgomp after creation, even if a new OpenMP master thread using the worker
285  has a different priority.
286
287- ``@<scheduler-name>`` is the scheduler instance name according to the RTEMS
288  application configuration.
289
290In case no thread pool configuration is specified for a scheduler instance,
291then each OpenMP master thread of this scheduler instance will use its own
292dynamically allocated thread pool.  To limit the worker thread count of the
293thread pools, each OpenMP master thread must call ``omp_set_num_threads``.
294
295Lets suppose we have three scheduler instances ``IO``, ``WRK0``, and ``WRK1``
296with ``GOMP_RTEMS_THREAD_POOLS`` set to ``"1@WRK0:3$4@WRK1"``.  Then there are
297no thread pool restrictions for scheduler instance ``IO``.  In the scheduler
298instance ``WRK0`` there is one thread pool available.  Since no priority is
299specified for this scheduler instance, the worker thread inherits the priority
300of the OpenMP master thread that created it.  In the scheduler instance
301``WRK1`` there are three thread pools available and their worker threads run at
302priority four.
303
304Application Issues
305==================
306
307Most operating system services provided by the uni-processor RTEMS are
308available in SMP configurations as well.  However, applications designed for an
309uni-processor environment may need some changes to correctly run in an SMP
310configuration.
311
312As discussed earlier, SMP systems have opportunities for true parallelism which
313was not possible on uni-processor systems. Consequently, multiple techniques
314that provided adequate critical sections on uni-processor systems are unsafe on
315SMP systems. In this section, some of these unsafe techniques will be
316discussed.
317
318In general, applications must use proper operating system provided mutual
319exclusion mechanisms to ensure correct behavior.
320
321Task variables
322--------------
323
324Task variables are ordinary global variables with a dedicated value for each
325thread.  During a context switch from the executing thread to the heir thread,
326the value of each task variable is saved to the thread control block of the
327executing thread and restored from the thread control block of the heir thread.
328This is inherently broken if more than one executing thread exists.
329Alternatives to task variables are POSIX keys and :term:`TLS`.  All use cases
330of task variables in the RTEMS code base were replaced with alternatives.  The
331task variable API has been removed in RTEMS 5.1.
332
333Highest Priority Thread Never Walks Alone
334-----------------------------------------
335
336On a uni-processor system, it is safe to assume that when the highest priority
337task in an application executes, it will execute without being preempted until
338it voluntarily blocks. Interrupts may occur while it is executing, but there
339will be no context switch to another task unless the highest priority task
340voluntarily initiates it.
341
342Given the assumption that no other tasks will have their execution interleaved
343with the highest priority task, it is possible for this task to be constructed
344such that it does not need to acquire a mutex for protected access to shared
345data.
346
347In an SMP system, it cannot be assumed there will never be a single task
348executing. It should be assumed that every processor is executing another
349application task. Further, those tasks will be ones which would not have been
350executed in a uni-processor configuration and should be assumed to have data
351synchronization conflicts with what was formerly the highest priority task
352which executed without conflict.
353
354Disabling of Thread Preemption
355------------------------------
356
357A thread which disables preemption prevents that a higher priority thread gets
358hold of its processor involuntarily.  In uni-processor configurations, this can
359be used to ensure mutual exclusion at thread level.  In SMP configurations,
360however, more than one executing thread may exist.  Thus, it is impossible to
361ensure mutual exclusion using this mechanism.  In order to prevent that
362applications using preemption for this purpose, would show inappropriate
363behaviour, this feature is disabled in SMP configurations and its use would
364case run-time errors.
365
366Disabling of Interrupts
367-----------------------
368
369A low overhead means that ensures mutual exclusion in uni-processor
370configurations is the disabling of interrupts around a critical section.  This
371is commonly used in device driver code.  In SMP configurations, however,
372disabling the interrupts on one processor has no effect on other processors.
373So, this is insufficient to ensure system-wide mutual exclusion.  The macros
374
375* :ref:`rtems_interrupt_disable() <rtems_interrupt_disable>`,
376
377* :ref:`rtems_interrupt_enable() <rtems_interrupt_enable>`, and
378
379* :ref:`rtems_interrupt_flash() <rtems_interrupt_flash>`.
380
381are disabled in SMP configurations and its use will cause compile-time warnings
382and link-time errors.  In the unlikely case that interrupts must be disabled on
383the current processor, the
384
385* :ref:`rtems_interrupt_local_disable() <rtems_interrupt_local_disable>`, and
386
387* :ref:`rtems_interrupt_local_enable() <rtems_interrupt_local_enable>`.
388
389macros are now available in all configurations.
390
391Since disabling of interrupts is insufficient to ensure system-wide mutual
392exclusion on SMP a new low-level synchronization primitive was added --
393interrupt locks.  The interrupt locks are a simple API layer on top of the SMP
394locks used for low-level synchronization in the operating system core.
395Currently, they are implemented as a ticket lock.  In uni-processor
396configurations, they degenerate to simple interrupt disable/enable sequences by
397means of the C pre-processor.  It is disallowed to acquire a single interrupt
398lock in a nested way.  This will result in an infinite loop with interrupts
399disabled.  While converting legacy code to interrupt locks, care must be taken
400to avoid this situation to happen.
401
402.. code-block:: c
403    :linenos:
404
405    #include <rtems.h>
406
407    void legacy_code_with_interrupt_disable_enable( void )
408    {
409      rtems_interrupt_level level;
410
411      rtems_interrupt_disable( level );
412      /* Critical section */
413      rtems_interrupt_enable( level );
414    }
415
416    RTEMS_INTERRUPT_LOCK_DEFINE( static, lock, "Name" )
417
418    void smp_ready_code_with_interrupt_lock( void )
419    {
420      rtems_interrupt_lock_context lock_context;
421
422      rtems_interrupt_lock_acquire( &lock, &lock_context );
423      /* Critical section */
424      rtems_interrupt_lock_release( &lock, &lock_context );
425    }
426
427An alternative to the RTEMS-specific interrupt locks are POSIX spinlocks.  The
428:c:type:`pthread_spinlock_t` is defined as a self-contained object, e.g. the
429user must provide the storage for this synchronization object.
430
431.. code-block:: c
432    :linenos:
433
434    #include <assert.h>
435    #include <pthread.h>
436
437    pthread_spinlock_t lock;
438
439    void smp_ready_code_with_posix_spinlock( void )
440    {
441      int error;
442
443      error = pthread_spin_lock( &lock );
444      assert( error == 0 );
445      /* Critical section */
446      error = pthread_spin_unlock( &lock );
447      assert( error == 0 );
448    }
449
450In contrast to POSIX spinlock implementation on Linux or FreeBSD, it is not
451allowed to call blocking operating system services inside the critical section.
452A recursive lock attempt is a severe usage error resulting in an infinite loop
453with interrupts disabled.  Nesting of different locks is allowed.  The user
454must ensure that no deadlock can occur.  As a non-portable feature the locks
455are zero-initialized, e.g. statically initialized global locks reside in the
456``.bss`` section and there is no need to call :c:func:`pthread_spin_init`.
457
458Interrupt Service Routines Execute in Parallel With Threads
459-----------------------------------------------------------
460
461On a machine with more than one processor, interrupt service routines (this
462includes timer service routines installed via :ref:`rtems_timer_fire_after()
463<rtems_timer_fire_after>`) and threads can execute in parallel.  Interrupt
464service routines must take this into account and use proper locking mechanisms
465to protect critical sections from interference by threads (interrupt locks or
466POSIX spinlocks).  This likely requires code modifications in legacy device
467drivers.
468
469Timers Do Not Stop Immediately
470------------------------------
471
472Timer service routines run in the context of the clock interrupt.  On
473uni-processor configurations, it is sufficient to disable interrupts and remove
474a timer from the set of active timers to stop it.  In SMP configurations,
475however, the timer service routine may already run and wait on an SMP lock
476owned by the thread which is about to stop the timer.  This opens the door to
477subtle synchronization issues.  During destruction of objects, special care
478must be taken to ensure that timer service routines cannot access (partly or
479fully) destroyed objects.
480
481False Sharing of Cache Lines Due to Objects Table
482-------------------------------------------------
483
484The Classic API and most POSIX API objects are indirectly accessed via an
485object identifier.  The user-level functions validate the object identifier and
486map it to the actual object structure which resides in a global objects table
487for each object class.  So, unrelated objects are packed together in a table.
488This may result in false sharing of cache lines.  The effect of false sharing
489of cache lines can be observed with the `TMFINE 1
490<https://git.rtems.org/rtems/tree/testsuites/tmtests/tmfine01>`_ test program
491on a suitable platform, e.g. QorIQ T4240.  High-performance SMP applications
492need full control of the object storage :cite:`Drepper:2007:Memory`.
493Therefore, self-contained synchronization objects are now available for RTEMS.
494
495Directives
496==========
497
498This section details the symmetric multiprocessing services.  A subsection is
499dedicated to each of these services and describes the calling sequence, related
500constants, usage, and status codes.
501
502.. raw:: latex
503
504   \clearpage
505
506.. _rtems_get_processor_count:
507
508GET_PROCESSOR_COUNT - Get processor count
509-----------------------------------------
510
511CALLING SEQUENCE:
512    .. code-block:: c
513
514        uint32_t rtems_get_processor_count(void);
515
516DIRECTIVE STATUS CODES:
517
518    The count of processors in the system that can be run. The value returned
519    is the highest numbered processor index of all processors available to the
520    application (if a scheduler is assigned) plus one.
521
522DESCRIPTION:
523    In uni-processor configurations, a value of one will be returned.
524
525    In SMP configurations, this returns the value of a global variable set
526    during system initialization to indicate the count of utilized processors.
527    The processor count depends on the physically or virtually available
528    processors and application configuration.  The value will always be less
529    than or equal to the maximum count of application configured processors.
530
531NOTES:
532    None.
533
534.. raw:: latex
535
536   \clearpage
537
538.. _rtems_get_current_processor:
539
540GET_CURRENT_PROCESSOR - Get current processor index
541---------------------------------------------------
542
543CALLING SEQUENCE:
544    .. code-block:: c
545
546        uint32_t rtems_get_current_processor(void);
547
548DIRECTIVE STATUS CODES:
549    The index of the current processor.
550
551DESCRIPTION:
552    In uni-processor configurations, a value of zero will be returned.
553
554    In SMP configurations, an architecture specific method is used to obtain the
555    index of the current processor in the system.  The set of processor indices
556    is the range of integers starting with zero up to the processor count minus
557    one.
558
559    Outside of sections with disabled thread dispatching the current processor
560    index may change after every instruction since the thread may migrate from
561    one processor to another.  Sections with disabled interrupts are sections
562    with thread dispatching disabled.
563
564NOTES:
565    None.
566
567Implementation Details
568======================
569
570This section covers some implementation details of the RTEMS SMP support.
571
572Low-Level Synchronization
573-------------------------
574
575All low-level synchronization primitives are implemented using :term:`C11`
576atomic operations, so no target-specific hand-written assembler code is
577necessary.  Four synchronization primitives are currently available
578
579* ticket locks (mutual exclusion),
580
581* :term:`MCS` locks (mutual exclusion),
582
583* barriers, implemented as a sense barrier, and
584
585* sequence locks :cite:`Boehm:2012:Seqlock`.
586
587A vital requirement for low-level mutual exclusion is :term:`FIFO` fairness
588since we are interested in a predictable system and not maximum throughput.
589With this requirement, there are only few options to resolve this problem.  For
590reasons of simplicity, the ticket lock algorithm was chosen to implement the
591SMP locks.  However, the API is capable to support MCS locks, which may be
592interesting in the future for systems with a processor count in the range of 32
593or more, e.g.  :term:`NUMA`, many-core systems.
594
595The test program `SMPLOCK 1
596<https://git.rtems.org/rtems/tree/testsuites/smptests/smplock01>`_ can be used
597to gather performance and fairness data for several scenarios.  The SMP lock
598performance and fairness measured on the QorIQ T4240 follows as an example.
599This chip contains three L2 caches.  Each L2 cache is shared by eight
600processors.
601
602.. image:: ../images/c_user/smplock01perf-t4240.*
603   :width: 400
604   :align: center
605
606.. image:: ../images/c_user/smplock01fair-t4240.*
607   :width: 400
608   :align: center
609
610Internal Locking
611----------------
612
613In SMP configurations, the operating system uses non-recursive SMP locks for
614low-level mutual exclusion.  The locking domains are roughly
615
616* a particular data structure,
617* the thread queue operations,
618* the thread state changes, and
619* the scheduler operations.
620
621For a good average-case performance it is vital that every high-level
622synchronization object, e.g. mutex, has its own SMP lock.  In the average-case,
623only this SMP lock should be involved to carry out a specific operation, e.g.
624obtain/release a mutex.  In general, the high-level synchronization objects
625have a thread queue embedded and use its SMP lock.
626
627In case a thread must block on a thread queue, then things get complicated.
628The executing thread first acquires the SMP lock of the thread queue and then
629figures out that it needs to block.  The procedure to block the thread on this
630particular thread queue involves state changes of the thread itself and for
631this thread-specific SMP locks must be used.
632
633In order to determine if a thread is blocked on a thread queue or not
634thread-specific SMP locks must be used.  A thread priority change must
635propagate this to the thread queue (possibly recursively).  Care must be taken
636to not have a lock order reversal between thread queue and thread-specific SMP
637locks.
638
639Each scheduler instance has its own SMP lock.  For the scheduler helping
640protocol multiple scheduler instances may be in charge of a thread.  It is not
641possible to acquire two scheduler instance SMP locks at the same time,
642otherwise deadlocks would happen.  A thread-specific SMP lock is used to
643synchronize the thread data shared by different scheduler instances.
644
645The thread state SMP lock protects various things, e.g. the thread state, join
646operations, signals, post-switch actions, the home scheduler instance, etc.
647
648Profiling
649---------
650
651To identify the bottlenecks in the system, support for profiling of low-level
652synchronization is optionally available.  The profiling support is a BSP build
653time configuration option (``--enable-profiling``) and is implemented with an
654acceptable overhead, even for production systems.  A low-overhead counter for
655short time intervals must be provided by the hardware.
656
657Profiling reports are generated in XML for most test programs of the RTEMS
658testsuite (more than 500 test programs).  This gives a good sample set for
659statistics.  For example the maximum thread dispatch disable time, the maximum
660interrupt latency or lock contention can be determined.
661
662.. code-block:: xml
663
664   <ProfilingReport name="SMPMIGRATION 1">
665     <PerCPUProfilingReport processorIndex="0">
666       <MaxThreadDispatchDisabledTime unit="ns">36636</MaxThreadDispatchDisabledTime>
667       <MeanThreadDispatchDisabledTime unit="ns">5065</MeanThreadDispatchDisabledTime>
668       <TotalThreadDispatchDisabledTime unit="ns">3846635988
669         </TotalThreadDispatchDisabledTime>
670       <ThreadDispatchDisabledCount>759395</ThreadDispatchDisabledCount>
671       <MaxInterruptDelay unit="ns">8772</MaxInterruptDelay>
672       <MaxInterruptTime unit="ns">13668</MaxInterruptTime>
673       <MeanInterruptTime unit="ns">6221</MeanInterruptTime>
674       <TotalInterruptTime unit="ns">6757072</TotalInterruptTime>
675       <InterruptCount>1086</InterruptCount>
676     </PerCPUProfilingReport>
677     <PerCPUProfilingReport processorIndex="1">
678       <MaxThreadDispatchDisabledTime unit="ns">39408</MaxThreadDispatchDisabledTime>
679       <MeanThreadDispatchDisabledTime unit="ns">5060</MeanThreadDispatchDisabledTime>
680       <TotalThreadDispatchDisabledTime unit="ns">3842749508
681         </TotalThreadDispatchDisabledTime>
682       <ThreadDispatchDisabledCount>759391</ThreadDispatchDisabledCount>
683       <MaxInterruptDelay unit="ns">8412</MaxInterruptDelay>
684       <MaxInterruptTime unit="ns">15868</MaxInterruptTime>
685       <MeanInterruptTime unit="ns">3525</MeanInterruptTime>
686       <TotalInterruptTime unit="ns">3814476</TotalInterruptTime>
687       <InterruptCount>1082</InterruptCount>
688     </PerCPUProfilingReport>
689     <!-- more reports omitted --->
690     <SMPLockProfilingReport name="Scheduler">
691       <MaxAcquireTime unit="ns">7092</MaxAcquireTime>
692       <MaxSectionTime unit="ns">10984</MaxSectionTime>
693       <MeanAcquireTime unit="ns">2320</MeanAcquireTime>
694       <MeanSectionTime unit="ns">199</MeanSectionTime>
695       <TotalAcquireTime unit="ns">3523939244</TotalAcquireTime>
696       <TotalSectionTime unit="ns">302545596</TotalSectionTime>
697       <UsageCount>1518758</UsageCount>
698       <ContentionCount initialQueueLength="0">759399</ContentionCount>
699       <ContentionCount initialQueueLength="1">759359</ContentionCount>
700       <ContentionCount initialQueueLength="2">0</ContentionCount>
701       <ContentionCount initialQueueLength="3">0</ContentionCount>
702     </SMPLockProfilingReport>
703   </ProfilingReport>
704
705Scheduler Helping Protocol
706--------------------------
707
708The scheduler provides a helping protocol to support locking protocols like the
709:ref:`OMIP` or the :ref:`MrsP`.  Each thread has a scheduler node for each
710scheduler instance in the system which are located in its :term:`TCB`.  A
711thread has exactly one home scheduler instance which is set during thread
712creation.  The home scheduler instance can be changed with
713:ref:`rtems_task_set_scheduler() <rtems_task_set_scheduler>`.  Due to the
714locking protocols a thread may gain access to scheduler nodes of other
715scheduler instances.  This allows the thread to temporarily migrate to another
716scheduler instance in case of preemption.
717
718The scheduler infrastructure is based on an object-oriented design.  The
719scheduler operations for a thread are defined as virtual functions.  For the
720scheduler helping protocol the following operations must be implemented by an
721SMP-aware scheduler
722
723* ask a scheduler node for help,
724* reconsider the help request of a scheduler node,
725* withdraw a schedule node.
726
727All currently available SMP-aware schedulers use a framework which is
728customized via inline functions.  This eases the implementation of scheduler
729variants.  Up to now, only priority-based schedulers are implemented.
730
731In case a thread is allowed to use more than one scheduler node it will ask
732these nodes for help
733
734* in case of preemption, or
735* an unblock did not schedule the thread, or
736* a yield  was successful.
737
738The actual ask for help scheduler operations are carried out as a side-effect
739of the thread dispatch procedure.  Once a need for help is recognized, a help
740request is registered in one of the processors related to the thread and a
741thread dispatch is issued.  This indirection leads to a better decoupling of
742scheduler instances.  Unrelated processors are not burdened with extra work for
743threads which participate in resource sharing.  Each ask for help operation
744indicates if it could help or not.  The procedure stops after the first
745successful ask for help.  Unsuccessful ask for help operations will register
746this need in the scheduler context.
747
748After a thread dispatch the reconsider help request operation is used to clean
749up stale help registrations in the scheduler contexts.
750
751The withdraw operation takes away scheduler nodes once the thread is no longer
752allowed to use them, e.g. it released a mutex.  The availability of scheduler
753nodes for a thread is controlled by the thread queues.
754
755Thread Dispatch Details
756-----------------------
757
758This section gives background information to developers interested in the
759interrupt latencies introduced by thread dispatching.  A thread dispatch
760consists of all work which must be done to stop the currently executing thread
761on a processor and hand over this processor to an heir thread.
762
763In SMP systems, scheduling decisions on one processor must be propagated
764to other processors through inter-processor interrupts.  A thread dispatch
765which must be carried out on another processor does not happen instantaneously.
766Thus, several thread dispatch requests might be in the air and it is possible
767that some of them may be out of date before the corresponding processor has
768time to deal with them.  The thread dispatch mechanism uses three per-processor
769variables,
770
771- the executing thread,
772
773- the heir thread, and
774
775- a boolean flag indicating if a thread dispatch is necessary or not.
776
777Updates of the heir thread are done via a normal store operation.  The thread
778dispatch necessary indicator of another processor is set as a side-effect of an
779inter-processor interrupt.  So, this change notification works without the use
780of locks.  The thread context is protected by a :term:`TTAS` lock embedded in
781the context to ensure that it is used on at most one processor at a time.
782Normally, only thread-specific or per-processor locks are used during a thread
783dispatch.  This implementation turned out to be quite efficient and no lock
784contention was observed in the testsuite.  The heavy-weight thread dispatch
785sequence is only entered in case the thread dispatch indicator is set.
786
787The context-switch is performed with interrupts enabled.  During the transition
788from the executing to the heir thread neither the stack of the executing nor
789the heir thread must be used during interrupt processing.  For this purpose a
790temporary per-processor stack is set up which may be used by the interrupt
791prologue before the stack is switched to the interrupt stack.
Note: See TracBrowser for help on using the repository browser.