source: rtems-docs/c-user/symmetric_multiprocessing_services.rst @ 2e0a2a0

Last change on this file since 2e0a2a0 was 2e0a2a0, checked in by Sebastian Huber <sebastian.huber@…>, on 02/02/17 at 09:49:56

c-user: Add SMP implementation details section

  • Property mode set to 100644
File size: 27.1 KB
1.. comment SPDX-License-Identifier: CC-BY-SA-4.0
3.. COMMENT: COPYRIGHT (c) 2014.
4.. COMMENT: On-Line Applications Research Corporation (OAR).
5.. COMMENT: All rights reserved.
7Symmetric Multiprocessing Services
13The Symmetric Multiprocessing (SMP) support of the RTEMS 4.11.0 and later is available
16- ARM,
18- PowerPC, and
20- SPARC.
22It must be explicitly enabled via the ``--enable-smp`` configure command line
23option.  To enable SMP in the application configuration see :ref:`Enable SMP
24Support for Applications`.  The default scheduler for SMP applications supports
25up to 32 processors and is a global fixed priority scheduler, see also
26:ref:`Configuring Clustered Schedulers`.  For example applications
29.. warning::
31   The SMP support in the release of RTEMS is a work in progress. Before you
32   start using this RTEMS version for SMP ask on the RTEMS mailing list.
34This chapter describes the services related to Symmetric Multiprocessing
35provided by RTEMS.
37The application level services currently provided are:
39- rtems_get_processor_count_ - Get processor count
41- rtems_get_current_processor_ - Get current processor index
46Uniprocessor versus SMP Parallelism
49Uniprocessor systems have long been used in embedded systems. In this hardware
50model, there are some system execution characteristics which have long been
51taken for granted:
53- one task executes at a time
55- hardware events result in interrupts
57There is no true parallelism. Even when interrupts appear to occur at the same
58time, they are processed in largely a serial fashion.  This is true even when
59the interupt service routines are allowed to nest.  From a tasking viewpoint,
60it is the responsibility of the real-time operatimg system to simulate
61parallelism by switching between tasks.  These task switches occur in response
62to hardware interrupt events and explicit application events such as blocking
63for a resource or delaying.
65With symmetric multiprocessing, the presence of multiple processors allows for
66true concurrency and provides for cost-effective performance
67improvements. Uniprocessors tend to increase performance by increasing clock
68speed and complexity. This tends to lead to hot, power hungry microprocessors
69which are poorly suited for many embedded applications.
71The true concurrency is in sharp contrast to the single task and interrupt
72model of uniprocessor systems. This results in a fundamental change to
73uniprocessor system characteristics listed above. Developers are faced with a
74different set of characteristics which, in turn, break some existing
75assumptions and result in new challenges. In an SMP system with N processors,
76these are the new execution characteristics.
78- N tasks execute in parallel
80- hardware events result in interrupts
82There is true parallelism with a task executing on each processor and the
83possibility of interrupts occurring on each processor. Thus in contrast to
84their being one task and one interrupt to consider on a uniprocessor, there are
85N tasks and potentially N simultaneous interrupts to consider on an SMP system.
87This increase in hardware complexity and presence of true parallelism results
88in the application developer needing to be even more cautious about mutual
89exclusion and shared data access than in a uniprocessor embedded system. Race
90conditions that never or rarely happened when an application executed on a
91uniprocessor system, become much more likely due to multiple threads executing
92in parallel. On a uniprocessor system, these race conditions would only happen
93when a task switch occurred at just the wrong moment. Now there are N-1 tasks
94executing in parallel all the time and this results in many more opportunities
95for small windows in critical sections to be hit.
97Task Affinity
99.. index:: task affinity
100.. index:: thread affinity
102RTEMS provides services to manipulate the affinity of a task. Affinity is used
103to specify the subset of processors in an SMP system on which a particular task
104can execute.
106By default, tasks have an affinity which allows them to execute on any
107available processor.
109Task affinity is a possible feature to be supported by SMP-aware
110schedulers. However, only a subset of the available schedulers support
111affinity. Although the behavior is scheduler specific, if the scheduler does
112not support affinity, it is likely to ignore all attempts to set affinity.
114The scheduler with support for arbitary processor affinities uses a proof of
115concept implementation.  See
117Task Migration
119.. index:: task migration
120.. index:: thread migration
122With more than one processor in the system tasks can migrate from one processor
123to another.  There are four reasons why tasks migrate in RTEMS.
125- The scheduler changes explicitly via
126  :ref:`rtems_task_set_scheduler() <rtems_task_set_scheduler>` or similar
127  directives.
129- The task processor affinity changes explicitly via
130  :ref:`rtems_task_set_affinity() <rtems_task_set_affinity>` or similar
131  directives.
133- The task resumes execution after a blocking operation.  On a priority based
134  scheduler it will evict the lowest priority task currently assigned to a
135  processor in the processor set managed by the scheduler instance.
137- The task moves temporarily to another scheduler instance due to locking
138  protocols like the :ref:`MrsP` or the :ref:`OMIP`.
140Task migration should be avoided so that the working set of a task can stay on
141the most local cache level.
143Clustered Scheduling
146The scheduler is responsible to assign processors to some of the threads which
147are ready to execute.  Trouble starts if more ready threads than processors
148exist at the same time.  There are various rules how the processor assignment
149can be performed attempting to fulfill additional constraints or yield some
150overall system properties.  As a matter of fact it is impossible to meet all
151requirements at the same time.  The way a scheduler works distinguishes
152real-time operating systems from general purpose operating systems.
154We have clustered scheduling in case the set of processors of a system is
155partitioned into non-empty pairwise-disjoint subsets of processors.  These
156subsets are called clusters.  Clusters with a cardinality of one are
157partitions.  Each cluster is owned by exactly one scheduler instance.  In case
158the cluster size equals the processor count, it is called global scheduling.
160Modern SMP systems have multi-layer caches.  An operating system which neglects
161cache constraints in the scheduler will not yield good performance.  Real-time
162operating systems usually provide priority (fixed or job-level) based
163schedulers so that each of the highest priority threads is assigned to a
164processor.  Priority based schedulers have difficulties in providing cache
165locality for threads and may suffer from excessive thread migrations
166:cite:`Brandenburg:2011:SL` :cite:`Compagnin:2014:RUN`.  Schedulers that use local run
167queues and some sort of load-balancing to improve the cache utilization may not
168fulfill global constraints :cite:`Gujarati:2013:LPP` and are more difficult to
169implement than one would normally expect :cite:`Lozi:2016:LSDWC`.
171Clustered scheduling was implemented for RTEMS SMP to best use the cache
172topology of a system and to keep the worst-case latencies under control.  The
173low-level SMP locks use FIFO ordering.  So, the worst-case run-time of
174operations increases with each processor involved.  The scheduler configuration
175is quite flexible and done at link-time, see :ref:`Configuring Clustered
176Schedulers`.  It is possible to re-assign processors to schedulers during
177run-time via :ref:`rtems_scheduler_add_processor()
178<rtems_scheduler_add_processor>` and :ref:`rtems_scheduler_remove_processor()
179<rtems_scheduler_remove_processor>`.  The schedulers are implemented in an
180object-oriented fashion.
182The problem is to provide synchronization
183primitives for inter-cluster synchronization (more than one cluster is involved
184in the synchronization process). In RTEMS there are currently some means
187- events,
189- message queues,
191- mutexes using the :ref:`OMIP`,
193- mutexes using the :ref:`MrsP`, and
195- binary and counting semaphores.
197The clustered scheduling approach enables separation of functions with
198real-time requirements and functions that profit from fairness and high
199throughput provided the scheduler instances are fully decoupled and adequate
200inter-cluster synchronization primitives are used.
202To set the scheduler of a task see :ref:`rtems_scheduler_ident()
203<rtems_scheduler_ident>` and :ref:`rtems_task_set_scheduler()
206Scheduler Helping Protocol
209The scheduler provides a helping protocol to support locking protocols like
210*Migratory Priority Inheritance* or the *Multiprocessor Resource Sharing
211Protocol*.  Each ready task can use at least one scheduler node at a time to
212gain access to a processor.  Each scheduler node has an owner, a user and an
213optional idle task.  The owner of a scheduler node is determined a task
214creation and never changes during the life time of a scheduler node.  The user
215of a scheduler node may change due to the scheduler helping protocol.  A
216scheduler node is in one of the four scheduler help states:
218:dfn:`help yourself`
219    This scheduler node is solely used by the owner task.  This task owns no
220    resources using a helping protocol and thus does not take part in the
221    scheduler helping protocol.  No help will be provided for other tasks.
223:dfn:`help active owner`
224    This scheduler node is owned by a task actively owning a resource and can
225    be used to help out tasks.  In case this scheduler node changes its state
226    from ready to scheduled and the task executes using another node, then an
227    idle task will be provided as a user of this node to temporarily execute on
228    behalf of the owner task.  Thus lower priority tasks are denied access to
229    the processors of this scheduler instance.  In case a task actively owning
230    a resource performs a blocking operation, then an idle task will be used
231    also in case this node is in the scheduled state.
233:dfn:`help active rival`
234    This scheduler node is owned by a task actively obtaining a resource
235    currently owned by another task and can be used to help out tasks.  The
236    task owning this node is ready and will give away its processor in case the
237    task owning the resource asks for help.
239:dfn:`help passive`
240    This scheduler node is owned by a task obtaining a resource currently owned
241    by another task and can be used to help out tasks.  The task owning this
242    node is blocked.
244The following scheduler operations return a task in need for help
246- unblock,
248- change priority,
250- yield, and
252- ask for help.
254A task in need for help is a task that encounters a scheduler state change from
255scheduled to ready (this is a pre-emption by a higher priority task) or a task
256that cannot be scheduled in an unblock operation.  Such a task can ask tasks
257which depend on resources owned by this task for help.
259In case it is not possible to schedule a task in need for help, then the
260scheduler nodes available for the task will be placed into the set of ready
261scheduler nodes of the corresponding scheduler instances.  Once a state change
262from ready to scheduled happens for one of scheduler nodes it will be used to
263schedule the task in need for help.
265The ask for help scheduler operation is used to help tasks in need for help
266returned by the operations mentioned above.  This operation is also used in
267case the root of a resource sub-tree owned by a task changes.
269The run-time of the ask for help procedures depend on the size of the resource
270tree of the task needing help and other resource trees in case tasks in need
271for help are produced during this operation.  Thus the worst-case latency in
272the system depends on the maximum resource tree size of the application.
277OpenMP support for RTEMS is available via the GCC provided libgomp.  There is
278libgomp support for RTEMS in the POSIX configuration of libgomp since GCC 4.9
279(requires a Newlib snapshot after 2015-03-12). In GCC 6.1 or later (requires a
280Newlib snapshot after 2015-07-30 for <sys/lock.h> provided self-contained
281synchronization objects) there is a specialized libgomp configuration for RTEMS
282which offers a significantly better performance compared to the POSIX
283configuration of libgomp.  In addition application configurable thread pools
284for each scheduler instance are available in GCC 6.1 or later.
286The run-time configuration of libgomp is done via environment variables
287documented in the `libgomp manual <>`_.
288The environment variables are evaluated in a constructor function which
289executes in the context of the first initialization task before the actual
290initialization task function is called (just like a global C++ constructor).
291To set application specific values, a higher priority constructor function must
292be used to set up the environment variables.
294.. code-block:: c
296    #include <stdlib.h>
297    void __attribute__((constructor(1000))) config_libgomp( void )
298    {
299        setenv( "OMP_DISPLAY_ENV", "VERBOSE", 1 );
300        setenv( "GOMP_SPINCOUNT", "30000", 1 );
301        setenv( "GOMP_RTEMS_THREAD_POOLS", "1$2@SCHD", 1 );
302    }
304The environment variable ``GOMP_RTEMS_THREAD_POOLS`` is RTEMS-specific.  It
305determines the thread pools for each scheduler instance.  The format for
306``GOMP_RTEMS_THREAD_POOLS`` is a list of optional
307``<thread-pool-count>[$<priority>]@<scheduler-name>`` configurations separated
308by ``:`` where:
310- ``<thread-pool-count>`` is the thread pool count for this scheduler instance.
312- ``$<priority>`` is an optional priority for the worker threads of a thread
313  pool according to ``pthread_setschedparam``.  In case a priority value is
314  omitted, then a worker thread will inherit the priority of the OpenMP master
315  thread that created it.  The priority of the worker thread is not changed by
316  libgomp after creation, even if a new OpenMP master thread using the worker
317  has a different priority.
319- ``@<scheduler-name>`` is the scheduler instance name according to the RTEMS
320  application configuration.
322In case no thread pool configuration is specified for a scheduler instance,
323then each OpenMP master thread of this scheduler instance will use its own
324dynamically allocated thread pool.  To limit the worker thread count of the
325thread pools, each OpenMP master thread must call ``omp_set_num_threads``.
327Lets suppose we have three scheduler instances ``IO``, ``WRK0``, and ``WRK1``
328with ``GOMP_RTEMS_THREAD_POOLS`` set to ``"1@WRK0:3$4@WRK1"``.  Then there are
329no thread pool restrictions for scheduler instance ``IO``.  In the scheduler
330instance ``WRK0`` there is one thread pool available.  Since no priority is
331specified for this scheduler instance, the worker thread inherits the priority
332of the OpenMP master thread that created it.  In the scheduler instance
333``WRK1`` there are three thread pools available and their worker threads run at
334priority four.
336Application Issues
339Most operating system services provided by the uni-processor RTEMS are
340available in SMP configurations as well.  However, applications designed for an
341uni-processor environment may need some changes to correctly run in an SMP
344As discussed earlier, SMP systems have opportunities for true parallelism which
345was not possible on uni-processor systems. Consequently, multiple techniques
346that provided adequate critical sections on uni-processor systems are unsafe on
347SMP systems. In this section, some of these unsafe techniques will be
350In general, applications must use proper operating system provided mutual
351exclusion mechanisms to ensure correct behavior.
353Task variables
356Task variables are ordinary global variables with a dedicated value for each
357thread.  During a context switch from the executing thread to the heir thread,
358the value of each task variable is saved to the thread control block of the
359executing thread and restored from the thread control block of the heir thread.
360This is inherently broken if more than one executing thread exists.
361Alternatives to task variables are POSIX keys and :ref:`TLS <TLS>`.  All use
362cases of task variables in the RTEMS code base were replaced with alternatives.
363The task variable API has been removed in RTEMS 4.12.
365Highest Priority Thread Never Walks Alone
368On a uni-processor system, it is safe to assume that when the highest priority
369task in an application executes, it will execute without being preempted until
370it voluntarily blocks. Interrupts may occur while it is executing, but there
371will be no context switch to another task unless the highest priority task
372voluntarily initiates it.
374Given the assumption that no other tasks will have their execution interleaved
375with the highest priority task, it is possible for this task to be constructed
376such that it does not need to acquire a mutex for protected access to shared
379In an SMP system, it cannot be assumed there will never be a single task
380executing. It should be assumed that every processor is executing another
381application task. Further, those tasks will be ones which would not have been
382executed in a uni-processor configuration and should be assumed to have data
383synchronization conflicts with what was formerly the highest priority task
384which executed without conflict.
386Disabling of Thread Pre-Emption
389A thread which disables pre-emption prevents that a higher priority thread gets
390hold of its processor involuntarily.  In uni-processor configurations, this can
391be used to ensure mutual exclusion at thread level.  In SMP configurations,
392however, more than one executing thread may exist.  Thus, it is impossible to
393ensure mutual exclusion using this mechanism.  In order to prevent that
394applications using pre-emption for this purpose, would show inappropriate
395behaviour, this feature is disabled in SMP configurations and its use would
396case run-time errors.
398Disabling of Interrupts
401A low overhead means that ensures mutual exclusion in uni-processor
402configurations is the disabling of interrupts around a critical section.  This
403is commonly used in device driver code.  In SMP configurations, however,
404disabling the interrupts on one processor has no effect on other processors.
405So, this is insufficient to ensure system-wide mutual exclusion.  The macros
407* :ref:`rtems_interrupt_disable() <rtems_interrupt_disable>`,
409* :ref:`rtems_interrupt_enable() <rtems_interrupt_enable>`, and
411* :ref:`rtems_interrupt_flash() <rtems_interrupt_flash>`.
413are disabled in SMP configurations and its use will cause compile-time warnings
414and link-time errors.  In the unlikely case that interrupts must be disabled on
415the current processor, the
417* :ref:`rtems_interrupt_local_disable() <rtems_interrupt_local_disable>`, and
419* :ref:`rtems_interrupt_local_enable() <rtems_interrupt_local_enable>`.
421macros are now available in all configurations.
423Since disabling of interrupts is insufficient to ensure system-wide mutual
424exclusion on SMP a new low-level synchronization primitive was added --
425interrupt locks.  The interrupt locks are a simple API layer on top of the SMP
426locks used for low-level synchronization in the operating system core.
427Currently, they are implemented as a ticket lock.  In uni-processor
428configurations, they degenerate to simple interrupt disable/enable sequences by
429means of the C pre-processor.  It is disallowed to acquire a single interrupt
430lock in a nested way.  This will result in an infinite loop with interrupts
431disabled.  While converting legacy code to interrupt locks, care must be taken
432to avoid this situation to happen.
434.. code-block:: c
435    :linenos:
437    #include <rtems.h>
439    void legacy_code_with_interrupt_disable_enable( void )
440    {
441      rtems_interrupt_level level;
443      rtems_interrupt_disable( level );
444      /* Critical section */
445      rtems_interrupt_enable( level );
446    }
448    RTEMS_INTERRUPT_LOCK_DEFINE( static, lock, "Name" )
450    void smp_ready_code_with_interrupt_lock( void )
451    {
452      rtems_interrupt_lock_context lock_context;
454      rtems_interrupt_lock_acquire( &lock, &lock_context );
455      /* Critical section */
456      rtems_interrupt_lock_release( &lock, &lock_context );
457    }
459An alternative to the RTEMS-specific interrupt locks are POSIX spinlocks.  The
460:c:type:`pthread_spinlock_t` is defined as a self-contained object, e.g. the
461user must provide the storage for this synchronization object.
463.. code-block:: c
464    :linenos:
466    #include <assert.h>
467    #include <pthread.h>
469    pthread_spinlock_t lock;
471    void smp_ready_code_with_posix_spinlock( void )
472    {
473      int error;
475      error = pthread_spin_lock( &lock );
476      assert( error == 0 );
477      /* Critical section */
478      error = pthread_spin_unlock( &lock );
479      assert( error == 0 );
480    }
482In contrast to POSIX spinlock implementation on Linux or FreeBSD, it is not
483allowed to call blocking operating system services inside the critical section.
484A recursive lock attempt is a severe usage error resulting in an infinite loop
485with interrupts disabled.  Nesting of different locks is allowed.  The user
486must ensure that no deadlock can occur.  As a non-portable feature the locks
487are zero-initialized, e.g. statically initialized global locks reside in the
488``.bss`` section and there is no need to call :c:func:`pthread_spin_init`.
490Interrupt Service Routines Execute in Parallel With Threads
493On a machine with more than one processor, interrupt service routines (this
494includes timer service routines installed via :ref:`rtems_timer_fire_after()
495<rtems_timer_fire_after>`) and threads can execute in parallel.  Interrupt
496service routines must take this into account and use proper locking mechanisms
497to protect critical sections from interference by threads (interrupt locks or
498POSIX spinlocks).  This likely requires code modifications in legacy device
501Timers Do Not Stop Immediately
504Timer service routines run in the context of the clock interrupt.  On
505uni-processor configurations, it is sufficient to disable interrupts and remove
506a timer from the set of active timers to stop it.  In SMP configurations,
507however, the timer service routine may already run and wait on an SMP lock
508owned by the thread which is about to stop the timer.  This opens the door to
509subtle synchronization issues.  During destruction of objects, special care
510must be taken to ensure that timer service routines cannot access (partly or
511fully) destroyed objects.
513False Sharing of Cache Lines Due to Objects Table
516The Classic API and most POSIX API objects are indirectly accessed via an
517object identifier.  The user-level functions validate the object identifier and
518map it to the actual object structure which resides in a global objects table
519for each object class.  So, unrelated objects are packed together in a table.
520This may result in false sharing of cache lines.  The effect of false sharing
521of cache lines can be observed with the `TMFINE 1
522<>`_ test program
523on a suitable platform, e.g. QorIQ T4240.  High-performance SMP applications
524need full control of the object storage :cite:`Drepper:2007:Memory`.
525Therefore, self-contained synchronization objects are now available for RTEMS.
527Implementation Details
530Thread Dispatch Details
533This section gives background information to developers interested in the
534interrupt latencies introduced by thread dispatching.  A thread dispatch
535consists of all work which must be done to stop the currently executing thread
536on a processor and hand over this processor to an heir thread.
538In SMP systems, scheduling decisions on one processor must be propagated
539to other processors through inter-processor interrupts.  A thread dispatch
540which must be carried out on another processor does not happen instantaneously.
541Thus, several thread dispatch requests might be in the air and it is possible
542that some of them may be out of date before the corresponding processor has
543time to deal with them.  The thread dispatch mechanism uses three per-processor
546- the executing thread,
548- the heir thread, and
550- a boolean flag indicating if a thread dispatch is necessary or not.
552Updates of the heir thread are done via a normal store operation.  The thread
553dispatch necessary indicator of another processor is set as a side-effect of an
554inter-processor interrupt.  So, this change notification works without the use
555of locks.  The thread context is protected by a TTAS lock embedded in the
556context to ensure that it is used on at most one processor at a time.
557Normally, only thread-specific or per-processor locks are used during a thread
558dispatch.  This implementation turned out to be quite efficient and no lock
559contention was observed in the testsuite.  The heavy-weight thread dispatch
560sequence is only entered in case the thread dispatch indicator is set.
562The context-switch is performed with interrupts enabled.  During the transition
563from the executing to the heir thread neither the stack of the executing nor
564the heir thread must be used during interrupt processing.  For this purpose a
565temporary per-processor stack is set up which may be used by the interrupt
566prologue before the stack is switched to the interrupt stack.
571This section details the symmetric multiprocessing services.  A subsection is
572dedicated to each of these services and describes the calling sequence, related
573constants, usage, and status codes.
575.. raw:: latex
577   \clearpage
579.. _rtems_get_processor_count:
581GET_PROCESSOR_COUNT - Get processor count
585    .. code-block:: c
587        uint32_t rtems_get_processor_count(void);
590    The count of processors in the system.
593    In uni-processor configurations, a value of one will be returned.
595    In SMP configurations, this returns the value of a global variable set
596    during system initialization to indicate the count of utilized processors.
597    The processor count depends on the physically or virtually available
598    processors and application configuration.  The value will always be less
599    than or equal to the maximum count of application configured processors.
602    None.
604.. raw:: latex
606   \clearpage
608.. _rtems_get_current_processor:
610GET_CURRENT_PROCESSOR - Get current processor index
614    .. code-block:: c
616        uint32_t rtems_get_current_processor(void);
619    The index of the current processor.
622    In uni-processor configurations, a value of zero will be returned.
624    In SMP configurations, an architecture specific method is used to obtain the
625    index of the current processor in the system.  The set of processor indices
626    is the range of integers starting with zero up to the processor count minus
627    one.
629    Outside of sections with disabled thread dispatching the current processor
630    index may change after every instruction since the thread may migrate from
631    one processor to another.  Sections with disabled interrupts are sections
632    with thread dispatching disabled.
635    None.
Note: See TracBrowser for help on using the repository browser.