[489740f] | 1 | .. comment SPDX-License-Identifier: CC-BY-SA-4.0 |
---|
| 2 | |
---|
[969b4cd] | 3 | .. COMMENT: COPYRIGHT (c) 2014. |
---|
| 4 | .. COMMENT: On-Line Applications Research Corporation (OAR). |
---|
[0874d7d] | 5 | .. COMMENT: Copyright (c) 2017 embedded brains GmbH. |
---|
[c9aaf31] | 6 | .. COMMENT: All rights reserved. |
---|
| 7 | |
---|
[6c56401] | 8 | .. index:: Symmetric Multiprocessing |
---|
| 9 | .. index:: SMP |
---|
| 10 | |
---|
[b2e56c5] | 11 | Symmetric Multiprocessing (SMP) |
---|
| 12 | ******************************* |
---|
[fd6dc8c8] | 13 | |
---|
| 14 | Introduction |
---|
| 15 | ============ |
---|
| 16 | |
---|
[60a6d6e] | 17 | The Symmetric Multiprocessing (SMP) support of the RTEMS is available on |
---|
[fd6dc8c8] | 18 | |
---|
[1f3c22e] | 19 | - ARMv7-A, |
---|
[fd6dc8c8] | 20 | |
---|
| 21 | - PowerPC, and |
---|
| 22 | |
---|
| 23 | - SPARC. |
---|
| 24 | |
---|
[c9aaf31] | 25 | .. warning:: |
---|
| 26 | |
---|
[1f3c22e] | 27 | The SMP support must be explicitly enabled via the ``--enable-smp`` |
---|
| 28 | configure command line option for the :term:`BSP` build. |
---|
[fd6dc8c8] | 29 | |
---|
[1f3c22e] | 30 | RTEMS is supposed to be a real-time operating system. What does this mean in |
---|
| 31 | the context of SMP? The RTEMS interpretation of real-time on SMP is the |
---|
| 32 | support for :ref:`ClusteredScheduling` with priority based schedulers and |
---|
| 33 | adequate locking protocols. One aim is to enable a schedulability analysis |
---|
| 34 | under the sporadic task model :cite:`Brandenburg:2011:SL` |
---|
| 35 | :cite:`Burns:2013:MrsP`. |
---|
[fd6dc8c8] | 36 | |
---|
[1f3c22e] | 37 | The directives provided by the SMP support are: |
---|
[fd6dc8c8] | 38 | |
---|
[c9aaf31] | 39 | - rtems_get_processor_count_ - Get processor count |
---|
[fd6dc8c8] | 40 | |
---|
[c9aaf31] | 41 | - rtems_get_current_processor_ - Get current processor index |
---|
[fd6dc8c8] | 42 | |
---|
| 43 | Background |
---|
| 44 | ========== |
---|
| 45 | |
---|
[1f3c22e] | 46 | Application Configuration |
---|
| 47 | ------------------------- |
---|
| 48 | |
---|
| 49 | By default, the maximum processor count is set to one in the application |
---|
| 50 | configuration. To enable SMP, the application configuration option |
---|
[1d073c5] | 51 | :ref:`CONFIGURE_MAXIMUM_PROCESSORS <CONFIGURE_MAXIMUM_PROCESSORS>` must be |
---|
| 52 | defined to a value greater than one. It is recommended to use the smallest |
---|
| 53 | value suitable for the application in order to save memory. Each processor |
---|
[1f3c22e] | 54 | needs an idle thread and interrupt stack for example. |
---|
| 55 | |
---|
| 56 | The default scheduler for SMP applications supports up to 32 processors and is |
---|
[c2ee227] | 57 | a global fixed priority scheduler, see also :ref:`ConfigurationSchedulersClustered`. |
---|
[1f3c22e] | 58 | |
---|
| 59 | The following compile-time test can be used to check if the SMP support is |
---|
| 60 | available or not. |
---|
| 61 | |
---|
| 62 | .. code-block:: c |
---|
| 63 | |
---|
| 64 | #include <rtems.h> |
---|
| 65 | |
---|
| 66 | #ifdef RTEMS_SMP |
---|
| 67 | #warning "SMP support is enabled" |
---|
| 68 | #else |
---|
| 69 | #warning "SMP support is disabled" |
---|
| 70 | #endif |
---|
| 71 | |
---|
| 72 | Examples |
---|
| 73 | -------- |
---|
| 74 | |
---|
| 75 | For example applications see `testsuites/smptests |
---|
| 76 | <https://git.rtems.org/rtems/tree/testsuites/smptests>`_. |
---|
| 77 | |
---|
[fd6dc8c8] | 78 | Uniprocessor versus SMP Parallelism |
---|
| 79 | ----------------------------------- |
---|
| 80 | |
---|
| 81 | Uniprocessor systems have long been used in embedded systems. In this hardware |
---|
| 82 | model, there are some system execution characteristics which have long been |
---|
| 83 | taken for granted: |
---|
| 84 | |
---|
| 85 | - one task executes at a time |
---|
| 86 | |
---|
| 87 | - hardware events result in interrupts |
---|
| 88 | |
---|
[c9aaf31] | 89 | There is no true parallelism. Even when interrupts appear to occur at the same |
---|
| 90 | time, they are processed in largely a serial fashion. This is true even when |
---|
| 91 | the interupt service routines are allowed to nest. From a tasking viewpoint, |
---|
| 92 | it is the responsibility of the real-time operatimg system to simulate |
---|
| 93 | parallelism by switching between tasks. These task switches occur in response |
---|
| 94 | to hardware interrupt events and explicit application events such as blocking |
---|
| 95 | for a resource or delaying. |
---|
| 96 | |
---|
| 97 | With symmetric multiprocessing, the presence of multiple processors allows for |
---|
| 98 | true concurrency and provides for cost-effective performance |
---|
| 99 | improvements. Uniprocessors tend to increase performance by increasing clock |
---|
| 100 | speed and complexity. This tends to lead to hot, power hungry microprocessors |
---|
| 101 | which are poorly suited for many embedded applications. |
---|
| 102 | |
---|
| 103 | The true concurrency is in sharp contrast to the single task and interrupt |
---|
| 104 | model of uniprocessor systems. This results in a fundamental change to |
---|
| 105 | uniprocessor system characteristics listed above. Developers are faced with a |
---|
| 106 | different set of characteristics which, in turn, break some existing |
---|
| 107 | assumptions and result in new challenges. In an SMP system with N processors, |
---|
| 108 | these are the new execution characteristics. |
---|
[fd6dc8c8] | 109 | |
---|
| 110 | - N tasks execute in parallel |
---|
| 111 | |
---|
| 112 | - hardware events result in interrupts |
---|
| 113 | |
---|
[c9aaf31] | 114 | There is true parallelism with a task executing on each processor and the |
---|
| 115 | possibility of interrupts occurring on each processor. Thus in contrast to |
---|
| 116 | their being one task and one interrupt to consider on a uniprocessor, there are |
---|
| 117 | N tasks and potentially N simultaneous interrupts to consider on an SMP system. |
---|
| 118 | |
---|
| 119 | This increase in hardware complexity and presence of true parallelism results |
---|
| 120 | in the application developer needing to be even more cautious about mutual |
---|
| 121 | exclusion and shared data access than in a uniprocessor embedded system. Race |
---|
| 122 | conditions that never or rarely happened when an application executed on a |
---|
| 123 | uniprocessor system, become much more likely due to multiple threads executing |
---|
| 124 | in parallel. On a uniprocessor system, these race conditions would only happen |
---|
| 125 | when a task switch occurred at just the wrong moment. Now there are N-1 tasks |
---|
| 126 | executing in parallel all the time and this results in many more opportunities |
---|
| 127 | for small windows in critical sections to be hit. |
---|
[fd6dc8c8] | 128 | |
---|
| 129 | .. index:: task affinity |
---|
| 130 | .. index:: thread affinity |
---|
| 131 | |
---|
[6c56401] | 132 | Task Affinity |
---|
| 133 | ------------- |
---|
| 134 | |
---|
[c9aaf31] | 135 | RTEMS provides services to manipulate the affinity of a task. Affinity is used |
---|
| 136 | to specify the subset of processors in an SMP system on which a particular task |
---|
| 137 | can execute. |
---|
[fd6dc8c8] | 138 | |
---|
| 139 | By default, tasks have an affinity which allows them to execute on any |
---|
| 140 | available processor. |
---|
| 141 | |
---|
| 142 | Task affinity is a possible feature to be supported by SMP-aware |
---|
| 143 | schedulers. However, only a subset of the available schedulers support |
---|
[c9aaf31] | 144 | affinity. Although the behavior is scheduler specific, if the scheduler does |
---|
| 145 | not support affinity, it is likely to ignore all attempts to set affinity. |
---|
[fd6dc8c8] | 146 | |
---|
| 147 | The scheduler with support for arbitary processor affinities uses a proof of |
---|
| 148 | concept implementation. See https://devel.rtems.org/ticket/2510. |
---|
| 149 | |
---|
| 150 | .. index:: task migration |
---|
| 151 | .. index:: thread migration |
---|
| 152 | |
---|
[6c56401] | 153 | Task Migration |
---|
| 154 | -------------- |
---|
| 155 | |
---|
[fd6dc8c8] | 156 | With more than one processor in the system tasks can migrate from one processor |
---|
[a6a1f72] | 157 | to another. There are four reasons why tasks migrate in RTEMS. |
---|
[fd6dc8c8] | 158 | |
---|
[a6a1f72] | 159 | - The scheduler changes explicitly via |
---|
| 160 | :ref:`rtems_task_set_scheduler() <rtems_task_set_scheduler>` or similar |
---|
| 161 | directives. |
---|
| 162 | |
---|
| 163 | - The task processor affinity changes explicitly via |
---|
| 164 | :ref:`rtems_task_set_affinity() <rtems_task_set_affinity>` or similar |
---|
| 165 | directives. |
---|
[fd6dc8c8] | 166 | |
---|
[c9aaf31] | 167 | - The task resumes execution after a blocking operation. On a priority based |
---|
| 168 | scheduler it will evict the lowest priority task currently assigned to a |
---|
[fd6dc8c8] | 169 | processor in the processor set managed by the scheduler instance. |
---|
| 170 | |
---|
| 171 | - The task moves temporarily to another scheduler instance due to locking |
---|
[a6a1f72] | 172 | protocols like the :ref:`MrsP` or the :ref:`OMIP`. |
---|
[fd6dc8c8] | 173 | |
---|
| 174 | Task migration should be avoided so that the working set of a task can stay on |
---|
| 175 | the most local cache level. |
---|
| 176 | |
---|
[1f3c22e] | 177 | .. _ClusteredScheduling: |
---|
| 178 | |
---|
[fd6dc8c8] | 179 | Clustered Scheduling |
---|
| 180 | -------------------- |
---|
| 181 | |
---|
[7b1c63c] | 182 | The scheduler is responsible to assign processors to some of the threads which |
---|
| 183 | are ready to execute. Trouble starts if more ready threads than processors |
---|
| 184 | exist at the same time. There are various rules how the processor assignment |
---|
| 185 | can be performed attempting to fulfill additional constraints or yield some |
---|
| 186 | overall system properties. As a matter of fact it is impossible to meet all |
---|
| 187 | requirements at the same time. The way a scheduler works distinguishes |
---|
| 188 | real-time operating systems from general purpose operating systems. |
---|
| 189 | |
---|
[fd6dc8c8] | 190 | We have clustered scheduling in case the set of processors of a system is |
---|
[7b1c63c] | 191 | partitioned into non-empty pairwise-disjoint subsets of processors. These |
---|
| 192 | subsets are called clusters. Clusters with a cardinality of one are |
---|
| 193 | partitions. Each cluster is owned by exactly one scheduler instance. In case |
---|
| 194 | the cluster size equals the processor count, it is called global scheduling. |
---|
| 195 | |
---|
| 196 | Modern SMP systems have multi-layer caches. An operating system which neglects |
---|
| 197 | cache constraints in the scheduler will not yield good performance. Real-time |
---|
| 198 | operating systems usually provide priority (fixed or job-level) based |
---|
| 199 | schedulers so that each of the highest priority threads is assigned to a |
---|
| 200 | processor. Priority based schedulers have difficulties in providing cache |
---|
| 201 | locality for threads and may suffer from excessive thread migrations |
---|
| 202 | :cite:`Brandenburg:2011:SL` :cite:`Compagnin:2014:RUN`. Schedulers that use local run |
---|
| 203 | queues and some sort of load-balancing to improve the cache utilization may not |
---|
| 204 | fulfill global constraints :cite:`Gujarati:2013:LPP` and are more difficult to |
---|
| 205 | implement than one would normally expect :cite:`Lozi:2016:LSDWC`. |
---|
| 206 | |
---|
| 207 | Clustered scheduling was implemented for RTEMS SMP to best use the cache |
---|
| 208 | topology of a system and to keep the worst-case latencies under control. The |
---|
| 209 | low-level SMP locks use FIFO ordering. So, the worst-case run-time of |
---|
| 210 | operations increases with each processor involved. The scheduler configuration |
---|
| 211 | is quite flexible and done at link-time, see :ref:`Configuring Clustered |
---|
| 212 | Schedulers`. It is possible to re-assign processors to schedulers during |
---|
| 213 | run-time via :ref:`rtems_scheduler_add_processor() |
---|
| 214 | <rtems_scheduler_add_processor>` and :ref:`rtems_scheduler_remove_processor() |
---|
| 215 | <rtems_scheduler_remove_processor>`. The schedulers are implemented in an |
---|
| 216 | object-oriented fashion. |
---|
| 217 | |
---|
| 218 | The problem is to provide synchronization |
---|
[fd6dc8c8] | 219 | primitives for inter-cluster synchronization (more than one cluster is involved |
---|
[7b1c63c] | 220 | in the synchronization process). In RTEMS there are currently some means |
---|
[fd6dc8c8] | 221 | available |
---|
| 222 | |
---|
| 223 | - events, |
---|
| 224 | |
---|
| 225 | - message queues, |
---|
| 226 | |
---|
[7b1c63c] | 227 | - mutexes using the :ref:`OMIP`, |
---|
| 228 | |
---|
| 229 | - mutexes using the :ref:`MrsP`, and |
---|
[fd6dc8c8] | 230 | |
---|
[7b1c63c] | 231 | - binary and counting semaphores. |
---|
[fd6dc8c8] | 232 | |
---|
| 233 | The clustered scheduling approach enables separation of functions with |
---|
| 234 | real-time requirements and functions that profit from fairness and high |
---|
| 235 | throughput provided the scheduler instances are fully decoupled and adequate |
---|
[7b1c63c] | 236 | inter-cluster synchronization primitives are used. |
---|
[fd6dc8c8] | 237 | |
---|
[7b1c63c] | 238 | To set the scheduler of a task see :ref:`rtems_scheduler_ident() |
---|
| 239 | <rtems_scheduler_ident>` and :ref:`rtems_task_set_scheduler() |
---|
| 240 | <rtems_task_set_scheduler>`. |
---|
[fd6dc8c8] | 241 | |
---|
| 242 | OpenMP |
---|
| 243 | ------ |
---|
| 244 | |
---|
| 245 | OpenMP support for RTEMS is available via the GCC provided libgomp. There is |
---|
| 246 | libgomp support for RTEMS in the POSIX configuration of libgomp since GCC 4.9 |
---|
| 247 | (requires a Newlib snapshot after 2015-03-12). In GCC 6.1 or later (requires a |
---|
| 248 | Newlib snapshot after 2015-07-30 for <sys/lock.h> provided self-contained |
---|
| 249 | synchronization objects) there is a specialized libgomp configuration for RTEMS |
---|
| 250 | which offers a significantly better performance compared to the POSIX |
---|
| 251 | configuration of libgomp. In addition application configurable thread pools |
---|
| 252 | for each scheduler instance are available in GCC 6.1 or later. |
---|
| 253 | |
---|
| 254 | The run-time configuration of libgomp is done via environment variables |
---|
[c9aaf31] | 255 | documented in the `libgomp manual <https://gcc.gnu.org/onlinedocs/libgomp/>`_. |
---|
| 256 | The environment variables are evaluated in a constructor function which |
---|
| 257 | executes in the context of the first initialization task before the actual |
---|
| 258 | initialization task function is called (just like a global C++ constructor). |
---|
| 259 | To set application specific values, a higher priority constructor function must |
---|
| 260 | be used to set up the environment variables. |
---|
| 261 | |
---|
[25d55d4] | 262 | .. code-block:: c |
---|
[fd6dc8c8] | 263 | |
---|
| 264 | #include <stdlib.h> |
---|
| 265 | void __attribute__((constructor(1000))) config_libgomp( void ) |
---|
| 266 | { |
---|
[c9aaf31] | 267 | setenv( "OMP_DISPLAY_ENV", "VERBOSE", 1 ); |
---|
| 268 | setenv( "GOMP_SPINCOUNT", "30000", 1 ); |
---|
| 269 | setenv( "GOMP_RTEMS_THREAD_POOLS", "1$2@SCHD", 1 ); |
---|
[fd6dc8c8] | 270 | } |
---|
| 271 | |
---|
| 272 | The environment variable ``GOMP_RTEMS_THREAD_POOLS`` is RTEMS-specific. It |
---|
[c9aaf31] | 273 | determines the thread pools for each scheduler instance. The format for |
---|
| 274 | ``GOMP_RTEMS_THREAD_POOLS`` is a list of optional |
---|
| 275 | ``<thread-pool-count>[$<priority>]@<scheduler-name>`` configurations separated |
---|
| 276 | by ``:`` where: |
---|
[fd6dc8c8] | 277 | |
---|
[c9aaf31] | 278 | - ``<thread-pool-count>`` is the thread pool count for this scheduler instance. |
---|
[fd6dc8c8] | 279 | |
---|
[c9aaf31] | 280 | - ``$<priority>`` is an optional priority for the worker threads of a thread |
---|
| 281 | pool according to ``pthread_setschedparam``. In case a priority value is |
---|
| 282 | omitted, then a worker thread will inherit the priority of the OpenMP master |
---|
| 283 | thread that created it. The priority of the worker thread is not changed by |
---|
| 284 | libgomp after creation, even if a new OpenMP master thread using the worker |
---|
| 285 | has a different priority. |
---|
[fd6dc8c8] | 286 | |
---|
[c9aaf31] | 287 | - ``@<scheduler-name>`` is the scheduler instance name according to the RTEMS |
---|
| 288 | application configuration. |
---|
[fd6dc8c8] | 289 | |
---|
| 290 | In case no thread pool configuration is specified for a scheduler instance, |
---|
| 291 | then each OpenMP master thread of this scheduler instance will use its own |
---|
| 292 | dynamically allocated thread pool. To limit the worker thread count of the |
---|
| 293 | thread pools, each OpenMP master thread must call ``omp_set_num_threads``. |
---|
| 294 | |
---|
[c9aaf31] | 295 | Lets suppose we have three scheduler instances ``IO``, ``WRK0``, and ``WRK1`` |
---|
| 296 | with ``GOMP_RTEMS_THREAD_POOLS`` set to ``"1@WRK0:3$4@WRK1"``. Then there are |
---|
| 297 | no thread pool restrictions for scheduler instance ``IO``. In the scheduler |
---|
| 298 | instance ``WRK0`` there is one thread pool available. Since no priority is |
---|
| 299 | specified for this scheduler instance, the worker thread inherits the priority |
---|
| 300 | of the OpenMP master thread that created it. In the scheduler instance |
---|
| 301 | ``WRK1`` there are three thread pools available and their worker threads run at |
---|
| 302 | priority four. |
---|
[fd6dc8c8] | 303 | |
---|
[b033e39] | 304 | Application Issues |
---|
| 305 | ================== |
---|
| 306 | |
---|
| 307 | Most operating system services provided by the uni-processor RTEMS are |
---|
| 308 | available in SMP configurations as well. However, applications designed for an |
---|
| 309 | uni-processor environment may need some changes to correctly run in an SMP |
---|
| 310 | configuration. |
---|
| 311 | |
---|
| 312 | As discussed earlier, SMP systems have opportunities for true parallelism which |
---|
| 313 | was not possible on uni-processor systems. Consequently, multiple techniques |
---|
| 314 | that provided adequate critical sections on uni-processor systems are unsafe on |
---|
| 315 | SMP systems. In this section, some of these unsafe techniques will be |
---|
| 316 | discussed. |
---|
| 317 | |
---|
| 318 | In general, applications must use proper operating system provided mutual |
---|
| 319 | exclusion mechanisms to ensure correct behavior. |
---|
| 320 | |
---|
| 321 | Task variables |
---|
| 322 | -------------- |
---|
| 323 | |
---|
| 324 | Task variables are ordinary global variables with a dedicated value for each |
---|
| 325 | thread. During a context switch from the executing thread to the heir thread, |
---|
| 326 | the value of each task variable is saved to the thread control block of the |
---|
| 327 | executing thread and restored from the thread control block of the heir thread. |
---|
| 328 | This is inherently broken if more than one executing thread exists. |
---|
[3384994] | 329 | Alternatives to task variables are POSIX keys and :term:`TLS`. All use cases |
---|
| 330 | of task variables in the RTEMS code base were replaced with alternatives. The |
---|
| 331 | task variable API has been removed in RTEMS 5.1. |
---|
[b033e39] | 332 | |
---|
| 333 | Highest Priority Thread Never Walks Alone |
---|
| 334 | ----------------------------------------- |
---|
| 335 | |
---|
| 336 | On a uni-processor system, it is safe to assume that when the highest priority |
---|
| 337 | task in an application executes, it will execute without being preempted until |
---|
| 338 | it voluntarily blocks. Interrupts may occur while it is executing, but there |
---|
| 339 | will be no context switch to another task unless the highest priority task |
---|
| 340 | voluntarily initiates it. |
---|
| 341 | |
---|
| 342 | Given the assumption that no other tasks will have their execution interleaved |
---|
| 343 | with the highest priority task, it is possible for this task to be constructed |
---|
| 344 | such that it does not need to acquire a mutex for protected access to shared |
---|
| 345 | data. |
---|
| 346 | |
---|
| 347 | In an SMP system, it cannot be assumed there will never be a single task |
---|
| 348 | executing. It should be assumed that every processor is executing another |
---|
| 349 | application task. Further, those tasks will be ones which would not have been |
---|
| 350 | executed in a uni-processor configuration and should be assumed to have data |
---|
| 351 | synchronization conflicts with what was formerly the highest priority task |
---|
| 352 | which executed without conflict. |
---|
| 353 | |
---|
[39773ce] | 354 | Disabling of Thread Preemption |
---|
| 355 | ------------------------------ |
---|
[b033e39] | 356 | |
---|
[39773ce] | 357 | A thread which disables preemption prevents that a higher priority thread gets |
---|
[b033e39] | 358 | hold of its processor involuntarily. In uni-processor configurations, this can |
---|
| 359 | be used to ensure mutual exclusion at thread level. In SMP configurations, |
---|
| 360 | however, more than one executing thread may exist. Thus, it is impossible to |
---|
| 361 | ensure mutual exclusion using this mechanism. In order to prevent that |
---|
[39773ce] | 362 | applications using preemption for this purpose, would show inappropriate |
---|
[b033e39] | 363 | behaviour, this feature is disabled in SMP configurations and its use would |
---|
| 364 | case run-time errors. |
---|
| 365 | |
---|
| 366 | Disabling of Interrupts |
---|
| 367 | ----------------------- |
---|
| 368 | |
---|
| 369 | A low overhead means that ensures mutual exclusion in uni-processor |
---|
| 370 | configurations is the disabling of interrupts around a critical section. This |
---|
| 371 | is commonly used in device driver code. In SMP configurations, however, |
---|
| 372 | disabling the interrupts on one processor has no effect on other processors. |
---|
| 373 | So, this is insufficient to ensure system-wide mutual exclusion. The macros |
---|
| 374 | |
---|
| 375 | * :ref:`rtems_interrupt_disable() <rtems_interrupt_disable>`, |
---|
| 376 | |
---|
| 377 | * :ref:`rtems_interrupt_enable() <rtems_interrupt_enable>`, and |
---|
| 378 | |
---|
| 379 | * :ref:`rtems_interrupt_flash() <rtems_interrupt_flash>`. |
---|
| 380 | |
---|
| 381 | are disabled in SMP configurations and its use will cause compile-time warnings |
---|
| 382 | and link-time errors. In the unlikely case that interrupts must be disabled on |
---|
| 383 | the current processor, the |
---|
| 384 | |
---|
| 385 | * :ref:`rtems_interrupt_local_disable() <rtems_interrupt_local_disable>`, and |
---|
| 386 | |
---|
| 387 | * :ref:`rtems_interrupt_local_enable() <rtems_interrupt_local_enable>`. |
---|
| 388 | |
---|
| 389 | macros are now available in all configurations. |
---|
| 390 | |
---|
| 391 | Since disabling of interrupts is insufficient to ensure system-wide mutual |
---|
| 392 | exclusion on SMP a new low-level synchronization primitive was added -- |
---|
| 393 | interrupt locks. The interrupt locks are a simple API layer on top of the SMP |
---|
| 394 | locks used for low-level synchronization in the operating system core. |
---|
| 395 | Currently, they are implemented as a ticket lock. In uni-processor |
---|
| 396 | configurations, they degenerate to simple interrupt disable/enable sequences by |
---|
| 397 | means of the C pre-processor. It is disallowed to acquire a single interrupt |
---|
| 398 | lock in a nested way. This will result in an infinite loop with interrupts |
---|
| 399 | disabled. While converting legacy code to interrupt locks, care must be taken |
---|
| 400 | to avoid this situation to happen. |
---|
| 401 | |
---|
| 402 | .. code-block:: c |
---|
| 403 | :linenos: |
---|
| 404 | |
---|
| 405 | #include <rtems.h> |
---|
| 406 | |
---|
| 407 | void legacy_code_with_interrupt_disable_enable( void ) |
---|
| 408 | { |
---|
| 409 | rtems_interrupt_level level; |
---|
| 410 | |
---|
| 411 | rtems_interrupt_disable( level ); |
---|
| 412 | /* Critical section */ |
---|
| 413 | rtems_interrupt_enable( level ); |
---|
| 414 | } |
---|
| 415 | |
---|
| 416 | RTEMS_INTERRUPT_LOCK_DEFINE( static, lock, "Name" ) |
---|
| 417 | |
---|
| 418 | void smp_ready_code_with_interrupt_lock( void ) |
---|
| 419 | { |
---|
| 420 | rtems_interrupt_lock_context lock_context; |
---|
| 421 | |
---|
| 422 | rtems_interrupt_lock_acquire( &lock, &lock_context ); |
---|
| 423 | /* Critical section */ |
---|
| 424 | rtems_interrupt_lock_release( &lock, &lock_context ); |
---|
| 425 | } |
---|
| 426 | |
---|
| 427 | An alternative to the RTEMS-specific interrupt locks are POSIX spinlocks. The |
---|
| 428 | :c:type:`pthread_spinlock_t` is defined as a self-contained object, e.g. the |
---|
| 429 | user must provide the storage for this synchronization object. |
---|
| 430 | |
---|
| 431 | .. code-block:: c |
---|
| 432 | :linenos: |
---|
| 433 | |
---|
| 434 | #include <assert.h> |
---|
| 435 | #include <pthread.h> |
---|
| 436 | |
---|
| 437 | pthread_spinlock_t lock; |
---|
| 438 | |
---|
| 439 | void smp_ready_code_with_posix_spinlock( void ) |
---|
| 440 | { |
---|
| 441 | int error; |
---|
| 442 | |
---|
| 443 | error = pthread_spin_lock( &lock ); |
---|
| 444 | assert( error == 0 ); |
---|
| 445 | /* Critical section */ |
---|
| 446 | error = pthread_spin_unlock( &lock ); |
---|
| 447 | assert( error == 0 ); |
---|
| 448 | } |
---|
| 449 | |
---|
| 450 | In contrast to POSIX spinlock implementation on Linux or FreeBSD, it is not |
---|
| 451 | allowed to call blocking operating system services inside the critical section. |
---|
| 452 | A recursive lock attempt is a severe usage error resulting in an infinite loop |
---|
| 453 | with interrupts disabled. Nesting of different locks is allowed. The user |
---|
| 454 | must ensure that no deadlock can occur. As a non-portable feature the locks |
---|
| 455 | are zero-initialized, e.g. statically initialized global locks reside in the |
---|
| 456 | ``.bss`` section and there is no need to call :c:func:`pthread_spin_init`. |
---|
| 457 | |
---|
| 458 | Interrupt Service Routines Execute in Parallel With Threads |
---|
| 459 | ----------------------------------------------------------- |
---|
| 460 | |
---|
| 461 | On a machine with more than one processor, interrupt service routines (this |
---|
| 462 | includes timer service routines installed via :ref:`rtems_timer_fire_after() |
---|
| 463 | <rtems_timer_fire_after>`) and threads can execute in parallel. Interrupt |
---|
| 464 | service routines must take this into account and use proper locking mechanisms |
---|
| 465 | to protect critical sections from interference by threads (interrupt locks or |
---|
| 466 | POSIX spinlocks). This likely requires code modifications in legacy device |
---|
| 467 | drivers. |
---|
| 468 | |
---|
| 469 | Timers Do Not Stop Immediately |
---|
| 470 | ------------------------------ |
---|
| 471 | |
---|
| 472 | Timer service routines run in the context of the clock interrupt. On |
---|
| 473 | uni-processor configurations, it is sufficient to disable interrupts and remove |
---|
| 474 | a timer from the set of active timers to stop it. In SMP configurations, |
---|
| 475 | however, the timer service routine may already run and wait on an SMP lock |
---|
| 476 | owned by the thread which is about to stop the timer. This opens the door to |
---|
| 477 | subtle synchronization issues. During destruction of objects, special care |
---|
| 478 | must be taken to ensure that timer service routines cannot access (partly or |
---|
| 479 | fully) destroyed objects. |
---|
| 480 | |
---|
| 481 | False Sharing of Cache Lines Due to Objects Table |
---|
| 482 | ------------------------------------------------- |
---|
| 483 | |
---|
| 484 | The Classic API and most POSIX API objects are indirectly accessed via an |
---|
| 485 | object identifier. The user-level functions validate the object identifier and |
---|
| 486 | map it to the actual object structure which resides in a global objects table |
---|
| 487 | for each object class. So, unrelated objects are packed together in a table. |
---|
| 488 | This may result in false sharing of cache lines. The effect of false sharing |
---|
| 489 | of cache lines can be observed with the `TMFINE 1 |
---|
| 490 | <https://git.rtems.org/rtems/tree/testsuites/tmtests/tmfine01>`_ test program |
---|
| 491 | on a suitable platform, e.g. QorIQ T4240. High-performance SMP applications |
---|
| 492 | need full control of the object storage :cite:`Drepper:2007:Memory`. |
---|
| 493 | Therefore, self-contained synchronization objects are now available for RTEMS. |
---|
| 494 | |
---|
[fd6dc8c8] | 495 | Directives |
---|
| 496 | ========== |
---|
| 497 | |
---|
[c9aaf31] | 498 | This section details the symmetric multiprocessing services. A subsection is |
---|
| 499 | dedicated to each of these services and describes the calling sequence, related |
---|
| 500 | constants, usage, and status codes. |
---|
[fd6dc8c8] | 501 | |
---|
[53bb72e] | 502 | .. raw:: latex |
---|
| 503 | |
---|
| 504 | \clearpage |
---|
| 505 | |
---|
[c9aaf31] | 506 | .. _rtems_get_processor_count: |
---|
[fd6dc8c8] | 507 | |
---|
| 508 | GET_PROCESSOR_COUNT - Get processor count |
---|
| 509 | ----------------------------------------- |
---|
| 510 | |
---|
[53bb72e] | 511 | CALLING SEQUENCE: |
---|
| 512 | .. code-block:: c |
---|
[fd6dc8c8] | 513 | |
---|
[53bb72e] | 514 | uint32_t rtems_get_processor_count(void); |
---|
[fd6dc8c8] | 515 | |
---|
[53bb72e] | 516 | DIRECTIVE STATUS CODES: |
---|
[794eb1b] | 517 | |
---|
| 518 | The count of processors in the system that can be run. The value returned |
---|
| 519 | is the highest numbered processor index of all processors available to the |
---|
| 520 | application (if a scheduler is assigned) plus one. |
---|
[fd6dc8c8] | 521 | |
---|
[53bb72e] | 522 | DESCRIPTION: |
---|
[a0d2eee] | 523 | In uni-processor configurations, a value of one will be returned. |
---|
[fd6dc8c8] | 524 | |
---|
[a0d2eee] | 525 | In SMP configurations, this returns the value of a global variable set |
---|
[53bb72e] | 526 | during system initialization to indicate the count of utilized processors. |
---|
| 527 | The processor count depends on the physically or virtually available |
---|
| 528 | processors and application configuration. The value will always be less |
---|
| 529 | than or equal to the maximum count of application configured processors. |
---|
[fd6dc8c8] | 530 | |
---|
[53bb72e] | 531 | NOTES: |
---|
| 532 | None. |
---|
[fd6dc8c8] | 533 | |
---|
[53bb72e] | 534 | .. raw:: latex |
---|
[fd6dc8c8] | 535 | |
---|
[53bb72e] | 536 | \clearpage |
---|
[fd6dc8c8] | 537 | |
---|
[c9aaf31] | 538 | .. _rtems_get_current_processor: |
---|
[fd6dc8c8] | 539 | |
---|
| 540 | GET_CURRENT_PROCESSOR - Get current processor index |
---|
| 541 | --------------------------------------------------- |
---|
| 542 | |
---|
[53bb72e] | 543 | CALLING SEQUENCE: |
---|
| 544 | .. code-block:: c |
---|
[fd6dc8c8] | 545 | |
---|
[53bb72e] | 546 | uint32_t rtems_get_current_processor(void); |
---|
[fd6dc8c8] | 547 | |
---|
[53bb72e] | 548 | DIRECTIVE STATUS CODES: |
---|
| 549 | The index of the current processor. |
---|
[fd6dc8c8] | 550 | |
---|
[53bb72e] | 551 | DESCRIPTION: |
---|
[a0d2eee] | 552 | In uni-processor configurations, a value of zero will be returned. |
---|
[fd6dc8c8] | 553 | |
---|
[a0d2eee] | 554 | In SMP configurations, an architecture specific method is used to obtain the |
---|
[53bb72e] | 555 | index of the current processor in the system. The set of processor indices |
---|
| 556 | is the range of integers starting with zero up to the processor count minus |
---|
| 557 | one. |
---|
[fd6dc8c8] | 558 | |
---|
[53bb72e] | 559 | Outside of sections with disabled thread dispatching the current processor |
---|
| 560 | index may change after every instruction since the thread may migrate from |
---|
| 561 | one processor to another. Sections with disabled interrupts are sections |
---|
| 562 | with thread dispatching disabled. |
---|
[fd6dc8c8] | 563 | |
---|
[53bb72e] | 564 | NOTES: |
---|
| 565 | None. |
---|
[785c02f] | 566 | |
---|
| 567 | Implementation Details |
---|
| 568 | ====================== |
---|
| 569 | |
---|
| 570 | This section covers some implementation details of the RTEMS SMP support. |
---|
| 571 | |
---|
| 572 | Low-Level Synchronization |
---|
| 573 | ------------------------- |
---|
| 574 | |
---|
| 575 | All low-level synchronization primitives are implemented using :term:`C11` |
---|
| 576 | atomic operations, so no target-specific hand-written assembler code is |
---|
| 577 | necessary. Four synchronization primitives are currently available |
---|
| 578 | |
---|
| 579 | * ticket locks (mutual exclusion), |
---|
| 580 | |
---|
| 581 | * :term:`MCS` locks (mutual exclusion), |
---|
| 582 | |
---|
| 583 | * barriers, implemented as a sense barrier, and |
---|
| 584 | |
---|
| 585 | * sequence locks :cite:`Boehm:2012:Seqlock`. |
---|
| 586 | |
---|
| 587 | A vital requirement for low-level mutual exclusion is :term:`FIFO` fairness |
---|
| 588 | since we are interested in a predictable system and not maximum throughput. |
---|
| 589 | With this requirement, there are only few options to resolve this problem. For |
---|
| 590 | reasons of simplicity, the ticket lock algorithm was chosen to implement the |
---|
| 591 | SMP locks. However, the API is capable to support MCS locks, which may be |
---|
| 592 | interesting in the future for systems with a processor count in the range of 32 |
---|
| 593 | or more, e.g. :term:`NUMA`, many-core systems. |
---|
| 594 | |
---|
| 595 | The test program `SMPLOCK 1 |
---|
| 596 | <https://git.rtems.org/rtems/tree/testsuites/smptests/smplock01>`_ can be used |
---|
| 597 | to gather performance and fairness data for several scenarios. The SMP lock |
---|
| 598 | performance and fairness measured on the QorIQ T4240 follows as an example. |
---|
| 599 | This chip contains three L2 caches. Each L2 cache is shared by eight |
---|
| 600 | processors. |
---|
| 601 | |
---|
| 602 | .. image:: ../images/c_user/smplock01perf-t4240.* |
---|
| 603 | :width: 400 |
---|
| 604 | :align: center |
---|
| 605 | |
---|
| 606 | .. image:: ../images/c_user/smplock01fair-t4240.* |
---|
| 607 | :width: 400 |
---|
| 608 | :align: center |
---|
| 609 | |
---|
[29222e3] | 610 | Internal Locking |
---|
| 611 | ---------------- |
---|
| 612 | |
---|
| 613 | In SMP configurations, the operating system uses non-recursive SMP locks for |
---|
| 614 | low-level mutual exclusion. The locking domains are roughly |
---|
| 615 | |
---|
| 616 | * a particular data structure, |
---|
| 617 | * the thread queue operations, |
---|
| 618 | * the thread state changes, and |
---|
| 619 | * the scheduler operations. |
---|
| 620 | |
---|
| 621 | For a good average-case performance it is vital that every high-level |
---|
| 622 | synchronization object, e.g. mutex, has its own SMP lock. In the average-case, |
---|
| 623 | only this SMP lock should be involved to carry out a specific operation, e.g. |
---|
| 624 | obtain/release a mutex. In general, the high-level synchronization objects |
---|
| 625 | have a thread queue embedded and use its SMP lock. |
---|
| 626 | |
---|
| 627 | In case a thread must block on a thread queue, then things get complicated. |
---|
| 628 | The executing thread first acquires the SMP lock of the thread queue and then |
---|
| 629 | figures out that it needs to block. The procedure to block the thread on this |
---|
| 630 | particular thread queue involves state changes of the thread itself and for |
---|
| 631 | this thread-specific SMP locks must be used. |
---|
| 632 | |
---|
| 633 | In order to determine if a thread is blocked on a thread queue or not |
---|
| 634 | thread-specific SMP locks must be used. A thread priority change must |
---|
| 635 | propagate this to the thread queue (possibly recursively). Care must be taken |
---|
| 636 | to not have a lock order reversal between thread queue and thread-specific SMP |
---|
| 637 | locks. |
---|
| 638 | |
---|
| 639 | Each scheduler instance has its own SMP lock. For the scheduler helping |
---|
| 640 | protocol multiple scheduler instances may be in charge of a thread. It is not |
---|
| 641 | possible to acquire two scheduler instance SMP locks at the same time, |
---|
| 642 | otherwise deadlocks would happen. A thread-specific SMP lock is used to |
---|
| 643 | synchronize the thread data shared by different scheduler instances. |
---|
| 644 | |
---|
| 645 | The thread state SMP lock protects various things, e.g. the thread state, join |
---|
| 646 | operations, signals, post-switch actions, the home scheduler instance, etc. |
---|
| 647 | |
---|
[0874d7d] | 648 | Profiling |
---|
| 649 | --------- |
---|
| 650 | |
---|
| 651 | To identify the bottlenecks in the system, support for profiling of low-level |
---|
| 652 | synchronization is optionally available. The profiling support is a BSP build |
---|
| 653 | time configuration option (``--enable-profiling``) and is implemented with an |
---|
| 654 | acceptable overhead, even for production systems. A low-overhead counter for |
---|
| 655 | short time intervals must be provided by the hardware. |
---|
| 656 | |
---|
| 657 | Profiling reports are generated in XML for most test programs of the RTEMS |
---|
| 658 | testsuite (more than 500 test programs). This gives a good sample set for |
---|
| 659 | statistics. For example the maximum thread dispatch disable time, the maximum |
---|
| 660 | interrupt latency or lock contention can be determined. |
---|
| 661 | |
---|
| 662 | .. code-block:: xml |
---|
| 663 | |
---|
| 664 | <ProfilingReport name="SMPMIGRATION 1"> |
---|
| 665 | <PerCPUProfilingReport processorIndex="0"> |
---|
| 666 | <MaxThreadDispatchDisabledTime unit="ns">36636</MaxThreadDispatchDisabledTime> |
---|
| 667 | <MeanThreadDispatchDisabledTime unit="ns">5065</MeanThreadDispatchDisabledTime> |
---|
| 668 | <TotalThreadDispatchDisabledTime unit="ns">3846635988 |
---|
| 669 | </TotalThreadDispatchDisabledTime> |
---|
| 670 | <ThreadDispatchDisabledCount>759395</ThreadDispatchDisabledCount> |
---|
| 671 | <MaxInterruptDelay unit="ns">8772</MaxInterruptDelay> |
---|
| 672 | <MaxInterruptTime unit="ns">13668</MaxInterruptTime> |
---|
| 673 | <MeanInterruptTime unit="ns">6221</MeanInterruptTime> |
---|
| 674 | <TotalInterruptTime unit="ns">6757072</TotalInterruptTime> |
---|
| 675 | <InterruptCount>1086</InterruptCount> |
---|
| 676 | </PerCPUProfilingReport> |
---|
| 677 | <PerCPUProfilingReport processorIndex="1"> |
---|
| 678 | <MaxThreadDispatchDisabledTime unit="ns">39408</MaxThreadDispatchDisabledTime> |
---|
| 679 | <MeanThreadDispatchDisabledTime unit="ns">5060</MeanThreadDispatchDisabledTime> |
---|
| 680 | <TotalThreadDispatchDisabledTime unit="ns">3842749508 |
---|
| 681 | </TotalThreadDispatchDisabledTime> |
---|
| 682 | <ThreadDispatchDisabledCount>759391</ThreadDispatchDisabledCount> |
---|
| 683 | <MaxInterruptDelay unit="ns">8412</MaxInterruptDelay> |
---|
| 684 | <MaxInterruptTime unit="ns">15868</MaxInterruptTime> |
---|
| 685 | <MeanInterruptTime unit="ns">3525</MeanInterruptTime> |
---|
| 686 | <TotalInterruptTime unit="ns">3814476</TotalInterruptTime> |
---|
| 687 | <InterruptCount>1082</InterruptCount> |
---|
| 688 | </PerCPUProfilingReport> |
---|
| 689 | <!-- more reports omitted ---> |
---|
| 690 | <SMPLockProfilingReport name="Scheduler"> |
---|
| 691 | <MaxAcquireTime unit="ns">7092</MaxAcquireTime> |
---|
| 692 | <MaxSectionTime unit="ns">10984</MaxSectionTime> |
---|
| 693 | <MeanAcquireTime unit="ns">2320</MeanAcquireTime> |
---|
| 694 | <MeanSectionTime unit="ns">199</MeanSectionTime> |
---|
| 695 | <TotalAcquireTime unit="ns">3523939244</TotalAcquireTime> |
---|
| 696 | <TotalSectionTime unit="ns">302545596</TotalSectionTime> |
---|
| 697 | <UsageCount>1518758</UsageCount> |
---|
| 698 | <ContentionCount initialQueueLength="0">759399</ContentionCount> |
---|
| 699 | <ContentionCount initialQueueLength="1">759359</ContentionCount> |
---|
| 700 | <ContentionCount initialQueueLength="2">0</ContentionCount> |
---|
| 701 | <ContentionCount initialQueueLength="3">0</ContentionCount> |
---|
| 702 | </SMPLockProfilingReport> |
---|
| 703 | </ProfilingReport> |
---|
| 704 | |
---|
[87b4d03] | 705 | Scheduler Helping Protocol |
---|
| 706 | -------------------------- |
---|
| 707 | |
---|
| 708 | The scheduler provides a helping protocol to support locking protocols like the |
---|
| 709 | :ref:`OMIP` or the :ref:`MrsP`. Each thread has a scheduler node for each |
---|
| 710 | scheduler instance in the system which are located in its :term:`TCB`. A |
---|
| 711 | thread has exactly one home scheduler instance which is set during thread |
---|
| 712 | creation. The home scheduler instance can be changed with |
---|
| 713 | :ref:`rtems_task_set_scheduler() <rtems_task_set_scheduler>`. Due to the |
---|
| 714 | locking protocols a thread may gain access to scheduler nodes of other |
---|
| 715 | scheduler instances. This allows the thread to temporarily migrate to another |
---|
[39773ce] | 716 | scheduler instance in case of preemption. |
---|
[87b4d03] | 717 | |
---|
| 718 | The scheduler infrastructure is based on an object-oriented design. The |
---|
| 719 | scheduler operations for a thread are defined as virtual functions. For the |
---|
| 720 | scheduler helping protocol the following operations must be implemented by an |
---|
| 721 | SMP-aware scheduler |
---|
| 722 | |
---|
| 723 | * ask a scheduler node for help, |
---|
| 724 | * reconsider the help request of a scheduler node, |
---|
| 725 | * withdraw a schedule node. |
---|
| 726 | |
---|
| 727 | All currently available SMP-aware schedulers use a framework which is |
---|
| 728 | customized via inline functions. This eases the implementation of scheduler |
---|
| 729 | variants. Up to now, only priority-based schedulers are implemented. |
---|
| 730 | |
---|
| 731 | In case a thread is allowed to use more than one scheduler node it will ask |
---|
| 732 | these nodes for help |
---|
| 733 | |
---|
[39773ce] | 734 | * in case of preemption, or |
---|
[87b4d03] | 735 | * an unblock did not schedule the thread, or |
---|
| 736 | * a yield was successful. |
---|
| 737 | |
---|
| 738 | The actual ask for help scheduler operations are carried out as a side-effect |
---|
| 739 | of the thread dispatch procedure. Once a need for help is recognized, a help |
---|
| 740 | request is registered in one of the processors related to the thread and a |
---|
| 741 | thread dispatch is issued. This indirection leads to a better decoupling of |
---|
| 742 | scheduler instances. Unrelated processors are not burdened with extra work for |
---|
| 743 | threads which participate in resource sharing. Each ask for help operation |
---|
| 744 | indicates if it could help or not. The procedure stops after the first |
---|
| 745 | successful ask for help. Unsuccessful ask for help operations will register |
---|
| 746 | this need in the scheduler context. |
---|
| 747 | |
---|
| 748 | After a thread dispatch the reconsider help request operation is used to clean |
---|
| 749 | up stale help registrations in the scheduler contexts. |
---|
| 750 | |
---|
| 751 | The withdraw operation takes away scheduler nodes once the thread is no longer |
---|
| 752 | allowed to use them, e.g. it released a mutex. The availability of scheduler |
---|
| 753 | nodes for a thread is controlled by the thread queues. |
---|
| 754 | |
---|
[785c02f] | 755 | Thread Dispatch Details |
---|
| 756 | ----------------------- |
---|
| 757 | |
---|
| 758 | This section gives background information to developers interested in the |
---|
| 759 | interrupt latencies introduced by thread dispatching. A thread dispatch |
---|
| 760 | consists of all work which must be done to stop the currently executing thread |
---|
| 761 | on a processor and hand over this processor to an heir thread. |
---|
| 762 | |
---|
| 763 | In SMP systems, scheduling decisions on one processor must be propagated |
---|
| 764 | to other processors through inter-processor interrupts. A thread dispatch |
---|
| 765 | which must be carried out on another processor does not happen instantaneously. |
---|
| 766 | Thus, several thread dispatch requests might be in the air and it is possible |
---|
| 767 | that some of them may be out of date before the corresponding processor has |
---|
| 768 | time to deal with them. The thread dispatch mechanism uses three per-processor |
---|
| 769 | variables, |
---|
| 770 | |
---|
| 771 | - the executing thread, |
---|
| 772 | |
---|
| 773 | - the heir thread, and |
---|
| 774 | |
---|
| 775 | - a boolean flag indicating if a thread dispatch is necessary or not. |
---|
| 776 | |
---|
| 777 | Updates of the heir thread are done via a normal store operation. The thread |
---|
| 778 | dispatch necessary indicator of another processor is set as a side-effect of an |
---|
| 779 | inter-processor interrupt. So, this change notification works without the use |
---|
[90a3c41] | 780 | of locks. The thread context is protected by a :term:`TTAS` lock embedded in |
---|
| 781 | the context to ensure that it is used on at most one processor at a time. |
---|
[785c02f] | 782 | Normally, only thread-specific or per-processor locks are used during a thread |
---|
| 783 | dispatch. This implementation turned out to be quite efficient and no lock |
---|
| 784 | contention was observed in the testsuite. The heavy-weight thread dispatch |
---|
| 785 | sequence is only entered in case the thread dispatch indicator is set. |
---|
| 786 | |
---|
| 787 | The context-switch is performed with interrupts enabled. During the transition |
---|
| 788 | from the executing to the heir thread neither the stack of the executing nor |
---|
| 789 | the heir thread must be used during interrupt processing. For this purpose a |
---|
| 790 | temporary per-processor stack is set up which may be used by the interrupt |
---|
| 791 | prologue before the stack is switched to the interrupt stack. |
---|