Opened 2 years ago

Closed 23 months ago

Last modified 14 months ago

#2271 closed enhancement (fixed)

Improved timestamp implementation

Reported by: Sebastian Huber Owned by: Sebastian Huber
Priority: normal Milestone: 4.11
Component: cpukit Version: 4.11
Severity: normal Keywords:
Cc: Alexander Krutwig

Description (last modified by Sebastian Huber)

Benefit

Improved average-case and worst-case performance. Uni-processor configurations will also benefit (simpler clock drivers, better NTP support, support for PPS).

Problem Description

The timekeeping is an important part of an operating system. It includes

  • timer services,
  • timeout options for operating system operations, and
  • time of day services, e.g. timestamps.

Timestamps are frequently used for example by the network stack to manage timeouts in various network protocols.

On RTEMS the timekeeping is implemented using

  • a basic time representation in 64-bit nanoseconds (or struct timespec),
  • watchdog delta chains for timer and timeout services,
  • a clock tick function rtems_clock_tick(), and
  • an optional nanoseconds extension to get the nanoseconds elapsed since the last clock tick.

TOD Lock Contention

The time of day (TOD) lock protects the current time of day and the uptime values (both are in 64-bit nanoseconds). During the rtems_clock_tick() procedure these values are updated. In combination with the nanoseconds extension they deliver a time resolution below the clock tick if supported by the clock driver.

Profiling reveals lock contention on the time of day (TOD) lock. For example for the test program SMPMRSP 1 on the 200MHz NGMP we have the following SMP lock profile for the TOD lock:

<SMPLockProfilingReport name="TOD">
  <MaxAcquireTime unit="ns">2695</MaxAcquireTime>
  <MaxSectionTime unit="ns">1785</MaxSectionTime>
  <MeanAcquireTime unit="ns">499</MeanAcquireTime>
  <MeanSectionTime unit="ns">734</MeanSectionTime>
  <TotalAcquireTime unit="ns">1535529630</TotalAcquireTime>
  <TotalSectionTime unit="ns">2254450075</TotalSectionTime>
  <UsageCount>3071324</UsageCount>
  <ContentionCount initialQueueLength="0">2049757</ContentionCount>
  <ContentionCount initialQueueLength="1">1018726</ContentionCount>
  <ContentionCount initialQueueLength="2">2840</ContentionCount>
  <ContentionCount initialQueueLength="3">1</ContentionCount>
</SMPLockProfilingReport>

This show that the time of day (TOD) lock is heavily used in this test program and high contention is visible (users have a 50% chance, that the lock is not immediately free). It is used on every context switch and is the last global data structure in the thread dispatch path. This shared state among all processors is a performance penalty.

Problem Report #2180

The nanoseconds extension and the standard clock drivers are broken on SMP, see #2180. The problem is that the time reference points for the operating system (global TOD structure, containing the time of day and uptime) and the clock driver (nanoseconds extension) are inconsistent for a non-zero length time interval during the clock tick procedure.

The software updated reference point for the nanoseconds extension leads to difficult implementations. Naively written clock drivers usually fail in the test program SPNSEXT 1.

Expensive Operation to Convert 64-bit Nanoseconds

Converting 64-bit nanoseconds values into the common struct timeval or struct timespec formats requires a 64-bit division to get the seconds value. This is a potentially expensive operation depending on the hardware support (this is the case for SPARC, see __divdi3() in libgcc.a).

Problem Solution

Use FreeBSD timecounters. This enables also proper support for the Network Time Protocol (NTP) and
Pulse Per Second (PPS).

In order to use the timecounters, the platform must provide

  • one periodic interval interrupt to trigger rtems_clock_tick(), and
  • one free running global counter with a resolution below the clock tick interval.

This change makes it necessary to touch every clock driver in the RTEMS sources. There are 40 clock drivers (20 of them with a nanoseconds extension) using the clock driver shell header file and 23 clock drivers (3 of them with a nanoseconds extension) with a custom implementation structure.

This free running global counter is an additional requirement, so it may be impossible to convert every clock driver. However it is feasible to adjust the FreeBSD timecounters implementation which uses ten timehands by default to avoid this additional requirement. Platforms lacking a free running global counter can reduce the timehands to one. In this case the periodic timer used to generate the clock tick interrupt can be used (like in the current nanoseconds extension).

Attachments (8)

bintime_requests.png (39.5 KB) - added by Alexander Krutwig 2 years ago.
Number of bintime requests with hardware periphery
bintime_requests_wo.png (37.0 KB) - added by Alexander Krutwig 2 years ago.
Number of bintime requests without hardware periphery
uptime_requests.png (39.2 KB) - added by Alexander Krutwig 2 years ago.
Number of uptime requests with hardware periphery
uptime_requests_wo.png (37.3 KB) - added by Alexander Krutwig 2 years ago.
Number of uptime requests without hardware periphery
new_bintime_without.png (36.3 KB) - added by Alexander Krutwig 2 years ago.
new_bintime_with.png (45.4 KB) - added by Alexander Krutwig 2 years ago.
new_uptime_with.png (36.5 KB) - added by Alexander Krutwig 2 years ago.
new_uptime_without.png (40.0 KB) - added by Alexander Krutwig 2 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 2 years ago by Sebastian Huber

Cc: Alexander Krutwig added
Status: newaccepted

Changed 2 years ago by Alexander Krutwig

Attachment: bintime_requests.png added

Number of bintime requests with hardware periphery

Changed 2 years ago by Alexander Krutwig

Attachment: bintime_requests_wo.png added

Number of bintime requests without hardware periphery

Changed 2 years ago by Alexander Krutwig

Attachment: uptime_requests.png added

Number of uptime requests with hardware periphery

Changed 2 years ago by Alexander Krutwig

Attachment: uptime_requests_wo.png added

Number of uptime requests without hardware periphery

comment:2 Changed 2 years ago by Alexander Krutwig

Current status

Several tests were executed to compare the performance figures of FreeBSD timecounters and the current RTEMS nanoseconds extension. The multiprocessor system those tests are based on, is a Freescale T4240RDB Evaluation board which features 24 processor cores organized in three clusters of eight cores each.
Four exemplary plots will be given in the following, categorized in two different sections: tests with hardware periphery and tests without hardware periphery.
The tests with hardware periphery feature the real process of obtaining the hardware counter values for FreeBSD timecounters and RTEMS clock uptimes, whereas without periphery, dummy values are returned instead which is only supposed to test the correct mechanism.
The test environment was set up to call and execute the corresponding time obtaining functions as often as possible without doing any other tasks. Simulation times were 1 second.

Plots with hardware periphery (T4240RDB)

Number of bintime requests with hardware periphery
Number of uptime requests with hardware periphery

All plots feature a box plot (http://en.wikipedia.org/wiki/Box_plot) which is derived from the amount of individual requests each active working core comes up with as well as an indicator for the sum of individual requests. In case of the FreeBSD binuptime requests, the sum of requests increases continuously in clusters. From one processor to three, the amount nearly triples, whereas, it nearly increases by factor 5 once all processors are exhausted. For one processor, it starts at about 6 millions and tops out at about 28 million requests for all processors.
In contrast, the RTEMSUptime requests do not increase for an elevated amount of processors and shows a tendency to decrease once more processors are utilized. The sum of requests drops from 2.4 millions for one processor down to 2.2 million requests when all processors are used.

Plots without hardware periphery (T4240RDB)

Number of bintime requests without hardware periphery
Number of uptime requests without hardware periphery

These tests were executed to check the correct mode of operation without any influence of hardware effects. For the bintime requests dummy values were returned once timekeeping functions were called. In case of the RTEMSUptime requests, the corresponding nanoseconds extension was initially set to 0.
The amount of bintime requests increases in a linear manner from 16 million from one processor to 500 million for all 24 processors.
In contrast, the RTEMSUptime requests show similar characteristics as before when hardware was used. For one processor, the sum of requests equals about 5 million calls, but once more processors are used, this sum decreases. Until all processors are integrated, the behaviour shows little aberrations with a tendency of a slight decrease.

Conclusion

Based on the findings of the previous two sections, it can be stated that FreeBSD shows much better performance figures than current RTEMS nanosecond extensions. Whereas FreeBSD shows continuously increasing sums of requests per active working processors, RTEMS nanosecond extensions show a slight decrease in requests in the same scenario.

Last edited 2 years ago by Sebastian Huber (previous) (diff)

comment:4 Changed 2 years ago by Chris Johns

Is the watchdog change and freebsd timer counter change linked in any way or are they separate patch sets ?

comment:5 Changed 2 years ago by Sebastian Huber

Description: modified (diff)
Summary: Improved TimekeepingImproved timestamp implementation

Changed 2 years ago by Alexander Krutwig

Attachment: new_bintime_without.png added

Changed 2 years ago by Alexander Krutwig

Attachment: new_bintime_with.png added

Changed 2 years ago by Alexander Krutwig

Attachment: new_uptime_with.png added

Changed 2 years ago by Alexander Krutwig

Attachment: new_uptime_without.png added

comment:6 Changed 2 years ago by Alexander Krutwig

Plots with hardware periphery (NGMP)

These plots as well as those in the upcoming section feature the same test environment than previously, however, a NGMP 4 processor core board was used. Profiling was enabled and the frequency was 50MHz.



The bintime plot shows a linear increase in requests which nearly quadruples from 600000 requests for one processor to nearly 2.2 million requests for four processors.
For the RTEMSUptime it also starts to grow linearly, however with four processors a breakdown can be investigated. This behaviour in supposed to continue once more processors are in use or the frequency increases. For one processor there are about 120000 requests and the peak is reached for three processors with 250000.

Plots without hardware periphery (NGMP)



For the cases without hardware periphery the plot figures are similar to the behaviour described in the section before. For the bintime requests, the increase is again linear, whereas the increase drops for the RTEMS nanosecond extension when four processors are used. Altogether the absolute amount of requests in higher than in the case of hardware usage due to the simplified return logic.

Last edited 2 years ago by Alexander Krutwig (previous) (diff)

comment:7 Changed 2 years ago by Gedare

Can you please scale-down the sizes of the images they don't need to be so large to be comprehensible. Thanks for all the details though this is great!

comment:8 in reply to:  7 Changed 2 years ago by Chris Johns

Replying to gedare:

Can you please scale-down the sizes of the images they don't need to be so large to be comprehensible. Thanks for all the details though this is great!

To scale add ", 50%" after the image macro.

comment:9 Changed 2 years ago by Alexander Krutwig

Requirements

Introduction

A central aspect of the performance tests was the determination of the minimum operating frequency the FreeBSD timecounters have to operate with.
The timecounters are based on a ten-timehands mechanism that have to be updated continously. Time can be represented via the bintime, microtime and nanotime format. All formats are structs that consist of two variables, a time_t size for the seconds as well as a 64-bit variable for the fractional part of the bintime as well as long size for the microseconds and nanoseconds, respectively.
The bintime of the current timehand is updated by adding the product (th->th_scale * tc_delta(th)) to the previous bintime value.

Problem and limiting factors

The limiting factor of the binuptime operation is the possible overflow of the 64-bit sized fraction part of the bintime bt. Th->th_scale is determined by the frequency of the system itself and tc_delta(th) is growing with the offset count that is written into the variable since the last update. When one second has passed since the last update, the value is scaled by th->th_scale to the maximum 64-bit value which results in an overflow and in the process, miscalculations. Therefore, a timehand should never have timestamp values stored that are older than one second.

Consequence

As the mechanism consists of a ring of ten timehands, each timehand should thus be updated every 1s/10 = 0.1s. Therefore, the minimum RTEMS system clock should at least be 10Hz so that timecounters can work properly. In order to have space to the absolute minimum, a minimum frequncy of 20Hz should be chosen.

comment:10 Changed 23 months ago by Sebastian Huber

Milestone: 4.11.14.11

comment:11 Changed 23 months ago by Alexander Krutwig <alexander.krutwig@…>

In 75acd9e69f906cbd880a17ee4ca705ad7caa92c0/rtems:

bsps: Convert clock drivers to use a timecounter

Update #2271.

comment:13 Changed 14 months ago by Sebastian Huber <sebastian.huber@…>

In 01b32d44a41e2959927dea4dafd786a11afc901b/rtems:

score: Delete obsolete CPU_TIMESTAMP_* defines

Update #2271.

Note: See TracTickets for help on using tickets.