#2274 closed enhancement (fixed)

Enable libgomp build in GCC

Reported by: Sebastian Huber Owned by: Sebastian Huber
Priority: normal Milestone: 4.11.1
Component: tool/gcc Version: 4.11
Severity: normal Keywords:
Cc: Blocked By:
Blocking:

Description

libgomp is the support library for OpenMP code emitted by GCC. Adding support for RTEMS needs roughly the following steps:

  • Move <semaphore.h> header file from RTEMS to Newlib. Due to license issue use the one provided by FreeBSD and modify it accordingly.
  • Add Autoconf code to detect presence of Newlib <semaphore.h>.
  • Add RTEMS tweaks to libgomp configure script.
  • Add RTEMS specific link-time configuration to select a special memory allocator for libgomp.
  • Add ability to control thread scheduler, priority, stack size, etc. via application configuration options/handler.
  • Add standard OpenMP tests to RTEMS testsuite.
  • Add documentation to user manual.
  • Do performance tests.
  • Add dedicated low-overhead barriers.

Attachments (5)

libgomp-parallel-bench-posix-malloc.png (34.1 KB) - added by Sebastian Huber on 07/07/15 at 07:55:59.
libgomp-parallel-bench-posix-no-malloc.png (36.6 KB) - added by Sebastian Huber on 07/16/15 at 08:14:43.
init.c (2.9 KB) - added by Sebastian Huber on 07/16/15 at 08:33:22.
Microbench
libgomp-parallel-bench-sys-lock.png (45.1 KB) - added by Sebastian Huber on 07/23/15 at 07:44:54.
lock.h (5.0 KB) - added by Sebastian Huber on 07/23/15 at 08:01:05.

Download all attachments as: .zip

Change History (14)

comment:1 Changed on 02/18/15 at 14:47:43 by Sebastian Huber

Status: newaccepted

comment:2 Changed on 07/07/15 at 07:38:02 by Sebastian Huber

OpenMP using the POSIX configuration is available in GCC 4.9.3 and later. It is enabled in the RSB for RTEMS 4.11 on ARM, PowerPC and SPARC.

Changed on 07/07/15 at 07:55:59 by Sebastian Huber

comment:3 Changed on 07/07/15 at 07:59:33 by Sebastian Huber

The microbenchmark posted here

https://gcc.gnu.org/ml/gcc-patches/2008-03/msg00930.html

shows a significant overhead due to malloc/free in the team create/destroy path.


The next step is to get fix this in upstream libgomp.

Version 0, edited on 07/07/15 at 07:59:33 by Sebastian Huber (next)

comment:4 Changed on 07/08/15 at 08:30:40 by Sebastian Huber

The results of the microbenchmark obtained on a T4240 with using only two processors (unmodified GCC 4.9.3, RTEMS 8406d94283cc704df2c0d8aa017310e3e4ad0919):

barrier bench 20.6147 seconds
parallel bench 16.8791 seconds
static bench 0.852061 seconds
dynamic bench 0.292199 seconds

Changed on 07/16/15 at 08:14:43 by Sebastian Huber

comment:5 Changed on 07/16/15 at 08:32:03 by Sebastian Huber

The malloc() problem is solved in the GCC 6.0:

https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=225811

Microbench (2 processor T4240):

barrier bench 23.3409 seconds
parallel bench 9.60804 seconds
static bench 0.472419 seconds
dynamic bench 0.223881 seconds
guided bench 0.00999273 seconds
runtime bench 0.229282 seconds
single bench 2.18316 seconds


Microbench (24 processor T4240):

barrier bench 783.888 seconds
parallel bench 115.901 seconds
static bench 5.7876 seconds
dynamic bench 0.262251 seconds
guided bench 0.0133215 seconds
runtime bench 0.261378 seconds
single bench 57.3227 seconds

There is a significant overhead due to the creation/destruction of POSIX mutexes and semaphores. In particular there is high contention on the allocator lock. The next step is to provide self-contained objects defined in Newlib <sys/lock.h> which can be used to implement the libgomp primitives and avoid the creation/destruction overhead. In addition a spin based barrier implementation based on the Linux futex barrier will be provided.

Changed on 07/16/15 at 08:33:22 by Sebastian Huber

Attachment: init.c added

Microbench

Changed on 07/23/15 at 07:44:54 by Sebastian Huber

comment:6 Changed on 07/23/15 at 07:59:54 by Sebastian Huber

Performance with self-contained objects defined in Newlib <sys/lock.h>. The barrier implementation is virtually identical to the Linux futex barrier present in libgomp.

Microbench (2 processor T4240):

barrier bench 0.387543 seconds
parallel bench 0.258221 seconds
static bench 0.0215772 seconds
dynamic bench 0.224599 seconds
guided bench 0.00639818 seconds
runtime bench 0.229863 seconds
single bench 0.0711802 seconds


Microbench (24 processor T4240):

barrier bench 5.74687 seconds
parallel bench 2.38893 seconds
static bench 0.118236 seconds
dynamic bench 0.2516 seconds
guided bench 0.00146854 seconds
runtime bench 0.250789 seconds
single bench 0.543456 seconds

This is a major improvement compared to the previous versions. In the parallel bench profile, the only operating system function is _Futex_Wake() with 13% processor utilization. This is all right, since barrier operations are heavily used in this test case.

Last edited on 07/23/15 at 08:08:00 by Sebastian Huber (previous) (diff)

Changed on 07/23/15 at 08:01:05 by Sebastian Huber

Attachment: lock.h added

comment:8 Changed on 09/04/15 at 11:45:56 by Sebastian Huber <sebastian.huber@…>

Resolution: fixed
Status: acceptedclosed

comment:9 Changed on 10/10/17 at 05:58:26 by Sebastian Huber

Component: GCCtool/gcc
Note: See TracTickets for help on using tickets.