wiki:GSoC/2013/ParavirtualizationOfRTEMS

Version 46 (modified by Phipse, on Sep 19, 2013 at 5:30:50 PM) (diff)

Introduction

GSOC 2013 - Paravirtualization of RTEMS

The project's goals were first, to introduce a virtualization layer into RTEMS, easing the virtualization of RTEMS across different hypervisors and second, to implement a proof of concept on POK. POK is a partitioned operating system, separating applications in different software partitions from each other. This is not exactly what one would call hypervisor, but as POK is ARINC 653 compliant and RTEMS is not, it would be a great benefit to become standard compliant this way.

The first half of this page debates obstacles and solutions inside RTEMS and discuss shortly the approach in the project of the previous year. The difficulties I was facing in POK weren't clear to me, when I was writing the proposal and discussing the design. The second half of the page explains the implementation, difficulties, remaining work and future steps.

The Project proposal can be found on google docs.

Up to date source code can be found in my rtems repo in the virt-pok branch and in the POK repo on github.

Partitioned OS Kernel - POK

This paper explains POK in detail. <ref>J. Delange and Laurent Lec. POK, an ARINC653-compliant operating system released under the BSD license. In - 13th Real-Time Linux Workshop. http://julien.gunnm.org/data/publications/article-dl11-osadl11.pdf</ref>

Architecture Analysis and Design Language

AADL is used in POK to configure and specify the systems architecture. The model must specify the size of the memory and the time slice and the communication ports of the partition. If a communication port is not defined in the model, it will cause an exception at run time, if the application tries to access it. If a fault occurs the kernel will call a handler function inside the partition, causing the fault.

As explained in the OSADL11 paper, section 5.2 and 5.3, there are several keywords for AADL. They divide into two categories:

Kernel and partition specification

  • processor
  • virtual processor
  • process
    • feature
  • memory

Behavior code

  • thread
  • data
  • subprogram

Services

  • Time Management -> provides time related functions to partitions
  • Fault Handling -> catches errors and calls handler of faulting partiton
  • Inter-partiton communication -> explicitly defined during configuration, kernel supervised

RTEMS

Target Architectures

  • x86 (proof of concept)
  • Sparc
  • PowerPC
  • ARM

Paravirtualization layer

The interface between RTEMS and the hypervisor / host OS is provided by a library. Central to the library is a header file defining all necessary functions, e.g. to connect to an IRQ source. The host has to implement the function specified in the header file and compile a library, which is passed to RTEMS. At RTEMS link-time the library is included and all remaining undefined references are resolved.

Function list

The listed functions describe the interface functions provided by the host system. The list is seperated into CPU dependent functions and common BSP functions. This leads to a redefinition of the CPU part and maybe the BSP functionality can be reused on other architectures. The files can be found in c/src/lib/libbsp/i386/virtualpok/include/.

CPU functions (i386)

  • _CPU_Virtual_Irq_request( int vector )
  • _CPU_Virtual_Irq_detach( int vector )
  • _CPU_Virtual_Interrupts_enable( int _level )
  • _CPU_Virtual_Interrupts_disable( int _level )
  • _CPU_Virtual_Interrupts_flash( int _level )
  • _CPU_Virtual_Interrupts_open(void)
  • _CPU_Virtual_Interrupts_close(void)
  • _CPU_Virtual_idle_thread(void)
  • _CPU_Virtual_exec_stop_error( int _error )

BSP functions (i386)

  • _BSP_Virtual_Console_init(void)
  • _BSP_Virtual_Console_read(void) -- not used, as POK can't read from console
  • _BSP_Virtual_Console_write(char* c)
  • _BSP_Virtual_faulthandler(void)
  • _BSP_Virtual_Clock_init(void)
  • _BSP_Virtual_Clock_read(void)
  • _BSP_Virtual_getworkspacearea(void)

RTEMS startup as a guest

We settled with the following design:

  • Compile POK including the partition, leading to libpart.a in generated-code/cpu/partX/
  • Copy libpart.a to the RTEMS BSP virtualpok/ and compile RTEMS for i386 and the virtualpok BSP
  • The resulting binaries from the samples directory can be used as pok partition binaries
  • Copy for examples hello.exe to part1/part1.elf and run the kernel compilation again
  • The kernel will start the partition and therefore RTEMS

Compile POK partitions

POK awaits ELF binaries to be included in the final linking stage. If we can provide RTEMS with enough information (read include files) to build up a valid partition binary, we maybe could set the entry point into the RTEMS binary and get POK to execute it, as it would start every other partition. As far as I can see, the ELF file compiled in the partX directories is taken and merged with the kernel binary. At run-time the kernel the loads the partition_size table and loads the ELF binaries into memory. I haven't come across checks, if the binary is a POK one.

I replaced the part1/part1.elf with the RTEMS hello.exe in the generated-code/cpu/Makefile and introduced a new Makefile target just invoking $(TARGET):

  1 export ARCH=x86
  2 export BSP=x86-qemu

  6 TARGET=$(shell pwd)/pok.elf
  7 #PARTITIONS= part1/part1.elf
  8 PARTITIONS= part1/hello.exe

 10 KERNEL=kernel/kernel.lo
 11 all: build-kernel partitions $(TARGET)
 12 
 13 last: $(TARGET)

</code>

Invoking with ''make last'' produces the expected result: The size.c file contains the size of hello.exe and nm partitions.bin shows the RTEMS symbols. 

The hello world works fine. Read this [http://phipse.github.io/rtems/blog/2013/07/08/HelloWorld/ blog post] for further information.
==  Build process  ==

The build process will be as follows:
 *  Design the POK system via an AADL model.
 *  Keep the size of the '''final''' binary, including RTEMS, in mind.
 *  Build the POK container for the RTEMS code --> Library 
 *  Take the library and pass it to RTEMS at compile time.
 *  Use the last years pok_rtems_combine script to add the final binary as partition.

This is a clean approach on both sides.
POK will be configured with the AADL model and the partition binary implements the POK side of the communication interface.
As POK starts partitions by loading the ELF-binary and jumping in on the entry_ip specified in the ELF-header, RTEMS should start fine.
On the RTEMS side the use of the virtualization layer functions works without issues, as the function implementations are passed via the library.

Read [http://phipse.github.io/rtems/blog/2013/07/08/HelloWorld/ this blog post] on how to build hello world.
=  Virtual CPU Issue  =


The following three options are possible to enable RTEMS to support a virtual CPU.
We have chosen the libcpu/score split.
==  libcpu/score split  ==

== = Structure  ===

The CPU dependent code is split up in virtualization sensitive and unsensitive parts. 
The unsensitive parts go in ''cpukit/score/cpu/${arch}/'' the sensitive parts go into ''c/src/lib/libcpu/${arch}/${arch}virt/''. 

The CPU is selected through the BSP, hence additional virtual BSPs of the form ''${bsp_name}virt'' are introduced.

Therefore no changes to the configuration scripts besides the additional BSP names are necessary.
The target names stay the same.

In the end there is one virtual CPU model and one BSP per virtualized architecture.
== = Configuration  ===

The only change to the RTEMS configuration scripts, will be additional names for the ''--enable-rtemsbsp='' option.
== = Questionable parts  ===

All files are below /cpukit/score/cpu/i386/ or c/src/lib/libcpu/i386/.
This lists the name, the file its in and the instruction(s):
{| class="wikitable" border="1"
|-
! Name
! File
! Instruction
! Description
|-
| _CPU_ISR_Set_level
| rtems/score/cpu.h
| cli, sti
|
|-
| _CPU_Fatal_halt
| rtems/score/cpu.h
| hlt
|
|-
| _CPU_Thread_Idle_body
| cpu.c
| hlt
|
|-
| CPU_EFLAGS_INTERRUPTS_ON/_OFF 
| rtems/score/cpu.h 
| 
| 
|-
| interrupt.h
| rtems/score/interrupt.h
| 
| Critical.
|- 
| rdtsc
| libcpu: cpuModel.h
|
| No direct access possible.
|}
==  Collective directory ''virt''  ==

== = Structure  ===

To prevent cluttering the BSP and CPU directories with additional virtual CPU models, a collective directory is added.

 *  ''c/src/lib/libbsp/virt/<arch>/<bsp_name>''
 *  ''cpukit/score/cpu/virt/<arch>''

The behaviour inside these directories is the same, as without virtualization.
The names for CPU and BSP stay the same.

The code necessary for the virtualization is shared among the BSPs and CPUs and goes into:
 *  ''c/src/lib/libbsp/virt/shared''
 *  ''cpukit/score/cpu/virt/shared''

The Makefiles have to cover these directories.
== = Configuration  ===

To configure RTEMS for virtual execution of the binary, a new flag is introduced. 

 *  ''--enable-virt:''

It tells autoconf to assume a different directory structure.
The other configuration parameter, which are deduce from ''--target'' and ''--enable-rtemsbsp'', are not touched.

==  Introduce new target  ==


I used this approach to bring RTEMS on L4Re. 
I will explain it with the aid of this implementation.
The architecture in use is x86 and I used the i386 CPU and BSP directory as a starting point.

[https://github.com/phipse/L4RTEMS L4RTEMS source code]
== = Structure  ===


A new target called ''l4vcpu'' was introduced and the corresponding directories:
 *  c/src/lib/libbsp/l4vcpu/
 *  cpukit/score/cpu/l4vcpu/
were added.

These directories are copies of the i386 directories and only code that produced visible faults was touched and changed. 
To provide a point where data can be shared a so called ''sharedVariableStruct'' was defined, which accommodates e.g. a pointer to the vcpu-structure and a pointer to the l4re_env (L4Re environment).
This is passed to RTEMS at startup in a register, e.g. like the multiboot information, and is saved before anything else is executed.

The BSP startup was boiled down, as hardware initialization isn't necessary.
Also some privileged instructions are skipped. 
It's still work in progress.
== = Configuration  ===

Also some configuration files were adapted, see the doc file in the source code. 

To configure RTEMS ''l4vcpu-rtems4.11'' must be used as a target and ''pc386'' as BSP. 
== = Compilation & Start up  ===


RTEMS compiles and links without errors.
The resulting ELF binary, e.g. hello.exe, is passed on to L4Re as a command line argument.
It is loaded into the applications address space and the vcpu is supplied with EIP and ESP.
=  ARINC 653  =


The ARINC 653 standard defines "a software specification for space and time partitioning in Safety-critical avionics Real-time operating systems".<ref>https://en.wikipedia.org/wiki/ARINC_653</ref> 
These specifications are enforced by an additional layer called APEX (APplication EXecutive).

As POK is ARINC compliant and RTEMS is not, a paravirtualized RTEMS on top of POK would be a way to achieve compliance.
To make use of this compliance, RTEMS needs to be able to communicate with other partitions on POK by using ''intra-partition communication''.


=  GSoC 2012 Project  =

Source code: [https://github.com/jolkaczad/rtems_hyper by Wiktor Langowski ]

The project used syscalls to access POK resources out of RTEMS.
To get the code together the RTEMS binary is compiled - what fails.
The generated .ralf file is the added to POK by rewriting the partition.bin file and by fixing the size section in the POK binary.

The code uses a hack:
By naming a function ''bsp_start'' in POK and in RTEMS the function is somehow executed twice, at least it looks like it from the output.
From my point of view that's not an approach, destined to be reused.

=  RTEMS_ENABLE_HYPERVISOR  =


With this guard, every CPU dependent code is wrapped.
In case it's false, the normal score code is used.
In case it's true, no CPU code of score is passed to the preprocessor.
In this case the POK BSP will provide the functionality.
At the moment, the CPU code in the BSP is partly real i386 code, partly just copied stubs from no_cpu.
=  Interrupt handling  =

The whole interrupt control chain of RTEMS was abandoned and a own registering system established. 
There is an own array to register the handler and the POK communication code.


=  Implementation  =


'''More implementation details''', including source code samples, can be found in the projects [http://phipse.github.io/rtems/ blog!]
=  Changes to RTEMS  =


Read about score - libcpu - split can be in the preceding section.
The implementation can be found in [https://github.com/phipse/rtems/tree/virt-bsp/c/src/lib/libcpu/i386 my github repo]. All changes to RTEMS reside in the virt-bsp branch.

'''ISSUE:''' The split isn't final, as it is not allowed to include files from outside cpukit in the cpukit files.
libcpu is in c/src/ and not in cpukit/ so this is due to change. 
Currently, a new configuration option is discussed. 
This moves parts of this back to cpukit and branches conditionally on the configure option.

=  virtualpok BoardSupportPackage  =


The virtualpok BSP is located in c/src/lib/libbsp/i386/ and is based on the BSP of the 2012 project.
The BSP brings with it a console driver, a clock driver, special interrupt functions and custom startup code. 
==  Configuration  ==


The command line to configure RTEMS using this BSP is:
 
$ ../rtems/configure --target=i386-rtems4.11 --enable-rtemsbsp=virtualpok --disable-cxx --disable-networking --enable-posix --enable-maintainter-mode --enable-tests --disable-multiprocessing USE_COM1_AS_CONSOLE=1 BSP_PRESS_KEY_FOR_RESET=0

virtualpok/make/custom/virtualpok.cfg defines the CPU model to be used with this BSP.
The configuration file of libcpu/i386/ checks for this CPU model and builds the corresponding makefile.
This way one target architecture can have several BSPs with distinctive CPUs.

==  Startup  ==

virtualpok/start/_start.S defines the standard GNU entry point "start", which is called by POK. 
Normally, there would be some hardware checks and initialization, but as we run virtually, we can directly switch to the standard RTEMS startup procedure "boot_card()".
When boot_card() returns, it is time to clean up and reset/shut down the board.

Bootcard is initializing RTEMS core structures and then proceeds with the BSP specific start up by calling bsp_start(), which is located in virtualpok/startup/bspstart.c. 
This function is responsible to load initialize the interrupt management and all other board specifics, e.g. the clock driver.

One special thing about this BSP are the two virtualization layer files: virtualizationlayercpu.h virtualizationlayerbsp.h in virtualpok/include.
They define the virtualization layer described above.

==  Drivers  ==


The BSP brings a console and clock driver with it.
Both drivers are dependent on calls to the virtualization layer, but fit into the default RTEMS structures.

The console driver defines inbyte() and outbyte(), which call the virtualization layer. 
Termios functions are "supported", meaning the functions calling termios are either stubs or forward to inbyte/outbyte.
The implementation of the virtualization layer is hypervisor dependent.
In this case it is POK and POK doesn't support reading from a console, so inbyte won't be of use.

The clock driver is based upon the pc386 implementation, but omits all calibrating functionality, as we require the host to do this with the normal clock tick. A timer driver is not implemented (see the interrupt issue below). The driver registers it's ISR with RTEMS and is called, when C_dispatch_isr is looking for an driver to deliver the interrupt to. This drivers ISR will then call the default RTEMS clock ISR.

==  Interrupts  ==


The interrupt functionality is not exactly part of the BSP.
They are implemented in libcpu/i386/[virtual|native]/.
To RTEMS is looks alike, but the implementation of i386_enable/disable_interrupts is calling the virtualization layer instead of performing sti/cli. 

Additionally, there are functions to directly open and close interrupts in _CPU_ISR_Set_level(). 
This was necessary as the virtualization layer behaves slightly different, than normal hardware. 
I observed the _level variable taking obscure values, most likely due to the address space switch (user space -> kernel space). 
Therefore, the hypervisor counts the enable/disable calls and decreases/increases an internal counter, opposed to setting the counter to to the value defined by _level. 
Open and close will set this internal counter value to 0 or 1 (see https://github.com/phipse/pok/blob/master/kernel/arch/x86/x86-qemu/bsp.c bsp.c).
=  Changes to POK  =

=  Virtualization layer  && libpart.a   =

The virtualization layer is implemented in the user code part of the partitions code.
Mostly, it makes a syscall and passes the arguments through.

libpart.a is build after libpok.a but with the same 
[https://github.com/phipse/pok/blob/master/misc/mk/rules-partition.mk makefile rule].
libpart.a consists of libpok.a plus the objects from the user code files. 

=  Interrupt design  =

I redesigned the interrupt handling in POK.
The way interrupts were handled before and my changes, are described [http://phipse.github.io/rtems/blog/2013/08/17/pok-hardware-interrupt-handling/ here].
Previously there was no way of knowing which interrupt number occurred, as the IDT directly invoked the corresponding handler.
I replaced all 16 hardware interrupt handlers with predefined handlers, which know their interrupt number. 
Then a meta handler is called to look up the registered handlers for this vector number and invokes first a kernel handler and then the partition handlers.
The lookup table is an array consisting of objects which have fields for the vector number, the handler list for the number of configured partitions plus one (kernel) and a list to indicate if the partition is waiting for an interrupt.
The waiting flag is necessary, because a partition needs to process the interrupt before a new one can be delivered, thus imitating hardware behaviour. 

To register a handler I introduced new syscalls with POK to register and unregister a handler, to enable and disable interrupt delivery to the partition and to acknowledge an interrupt. 
The acknowledge syscall sets the waiting flag mentioned above.
(see [https://github.com/phipse/pok/blob/master/kernel/include/core/syscall.h syscall.h] and 
[https://github.com/phipse/pok/blob/master/kernel/core/syscall.c syscall.c] and 
[https://github.com/phipse/pok/blob/master/kernel/arch/x86/x86-qemu/bsp.c bsp.c] )

The functions corresponding to these syscalls are implemented in the x86 bsp.c file.
There are new functions: 
 *  pok_bsp_irq_register_hw,
 *  pok_bsp_irq_unregister_hw,
 *  pok_bsp_irq_partition_ack,
 *  pok_bsp_irq_partition_enable and
 *  pok_bsp_irq_partition_disable.
pok_bsp_register_hw and pok_bsp_unregister_hw only accpet IRQ values for the hardware IRQ lines (0-15) and pok_bsp_irq_register cannot be used to register IRQ handlers for hardware IRQs anymore.
The enable and disable functions will decrease/increase a counter and return the previous state, thus the previous interrupt level can be restored. 

=  Be careful with GDB  =


GDB isn't designed for low level debugging.
To me it seems, it isn't able to handle memory segments.
When you single-step through the kernel space - user space transition, you will see, that GDB won't show anything on the user-stack, if you inspect the memory address, just 0x00000000. 
But if you load the values written to %%gs:(%%ebx) back into %%eax, and inspect EAX, it will show the right value. 
Debugging this piece of code is only able by loading values into the register and print them step by step. 
I asked a question on [http://stackoverflow.com/questions/18869624/how-can-gdb-print-segmentoffset-addresses stack overflow] about this.

How to use GDB with POK can be found in the pok-devel guide in the section "GDB'ing POK with QEMU". 
The guide is located in the doc directory of the repo.

=  Remaining Issues  =
==  Forwarding interrupts to user space handlers  ==


The transition to user space proved very challenging. 
I was not able to implement it in a reliable way.
In my design I will only make the transition to user space once.
The kernel stack will be cleaned up, after the transition, so no memory leaks will occur there.
To make this transition the interrupted context and the vector is passed to the user space handler, which is responsible to find the right handler for this interrupt number and before it returns to the point of interruption it has to restore the register and stack state. 
The return instruction of the handler is used to pop the interrupted EIP from the stack.
Then the user space shall proceed like no interrupt occurred.

This doesn't work well. 
This is implemented with manual stack manipulation and magic values to find the data on the stack. 
This leads to GPF (0xd) as the EIP is pointing to invalid instructions.
This needs debugging - but GDB is of no help, because it can't handle segmentation. 
At time of writing, I need to write documentation before I can go back to debugging.

==  Deliver interrupts occurring, while partition is not able to receive  ==


This can mean to things: Either the partition has registered an interrupt and is not scheduled at the moment or it is busy servicing another interrupt occurred earlier.
I [http://phipse.github.io/rtems/blog/2013/07/09/how-late-is-it/ discussed this] earlier under the term pending interrupts, as undelivered interrupts '''mustn't''' get lost and therefore be stored for later delivery. 
I didn't come around to implement this, but a simple ''unsigned array[16]'' for each partition, counting the undelivered but occurred interrupts, should do it.

The scheduler then has to check this array when it schedules the partition again and care for the delivery of all pending interrupts, either one by one, or by delivering the amount. 
If the latter is chosen, the partition needs to be aware of this.
For instance a time_warp function for the clock interrupt is needed.
Claudio da Silva has provided one [https://gist.github.com/cdcs/5874932 here].

=  More Hypervisors (L4Re) =


The concept of the virtualization layer should be portable. 
I write should be, because I haven't tested it.
From my point of view, it should be an easy task to port the virtualpok BSP to L4Re.
L4Re provides a full vCPU interface and library functions to control it. 
Compared to POK this is heaven.

In my previous attempt to virtualize RTEMS on L4Re, before this project, I made L4Re load the RTEMS binary and then configure the vCPU to start at the binaries entry point, etc. .
This approach required the RTEMS binary to be analysed at compile-time of the L4Re application.
Also RTEMS was modified. Back then I introduced a new CPU target called l4vcpu, which was mostly a copy of the i386 CPU target. 
The L4RTEMS project can be found on [https://github.com/phipse/L4RTEMS github]

With the virtualization layer this changes.
RTEMS needs a library provided by L4Re to compile and then the binary can be loaded the same way.
So it doesn't need to be present at compile time. 
Furthermore, it we should be able to reuse the virtualpok BSP, provided we replace libpart.a with a library provided by L4Re.

=  References  =

<references/>

=  Misc  =

''Under construction''

''' Virtual CPU state/model '''

In a virtual environment the CPU is shared among several virtual machines.
There are privileged instructions (e.g CLI/STI) which would allow the VM to prevent the hypervisor to switch execution to another VM.
Besides that there are instructions altering the CPU state without notice of the hypervisor or another VM. 
To prevent these disturbances it is common practice to provide each VM with an own virtual CPU implemented in software.
So the VM can for example only disable interrupts on it's virtual CPU.
This state change is persistent, but only on the virtual CPU model and isn't written out to the hardware CPU. 
This ensures the separation of all VMs in the system. 
Additionally, the hypervisor can inspect the virtual CPU state and alter it in case of errors.  

'''How is it in POK?'''

I haven't found a pleasing alternative for the virtual CPU yet. 

'''Paravirtualization'''

Is the method to run the guest system on the host system, while modifying the guest source code to access host system functionality directly. 

'''Paravirtualization layer'''

Set of functions headers, providing an defined interface to RTEMS as well as to the host system. 
RTEMS will call these functions instead of the hardware.
The host needs to provide enough source code to the guest to implement these function, even if they emit a call to a host function.
Hopefully, the compiler is optimizing this additional function call.

Alternatively, the host provides a library to be included at link-time, to resolve all missing references.