wiki:GSoC/2014/ParavirtualizationOfRTEMS

Version 5 (modified by Youren Shen, on 08/18/14 at 18:14:34) (diff)

/* The interrupt mask function */

GSOC 2014 - Paravirtualization of RTEMS

The project this year continues with the consequent of last year. We share the same goal, to introduce a virtualization layer into RTEMS. The different between this year and last year is that we focus on hypervisor more this year. We designed two mechanism to connect guest OS and host OS. The one is hypercall to send request from guest OS to host OS. And the notification mechanism to send request from host OS to guest OS. By designing these two mechanisms in POK, the RTEMS will co-work well with POK. You can find the Wikipedia page for last year.

And the proposal of this year.

The source code can be found in my github repository.

Partitioned OS kernel – POK

The POK kernel is an partitioned OS based on ARINC 653 compliant. The target of us is adapt the POK kernel to an hypervisor to fit RTEMS paravirtualization. To adapt the POK kernel, the essential premise is understood how the POK worked. = The POK startup flow =

How to startup on x86 architecture is a common sense. But we still should focus on how POK set GDT, to know the privilege level and segment setting. Also it’s important to understand interrupt handling mechanism in POK when dealing with interrupt delivery.

pok_ret_t pok_arch_init ()
{
  pok_gdt_init ();
  pok_event_init ();
  return (POK_ERRNO_OK);
}

The POK initialize the GDT and IDT in this two function. For more details, please see this two blogs.POK Startup Flow and The syscall system in POK= The POK context switch function =

This function is interesting because it’s different with other operating system. It will used the structure context_t to emulate the interrupt and interrupt return behavior.

For more detail please see this blog.

The separation between POK kernel and virtualization

To separate the POK BSP to x86-qemu and x86-qemu-vmm will benefit to the paravirtualization, on the point that the change of x86-qemu-vmm will not influence the normal x86-qemu, and also the change of x86-qemu-vmm will be more clear. Where to change: There are two situations. Firstly , on some files there are only some of lines which are for virtualization only. In this situation, we use the macro POK_NEEDS_X86_VMM to control compiler. Secondly, if one file is for virtualization only, then using macro in Makefile. Here are two example:= Separate in one source file =

In arch.c, the function pok_arch_event_register derive to types, one is for x86-qemu, the other, of course is for x86-qemu-vmm.

diff --git a/kernel/arch/x86/arch.c b/kernel/arch/x86/arch.c
index 917f3f3..1d54dda 100644
--- a/kernel/arch/x86/arch.c
+++ b/kernel/arch/x86/arch.c
@@ -58,6 +58,7 @@ pok_ret_t pok_arch_idle()
 }
 
 
+#ifdef POK_NEEDS_X86_VMM
 extern void pok_irq_prologue_0(void);
 extern void pok_irq_prologue_1(void);
 extern void pok_irq_prologue_2(void);
@@ -119,6 +120,20 @@ pok_ret_t pok_arch_event_register  (uint8_t vector,
   }
 }
 
+#else
+pok_ret_t pok_arch_event_register  (uint8_t vector,
+                                    void (*handler)(void))
+{
+  pok_idt_set_gate (vector,
+                   GDT_CORE_CODE_SEGMENT << 3,
+               (uint32_t) handler,
+                   IDTE_TRAP,
+                   3);
+
+  return (POK_ERRNO_OK);
+}
+#endif /* POK_NEEDS_X86_VMM */
}}}=  Separate in Makefile  =

If one whole file is for virtualization only, we can change the Makefile to separate it. Here is an example:
{{{
diff --git a/kernel/arch/x86/Makefile b/kernel/arch/x86/Makefile
index c486d47..f9cba01 100644
--- a/kernel/arch/x86/Makefile
+++ b/kernel/arch/x86/Makefile
@@ -13,10 +13,14 @@ LO_OBJS=   arch.o      \
            space.o     \
            syscalls.o  \
            interrupt.o \
-          interrupt_prologue.o    \
            pci.o       \
            exceptions.o
 
+ifeq ($(BSP),x86-qemu-vmm)
+
+LO_OBJS+= interrupt_prologue.o
+
+endif
 LO_DEPS=   $(BSP)/$(BSP).lo
 
 all: $(LO_TARGET)
}}}
Now I hope it's clear enough, and the change will be used in the next steps.
=  Hypercall  =

The hypercall, as a mechanism imitate from syscall, is an way to using the hypervisor’s resources or notify hypervisor some events.
Here is the change make in POK kernel. 
 
# Add a pok_hypercall_init in pok_event_init, also should add POK_NEES_X86_VMM to guard this function.    
# Add the two head file, kernel/include/core/hypercall.h and libpok/include/core/hypercall.h, and build the corresponding structure and declaration.   
# implement the corresponding functions in corresponding .c files. That is:   
# kernel/arch/x86/hypercall.c, using this file to build the hypercall_gate.
# kernel/core/hypercall.c, in this file, the pok_core_hypercall will deal with the hypercall.
# modify the kernel/include/arch/x86/interrupt.h, add the support of hypercall handler.
# add libpok/arch/x86/hypercall.c, in this file, we implement the pok_do_hypercall, which will invoke the soft interrupt.
# modified interrelated Makefile to assure those file will work when the BSP is x86-qemu-vmm, also will not influence the normal POK, when the BSP is not x86-qemu-vmm.

For more details please see this [http://huaiyusched.github.io/2014/05/30/build-a-new-hypercall-system-by-imitating-the-syscall/ blog].=  vCPU in partition  =


The vcpu is part of schedule in VMM, to manage the processor, and the arch-dependent structure (arch-vcpu) is relevant with current partition.    
As a result, first, the whole structure of vcpu is part of processor management, should be placed in kernel.    

I build a new file vcpu.h in kernel/include, and put the vcpu structure definition in it. Then build a arch_vcpu.h in kernel/arch/x86, and put the arch_vcpu in it. in this file, I use the context_t in this structure to contain user_regs.   
Also in the arch_vcpu, I put a irq_desc struct, to store interrupt information.   
Then I builds a new file vcpu.c in kernel/core, and implement the alloc_vcpu function in this file. This function relies on some arch-dependent functions, like alloc_vcpu_struct and vcpu_initialize function. So I build a new file arch_vcpu.c in pok_kernel/arch/x86, and put the arch-dependent functions in. 

Also, I modify some file, like pok/kernel/include/core/partition.h. In this file, I planed to add a vcpu list head in partitions. Another file modified is kernel/core/sched.c, In this file, I add some empty function, because the schedule for vcpu is not necessary. 

Finally, I add the alloc_vcpu in partition_init. 
All the function will be test in this week.

There are something should be noted: 
#The space alloced by alloc_vcpu_struct can not be free. So once the vcpu has been alloced, it can't be destroyed. As a result, the vcpu can be dynamic. So maybe we can alloc it in aadl file in the future. 
#In the function vcpu_initialize, we planed to alloc schedule function, but as for now, the schedule for vcpu is not essential, so the function is empty for now. 
#The function declarations in head files is omited in this blog.
New files
#pok/kernel/core/vcpu.c
#pok/kernel/arch/x86/arch_vcpu.c
#pok/kernle/include/core/vcpu.h
#pok/kernel/include/arch/x86/arch_vcpu.h
Modified files
#pok/kernel/core/sched.c
#pok/kernel/include/partition.h
Reused structure
#The context_t is reused in arch_vcpu, to put the user_regs.
#The interrupt_frame is reused in arch_vcpu, to put the interrupt information.

For more details please see this [http://huaiyusched.github.io/2014/06/10/the-design-of-vcpu-in-pok/ blog].
=  Register interrupt handler for vCPU  =


The Guest OS should register interrupt handler first, we should replace all native interrupt function in RTEMS as this Register function for vCPU in paravirtualization layer.

This function is implement by Hypercall. We add a new Hypercall, and implement the core function.
=  Add a new hypercall  =

New hypercall number
{{{
POK_HYPERCALL_IRQ_REGISTER_VCPU          =  30,
   POK_HYPERCALL_IRQ_UNREGISTER_VCPU       =  31,
}}}
New case in pok_core_hypercall
{{{
pok_ret_t pok_core_hypercall (const pok_hypercall_id_t       hypercall_id,
                            const pok_hypercall_args_t*    args,
                            const pok_hypercall_info_t*    infos)
{
....
  /* register interrupt delivery to vcpu */
   case POK_HYPERCALL_IRQ_REGISTER_VCPU:
       return pok_bsp_irq_register_vcpu(args->arg1,(void(*)(uint8_t)) ((args->arg2 + infos->base_addr)));
       break;
   /* unregister interrupt delivery to vcpu */
   case POK_HYPERCALL_IRQ_UNREGISTER_VCPU:
       return pok_bsp_irq_unregister_vcpu(args->arg1);
       break;
....
}
}}}
=  Register Function for vCPU  =


The implementation of corresponding function pok_bsp_irq_register_vcpu function in pok/kernel/arch/x86/x86-qemu-vmm/bsp.c
{{{
/*
 * Register irq in vCPU.
 * This irq must register in POK kernel first.
 * The parameter vector is the irq number.
 * The handle_irq is a common Entry from Guest OS.
 */
pok_ret_t pok_bsp_irq_register_vcpu(uint8_t vector,void (*handle_irq)(uint8_t))
{
  uint8_t i;
  struct vcpu *v;
  
  if(vector < 32)
    return POK_ERRNO_EINVAL;
  v = pok_partitions[POK_SCHED_CURRENT_PARTITION].vcpu;
  for (i=0; i<16; i++)
  {
    if(v->arch.irqdesc[i].vector == 0)
    {
      v->arch.irqdesc[i].vector=vector;
      v->arch.handler=(uint32_t) handle_irq;
      return POK_ERRNO_OK;
    }
  }
}
}}}
As we can see, this structure in vCPU is not correspond to certain IRQ in POK kernel. Once you need some to handling some interrupt in vCPU, you invoke this hypercall, then this function will find an empty irq_desc, then assign the irq as it passed in hypercall, and set the handler of Guest OS.

For more details please see this [http://huaiyusched.github.io/2014/07/29/the-interrupt-register-function-for-vcpu blog].
=  The interrupt delivery  =

This part will crash the kernel. But if disuse the mask function (i.e. no interrupt handler pending in vcpu), the upcall_irq function will return to guest OS. (Noticed this is because of a bug which makes the do_IRQ part unused, but the upcall_irq still be invoked. Now this bug is fixed.)=  The POK part  =


To delivery the interrupt to RTEMS, firstly we should mask the corresponding interrupt in vCPU when it occurs. Then when the vCPU is resuming, goto user space to execute the interrupt handler of RTEMS. See this [http://huaiyusched.github.io/rtems/2014/04/07/the-brief-design-and-outline/ blog].
=  The interrupt mask function  =


The interrupt mask function is do_IRQ. Once an interrupt occurs, it will be invoked to check the irq_desc in every vCPU. For example:
{{{
INTERRUPT_HANDLER(pit_interrupt)
{
   //uint8_t vector;
   //vector = 32;
   (void) frame;
   pok_pic_eoi (PIT_IRQ);
   do_IRQ(32);
   CLOCK_HANDLER;
}
}}}
Let's grab do_IRQ and check it out.
{{{
#ifdef POK_NEEDS_X86_VMM

/*
 * Deal with the interrupt if the interrupt should be handler by guest
 */
void do_IRQ(uint8_t vector)
{
  do_IRQ_guest(vector);
}

/*
 * Decide the interrupt should be send to guest or not
 */
void do_IRQ_guest(uint8_t vector)
{
  uint8_t i,j;
  struct vcpu *v;
  for(i = 0 ; i < POK_CONFIG_NB_PARTITIONS ; i++)
  {
    v = pok_partitions[i].vcpu;
    for (j = 0 ; j< 16; j++)
    {
      if(v->arch.irqdesc[j].vector == vector)
      {
        v->arch.irqdesc[j].pending = TRUE;
    v->pending = TRUE;
    v->arch.irqdesc[i].count++;
      }
    }
  }
}

#endif /* POK_NEEDS_X86_VMM */
}}}
The do_IRQ is supposed to do some common treatment for interrupt, but I can't see any of them. and do_IRQ_guest will, as I said before, check the irq_desc array in vCPU, if there is a register of this interrupt, mask it, counter++.
For more details please see this [http://huaiyusched.github.io/2014/07/29/the-interrupt-register-function-for-vcpu blog].
=  The upcall function  =

This upcall function's name may confuse some people. When the vCPU resumes, the POK kernel check the vCPU states, if there is any interrupt pending, then up this call to RTEMS in user space. After the interrupt handling in RTEMS, the POK should return back to kernel, and go to normal program to continue it's work. this is most difficult part. 
==  When the upcall will be invoked?  ==


When the partition resumes, the hanging interrupt of this partition should be tickled. So: 
#. The corresponding partition resumes. 
#. interrupt hang in this vCPU.
==  What the upcall_irq exactly do?  ==


{{{
uint32_t upcall_irq(interrupt_frame* frame)
{
  struct vcpu *v;
  uint8_t i;
  uint32_t _eip;
  uint32_t user_space_handler;
  v = pok_partitions[POK_SCHED_CURRENT_PARTITION].vcpu;
  _eip = frame->eip;       // if no interrupt happened, return the point of normal program;
  user_space_handler = v->arch.handler;
  user_space_handler -= pok_partitions[POK_SCHED_CURRENT_PARTITION].base_addr;
  if(v->pending != 0)
  {
    for(i=0;i<15;i++)
    {
      if(v->arch.irqdesc[i].counter != 0)
      {
        save_interrupt_vcpu(v,frame);
        __upcall_irq(frame, i, (uint32_t) user_space_handler);
        v->arch.irqdesc[i].counter --;
    return user_space_handler;  //if any interrupt occours, return the point of interrupt handler;
      }
    }
  }
  return _eip;
}

}}}

# Check the pending bit in vcpu, if it mask there is an interrupt hanging on it then go to step to. 
# Check the irq_desc.
# If the counter not equal to zero, then save the interrupt context to interrupt frame, and invoke __upcall_irq.==  What the __upcall_irq do?  ==


The some_function always be some core or assemble function of some_function. So the upcall_irq is the key to understand the mechanism of it. So it'simportant to understand this function.

{{{
void __upcall_irq(interrupt_frame* frame,uint8_t vector, uint32_t handler)
{
  frame->eax = vector;        //put the irq number to eax
  frame->eip = handler;       //Set the eip as handler
}
}}}

Then it returns to common interrupt handler entry for Guest OS.

As you can see, the __upcall_irq will change the eax register as the vector of current interrupt. Also, a very very important step, change the eip register in stack for iret.

As we all know, the iret will resume the eip, cs, eflags from stack. so we change the eip in stack, and when the iret exert, the program will go as this eip point. What's this eip? The handler in vCPU. And the original eip is saved in save_interrupt_vcpu function.

This is all of the work in POK kernel of interrupt delivery.
=  The RTEMS Part  =


We assume that the upcall function works well, and now the the eip is pointing to the handler that the RTEMS register before. 
=  The Handler of RTEMS  =

The handler we register is an common entry for whole RTEMS interrupt. The POK kernel will also pass the irq vector in eax register.
So the handler should first, read the irq number from register.
Here is an example of handler in Guest OS:
{{{
void handle_irq()
{
  uint32_t irq=0;
  do{
  asm(
      "add %%eax,%0  \n"	\
      :"=m"(irq)	\
      :
      :"%eax");
  }while(0);
  switch(irq)
  {
case PIT_IRQ:
    tick_counter++;
    printf( "Clock gettick: %u \n",tick_counter);
    pok_hypercall1( POK_HYPERCALL_IRQ_DO_IRET,0);    
    break;

    
default:
    pok_hypercall1( POK_HYPERCALL_IRQ_DO_IRET,0);    
  }
}
}}}

As we can see, we can invoke corresponding interrupt handle of RTEMS in user space.

However, we can't just using iret instruction in user space to return this interrupt handler. we need a specific iret.
=  Do_IRET in Hypercall  =

After handle of interrupt, we should invoke HYPERCALL_IRQ_DO_IREQ. This hypercall will invoke do_iret in POK kernel. so we use the hypercall to change our into kernel space again.

What the do_iret do?
{{{
/*
 * This do_iret will check the irq_desc,and according to the irq_desc, construct interrupt frame, then iret to execute handler of Guest OS
 */
pok_ret_t do_iret(interrupt_frame *frame)
{
  struct vcpu *v;
  uint8_t i;
  uint32_t user_space_handler;

  v = pok_partitions[POK_SCHED_CURRENT_PARTITION].vcpu;

  user_space_handler = v->arch.handler;
  user_space_handler -= pok_partitions[POK_SCHED_CURRENT_PARTITION].base_addr;
  if(v->pending != 0)
  {
    for(i=0;i<15;i++)
    {
      while(v->arch.irqdesc[i].counter != 0)
      {
        __upcall_irq(frame, i, (uint32_t) user_space_handler);
    v->arch.irqdesc[i].counter--;
    return POK_ERRNO_OK;
      }
    }
    v->pending = 0;
  }
  else if(v->pending == 0)
  {
    restore_interrupt_vcpu(v, frame);
  }

  return POK_ERRNO_OK;
}
}}}
The do_iret is  similar with the upcall_irq function in some extent. So let me introduce it briefly.  

The do_iret will check the pending bit and irq_desc structure, if there is no more interrupt hanging in this CPU, then resume the interrupt context from vCPU, if not, return to handler of Guest OS again.

For more details, please see this [http://huaiyusched.github.io/2014/08/08/the-current-workflow-of-interrupt-handling blog].
=  Summary  =


Here is an illustration:
[wiki:File:The_work_flow_of_interrupt_handler_in_vCPU.jpg File:The work flow of interrupt handler in vCPU.jpg]

As we can see in this illustration, the The work flow of interrupt handler in vCPU is clear.
=  To be improve in future  =
=  In POK  =

# The vCPU in POK kernel is only work for interrupt handling. we should improve it to be a part of POK processor manager. Add macro of CURRENT vCPU. Schedule function etc..
# The upcall_irq is not working for now. I promise I will fix it and make it running on RTEMS. To complete the hypervisor this year will benefit the successor next year.=  In RTEMS  =

# As for now, we used time interrupt to test interrupt virtualization. But we should set the time interrupt not delivery to RTEMS in future, because the cost is to heavy. This request us to build a virtual time system.
# When the upcall_irq works, we should use the paravirtualization layer's API to rebuild the syscall in RTEMS.