RTEMS-NFS
=========

A NFS-V2 client implementation for the RTEMS real-time
executive.

Author: Till Straumann <strauman@slac.stanford.edu>, 2002

Copyright 2002, Stanford University and
                Till Straumann <strauman@slac.stanford.edu>

Stanford Notice
***************

Acknowledgement of sponsorship
* * * * * * * * * * * * * * * *
This software was produced by the Stanford Linear Accelerator Center,
Stanford University, under Contract DE-AC03-76SFO0515 with the Department
of Energy.


Contents
--------
I   Overview
  1) Performance
  2) Reference Platform / Test Environment

II  Usage
  1) Initialization
  2) Mounting Remote Server Filesystems
  3) Unmounting
  4) Unloading
  5) Dumping Information / Statistics

III Implementation Details
  1) RPCIOD
  2) NFS
  3) RTEMS Resources Used By NFS/RPCIOD
  4) Caveats & Bugs

IV  Licensing & Disclaimers

I  Overview
-----------

This package implements a simple non-caching NFS
client for RTEMS. Most of the system calls are
supported with the exception of 'mount', i.e. it
is not possible to mount another FS on top of NFS
(mostly because of the difficulty that arises when
mount points are deleted on the server). It
shouldn't be hard to do, though.

Note: this client supports NFS vers. 2 / MOUNT vers. 1;
      NFS Version 3 or higher are NOT supported.

The package consists of two modules: RPCIOD and NFS
itself.

 - RPCIOD is a UDP/RPC multiplexor daemon. It takes
   RPC requests from multiple local client threads,
   funnels them through a single socket to multiple
   servers and dispatches the replies back to the
   (blocked) requestor threads.
   RPCIOD does packet retransmission and handles
   timeouts etc.
   Note however, that it does NOT do any XDR
   marshalling - it is up to the requestor threads
   to do the XDR encoding/decoding. RPCIOD _is_ RPC
   specific, though, because its message dispatching
   is based on the RPC transaction ID.

 - The NFS package maps RTEMS filesystem calls
   to proper RPCs, it does the XDR work and
   hands marshalled RPC requests to RPCIOD.
   All of the calls are synchronous, i.e. they
   block until they get a reply.

1) Performance
- - - - - - - -
Performance sucks (due to the lack of
readahead/delayed write and caching). On a fast
(100Mb/s) ethernet, it takes about 20s to copy a
10MB file from NFS to NFS.  I found, however, that
vxWorks' NFS client doesn't seem to be any
faster...

Since there is no buffer cache with read-ahead
implemented, all NFS reads are synchronous RPC
calls. Every read operation involves sending a
request and waiting for the reply. As long as the
overhead (sending request + processing it on the
server) is significant compared to the time it
takes to transferring the actual data, increasing
the amount of data per request results in better
throughput. The UDP packet size limit imposes a
limit of 8k per RPC call, hence reading from NFS
in chunks of 8k is better than chunks of 1k [but
chunks >8k are not possible, i.e., simply not
honoured: read(a_nfs_fd, buf, 20000) returns
8192]. This is similar to the old linux days
(mount with rsize=8k).  You can let stdio take
care of the buffering or use 8k buffers with
explicit read(2) operations. Note that stdio
honours the file-system's st_blksize field
if newlib is compiled with HAVE_BLKSIZE defined.
In this case, stdio uses 8k buffers for files
on NFS transparently. The blocksize NFS 
reports can be tuned with a global variable
setting (see nfs.c for details).

Further increase of throughput can be achieved
with read-ahead (issuing RPC calls in parallel
[send out request for block n+1 while you are
waiting for data of block n to arrive]).  Since
this is not handled by the file system itself, you
would have to code this yourself e.g., using
parallel threads to read from a single file from
interleaved offsets.

Another obvious improvement can be achieved if
processing the data takes a significant amount of
time. Then, having a pipeline of threads for
reading data and processing them makes sense
[thread b processes chunk n while thread a blocks
in read(chunk n+1)].

Some performance figures:
Software: src/nfsTest.c:nfsReadTest() [data not
          processed in any way].
Hardware: MVME6100
Network:  100baseT-FD
Server:   Linux-2.6/RHEL4-smp [dell precision 420]
File:     10MB

Results:
Single threaded ('normal') NFS read, 1k buffers: 3.46s (2.89MB/s)
Single threaded ('normal') NFS read, 8k buffers: 1.31s (7.63MB/s)
Multi  threaded; 2 readers, 8k buffers/xfers:    1.12s (8.9 MB/s)  
Multi  threaded; 3 readers, 8k buffers/xfers:    1.04s (9.6 MB/s)

2) Reference Platform
- - - - - - - - - - -
RTEMS-NFS was developed and tested on

 o RTEMS-ss20020301 (local patches applied)
 o PowerPC G3, G4 on Synergy SVGM series board
   (custom 'SVGM' BSP, to be released soon)
 o PowerPC 604 on MVME23xx
   (powerpc/shared/motorola-powerpc BSP)
 o Test Environment:
    - RTEMS executable running CEXP
    - rpciod/nfs dynamically loaded from TFTPfs
    - EPICS application dynamically loaded from NFS;
      the executing IOC accesses all of its files
      on NFS.

II Usage
---------

After linking into the system and proper initialization
(rtems-NFS supports 'magic' module initialization when
loaded into a running system with the CEXP loader),
you are ready for mounting NFSes from a server
(I avoid the term NFS filesystem because NFS already
stands for 'Network File System').

You should also read the

  - "RTEMS Resources Used By NFS/RPCIOD"
  - "CAVEATS & BUGS"

below.

1) Initialization
- - - - - - - - - 
NFS consists of two modules who must be initialized:

 a) the RPCIO daemon package; by calling

      rpcUdpInit();

    note that this step must be performed prior to
    initializing NFS:

 b) NFS is initialized by calling

      nfsInit( smallPoolDepth, bigPoolDepth );

    if you supply 0 (zero) values for the pool
    depths, the compile-time default configuration
    is used which should work fine.

NOTE: when using CEXP to load these modules into a
running system, initialization will be performed
automagically.

2) Mounting Remote Server Filesystems
- - - - - - - - - - - - - - - - - - -

There are two interfaces for mounting an NFS:

 - The (non-POSIX) RTEMS 'mount()' call:

     mount( &mount_table_entry_pointer,
            &filesystem_operations_table_pointer,
            options,
            device,
            mount_point )

    Note that you must specify a 'mount_table_entry_pointer'
    (use a dummy) - RTEMS' mount() doesn't grok a NULL for
    the first argument.

     o for the 'filesystem_operations_table_pointer', supply

         &nfs_fs_ops
   
     o options are constants (see RTEMS headers) for specifying
       read-only / read-write mounts.

     o the 'device' string specifies the remote filesystem
       who is to be mounted. NFS expects a string conforming
       to the following format (EBNF syntax):

         [ <uid> '.' <gid> '@' ] <hostip> ':' <path>

       The first optional part of the string allows you
       to specify the credentials to be used for all
       subsequent transactions with this server. If the
       string is omitted, the EUID/EGID of the executing
       thread (i.e. the thread performing the 'mount' - 
       NFS will still 'remember' these values and use them
       for all future communication with this server).
       
       The <hostip> part denotes the server IP address
       in standard 'dot' notation. It is followed by
       a colon and the (absolute) path on the server.
       Note that no extra characters or whitespace must
       be present in the string. Example 'device' strings
       are:

         "300.99@192.168.44.3:/remote/rtems/root"

         "192.168.44.3:/remote/rtems/root"

    o the 'mount_point' string identifies the local
      directory (most probably on IMFS) where the NFS
      is to be mounted. Note that the mount point must
      already exist with proper permissions.

 - Alternate 'mount' interface. NFS offers a more
   convenient wrapper taking three string arguments:

	nfsMount(uidgid_at_host, server_path, mount_point)

   This interface does DNS lookup (see reentrancy note
   below) and creates the mount point if necessary.
   
   o the first argument specifies the server and
     optionally the uid/gid to be used for authentication.
     The semantics are exactly as described above:

       [ <uid> '.' <gid> '@' ] <host>
     
     The <host> part may be either a host _name_ or
     an IP address in 'dot' notation. In the former
     case, nfsMount() uses 'gethostbyname()' to do
     a DNS lookup.

     IMPORTANT NOTE: gethostbyname() is NOT reentrant/
     thread-safe and 'nfsMount()' (if not provided with an
     IP/dot address string) is hence subject to race conditions.
 
   o the 'server_path' and 'mount_point' arguments
     are described above.
     NOTE: If the mount point does not exist yet,
           nfsMount() tries to create it.

   o if nfsMount() is called with a NULL 'uidgid_at_host'
     argument, it lists all currently mounted NFS

3) Unmounting
- - - - - - -
An NFS can be unmounted using RTEMS 'unmount()'
call (yep, it is unmount() - not umount()):

  unmount(mount_point)

Note that you _must_ supply the mount point (string
argument). It is _not_ possible to specify the
'mountee' when unmounting. NFS implements no
convenience wrapper for this (yet), essentially because
(although this sounds unbelievable) it is non-trivial
to lookup the path leading to an RTEMS filesystem
directory node.

4) Unloading
- - - - - - -
After unmounting all NFS from the system, the NFS
and RPCIOD modules may be stopped and unloaded.
Just call 'nfsCleanup()' and 'rpcUdpCleanup()'
in this order. You should evaluate the return value
of these routines which is non-zero if either
of them refuses to yield (e.g. because there are
still mounted filesystems).
Again, when unloading is done by CEXP this is
transparently handled.

5) Dumping Information / Statistics
- - - - - - - - - - - - - - - - - -

Rudimentary RPCIOD statistics are printed
to a file (stdout when NULL) by

  int rpcUdpStats(FILE *f)

A list of all currently mounted NFS can be
printed to a file (stdout if NULL) using

  int nfsMountsShow(FILE *f)

For convenience, this routine is also called
by nfsMount() when supplying NULL arguments.

III Implementation Details
--------------------------

1) RPCIOD
- - - - -

RPCIOD was created to

a) avoid non-reentrant librpc calls.
b) support 'asynchronous' operation over a single
   socket.

RPCIOD is a daemon thread handling 'transaction objects'
(XACTs) through an UDP socket.  XACTs are marshalled RPC
calls/replies associated with RPC servers and requestor
threads.

requestor thread:                 network:

       XACT                        packet  
        |                            |
        V                            V
  | message queue |              ( socket )
        |                            |  ^
        ---------->          <-----  |  |
                     RPCIOD             |
                   /       --------------
           timeout/         (re) transmission
                         

A requestor thread drops a transaction into 
the message queue and goes to sleep.  The XACT is
picked up by rpciod who is listening for events from
three sources:

  o the request queue
  o packet arrival at the socket
  o timeouts

RPCIOD sends the XACT to its destination server and
enqueues the pending XACT into an ordered list of
outstanding transactions.

When a packet arrives, RPCIOD (based on the RPC transaction
ID) looks up the matching XACT and wakes up the requestor
who can then XDR-decode the RPC results found in the XACT
object's buffer.

When a timeout expires, RPCIOD examines the outstanding
XACT that is responsible for the timeout. If its lifetime
has not expired yet, RPCIOD resends the request. Otherwise,
the XACT's error status is set and the requestor is woken up.

RPCIOD dynamically adjusts the retransmission intervals
based on the average round-trip time measured (on a per-server
basis).

Having the requestors event driven (rather than blocking
e.g. on a semaphore) is geared to having many different
requestors (one synchronization object per requestor would
be needed otherwise).

Requestors who want to do asynchronous IO need a different
interface which will be added in the future.

1.a) Reentrancy
- - - - - - - - 
RPCIOD does no non-reentrant librpc calls.

1.b) Efficiency
- - - - - - - - 
We shouldn't bother about efficiency until pipelining (read-ahead/
delayed write) and caching are implemented. The round-trip delay
associated with every single RPC transaction clearly is a big
performance killer.

Nevertheless, I could not withstand the temptation to eliminate
the extra copy step involved with socket IO:

A user data object has to be XDR encoded into a buffer. The 
buffer given to the socket where it is copied into MBUFs.
(The network chip driver might even do more copying).

Likewise, on reception 'recvfrom' copies MBUFS into a user
buffer which is XDR decoded into the final user data object.

Eliminating the copying into (possibly multiple) MBUFS by
'sendto()' is actually a piece of cake. RPCIOD uses the
'sosend()' routine [properly wrapped] supplying a single
MBUF header who directly points to the marshalled buffer
:-)

Getting rid of the extra copy on reception was (only a little)
harder: I derived a 'XDR-mbuf' stream from SUN's xdr_mem which
allows for XDR-decoding out of a MBUF chain who is obtained by
soreceive().

2) NFS
- - - -
The actual NFS implementation is straightforward and essentially
'passive' (no threads created). Any RTEMS task executing a
filesystem call dispatched to NFS (such as 'opendir()', 'lseek()'
or 'unlink()') ends up XDR encoding arguments, dropping a
XACT into RPCIOD's message queue and going to sleep.
When woken up by RPCIOD, the XACT is decoded (using the XDR-mbuf
stream mentioned above) and the properly cooked-up results are
returned.

3) RTEMS Resources Used By NFS/RPCIOD
- - - - - - - - - - - - - - - - - - -

The RPCIOD/NFS package uses the following resources. Some
parameters are compile-time configurable - consult the
source files for details.

RPCIOD:
 o 1 task 
 o 1 message queue
 o 1 socket/filedescriptor
 o 2 semaphores (a third one is temporarily created during
   rpcUdpCleanup()).
 o 1 RTEMS EVENT (by default RTEMS_EVENT_30).
   IMPORTANT: this event is used by _every_ thread executing
              NFS system calls and hence is RESERVED.
 o 3 events only used by RPCIOD itself, i.e. these must not
   be sent to RPCIOD by no other thread (except for the intended
   use, of course). The events involved are 1,2,3.
 o preemption disabled sections:      NONE
 o sections with interrupts disabled: NONE
 o NO 'timers' are used (timer code would run in IRQ context)
 o memory usage: n.a

NFS:
 o 2 message queues
 o 2 semaphores
 o 1 semaphore per mounted NFS
 o 1 slot in driver entry table (for major number)
 o preemption disabled sections:      NONE
 o sections with interrupts disabled: NONE
 o 1 task + 1 semaphore temporarily created when
   listing mounted filesystems (rtems_filesystem_resolve_location())

4) CAVEATS & BUGS
- - - - - - - - -
Unfortunately, some bugs crawl around in the filesystem generics.
(Some of them might already be fixed in versions later than
rtems-ss-20020301).
I recommend to use the patch distributed with RTEMS-NFS.

 o RTEMS uses/used (Joel said it has been fixed already) a 'short'
   ino_t which is not enough for NFS.
   The driver detects this problem and enables a workaround. In rare
   situations (mainly involving 'getcwd()' improper inode comparison
   may result (due to the restricted size, stat() returns st_ino modulo
   2^16). In most cases, however, st_dev is compared along with st_ino
   which will give correct results (different files may yield identical
   st_ino but they will have different st_dev). However, there is 
   code (in getcwd(), for example) who assumes that files residing
   in one directory must be hosted by the same device and hence omits
   the st_dev comparison. In such a case, the workaround will fail.
 
   NOTE: changing the size (sys/types.h) of ino_t from 'short' to 'long'
         is strongly recommended. It is NOT included in the patch, however
         as this is a major change requiring ALL of your sources to
         be recompiled.

   THE ino_t SIZE IS FIXED IN GCC-3.2/NEWLIB-1.10.0-2 DISTRIBUTED BY
   OAR.

 o You may work around most filesystem bugs by observing the following
   rules:

    * never use chroot() (fixed by the patch)
    * never use getpwent(), getgrent() & friends - they are NOT THREAD
      safe (fixed by the patch)
    * NEVER use rtems_libio_share_private_env() - not even with the
      patch applied. Just DONT - it is broken by design.
    * All threads who have their own userenv (who have called
      rtems_libio_set_private_env()) SHOULD 'chdir("/")' before
      terminating. Otherwise, (i.e. if their cwd is on NFS), it will
      be impossible to unmount the NFS involved.

 o The patch slightly changes the semantics of 'getpwent()' and
   'getgrent()' & friends (to what is IMHO correct anyways - the patch is
   also needed to fix another problem, however): with the patch applied,
   the passwd and group files are always accessed from the 'current' user
   environment, i.e. a thread who has changed its 'root' or 'uid' might
   not be able to access these files anymore.
      
 o NOTE: RTEMS 'mount()' / 'unmount()' are NOT THREAD SAFE.

 o The NFS protocol has no 'append' or 'seek_end' primitive. The client
   must query the current file size (this client uses cached info) and
   change the local file pointer accordingly (in 'O_APPEND' mode).
   Obviously, this involves a race condition and hence multiple clients
   writing the same file may lead to corruption.

IV Licensing & Disclaimers
--------------------------

NFS is distributed under the SLAC License - consult the
separate 'LICENSE' file.

Government disclaimer of liability
- - - - - - - - - - - - - - - - -
Neither the United States nor the United States Department of Energy,
nor any of their employees, makes any warranty, express or implied,
or assumes any legal liability or responsibility for the accuracy,
completeness, or usefulness of any data, apparatus, product, or process
disclosed, or represents that its use would not infringe privately
owned rights.

Stanford disclaimer of liability
- - - - - - - - - - - - - - - - -
Stanford University makes no representations or warranties, express or
implied, nor assumes any liability for the use of this software.

Maintenance of notice
- - - - - - - - - - -
In the interest of clarity regarding the origin and status of this
software, Stanford University requests that any recipient of it maintain
this notice affixed to any distribution by the recipient that contains a
copy or derivative of this software.