libFFIO(3)							    libFFIO(3)



NAME
       libFFIO - Describes the usage of the libFFIO layers

DESCRIPTION
       The libFFIO (Flexible File I/O) system supports an eie and event layers
       through which user I/O data can be passed.  The two layers are  invoked
       by  specifying  the  numerics  and  the	options of the layers with the
       FF_IO_OPTS env variable, and  overloading  glibc	 with  the  libFFIO.so
       library	via  the  Linux LD_PRELOAD mechanism (see ld.so(8)).  The exea-
       cutible codes need not to be modified, recompiled, or relinked, as long
       as they are dynamically linked with the libc library to allow the overa-
       load.


   Environmental Variables
       There are three environment variables that affect the  actions  of  the
       libFFIO library:

       1)  FF_IO_LOGFILE.   The libFFIO library interprets $FF_IO_LOGFILE as a
       file to be opened as a destination for ffio  diagnostics.  eie.diag  or
       event.summary.  If  requested,  output  will be sent to $FF_IO_LOGFILE.
       There are two special cases for the value of FF_IO_LOGFILE, stderr  and
       stdout.	If FF_IO_LOGFILE is set to either of these values, the libFFIO
       library will use the respective standard I/O stream for	output	rather
       than  files named stderr or stdout. The default action of libFFIO is to
       overwrite the existing log file if it exists. The user may have the log
       file appended to by prefixing the logfile name with a "+".

       example: setenv FF_IO_LOGFILE +/usr/tmp/ffio.log

       2)  FF_IO_OPEN_DIAGS.   If  FF_IO_OPEN_DIAGS  is	 set to any value, and
       FF_IO_LOGFILE is also set, diagnostic messages concerning the  template
       name  matching and layer invocations will be written to $FF_IO_LOGFILE.
       Setting this variable may produce a lot of ouput in  the	 logfile,  but
       provides a convient check of all files used by a user program which are
       trapable by the libFFIO library.

       3) FF_IO_OPTS.  This is a critical environment variable for the libFFIO
       library.	 It  is in FF_IO_OPTS that the user indicates which layers are
       to be invoked for selected files. The format of FF_IO_OPTS  is  a  file
       name template followed by layer specifications enclosed in parentheses.

       " *fct ( event.summary | eie.mem | system ) "
		 |_____| |_________________________________|
		    |		       |
		    |		       |__ layer specification string
		    |
		    |
		    |_____  filename template

       As each file is tpened by fftpen, ffopen attempts to match the incoming
       file  name  with	 the supplied templates in FF_IO_OPTS, reading left to
       right. Upon finding a match, ffopen will invoke	the  layers  that  are
       specified  between the next pair of parenthesis. More than one template
       may be used for each layer specification string, and there may be  more
       than  one pair of template strings/layer specifications. A more general
       FF_IO_OPTS follows:

       *fct *scr* (event)  fort.1* fort.2* ( eie | event )"
		 |________| |_____|  |_____________| |_____________|
		    |			       |	 |__layer spec	string
       #2
		    |		|	       |__ file template #2
		    |		|
		    |		|____ layer spec string #1
		    |_____  file template #1


   Template Matching
       When FF_IO_OPEN_DIAGS and FF_IO_LOGFILE are both set, a diagnostic line
       similar to the following will be written	 to  $FF_IO_LOGFILE  for  each
       open of a file.

       ffopen(/usr/tmp/probname.fct) ft:0 gt:0

       The  string  between  the  parentheses is the path and file name of the
       file to be opened. The file tag of the file follows the	ft:,  and  the
       group tag follows the gt:.

       File templates may match the filename. Only filename matching is availa-
       able. Therefore, for the above ffopen, only the following would	result
       in a template match:
		prob*
		*fct
	Note that the template matching is case sensitive, and the pathname is
       irrelevant.


   MPI Rank Dependent Setting
       The user can set a private FF_IO_OPTS_RANK#, where # is	the  MPI  rank
       number, for each of the processes generated by either the SGI MPT (Mesa-
       sage Passing Toolkits, ia64 only), HP MPI (x86_64 only), or the LAM/MPI
       library,	 if the SGI_MPI, HP_MPI, or the LAM_MPI env variable is set to
       the   directory	 containing   the   libmpi.so	(needs	 to    contain
       mpi_comm_rank_)	library.   For SGI MPT, /usr/lib is the directory that
       contains libmpi.so.  The rank dependent setting would not work, and may
       even  cause  a  core  dump,  with an executible that is not linked with
       either of the above MPI.	 If the SGI_MPI, HP_MPI, or  the  LAM_MPI  env
       variable	 is  not set, then what is set for the FF_IO_OPTS env variable
       is applied to all the MPI ranks.

       Example:
	      setenv SGI_MPI /usr/lib		      # ia64 only
		or
	      setenv LAM_MPI /usr/local/lam-7.0.4/lib

	      setenv FF_IO_OPTS_RANK0 ...
	      setenv FF_IO_OPTS_RANK1 ...
	      ...


   Layer Descriptions
       The libFFIO layers are controlled by options and numeric values	specia-
       fied in the layer specification.

	    Options:
	    o indicate to the layer the selection of a feature.
	    o are delimited by a period (.).
	    o are order independant.
	    o  may be part of mutually exclusive sets. Selecting an option may
       deselect
	      other options. In the layer descriptions	that  follow,  options
       that are
	      mutually	exclusive  are	listed together with the default value
       underlined.

	    Numerics:
	    o pass an integer value to the layer.
	    o are delimited by a colon (:).
	    o are order dependant.
	    o may be specified in :
		hexidecimal ( leading 0x )
		decimal
	      example:
	      eie.diag:92:0x20

	      The first numeric to the eie layer is 92(decimal), the page size
       in
	      blocks.  The second numeric to the eie layer is 32(decimal), the
       number
	      of pages .
	    o may be specified in an  alternate	 format	 that  eliminates  the
       order
	      dependancy by referring to the numeric value by its keyword.

	      example:
	      eie.diag.page_size=92.num_page=20

	      The numeric with the keyword page_size is set to 92.
	      The numeric with the keyword num_page is set to 20.
	    o  are updated as the layer specification string is processed left
       to right
	      taking on the value of the rightmost specification

	      example:
	      eie.page_size=200.page_size=200.page_size=100
	      will result in a page_size of 100

	      Basic math can be	 performed  on	numeric	 values.  The  numeric
       string may include +-*.

	      examples:
	      All of the following result in a page_size of 200 blocks.

		       eie.page_size=200
		       eie.page_size=100+75+25
		       eie.page_size=100*2

	In  the	 following  layer  descriptions	 the numerics are given in the
       dependent order and their associated keyword also is given.

   EIE Layer
       The eie ( Enhanced Intelligence Engineering ) layer performs  two  main
       functions.  The	first  is caching of user I/O requests in user memory.
       The primary benefit of this caching is that  the	 smaller  program  I/O
       requests	 will be buffered into full cache page requests to the kernel.
       This can dramatically cut down on the number of logical	I/O  requests,
       reducing both system cpu time and I/O wait time. The second function of
       the eie layer is to detect sequential access of the file	 by  the  proa-
       gram.  Once  this  sequential  access is detected, the cache will asnya-
       chronously preload cache pages with file data in the direction  of  the
       detected access.

       Numerics:
	      1.page_size
		The  number  of	 blocks	 per  cache  page.  If	the  requested
       page_size
		is not a multiple of the sector size of the underlying file
		system, page_size is automatically rounded up to the next
		multiple of the sector size. If the cache is a	shared	cache,
       the
		page_size is rounded up to a multiple of 4 blocks.  For direct
		IO, the valid page size is between 64KB and 256MB in this
		libFFIO distribution.
	      2.num_page
		The number of pages in the cache. If the num_page  is  entered
       as a
		negative  value,  num_page is interpreted as the total size of
       the
		cache and the number of pages is calculated by dividing the
		absolute value of num_page by page_size.
	      3.max_lead
		The maximum number of pages the cache is to asyncronoulsy read
		ahead of the program requests once sequential access has been
		detected.
	      4.share
		Indicates a private or shared cache. A private cache, share=0,
		is used only by the file requesting the cache. A shared cache
		allows up to 255 files to share the cache pages between the
		files. The  first  file	 to  request  the  cache  defines  the
       page_size,
		num_page, and cache residency. All other files using the cache
		use it as defined by the first file. By default, if the file
		usage count drops to zero the memory space  allocated  to  the
       cache
		is  returned  to  the  system.	If  the shared cache is opened
       again,
		the space must be reallocated. To prevent this, use the norls
		option, informing the shared cache not to release  it's	 space
       even
		if the file usage count goes to zero.
	      5.stride
		By default, the eie cache attempts to detect sequential access
       of
		cache pages with a stride of 1 page. The user may select an
		alternate stride to be detected via the stride numeric.

       Options:
	      nodirect, direct
		Between the FFIO cache and the file system, the eie layer  can
       do
		either direct (direct) or buffered (nodirect) IO via the Linux
		kernel.	 Note that the user IO buffer is always copied to/from
       the
		FFIO  cache.   Therefore, the eie direct option doesn't bypass
       the
		FFIO cache, and bcopy always is done  with  a  job  that  uses
       FFIO.
	      nodiag, diag
		The cache is to report cache usage statistics, diag, or not to
		report statistics, nodiag. The report will have the following
		format.
	      bytes, mbytes, gbytes, words, mwords, gwords, blocks
		selects which units are to be used for reporting  data	transa-
       ferred
		values to $FF_IO_LOGFILE. For units other than bytes, the
		quantities are always rounded up to the next multiple of the
		units.
	      wb, nowb, hldwb
		The  eie  cache	 has  logic  to asynchronously write out dirty
       cache
		pages. This logic is on by default. It may be completely  disa-
       abled
		with nowb. The write behind logic may be delayed until a cache
		runs out of unused pages  by  specifying  hldwb	 (hold	write-
       behind).
	      save, scr
		The save option indicates that the file is to be valid upon
		closing.  The  scr  option indicates that the file need not be
       valid
		upon closing, allowing the cache  to  skip  some  flushing  of
       data.
		Note  that  the	 Linux	IO  buffer  cache has the advantage of
       flushing
		the file asynchronously after it is closed. The eie cache  has
       to
		flush the file synchronously at the closing time.
	      rls, norls
		These  options	apply  only  to	 a  shared  cache. The default
       option,
		rls, has the cache release the memory  space  that  the	 cache
       pages
		reside in when file count (file count is the number of files
		currently opened and using the shared cache) goes to zero. The
		norls option has the cache remain allocated even when the file
		count goes to zero.
	      nobpons,bpons ( bypass on no space)
		If  the	 eie  cache is unable to allocate sufficient space for
       the
		cache pages, the bpons option will allow the open to continue
		without the cache. By default, nobpons, the open will fail.


   EVENT Layer
       The event layer monitors the  I/O  occuring  between  layers,  much  as
       strace  monitors	 I/O  events  between  the program and the kernel. The
       event layer can be requested to gather total statistics.

       Options:
	      notrace
		This option is required to turn off the yet-to-be-implemented
		event tracer in libFFIO
	      rtc, cpc
		Indicates which type of clock to use for the times reported by
		the  event  layer.  rtc uses the real time clock. cpc uses the
       cpu
		clock.
	      summary
		Information on event  types  that  occur  at  least  once  are
       reported.
	      bytes, mbytes, gbytes, words, mwords, gwords, blocks
		selects	 which	units are to be used for reporting data transa-
       ferred
		values to $FFIO_IO_LOGFILE. For units other than bytes, the
		quantities are always rounded up to the next multiple of the
		units.


   System Calls Intercepted by libFFIO
       The following  functions	 in  glibc  are	 intercepted  by  libFFIO  and
       replaced	 with  the  FFIO  equivalent:  open,  fopen,  open64,  __open,
       __open64, read, fread, pread, pread64, fgets,  write,  fwrite,  pwrite,
       pwrite64,  fputs,  fputc, lseek, fseek, lseek64, close, fclose, fflush,
       mkstemp, mkstemp64, ftruncate,  ftruncate64,  fprintf,  fscanf.	 Glibc
       calls  that  operate on files in /proc, /etc, and /dev, and stdin, stda-
       out, stderr are not intercepted.


   LibFFIO Compatible Runtimes
       The libFFIO.so library has been tested against the Intel compiler (v8.0
       and later) runtime libraries for C/C++ and Fortran.


   Exceptions
       This  FFIO  implementation of pread/pwrite doesn't conform to the pread
       and pwrite conventions.	That is, the file pointer offset would	change
       with the pread/pwrite call, instead of staying constant.	 The user also
       should avoid doing IO on a socket with FFIO, as it may be caught in  an
       infinite loop.


   Example
       The following simple Fortran IO program demonstrates how libFFIO works.
       The event.summary and the eie.diag output are printed to the screen  by
       default.

	      PROGRAM writeit ! Write out 2 1-Gbyte records to a Fortran
	      implicit none   ! unformatted, sequential access file.
	      integer iter
	      parameter (iter=2)
	      integer n,nx,ny,nz
	      parameter (n=1,nx=1024,ny=1024,nz=256)
	      integer w
	      dimension w(n,nx,ny,nz)
	      character*20 fname
	      integer i,j,k,l, error
	      integer wiostat
	      integer*8 nbytes,four
	      wiostat = -1
	      four=4
	      DO k=1,nz
		DO j=1,ny
		  DO i=1,nx
		    DO l=1,n
		      w(l,i,j,k)=(i-1)*ny*nz+(j-1)*nz+k
		    ENDDO
		  ENDDO
		ENDDO
	      ENDDO
	      OPEN(20,FILE='scratch',FORM='unformatted',STATUS='unknown')
	      DO i=1,iter
		WRITE(20,IOSTAT=wiostat) w
	      ENDDO
	      close(20)
	      WRITE(*,*) wiostat
	      END

	# ifort -O3 ./example.f
	# setenv LD_PRELOAD /usr/lib/libFFIO.so
	#			    setenv			    FF_IO_OPTS
       'scr*(eie.direct.diag.mbytes:4096:128:6:1:1:0,event.suma-
       mary.mbytes.notrace)'
	# ./a.out

	event_close(/tmp/scratch)      eie <-->syscall	 (	2049 mbytes)/(
       5.84 s)=	 350.62 mbytes/s

	   oflags=0x0000000000004042=RDWR+CREAT+DIRECT
	   sector size =4096(bytes)
	   cblks =0  cbits =0x0000000000000000
	   current file size =2049 mbytes   high water file size =2049 mbytes

	   function	   times      wall	  all	   mbytes	mbytes
       min	  max	     avg	ill
			 called	      time	hidden	 requested   delivered
       request	  request    request	 formed
	      open	      1	    0.00
	      seek	      4	    0.00
	      writea	     131      0.01	    0	     2049	  2049
       1	 16	    16		0
	      fcntl
		 recall
		   writea   131	    5.82
		 other	      5	    0.00
	      flush	      1	    0.00
	      weod				 1			  0.01
       1
	      close	      1	    0.00
	      extends	    130

	eie_close EIE final stats for file /tmp/scratch
	eie_close  Used shared eie cache 1
	eie_close  128 mem pages of 4096 blocks (4096 sectors), max_lead  =  6
       pages
	eie_close  advance reads used/started :	     127/     139   91.37%   (
       0.00 seconds wasted)
	eie_close  write hits/total	      :	    4228/    4230   99.95%
	eie_close  read	 hits/total	      :	       0/	0    0.00%
	eie_close  mbytes transferred	 parent --> eie	 -->  child	  sync
       async
	eie_close				2049	  2049		     0
       131
	eie_close				  0	    0		     0
       0 (0,0)
	eie_close			 parent <-- eie <-- child

	eie_close EIE stats for Shared cache 1
	eie_close  128 mem pages of 4096 blocks
	eie_close  advance reads used/started :	     127/     139   91.37%   (
       0.00 seconds wasted)
	eie_close  write hits/total	      :	    4228/    4230   99.95%
	eie_close  read	 hits/total	      :	       0/	0    0.00%
	eie_close  mbytes transferred	 parent --> eie	 -->  child	  sync
       async
	eie_close				2049	  2049		     0
       131
	eie_close				  0	    0		     0
       0 (0,0)
	eie_close			 parent <-- eie <-- child
		   0
	# ls -l scratch
	-rw-r--r--  1 root sys 2147483664 2006-05-18 20:30 scratch


FILES
       /usr/lib/libFFIO.so (ia64 only), /usr/lib64/libFFIO.so (x86_64 only)


SEE ALSO
       ld.so(8), ffopen(3C)



				  June, 2006			    libFFIO(3)