ALIs
kommt nochTroubleshooting: Tips and Tricks for using the LRZ HPC Systems
This document is a collection of hints which may help you to solve problems encountered when running programs on LRZ's HPC systems.
Table of contents
- Common Problems with the Compilers
- Why does my IA32 executable fail when static arrays are very large?
- Code fails to link ("Relocation truncated to fit")
- Using large temporary arrays in subroutines fails
- icc and icpc fail to compile my (assembler) code
- My program stops in I/O when writing or reading large files
- I've got a lot of Fortran-unformatted Files from VPP, Sun, IBM, ... Can I use those?
- Reading unformatted direct access files generated on other HPC platforms
- Maximum record length for unformatted direct access I/O for Intel ifort
- Compiler does not optimize as specified
- Gradual underflow optimization: -ftz compiler option may improve performance
- Why does the Intel compiler complain about missing GETARG (or other extension subroutine)?
- When starting my binary, it complains about missing libpgc.so (IA32/Opteron/EM64T)
- On IA64, running my binary produces "unaligned access" or "assist fault"
- Intel C, C++ or Fortran compilers: Linkage fails
- When starting my binary, it complains about missing symbol (e.g., 0_memcopyA)
- How to get an error traceback for your code
- Common Problems using OpenMP
- Common Problems using MPI
- Hybrid MPI and OpenMP/threaded programs
- Problems with parallel tracing (Intel Tracing Tools / Vampirtrace)
- Problems with using existing binaries
- Issues with batch queuing and batch jobs
- Jobs using shared memory parallelism (SGE)
- Names of Job Scripts (SGE)
- Batch Scripts in DOS/Unicode format / Unprintable Characters in Scripts (SGE and PBS)
- Submission: qsub with binary file as argument Refused (SGE)
- Can't open output file (SGE)
- My jobs fail with strange error messages involving I/O (SGE and PBS)
- I've deleted my job, but it is still listed in qstat (SGE and PBS)
- Common Problems with Scripting Languages (perl, python, R, ruby)
- Installing your own program packages on supermuc
Common Problems with the Compilers
This section covers the Intel, PGI, Pathscale and GNU compilers as pointed out for each section.
Why does my IA32 executable fail when static arrays are very large?
Note: This is becoming (mostly) obsolete upon migrating to 64 bit.
When having more than around 1 GByte of static arrays, one starts to obtain segmentation faults when starting up the program. On IA32, it should however be possible to use data until the process memory limit of 2 GByte is reached. This presently is only a problem with the g95 compiler on 32 bit systems, but some alternatives for other compilers remain documented below.
Workaround:
-
Use static linkage: This is the recommended workaround. The static linkage flag is -Bstatic for the PGI compiler, and -static for the GNU and Intel compilers.
-
Use dynamic allocation: Replace
DOUBLE PRECISION A(150000000) by DOUBLE PRECISION, ALLOCATABLE :: A(:) INTEGER :: ISTAT ... (further declarations) ALLOCATE(A(150000000), STAT=ISTAT) IF (ISTAT /= 0) THEN STOP 'ALLOCATION FAILURE' END IF ... (further code, until A not used any more) DEALLOCATE(A, STAT=ISTAT) IF (ISTAT /= 0) THEN STOP 'DEALLOCATION FAILURE' END IF
This is the right way to do things in Fortran 90, at least for newly developed code; you can now adjust A to the size really needed.
-
Intel Compiler: Add a common statement to your array declaration as shown here
DOUBLE PRECISION A(150000000) COMMON /COM_INTEL/ A
(more than one array may of course appear in the same common block) and specify the -Qdyncom"COM_INTEL" option to the Intel Compiler. Please note that this will not work together with the -g debug option. This is a known bug.
-
pgf77 Compiler: Add a dynamic common statement and allocate that
DOUBLE PRECISION A(150000000) COMMON, ALLOCATABLE /COM_PGI/ A ... (further declarations) ALLOCATE(/COM_PGI/, STAT=ISTAT) IF (ISTAT .NE. 0) THEN STOP 'ALLOCATION FAILURE' END IF ... (etc. as above)
This uses a Fortran extension available only for pgf77, not for pgf90.
-
g77 Compiler: A compilation problem is fixed since the 3.1 release of the gcc.
Code fails to link ("Relocation truncated to fit")
This may happen on x86_64 based systems if your data segment becomes larger than 2 GBytes. Please use the compiler options -mcmodel=medium -shared-intel to build and link your code. The-fpic option should be avoided.
Using large temporary arrays in subroutines fails
Since the 8.x releases of the Intel compiler, using large automatic arrays as in
subroutine foo(n, u) integer :: n real(rk) :: u(n) ... real(rk) :: temp(n) ... end subroutine
leads to segmentation faults and/or signal 11 crashes of the generated executables. The reason is that automatic arrays are placed on the stack, and the stack limit may be too low.
Workarounds:
-
Use the -heap-arrays compiler switch to move allocation to the heap. You can also specify a size modifier if only large arrays should be thusly treated.
-
Increase the stack limit via the command ulimit -s unlimited. Note that special measures might be needed for MPI parallel programs to propagate this setting across nodes.
-
Change over to use dynamic allocation:
subroutine foo(n, u) integer :: n real(rk) :: u(n) ... real(rk), allocatable :: temp(:) ... allocate(temp(n), ...) ! allocation status query omitted here, please check to be safe ... deallocate(temp) end subroutine
this will use the heap for the required storage.
icc and icpc fail to compile my (assembler) code
icc and icpc do their best to behave like the GNU compilers. However, they do not support assember. Please use the -no-gcc compiler switch. This will disable the gcc macros, and hence suppress using assembler statements which are (usually) shielded by macro invocations.
My program stops in I/O when writing or reading large files
Your file may be larger than 2 Gbytes and hence beyond the 32 bits supported by the traditional open() system call. Linux nowadays (Kernel 2.4.x and higher, glibc 2.2.x and higher) does support file sizes larger than 2 GBytes, however you may need to recompile your program to use this feature.
-
GNU C compiler (gcc): Please recompile all sources containing I/O calls using the preprocessor macro _FILE_OFFSET_BITS=64, i. e.
gcc -c -D_FILE_OFFSET_BITS=64 (... other options) foo.c
See this page for further details (some of which may be outdated.
-
PGI Fortran compiler: Please use the -Mlfs compiler switch when linking.
-
Intel Fortran compiler: Automatically supports large files. However, there are limits on the record sizes.
-
64 Bit Systems: On Itanium and Opteron/EM64T systems in 64 bit mode no problems should occur since large files should be supported by default. Note however that there still may be limits for accessing large files via NFS.
I've got a lot of Fortran-unformatted Files from VPP, Sun, IBM, ... Can I use those?
Yes. There are two variants of this situation:
-
Portability of unformatted data. In this situation you want to use both Intel (little endian) and other (big endian) platforms concurrently.
Compiler
Action
PGI
Use the compilation switch -Mbyteswapio
Intel
Set the following environment variable (under sh, ksh, bash) before running your executable:
export F_UFMTENDIAN="big"In this case all unformatted files are operated on in big endian mode.
-
Migration from one platform to the other. Here you need to write a program to convert your data from (or to) big endian once and for all. In the following we shall assume that conversion happens from big endian to little endian, and unit 22 is used to read in the big endian unformatted data.
-
Compiler
Action
PGI
Use the OPEN statement specifier CONVERT in your source:
OPEN(22, FILE='mysundata', FORM='UNFORMATTED', CONVERT='BIG_ENDIAN')Intel
Set the following environment variable (under sh, ksh, bash) before running your executable:
export F_UFMTENDIAN="little;big:22"
This will switch I/O to big endian on unit 22 only.
Please note that you need to perform testing on data files from more exotic big endian platforms because assumptions still are made on IEEE conformance and Fortran record layout.
Generally the Intel Compiler gives you more flexible handling since the functionality is supported by the run time environment and no code recompile is required. You can also specify more than one unit via a list of comma-separated values, or a range of units, i. e. 10-20.
As an illustration, three example programs writedat.f90, readdat.f90 and readdat_pgi2.f90 as well as a Makefile are provided. To run the example, please first run make big on a Sun or SGI to produce an unformatted file data.unf, then run make linux on an Intel Linux box to produce three variants of output (ifc.out, pgi.out and pgi2.out) corresponding to the various possibilities mentioned above.
Please also refer to a section below on how to use this functionality in conjunction with MPI.
Reading unformatted direct access files generated on other HPC platforms
While the above mentioned method works fine for unformatted sequential files, care must be taken to read unformatted direct-access files generated on other platforms. When a direct acess file is opened, the parameter: ...,access='DIRECT', recl=irecl,... is required, specifying the record length. The unit irecl refers to is implementation dependent: E.g., 4 Byte words on Itanium using the Intel compiler, 8 Byte words on the VPP. It is therefore good practice to set the parameter irecl before the open call via the inquire function.
Assume the largest record one wants to write is an array A, which was declared as
real,dimension(n) :: rval
Then one should add the following line before the open call:
inquire(iolength=irecl) rval
and use irecl in the following open statement. Thus the assigned record length for the direct-access file becomes independent of the implementation.
Maximum record length for unformatted direct access I/O for Intel ifort
Up to compiler release 10.1, Intel's documentation does not provide any information on this. The maximum value is 2 GBytes (231 bytes) for each record; note that the storage unit used is 4 bytes unless the switch -assume byterecl is specified, in which case the storage unit is 1 byte.
Compiler does not optimize as specified
The Intel compiler may occasionally give the complaint "fortcom: Warning: Optimization suppressed due to excessive resource requirements; contact Intel Premier Support". In this case, please try the -override-limits switch. However, this may lead to very long compilation time and/or considerable memory usage. If system resources are overstrained, the compilation may fail anyway. If compilation completes, the generated code may be incorrect. In the latter two cases please send your source file(s) to the LRZ support team.
Gradual underflow optimization: -ftz compiler option may improve performance
This applies to usage of the Intel compilers; the material is drawn from the SGI document "Linux Application Tuning Guide", Chapter 2, The SGI Compiling Environment Many processors do not handle denormalized arithmetic (for gradual underflow) in hardware. The support of gradual underflow is implementation-dependent. Use the -ftz option with the Intel compilers to force the flushing of denormalized results to zero.
Note that frequent gradual underflow arithmetic in a program causes the program to run very slowly, consuming large amounts of system time (this can be determined with the 'time' command). In this case, it is best to trace the source of the underflows and fix the code. Gradual underflow is often a source of reduced accuracy anyway.
Why does the Intel compiler complain about missing GETARG (or other extension subroutine)?
This subsection discusses how to link to the portability library (-Vaxlib, -lPEPCF90, -lifport). The information below is obtained from the Intel Software Development web page.
The Portability Library contains library functions that are not built-in to the Fortran language. These functions are in the user name space. These functions are linked only when there is no global definition in your program that satisfies the global reference. For the 7.x compilers, you must use the /4Yportlib (Windows*) or -Vaxlib (Linux*) options to link in the appropriate libraries.
While not required, inserting a USE statement to interface to these functions is recommended as the easiest way to pick up the correct interface and avoid hard-to-debug errors due to type mismatches. If pollution of the user namespace is a concern, specify the ONLY clause with the USE statement.
If you want to include interfaces for all the routines in your program, you can either, depending on the Intel(R) Fortran Compiler version:
-
include the file iflport.f90 (7.x) or ifport.f90 (8.x) from the INCLUDE directory of your compiler distribution,
OR -
add a USE IFLPORT (7.x) or USE IFPORT (8.x) statement to access the INTERFACEs for all portability functions.
Some of the commonly used functions included in this library are: flush, dtime, etime, getarg. A complete list of portability library functions is listed in the Intel Fortran Library Reference.
When starting my binary, it complains about missing libpgc.so (IA32/Opteron/EM64T)
This is a PGI compiler issue and hence only relevant for x86 and possibly x86_64 systems. Please load the environment module fortran/pgi/x.y, or ccomp/pgi/x.y using the version number x.y you used for compiling the application. This should correctly set the LD_LIBRARY_PATH variable.
On IA64, running my binary produces "unaligned access" or "assist fault"
The error messages while running are of the form
myprog.exe(251757): unaligned access to 0x600000000610801c, ip=0x400000000076d30
or
myprog.exe(34443): floating-point assist fault at ip 400000000062ada1, isr 0000020000000008
Both unaligned data and usage if IEEE denormal numbers (see also above) can have a considerable negative performance impact; especially treatment of denormal numbers (which induces floating point assists) requires use of a kernel software trap, using hundreds or thousands of cycles per operation. In fact, you will only see the console output if many of these events accumulate, giving you a hint that you need to fix your program or your compilation procedure.
Suggestions for fixing:
- Do not use unaligned (e.g. packed) data types. Compiler flags may help in providing suitable padding.
- Use Intel compilers. GCC is not always good at avoiding alignment issues
- If FP underflows occur frequently, this may be indication of a defect in your algorithm or data.
- In most cases, compiling with the -ftz (flush-to-zero) switch should be fine with respect to numerical stability, and will avoid use of denormal numbers. If this is not the case, please contact HPC Support for advice.
If messages still occur and are considered a nuisance:
You can run your program under the control of prctl: For example,
prctl --unaligned=silent <program-name>
will suppress the unaligned access messages. Please consult the prctl(1) man page for details.
Intel C, C++ or Fortran compilers: Linkage fails
This not uncommonly happens if you need to link against system libraries (e.g., libX11, libpthread, ...). Of course there are many possible reasons:
-
Check whether you have specified all needed libraries
-
Check whether you are trying to link 32 bit objects into a 64 bit executable. This is not possible.
-
If you use the -static option of the compiler in your linkage command, please remove it or replace it by -i-static (for 9.0 and newer Intel compilers) to only link the Intel libraries statically.
See also the linkage problems with MPI below for further information
When starting my binary, it complains about missing symbol (e.g., 0_memcopyA)
This can be a problem when using non-default versions of the Intel Compilers, or mixing different versions of the C and Fortran compilers. When doing e.g., a
module switch fortran fortran/9.0
for compilation, this setting must also be performed before execution of the program. Otherwise the wrong base library may be bound at run time; in fact if the order of library entries in $LD_LIBRARY_PATH is wrong it may happen that the wrong library is bound from the C installation for a Fortran program (or vice versa). There are a number of possibilities to deal with this problem:
-
If you use e.g., Fortran only, alway perform the command sequence
module unload ccomp module switch fortran fortran/9.0
before either compiling or running your executable.
-
For compiler releases 9.0 and higher, the -i-static link time switch is provided which statically links in the Intel libraries.
-
Use the -Xlinker -rpath [path_to_libraries] switch at linkage to fix the path chosen for resolution of the shared libraries. We're considering to make this the default setting in the compiler configuration file.
How to get an error traceback for your code
If you are using Intel version 8.1 (and higher) compilers, the -traceback option should get you a traceback if your code fails. Adding the -g option may provide source line information as well. You can also add -fpe0 if you suspect that your code fails due to floating point exception error. Note that all of the above can (and perhaps should) be specified in addition to any options used for the production code. Example:
program sample real a(10), b(10) do i=1,10 b(i)=0.0 a(i)=a(i)/b(i) end do stop end
$ ifort -fpe0 -traceback -g sample.f
$ ulimit -c unlimited
$ a.out
forrtl: error (65): floating invalid Image PC Routine Line Source a.out 4000000000002D11 MAIN__ 6 sample.f90 a.out 4000000000002A80 Unknown Unknown Unknown libc.so.6.1 2000000000435C50 Unknown Unknown Unknown a.out 40000000000027C0 Unknown Unknown Unknown
Note that the ulimit -c setting is necessary if you want to investigate a core dump.
Common Problems using OpenMP
my OpenMP program (Fortran, C or C++) segfaults upon or shortly after startup
All static arrays and variables are put on the stack (this ensures thread-safeness if -openmp is used). Hence one of the following needs to be done:
-
Increase the stack limit via e.g., ulimit -s 1000000 (for 1 GB of stack).
-
If you use the Intel compilers and the stack size is already adjusted, perform export KMP_STACKSIZE=32m to adjust the thread individual stack size to a higher value than the default 4 MB (here 32 MB).
-
Convert large static arrays to allocatable and allocate storage dynamically. For private entities this may not always be feasible, though.
Note that specifying -save (or the SAVE attribute for large arrays) may also be possible in some instances, but you need to check that no problems with thread-safeness ensue since SAVEd storage will be in shared scope by default. Also, there may be additional limits (e.g. within the kernel): do not count on being able to use more than 2 GByte on the stack even if limits are set appropriately on a 64 bit system.
Common Problems using MPI
sgi MPT consumes to much memory
see: specific page about SGI's MPI implementation and how to control it.
sgi MPT does not allow static linking
The message passing toolkit from sgi is delivered with shared libraries only. Hence it is not possible to perform static linking (-static or -fast switch) of MPI programs in the mpi.altix environment.
My MPI tasks cannot get the values of environment variables even though I've exported them
Under MPICH, environment variables are not exported to the compute nodes. Parastation MPI (used on the 2 and 4-way cluster systems) here offers the possibility to use the PSI_EXPORTS variable, which by default already contains a number of entries, such as F_UFMTENDIAN or PATH. To add your own variable MY_FOO, please type
export PSI_EXPORTS=$PSI_EXPORTS,MY_FOO
My MPI program crashes. What do I do?
The symptom will look somewhat like this (sgi MPT):
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal x (x may e.g. be 11)
Even if your program appeared to run correctly on another machine/with a different number of CPUs, there still may be bugs in the program. There also may be bugs in the MPI implementation, but that is less probable. To find out where bad things are happening, please perform a traceback procedure as described below.
MPI crash due to incorrect header information (any MPI)
If debugging shows that MPI calls very obviously deliver incorrect results (especially administrative calls), please check whether you've got a file called mpi.h, mpif.h or mpi.mod somewhere in your private include path which interferes with the corresponding files in the system include path. This may lead to errors since different MPI implementations are not binary or even source compatible. Please either remove the spurious files or change your include path so these files are not referenced.
MPI crash due to exceeding internal limits (sgi MPT)
Note that if the crash is initiated by a message similar to
*** MPI has run out of unexpected request entries. *** The current allocation level is: *** MPI_REQUEST_MAX = 16384
this is typically due to exceeding an MPT internal limit. In this case, you simply need to set the referenced environment variable (in the above case, MPI_REQUEST_MAX) to a value sufficient to cover your application's needs. Some experimenting may be necessary, also consult the mpi (1) manual page for the functionality and possible side effects of the referenced variable.
Traceback for parallel codes
The following recipe works for SGI's MPI implementation (MPT).
First, build your application as described in the section about obtaining an error traceback (in the serial case), except that you should use mpif90, mpicc etc. Then, perform the following command sequence
$ module load intel_idb
$ mpirun -np 32 ./myparprog.exe
to trace back to the point in the code where the crash happens. Note that on some systems the intel_idb module is loaded by default.
Master-Slave Codes
Some user applications run in master-slave mode. This may be a configuration where e.g. the process with MPI rank 0 does not actually do any computational work but is only responsible for administrative stuff. Here are some hints on how to deal with this situation.
Notes:
-
Please refer to the section on Controlling Parastation MPI execution for information on Parastation environment variables and their handling.
-
For analogous information on SGI's MPI implementation, see man mpi.
Configuration of Batch Jobs
Depending on the master's computational load it may be reasonable to specify one CPU less for your job script than actually used in the mpirun call. This will cause one of the slave processes to share resources with the master. If you use Parastation MPI, you need to set the PSI_OVERBOOK environment variable to enable this overbooking of resources.
Master consumes CPU Resources
On some MPI variants, the Master will consume CPU resources even though it is actually only waiting e.g., for an incoming message (spinlock). Depending on your code, this may lead to a performance and/or scalability problem if you configure for resource sharing as described in the previous subsection. Here are some suggestions on how to deal with this situation:
-
Parastation MPI: Switch off shared-memory communication via
export PSP_SHAREDMEM=0
However depending on your communication patterns this measure in itself may degrade performance. -
SGI MPT: A setting of e.g., export MPI_NAP=100 will put idling processes to sleep after 100 milliseconds.
-
Self-regulatory: Teach the master to do a renice on itself. But note that this is not reversible.
If nothing helps, you will need to return to allotting the master its own CPU. In any case, please check your performance with suitable test scenarios before burning lots of low-quality cycles.
-fast compiler switch prevents linking on Altix
You cannot use the -fast compiler switch at the linking stage on the sgi systems. See the section below for an explanation.
Hybrid MPI and OpenMP/threaded programs
Performance of Hybrid Code
This section is especially relevant for users migrating their code from the Hitachi SR8000.
Before going into production run with a code which supports hybrid mode, either via OpenMP or via automatic parallelization, please check whether performance is not better running with one thread per MPI process.
Please note: Altogether removing -openmp may improve performance of hybrid MPI+OpenMP codes (which then run as pure MPI codes): For these codes, if you are running with OMP_NUM_THREADS set to 1 because you want to run the "pure" MPI case, the performance of your code may be better if you compile/link your code without the -openmp flag. If you compile/link with the flag, the performance of your code may be penalized with the OpenMP overhead even though you don't want to use OpenMP since the compiler may produce less optimized code due to the OpenMP induced code transformations.
For codes that have explicit calls to OpenMP functions, either shield the calls with !$ directives, or compile them for the "pure" MPI case using the -openmp_stubs option instead of -openmp. A code compiled with -openmp_stubs will not work if OMP_NUM_THREADS is set to a value greater than 1.
Note that there may well be cases in which retaining hybrid functionality may give a performance advantage e.g. if your code becomes cache-bound and little shared-memory synchronization is required. But you need to check this, and optimize the number of threads used if you do decide in favour of hybrid mode.
Problems with parallel tracing (Intel Tracing Tools / Vampirtrace)
Why can I only resolve MPI calls, but not my own subroutine calls?
Automatic subroutine tracing is supported in the newest releases of tracing tools (7.1) and compilers (10.0). Use the -tcollect compiler switch in addition to -vtrace after loading the appropriate modules. To reduce overhead, you can also use the VT API to manually insert instrumentation into your source.
My program runs, but crashes when trying to write the trace file
A typical error message might look like
[0] Intel Trace Collector INFO: Writing tracefile a.out.stf in /home/cluster/a2832ba PSIlogger: Child with rank 1 exited on signal 15. PSIlogger: Child with rank 0 exited on signal 11.
(for Parastation MPI, sgi MPT produces a stack traceback). The trace data are mostly corrupt. The reason for this behaviour may be that you have a global symbol which clashes with a system call used by the tracing library e.g., a static variable
double *time;
Please rename your symbol so as to not clash, or convert it to local scope.
Problems with using existing binaries
Failing to link with an existing object file
There may be various issues here:
-
First run the file command on the object(s):
file foo.o
The result must be consistent with the platform you're working on (e.g. ELF 64-bit LSB relocatable, IA-64, version 1 (GNU/Linux), not stripped on an Itanium-based system.
-
If you get an error message like
undefined reference to `__ctype_b'
you need to change the source code of your application and recompile. Symbols beginning with __ should not be used at all if possible since they are only meant for internal glibc usage and are not exported any more in newer glibc releases.
A workaround in the above case may be to replace the __ctype_b by *__ctype_b_loc()in your code.
Issues with batch queuing and batch jobs
Jobs using shared memory parallelism (SGE)
Note that OpenMP parallel programs may need a suitable setting for the environment variable OMP_NUM_THREADS. Default is usually 1 Thread.
For programs (like, e.g., Gaussian), which are multithreaded, however not via OpenMP but via shared memory used by processes or explicit pthread programming, you need to study the program documentation for hints on how to configure a parallel run. Setting OMP_NUM_THREADS will usually not have any effect for these programs.
For MPI programs running via shared memory, setting OMP_NUM_THREADS will also not have any effect.
Names of Job Scripts (SGE)
Batch scripts must not have a number as first character of their name. E.g., a script of the form 01ismyjob will not be correctly started. Please use one of the characters a-z, A-Z as first character of your job script name.
Batch Scripts in DOS/Unicode format / Unprintable Characters in Scripts (SGE and PBS)
Scripts which have been edited under DOS/Windows may contain line-feeds and carriage-returns; these will not work under SGE or PBS. The same applies for scripts which have been written in Unicode (e.g., UTF format) by modern editors. Furthermore, apparent whitespaces, for example in the
#! /bin/sh
specification could lead to problems with SGE. Scripts like these will fail to execute and may even block a queue altogether! Please remove such special whitespaces.
Determination of file format and fixing of format problems
-
Run the file command on the script:
file my_scriptThe result should be something like my_script: Bourne-Again shell script text. If this is not the case, but instead a format like UTF-8 is reported, then please run the iconv command:
iconv -c -f UTF-8 -t ASCII my_script > fixed_script(the result is written to standard output, which is redirected in this example - so fixed_script should now be ASCII, while my_script is unchanged).
-
Edit the script with vi(=vim). In the status line you will see the string [dos] if the file happens to contain carriage returns/linefeeds.
For conversion from DOS to UNIX format the tool recode may be used:
recode ibmpc..lat1 my_scriptAlternatively, the dos2unix command can also be used:
dos2unix my_scriptThese commands perform the necessary changes in-place (i.e., the file is modified).
-
If none of the above two items help, you can also perform an octal dump of your script. You should see the following
$ od -c myscript | less 0000000 # ! / b i n / b a s h \n # $ - o etc. etc. ...
If any strange numbers or "\r \n" sequences occur, the format is incorrect and must be fixed (e.g. via multi-lingual emacs editing).
Submission: qsub with binary file as argument Refused (SGE)
Vanilla SGE supports the following command:
qsub myscript myprog.exe
where myprog.exe is a binary file and processed via $1 within the script. This functionality is not supported at LRZ.
Can't open output file (SGE)
If you receive a mail from SGE which contains an error message of the form
Shepherd error: 01/09/2006 21:41:20 [20542:31945]: error: can't open output file "/some_dir/username/xyz": No such file or directory
in its last few lines, something may be wrong with the configuration of your job script. Please check
-
whether the directory you have targeted the SGE output for actually exists (on the target machine).
-
whether the SGE script is located in a shared file system, preferably within your HOME directory tree. Note that especially the combination of having the script on $TMPDIR on the submit host and using the -cwd switch will not work!
-
also note that the directory specified for the output/error files of SGE must exist prior to the job starting its run. Having an appropriate mkdir command in the script section is not sufficient.
Another possible reason for the above error message is contained in the section about jobs failing with I/O errors.
My jobs fail with strange error messages involving I/O (SGE and PBS)
The error messages typically are "Can't open output file" and/or "file too large".
The reasons for this may be
- you have exceeded a file system quota
- you have exceeded the per-directory limit for the number of files
Please consult the file systems document appropriate for the system you are using: Cluster or HLRB-II. The therapy usually requires removing files and/or restructuring your directory hierarchy.
I've deleted my job, but it is still listed in qstat (SGE and PBS)
With SGE, you might typically see an entry
x64_serial@lx64a389.cos B 0/1/4 -NA- lx26-amd64 au 10226 50.04043 myjob xxxyyy dr 09/21/2012 13:06:29 1
when issuing qstat -f -u $USER some time after having deleted your job using qdel. With PBS, the qstat output has a different format, but the same problem may surface. The trouble is that the master cannot distinguish a node crash from a temporary network outage, and so cannot remove the job from its internal tables. It might or might not need to do some janitorial work on the client node!
There is of course a catch here: If you resubmit the job (operating on the same data), there is a chance that processes from the deleted job still running on the client node will overwrite newly generated data. The chance for this is not large, since in most cases we do observe node crashes rather than long-term network outages, but it is not zero. There is no sure way to avoid this but for running the new job on a separate data set.
Common Problems with Scripting Languages (perl, python, R, ruby)
This section covers the scripting facilities for the HLRB-II and the Linux Cluster systems
I need additional modules for perl. How can I install them for my user account?
In order to give our users the maximum freedom, we are not installing perl modules systemwide, but the modules have to be installed on a per user basis, unless the modules are of such importance and wide use by our user basis, that it justifies a system wide installation. We are always pleased to help our users in case of any problems.
You can install all the modules that you require easily in the following way:
- create local perl module directory:
> mkdir ~/myperl
- start cpan
> cpan
You will be in the initial configuration dialog. The defaults are ok. Just press enter.
When the installation dialog asks for the downloadsites set: 11 9 4
This might take a while (several minutes).
Then set the following options in cpan:cpan> o conf makepl_arg "LIB=~/myperl/lib \ INSTALLMAN1DIR=~/myperl/man/man1 \ INSTALLMAN3DIR=~/myperl/man/man3 \ INSTALLSCRIPT=~/myperl/bin \ INSTALLBIN=~/myperl/bin" cpan> o conf mbuildpl_arg "--lib=~/myperl/lib \ --installman1dir=~/myperl/man/man1 \ --installman3dir=~/myperl/man/man3 \ --installscript=~/myperl/bin \ --installbin=~/myperl/bin" cpan> o conf mbuild_install_arg "--install_path lib=~/myperl" cpan> o conf prerequisites_policy automatically cpan> o conf commit cpan> quit
- Now you can install whatever modules you like in your local directory ~/myperl.
For example, for the bioperl module, type in the following commands while being in the cpan shell:
cpan>d /bioperl/ CPAN: Storable loaded ok Going to read /home/bosborne/.cpan/Metadata Database was generated on Mon, 20 Nov 2006 05:24:36 GMT .... Distribution B/BI/BIRNEY/bioperl-1.2.tar.gz Distribution B/BI/BIRNEY/bioperl-1.4.tar.gz Distribution C/CJ/CJFIELDS/BioPerl-1.6.0.tar.gz Now install: cpan> force install C/CJ/CJFIELDS/BioPerl-1.6.0.tar.gzSome additional tricks:
in case the download is slow, then edit the file ~/.cpan/CPAN/MyConfig.pm and insert the following line into $CPAN::Config :
'dontload_hash' => { "Net::FTP" => 1, "LWP" =>1 },
in case something goes wrong you can delete ~/.cpan and start over again.
Very helpful is the perl shell which you can easily obtain by installing:
cpan> install Psh cpan> install IO::String
it is then available under ~/myperl/bin/psh e.g. try it out:
psh% use Bio::Perl;
psh% $seq_object = get_sequence('genbank',"ROA1_HUMAN");
psh% write_sequence(">roa1.fasta",'fasta',$seq_object);
I need additional modules for R. How can I install them for my user account?
You can install R modules in your home directory. R automatically asks to put the libraries in a directory in your home when it cannot write to the system wide installation directory. To install additional modules you start up R and run install.modules:
$ R
R version 2.10.1 (2009-12-14)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("XML")
R then asks you for a download site. Use the local (München) download site. In case you want special compilers or want to install the package on the HLRB-II where no direct internetconnection is possible, you have to download the tar.gz file first (e.g. from CRAN) and then install it via the following command:
> install.packages(
c("XML_0.99-5.tar.gz",
"../../Interfaces/Perl/RSPerl_0.8-0.tar.gz"),
repos = NULL,
configure.args = c(XML = '--with-xml-config=xml-config',
RSPerl = "--with-modules='IO Fcntl'"))
Please also have a look at the man pages in R.
I need additional modules for python. How can I install them for my user account?
In order to give our users the maximum freedom, we are not installing perl modules systemwide, but the modules have to be installed on a per user basis, unless the modules are of such importance and wide use by our user basis, that it justifies a system wide installation. We are always pleased to help our users in case of any problems.
The idea behind the home scheme' is that you build and maintain a personal stash of Python modules. This scheme's name is derived from the idea of a home directory on Unix, since it's not unusual for a Unix user to make their home directory have a layout similar to /usr/ or /usr/local/. This scheme can be used by anyone, regardless of the operating system they are installing for.
You can install all the modules that you require easily in the following way:
- Download the tar.gz file to your home directory, unpack it and cd to the installation directory. You should have a setup.py file in this directory
-
Create a library directoy for your library files
$ mkdir ~/mypython
-
Install the library files in your own library directory.
$ python setup.py install --home=~/mypython
While in most cases setting the --home option will do what you want, in some cases you want to install modules in another python interpreter. E.g. consider that many Linux distributions put Python in /usr, rather than the more traditional /usr/local. This is entirely appropriate, since in those cases Python is part of the system' rather than a local add-on. However, if you are installing Python modules from source, you probably want them to go in /usr/local/lib/python2.X rather than /usr/lib/python2.X. This can be done with
$ /usr/bin/python setup.py install --prefix=/usr/local
The last case is when you want to be totally free to set the options of your installation procedure. You can also override other options for the installation procedure:
$ python setup.py install --home=~/mypython \
--install-purelib=~/mypython/lib \
--install-platlib='lib.$PLAT' \
--install-scripts=~/mypython/scripts
--install-data=~/mypython/data
Another way to install packages under python is by using the easy_installcommand. When you want to install packages only for your own user account then please use the following options:
easy_install --install-dir ~/mypython/lib/ --script-dir ~/mypython/scriptsand don't forget to set the python path accordingly:
export PYTHONPATH=~/mypython/lib
Installing your own program packages on supermuc
Sometimes you want to install your own software packages from the internet by using svn or via an installation script that fetches files via http or ftp. This cannot be done easily due to the restrictions of the supermuc firewall on external connections. We propose the following solutions:
Copy all needed installation files to supermuc
The easiest way to install a software package is to copy all needed files into a directory on supermuc, unpack them and run the configure script. Be sure that you have resolved all the dependencies and then you can compile the software and install it in a directory in your home. Most of the time this will include an option like this:
$ ./configure --prefix=/home/<group>/<account>/mydir
$ make
$ make install
This should do the job in most of the cases.
Mount a directory of supermuc on your local machine
Sometimes you need internet access in order to install software. You can mirror the directory on the supermuc to your local machine (when you are running SUSE SLES11 or another compatible linux) and install the software on your local machine.
This can be done by creating a directory on supermuc in your HOME directory e.g. $HOME/mydir and a directory at your local machine $HOME/mydir and then mounting it. You need sshfs in order to do the mapping on your local machine.
$ sshfs login01.sm-gw.lrz.de:/home/<group>/<account>/mydir ./mydir
Then install the software on your local machine in the directory $HOME/mydir. It will be automatically mirrored to the supermuc directory and can then be used there.
Create a tunnel for internet access
In case the last method fails you can create a tunnel via ssh from your local machine to supermuc and use a proxy at your local machine to re-direct internet connections from supermuc to the internet.
Get a proxy software for http connections (e.g. Tiny HTTP Proxy from http://www.voidtrance.net/2010/01/simple-python-http-proxy/) and unpack it on your local machine.
Then start the local proxy using:
$ python TinyHTTPProxy.py -p 1234
Then create a tunnel to supermuc via:
$ ssh -l <kennung>-R 1234:<your hostname>:1234 login02.sm-gw.lrz.de
On supermuc you can set a http_proxy to localhost:1234 and have full access to the http protocoll. (e.g. firefox etc)