Globus url copy
Using globus-url-copy with LRZ resources
The tool globus-url-copy is a command line client provided by the Globus Toolkit in order to move data using the GridFTP transfer protocol.
In this subsection we describe how you can use globus-url-copy to
- copy data between your local workstation and SuperMUC
- copy data from a remote workstation to SuperMUC (also called third party transfer)
After that we give some concrete examples that show how to use the globus-url-copy command.
There are some prerequisites that have to be satisfied before you can move files:
- You must have a Grid certificate
- You must create a proxy certificate: this is a short-lived credential generated from your Grid certificate (see previous bullet) by means of the command grid-proxy-init and the password of your private key. The command and the related output are:
localhost:~ johndoe$ grid-proxy-init
If you are confident enough with this tool, maybe you want to store your credentials on a remote server (MyProxy) and retrieve your proxy certificate from there.
You need to have a Globus installation available on your workstation. The easiest way to satisfy this requirement is to install precompiled packages for your Linux distribution provided by the IGE Project. An alternative could be to install Globus from the source code. The Globus Toolkit website explains how to do that.
Copying data between your local workstation and SuperMUC
First of all, you have to remember that this operation can be successfully carried out only if you have network access to SuperMUC, that is the IP address of your workstation has been registered in the firewall of SuperMUC. In addition you have to set the ephemral port range to 20000 to 25000. In (t)csh and (ba)sh shell syntax this would be done in a unix shell like this:
|setenv GLOBUS_TCP_RANGE 20000,25000||export GLOBUS_TCP_RANGE=20000,25000|
|setenv GLOBUS_TCP_PORT_RANGE 20000,25000||export GLOBUS_TCP_PORT_RANGE=20000,25000|
|setenv GLOBUS_TCP_SOURCE_RANGE 20000,25000||export GLOBUS_TCP_SOURCE_RANGE=20000,25000|
|setenv GLOBUS_UDP_PORT_RANGE 20000,25000||export GLOBUS_UDP_PORT_RANGE=20000,25000|
|setenv GLOBUS_UDP_SOURCE_RANGE 20000,25000||export GLOBUS_UDP_SOURCE_RANGE=20000,25000|
The basic syntax of the command is:
globus-url-copy <sourceURL> <destinationURL>
where <sourceURL> and <destinationURL> can be one of the following:
- file://<absolute path to your file>, if you are referring to a file local on the machine where you call this command;
gsiftp://<remote machine><absolute path to your file>, if you are referring to a file on a remote workstation.
After the command globus-url-copy you can specify some options, for example -vb to enable the verbose mode and monitor the transfer. In order to achieve maximal throughput you should specify the -p 10 option; this uses 10 parallel streams for data transfer and can increase your transfer speed from 200 MB/s to 900 MB/s on SuperMUC. However, it only works if the destinationURL is a gsiftp://-type URL.
Assuming that you want to copy a file named foo located in the home folder of your workstation to the home folder you have on SuperMUC, what you have to do is:
localhost:~ johndoe$ globus-url-copy -vb -p 10 file:///home/johndoe/foo gsiftp://supermuc.lrz.de/~/
The following picture shows the output of the command:
You can see that:
it is not necessary to specify the name of the destination file (even through you can do that, i.e., globus-url-copy -vb -p 10 file:///home/johndoe/foo gsiftp://supermuc.lrz.de/~/bar);
it is possible to specify the home folder in the remote location using the '~' shortcut (very convenient since the absolute path of the home folder on the remote system is not very easy to remember);
since the verbose option is specified, some statistics and the progress of the transfer are shown.
if you specify the -p 10 option, you can reach much higher transfer speeds up to 900 MB/s. Recommended!
The transfer can also occur in the opposite direction, from SuperMUC to your local workstation. So, if you want to copy the file back to your local workstation, calling this second version bar, you have to type:
localhost:~ johndoe$ globus-url-copy -vb -p 10 gsiftp://supermuc.lrz.de/~/foo gsiftp://lxgt2.lrz.de/~/bar
localhost:~ johndoe$ globus-url-copy -vb gsiftp://supermuc.lrz.de/~/foo file:///home/johndoe/bar
Finally, you can also transfer a folder. You have to specify two additional options: -r, for a recursive transfer and -cd, to create the destination directory, if necessary.
So, if you want to copy the foo_dir folder from your local workstation to SuperMUC, the complete command to issue is:
localhost:~ johndoe$ globus-url-copy -vb -cd -r -p 10 file:///home/johndoe/foo_dir/ gsiftp://supermuc.lrz.de/~/foo_dir/
Please note that in the <sourceURL> you specified the folder name with the trailing '/' character and you also added the same folder name to the <destinationURL>, created for you on the fly by globus-url-copy.
Copy data from a remote workstation to SuperMUC
The globus-url-copy allows you to move data between a remote workstation (i.e., a machine different from your local one however running a GridFTP server) to SuperMUC, provided that:
your credentials are valid on the remote workstation and you have a valid account there;
the remote workstation also has network access to SuperMUC (i.e., its IP address is allowed in the firewall of SuperMUC).
Supposing that all these prerequisites are met and you want to move the a file named foo from your home folder on SuperMUC to your home folder on lxgt2.lrz.de, changing the name to bar, you will have to type:
localhost:~ johndoe$ globus-url-copy -vb -p 10 gsiftp://supermuc.lrz.de/~/foo gsiftp://lxgt2.lrz.de/~/bar
Please note that if the GridFTP server on the remote workstation runs on a port different from the standard one (2811), then you have to specify the port on the URL, which would become gsiftp://<remote workstation>:<port>/<absolute path to your file>. For example, if the GridFTP port on lxgt2.lrz.de were 2812, then the destination URL of the previous use case would be gsiftp://lxgt2.lrz.de:2812/~/bar.
Advanced usage and performance enhancement
In case of a massive transfer, it is good practice to add to globus-url-copy the options -rst (for restarting interrupted operations) together with -df <filename>. The file specified by means of the -df flag is the so called dump file, containing the URLs that still have to be transfered. In case globus-url-copy returns to the prompt before finishing, entering again the exactly same command, taking care to preserve the mentioned options, will resume the transfer starting from the first incomplete file. In fact, any source file or path will be ignored and globus-url-copy will read the content of the dump file. However, the following restrictions apply:
- the two options,-rst and -df <filename> must be present since the first GUC call, otherwise it is not possible to populate the dump file;
- globus-url-copy, when restarted, can not resume a file transfer at an arbitrary point, but will start to move incomplete file(s) from the beginning;
Another approach to resume a transfer is the usage of the -sync flag. In this case, globus-url-copy will perform a check before moving a file or a folder. The defualt behaviour is to move the source only if its timestamp is more recent than that of the destination. The user can choose among different synchronisation mechanisms acting on the numerical value assigned to the option -sync-level according to the following mapping (taken from the man page of globus-url-copy):
- 0: transfer the source only if it does not exist on the destination machine;
- 1: the source is copied if not present on the target machine or if the size does not match;
- 2 (default if -sync-level not specified): move the source if it is newer than what is there at the destination side (and of course if it does not exist at all);
- 3: compute the checksum of the source and of the destination, performing the transfer only if the two are different.
Network performance can be influenced and enhanced by tuning the mechanisms of parallelism and concurrency. In particular:
- parallelism is the number of streams used to move each single file. The flag to use is -p <number of parallel streams>. Reasonable numbers could be 4, 8, 12, and 16. Usually, beyond 16 performance reaches a plateau (or even gets worse again). You have to experiment yourself to find the optimum value, as this value depends on many LRZ-external factors;
- concurrency is the number of GridFTP servers that are started on the target machine. In other words, it is the number of files that are transferred simultaneously. The related option on the command line is -cc <number of parallel files>. Reasonable numbers could be 2 or 4.
Both mechanisms can be combined to achieve optimal results, however, it is important to remember that there is not a single recipe for all cases and the final result depends on the network bandwidth and the machines (CPU power, memory) involved in the operation. As a rule of thumb, if there are few big files, than it is better to use only parallelism with a high number of streams, i.e., 16. If the transfer is made up of many small files, then it could be beneficial to introduce concurrency, i.e., 2 or 4, associated maybe with a low level of parallelism, i.e.. 2 or 4. Please note that, for example, with -cc 8 and -p 4, globus-url-copy is moving 8 files at a time, employing 4 streams for each one, leading to a total of 8 * 4 = 32 streams from the client to the destination. The client and the server should have enough resources in terms of computation and memory buffers to sustain that. The advice is to experiment with the parameters (maybe using -vb to display the current and average speed) and find the optimal compromise.
For the sake of completeness, we mention here that in case both of the following conditions apply:
- server to server transfer, i.e., both the source and the destination are specified as gsiftp://
- the destination server is made up of multiple different nodes (i.e., login nodes of a supercomputer) sharing the same filesystem
then the most efficient way to perform a transfer is to use the so called striping, specifying the option -stripe instead of -cc and <-p>. It is not necessary to enter the number of stripes, this is communicated by the target GridFTP server. Please verify with the administrators of the resources if striping has been configured and can be used. More specifically, SuperMUC supports a striping level equal to 7 since this is the number of login nodes whose GridFTP services have been configured in an interconnected way.