I recently had to copy a large volume of files from one s3 folder to another using s3cmd but found that the process was very slow as it is single threaded. I looked into how to do this in a more parallel manner and discovered GNU Parallel which was the answer to my dreams.
To use it I simply create a list of files that need to be moved and pass them to parallel with the correct copy command :
- The -j20 switch tells it to use 20 parallel threads
- –halt 1 is so that if one task fails then the remainder are finished and the command fails
- {/} is a GNU Parallel specific token which denotes the basename of the line from the file passed to parallel i.e. given s3://bucket/filename.txt {/} returns filename.txt. Note that this requires a recent version of GNU parallel so install the latest stable version from source if necessary.
# filelist.txt just contains a list of s3 files i.e. s3://bucket/filename.ext # Parallel creates a new thread to handle each line in the file (up to the limit -j) # Note also that we escape the $ within the command passed to parallel. If we did not # escape it then the variable would be treated as a variable in the scope of the # calling script rather than within the parallel call. cat filelist.txt |parallel -j20 --halt 1 "filename={/};s3cmd cp s3://bucket/folder1/$filename s3://bucket/folder2/$filename;"
If you want to perform a more complex task in parallel it can be a bit cumbersome and unreadable to put all the command line. To get around this we can just use a bash function.
#!/bin/bash function dosomethingabitmorecomplex { echo "Doing something with arg $1" sleep 10 echo "finished doing something with arg $1" } # Since parallel creates subshells for each of its threads we need to # export the function to ensure it can be accessed by the subshell export -f dosomethingabitmorecomplex # testfile.txt just contains lines of text # the {} token represents the line passed to parallel from the text file parallel -j20 "dosomethingabitmorecomplex {}" < testfile.txt
It is somewhat unusual to use ‘less’ instead of cat. And if you use {/} directly, you can can avoid a lot of quoting:
cat filelist.txt | parallel -j20 –halt 1 s3cmd cp s3://bucket/folder1/{/} s3://bucket/folder2/{/}
Pingback: parallel @ Savannah: GNU Parallel 20140622 (‘Brazil’) released | Open World
Can’t you use this? You then don’t need Parallel, right
/
$ aws configure set \
default.s3.max_concurrent_requests \
100