s3 | Andrew Clarke's Blog!

I recently had to copy a large volume of files from one s3 folder to another using s3cmd but found that the process was very slow as it is single threaded. I looked into how to do this in a more parallel manner and discovered GNU Parallel which was the answer to my dreams.

To use it I simply create a list of files that need to be moved and pass them to parallel with the correct copy command :

The -j20 switch tells it to use 20 parallel threads
–halt 1 is so that if one task fails then the remainder are finished and the command fails
{/} is a GNU Parallel specific token which denotes the basename of the line from the file passed to parallel i.e. given s3://bucket/filename.txt {/} returns filename.txt. Note that this requires a recent version of GNU parallel so install the latest stable version from source if necessary.

# filelist.txt just contains a list of s3 files i.e. s3://bucket/filename.ext
# Parallel creates a new thread to handle each line in the file (up to the limit -j)
# Note also that we escape the $ within the command passed to parallel. If we did not 
# escape it then the variable would be treated as a variable in the scope of the 
# calling script rather than within the parallel call.
cat filelist.txt |parallel -j20 --halt 1 "filename={/};s3cmd cp s3://bucket/folder1/$filename s3://bucket/folder2/$filename;"

If you want to perform a more complex task in parallel it can be a bit cumbersome and unreadable to put all the command line. To get around this we can just use a bash function.

#!/bin/bash
function dosomethingabitmorecomplex {
    echo "Doing something with arg $1"
    sleep 10
    echo "finished doing something with arg $1"
}

# Since parallel creates subshells for each of its threads we need to 
# export the function to ensure it can be accessed by the subshell
export -f dosomethingabitmorecomplex 
# testfile.txt just contains lines of text
# the {} token represents the line passed to parallel from the text file
parallel -j20 "dosomethingabitmorecomplex {}" < testfile.txt

Andrew Clarke's Blog!

Random thoughts

Tag Archives: s3

Moving / Copying lots of s3 files quickly using gnu parallel

A simple script to list s3 bucket sizes