Moving / Copying lots of s3 files quickly using gnu parallel

I recently had to copy a large volume of files from one s3 folder to another using s3cmd but found that the process was very slow as it is single threaded. I looked into how to do this in a more parallel manner and discovered GNU Parallel which was the answer to my dreams.

To use it I simply create a list of files that need to be moved and pass them to parallel with the correct copy command :

  • The -j20 switch tells it to use 20 parallel threads
  • –halt 1 is so that if one task fails then the remainder are finished and the command fails
  • {/} is a GNU Parallel specific token which denotes the basename of the line from the file passed to parallel i.e. given s3://bucket/filename.txt {/} returns filename.txt. Note that this requires a recent version of GNU parallel so install the latest stable version from source if necessary.
# filelist.txt just contains a list of s3 files i.e. s3://bucket/filename.ext
# Parallel creates a new thread to handle each line in the file (up to the limit -j)
# Note also that we escape the $ within the command passed to parallel. If we did not 
# escape it then the variable would be treated as a variable in the scope of the 
# calling script rather than within the parallel call.
cat filelist.txt |parallel -j20 --halt 1 "filename={/};s3cmd cp s3://bucket/folder1/$filename s3://bucket/folder2/$filename;"

If you want to perform a more complex task in parallel it can be a bit cumbersome and unreadable to put all the command line. To get around this we can just use a bash function.

function dosomethingabitmorecomplex {
    echo "Doing something with arg $1"
    sleep 10
    echo "finished doing something with arg $1"

# Since parallel creates subshells for each of its threads we need to 
# export the function to ensure it can be accessed by the subshell
export -f dosomethingabitmorecomplex 
# testfile.txt just contains lines of text
# the {} token represents the line passed to parallel from the text file
parallel -j20 "dosomethingabitmorecomplex {}" < testfile.txt

A simple script to list s3 bucket sizes

This script uses the excellent s3cmd to provide a list of all buckets in an account along with their sizes rounded down to the nearest GB MB and Byte. Amazingly there doesnt seem to be an easy way to do this within the s3 web interface.

buckets=`s3cmd -c .yours3cfgfile ls | awk '{FS=" ";print $3}'`
for bucket in $buckets
size=`s3cmd -c .yours3cfgfile du "$bucket" |awk '{FS=" ";print $1}'`
sizemb=`expr $size / \( 1024 \* 1024 \)`
sizegb=`expr $sizemb / 1024`
echo "$bucket ${sizegb} GB ${sizemb} MB ${size} bytes"

Hope someone finds this useful.