Moving / Copying lots of s3 files quickly using gnu parallel

I recently had to copy a large volume of files from one s3 folder to another using s3cmd but found that the process was very slow as it is single threaded. I looked into how to do this in a more parallel manner and discovered GNU Parallel which was the answer to my dreams.

To use it I simply create a list of files that need to be moved and pass them to parallel with the correct copy command :

  • The -j20 switch tells it to use 20 parallel threads
  • –halt 1 is so that if one task fails then the remainder are finished and the command fails
  • {/} is a GNU Parallel specific token which denotes the basename of the line from the file passed to parallel i.e. given s3://bucket/filename.txt {/} returns filename.txt. Note that this requires a recent version of GNU parallel so install the latest stable version from source if necessary.
# filelist.txt just contains a list of s3 files i.e. s3://bucket/filename.ext
# Parallel creates a new thread to handle each line in the file (up to the limit -j)
# Note also that we escape the $ within the command passed to parallel. If we did not 
# escape it then the variable would be treated as a variable in the scope of the 
# calling script rather than within the parallel call.
cat filelist.txt |parallel -j20 --halt 1 "filename={/};s3cmd cp s3://bucket/folder1/$filename s3://bucket/folder2/$filename;"

If you want to perform a more complex task in parallel it can be a bit cumbersome and unreadable to put all the command line. To get around this we can just use a bash function.

#!/bin/bash
function dosomethingabitmorecomplex {
    echo "Doing something with arg $1"
    sleep 10
    echo "finished doing something with arg $1"
}

# Since parallel creates subshells for each of its threads we need to 
# export the function to ensure it can be accessed by the subshell
export -f dosomethingabitmorecomplex 
# testfile.txt just contains lines of text
# the {} token represents the line passed to parallel from the text file
parallel -j20 "dosomethingabitmorecomplex {}" < testfile.txt

3 thoughts on “Moving / Copying lots of s3 files quickly using gnu parallel

  1. Ole Tange

    It is somewhat unusual to use ‘less’ instead of cat. And if you use {/} directly, you can can avoid a lot of quoting:

    cat filelist.txt | parallel -j20 –halt 1 s3cmd cp s3://bucket/folder1/{/} s3://bucket/folder2/{/}

    Reply
  2. Pingback: parallel @ Savannah: GNU Parallel 20140622 (‘Brazil’) released | Open World

  3. Hasib

    Can’t you use this? You then don’t need Parallel, right
    /

    $ aws configure set \
    default.s3.max_concurrent_requests \
    100

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *