Category Archives: Shell

Moving / Copying lots of s3 files quickly using gnu parallel

I recently had to copy a large volume of files from one s3 folder to another using s3cmd but found that the process was very slow as it is single threaded. I looked into how to do this in a more parallel manner and discovered GNU Parallel which was the answer to my dreams.

To use it I simply create a list of files that need to be moved and pass them to parallel with the correct copy command :

The -j20 switch tells it to use 20 parallel threads
–halt 1 is so that if one task fails then the remainder are finished and the command fails
{/} is a GNU Parallel specific token which denotes the basename of the line from the file passed to parallel i.e. given s3://bucket/filename.txt {/} returns filename.txt. Note that this requires a recent version of GNU parallel so install the latest stable version from source if necessary.

# filelist.txt just contains a list of s3 files i.e. s3://bucket/filename.ext
# Parallel creates a new thread to handle each line in the file (up to the limit -j)
# Note also that we escape the $ within the command passed to parallel. If we did not 
# escape it then the variable would be treated as a variable in the scope of the 
# calling script rather than within the parallel call.
cat filelist.txt |parallel -j20 --halt 1 "filename={/};s3cmd cp s3://bucket/folder1/$filename s3://bucket/folder2/$filename;"

If you want to perform a more complex task in parallel it can be a bit cumbersome and unreadable to put all the command line. To get around this we can just use a bash function.

#!/bin/bash
function dosomethingabitmorecomplex {
    echo "Doing something with arg $1"
    sleep 10
    echo "finished doing something with arg $1"
}

# Since parallel creates subshells for each of its threads we need to 
# export the function to ensure it can be accessed by the subshell
export -f dosomethingabitmorecomplex 
# testfile.txt just contains lines of text
# the {} token represents the line passed to parallel from the text file
parallel -j20 "dosomethingabitmorecomplex {}" < testfile.txt

A simple script to list s3 bucket sizes

Mounting a Windows 7 Share in Linux

Leave a reply

I run an Ubuntu VM using VMware on my windows 7 machine and I find it very useful to be able to access files on the windows machine from inside of linux. This is actually quite easy to achieve and you can be up and running in about five minutes if nothing goes wrong.

First, in windows, share the folder that you are trying to access from your linux box. Make sure that the user you want to grant access to has read and write permissions (by default this the the current admin user so most likely you wont have to worry about this)

Now create an .smbcredentials file somewhere on your linux machine. I just used :

/home/andrew/.smbcredentials

but you can put it wherever floats your boat.

Now edit the .smbcredentials file and add the username and password for the windows machine in this file in this format :

username=andrew
password=andrewspassword

Next you need to create the entry in your /etc/fstab which will mount the directory on startup.
Add this line to your /etc/fstab file (filling in your details where necessary) :

///dev /mnt/ smbfs iocharset=utf8,credentials=/path/to/.smbcredentials,uid=1000 0 0

Next you need to create the mount point and make sure you can access it as a non root user :

sudo mkdir /mnt/
chmod -R 775 /mnt/
chown -R :root /mnt/

Finally ensure that smbfs is installed on your system. In Ubuntu just use :

sudo apt-get install smbfs

Now just mount it

sudo mount -a

If you get an error like this :

mount: wrong fs type, bad option, bad superblock on ⁄⁄mywindows⁄myshare,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might....

You probably havent installed smbfs correctly so run :

sudo apt-get install smbfs

again.

A simple “treesize” shell script for Linux

26 Replies

One of my favorite pieces of software on windows is a little app called treesize free by Jam Software. It basically gives you a simple list of how much disk space each directory is taking up. This is really useful when you are trying to work out where all the space on your 500gig disk is gone.

I am always finding myself looking for similar piece of software for linux which can be run simply from the command line, but alas none exists so i decided to create a simple shell script to do a similar job, and here it is :

#/bin/sh
du -k --max-depth=1 | sort -nr | awk '
     BEGIN {
        split("KB,MB,GB,TB", Units, ",");
     }
     {
        u = 1;
        while ($1 >= 1024) {
           $1 = $1 / 1024;
           u += 1
        }
        $1 = sprintf("%.1f %s", $1, Units[u]);
        print $0;
     }
    '

Just put this code into a file /bin/treesize and make it executable. Then any system user can get a list of directory sizes within a directory by just running treesize from any directory.

Andrew Clarke's Blog!

Random thoughts

Category Archives: Shell

Moving / Copying lots of s3 files quickly using gnu parallel

A simple script to list s3 bucket sizes

Mounting a Windows 7 Share in Linux

A simple “treesize” shell script for Linux