Jaymatter

I do not fear computers. I fear the lack of them..

rc.local

Part of running a cluster means automating tasks across many computers. Putting executable scripts in a repo of some kind (I use a private github one) makes it easy to make changes for testing, and you definitely want to use a script of some kind to automatically check out and run your executables. Many folks will probably already know how to do this - so this may not be useful to many. I might just be posting this here for posterity.

In /etc/init.d/rc.local do something like:

1
2
3
4
5
sleep 120s
#command line bash script goes here
sudo mkdir /home/user/dest
git clone git@github.com:your-repo/files.git /dest/path
node /dest/path/app.js

The sleep makes the script wait for two minutes before running. I’ve found that without it, things don’t always work as expected. There are 25th level Linux wizards out there that could tell me why this is necessary - if you’re one and you feel like sharing some education, hit me up on Twitter.

Everything else is pretty much what you would run on the command line.

Test your script by running the rc.local file as you would any other bash script:

1
/etc/init.d/rc.local

Once the script is doing what you expect, you can do whatever you do to spin up your cluster. Each node should run your commands after they boot up and sleep for two minutes.

Happy clustering.

Moving Millions of Files.

These past months I’ve been working to gather large data sets from the internet using a cluster on AWS and Rackspace. (Just a quick note - I’ve been totally polite about wielding the cluster of 200 virtual machines. More about how I did this later.) The end result of this data gathering was 33 million html files. Since our team has been running pretty lean to start, the 800GB data set was stored on a 2006 Xserve, where it sat for a couple of weeks on an HFS+ formatted drive. Having only 1GB of memory, this machine was useful for not much more than storing the files. This week I setup a new server that was more capable of running the tasks that I wanted to over the millions of files. I could have physically moved the drive but didn’t want to have to shift my work focus to learn the intricacies of either rebuilding the Linux kernel with HFS+ support, or removing journaling from the drive to allow it to mount in Linux and deal with the caveats that I’ve read that solution brings. Not my job. Plus I feel that a more purely network oriented solution was more appropriate for the cloud space that I’m working in. The obvious (to me) solution here is rsync, which I thought would be a snap. I first used this:

1
rsync -avze ssh user@ip-address:/source/path/* /local/folder/

Considering I had 33 million files, I expected the incremental file list to take a long time to generate. After hitting enter, my terminal responded with a blank line, which I guessed was okay since rsync had to think about all those files. I left it overnight in a tmux session. When I came back the next day, I had an error: “Argument list too long”. A quick Google search showed me that I didn’t want to use the wildcard to specify every file in the directory - I could specify just the directory. This worked for me:

1
rsync -avze ssh user@ip-address:/source/path/. /local/folder/

After entering my password for the remote server, rsync immediately reported that it was sending the incremental file list. Much better than the blank line I was getting before, and after about 20 minutes the files started to transfer. Hopefully this post helps others out.