My days as a BigData engineer grew darker in the past few days. Now almost all critical data that our aggregations, partners and customers want to access is located in a bucket on Amazon’s S3. Since that gets horribly expensive for the vast amounts of data we create on a daily basis, we lean towards not keeping much more than a month of active objects in the bucket, to meet our SLA’s regarding backup and recovery, we keep them for about a year in Glacier afterwards.
That sounds pretty cool until shit hits the fan. Because when that happens you have to restore data from Glacier. I was pretty shocked when I found out that Glacier needs one api-request per object you want to restore. That specific consumer wanted just one day of data back, unfortunately that one day summed up to be ~75k objects.
Obviously I absolutely defy the idea of clicking the objects back to the store, so I wrote a small bash script, that essentially did this:
for key in `cat files_to_restore `; do
aws s3api restore-object --region us-east-1 --bucket very.large.bucket --key "$key" --restore-request Days=3
if [[ $? -ne 0 ]]; then
echo "^ was on file: $key"
fi
done
Apparently the cli requests took something around 2s to finish. That amounts to about ~42h to restore 24h of data. Sounds stupid? yeah, that’s because it is.
With GNU’s parallel not available on my new shiny macbook, I looked for an alternative and found that xargs can do just that. I ended up with something around 24requests instead of 1 within 2s by using the parallelisation function (-P) in xargs.
Look at that beautiful bash-oneliner, that should work with any aws cli command:
cat files_to_restore | xargs -t -I {} -P 5 aws s3api restore-object --region us-east-1 --bucket very.large.bucket --key {} --restore-request Days=3
Here’s the options explained in detail (if you’re too lazy to check man xargs
)
-t writes the command before the execution to STDOUT (very helpful if something goes wrong)
-I indicates the marker to replace with the passed argument
-P the amount of parallel processes that are allowed to run (I’ve set this to CPU’s +1)