In order to reduce the costs associated with bandwidth, I wrote a simple download script a while ago that caches the most recently accessed files from Amazon S3 on a less expensive hosting plan. This script was written using a combination of Python, Nginx, and Flask, but the same effect could be accomplished in any language / framework that supports setting response headers.
At the time of writing, Amazon charges at least 12 cents per gigabyte of data transfer versus the terabytes of data transfer routinely offered by dedicated server companies for less than $100 per month. In my particular case, 32TB of traffic was clustered around the most popular files on any given day. Managing the storage on each individual server would be impractical without using S3 for a backend, but it didn’t make financial sense to pay disproportional costs for a small subset of files that changed predictably.
Quickstart
If you have docker installed, you can quickly see my script in action by running
docker run -p 8000:80 -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e S3_BUCKET=$S3_BUCKET -t bluelaguna/s3cache
Where $AWS_ACCESS_KEY_ID, $AWS_SECRET_ACCESS_KEY, and $S3_BUCKET are stored in environment variables or replaced with their actual values.
You should then be able to access any file stored in $S3_BUCKET using http://localhost:8000/download/path/to/file. This will redirect to Amazon S3 the first time, but send the file directly the second time.
Setting up the download script
If you want to follow along with what I did, or create your own version of this script, you’ll need to install a few things first. This all assumes a fairly recent version of Ubuntu / Debian, but the same packages can also be found on other operating systems.
Pre-requisites
sudo apt-get install build-essential python-pip python-devsudo pip install flask boto uwsgi
download.py
Next, create a new python file in the directory of your choice. I decided to go with /home/s3cache/download.py. You should probably place this app in its own folder for now.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
from flask import Flask, redirect import boto import boto.s3 import os import os.path import threading import urllib from time import time from boto.s3.key import Key from flask import make_response app = Flask(__name__) app.debug = True '''Change the constants below to your specific details''' CACHE_ROOT = "/var/www" #Timeout in seconds CACHE_TIMEOUT = 3600 * 24 * 30 #AWS environment variables AWS_ACCESS_KEY_ID = os.environ["AWS_ACCESS_KEY_ID"] AWS_SECRET_ACCESS_KEY = os.environ["AWS_SECRET_ACCESS_KEY"] S3_BUCKET = os.environ["S3_BUCKET"] access_times = {} @app.route("/download/<path:filename>") def download(filename): path = os.path.join(CACHE_ROOT, filename) access_times[path] = time() # Check if the file exists and a file sync is not in progress if os.access(path, os.F_OK) is False and os.access(path + ".lock", os.F_OK) is False: conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) bucket = conn.get_bucket(S3_BUCKET) k = Key(bucket, filename) url = k.generate_url(3600, query_auth=True, force_http=True) # Attempt to resync the missing file sync_thread = threading.Thread() sync_thread.run = lambda: sync_file(url, path) sync_thread.start() # Clear cache of files that haven't been accessed recently clear_thread = threading.Thread() clear_thread.run = lambda: clear_cache() clear_thread.start() return redirect(url, code=302) else: response = make_response() response.headers['Content-Type'] = "" response.headers['Content-Disposition'] = 'attachment; filename="%s"' % (os.path.basename(filename),) response.headers['X-Accel-Redirect'] = os.path.join("/internal-redirect", filename) return response def sync_file(source, destination): if os.access(destination + ".lock", os.F_OK) is False: if os.access(os.path.dirname(destination), os.F_OK) is False: os.makedirs(os.path.dirname(destination)) lock = open(destination + ".lock", 'w') lock.write("") lock.flush() lock.close() try: urllib.urlretrieve(source, destination) except Exception as e: print(e) finally: os.remove(destination + ".lock") def clear_cache(): ''' Removes files contained within [path] that have not been accessed recently within the [timeout] period. ''' for root, dirs, files in os.walk(CACHE_ROOT): for name in files: path = os.path.join(root, name) if name in access_times: access_time = access_times[path] else: access_time = os.stat(path).st_atime if (time() - access_time) > CACHE_TIMEOUT: os.remove(path) if __name__ == "__main__": app.run(host='0.0.0.0', port=8080) |
Setting Up the Nginx web server
The nginx web server now comes packaged with most Linux distributions. In Ubuntu / Debian, you’ll want to run the following command to install the version that includes all the features we need.
sudo apt-get install nginx-full
Site Config
Create a new file in /etc/nginx/site-enabled (for example /etc/nginx/sites-enabled/s3cache.conf)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
server { listen 80; server_name localhost; charset utf-8; client_max_body_size 75M; location / { try_files $uri @s3cache; } location @s3cache { include uwsgi_params; uwsgi_pass unix:/home/s3cache/uwsgi.sock; } location /internal-redirect/ { internal; alias /var/www/; } } |
Be sure to adjust the value for server_name and change “/home/s3cache” to match CACHE_ROOT in download.py From there, restart nginx or reload the configuration for the changes to take effect.
sudo /etc/init.d/nginx restart
However, if you try to access the site now, you’ll get an error. If you look through the configuration, you’ll notice a reference to a socket connection (s3cache.sock). We’ll need to configure uwsgi to get it going by creating uswgi.ini in the same folder as our app (/home/s3cache)
uwsgi.ini
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
[uwsgi] #application's base folder base = /home/s3cache #python module to import app = download module = %(app) pythonpath = %(base) #socket file's location socket = /home/s3cache/%n.sock #permissions for the socket file chmod-socket = 666 #the variable that holds a flask application inside the module imported at line #6 callable = app #location of log files logto = /tmp/%n.log #Enable threads so that syncing / removing files will work enable-threads=True |
To start the app, simply run uwsgi uwsgi.ini in the app folder. You should then be able to access any file stored in $S3_BUCKET using http://localhost/download/path/to/file. This will redirect to Amazon S3 the first time, but send the file directly the second time. In a production environment, you would want to adjust uwsgi.ini run as another user and setup a uwsgi emperor
John David Reaver
Oct 20, 2014 -
Excellent! This is a clear example of how to cache S3 downloads, and how to use boto to allow users to download files.