Saving bandwidth on Amazon S3 by caching the most recent files

In order to reduce the costs associated with bandwidth, I wrote a simple download script a while ago that caches the most recently accessed files from Amazon S3 on a less expensive hosting plan. This script was written using a combination of Python, Nginx, and Flask, but the same effect could be accomplished in any language / framework that supports setting response headers.

At the time of writing, Amazon charges at least 12 cents per gigabyte of data transfer versus the terabytes of data transfer routinely offered by dedicated server companies for less than $100 per month. In my particular case, 32TB of traffic was clustered around the most popular files on any given day. Managing the storage on each individual server would be impractical without using S3 for a backend, but it didn’t make financial sense to pay disproportional costs for a small subset of files that changed predictably.

Quickstart

If you have docker installed, you can quickly see my script in action by running

docker run -p 8000:80 -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e S3_BUCKET=$S3_BUCKET -t bluelaguna/s3cache

Where $AWS_ACCESS_KEY_ID, $AWS_SECRET_ACCESS_KEY, and $S3_BUCKET are stored in environment variables or replaced with their actual values.

You should then be able to access any file stored in $S3_BUCKET using http://localhost:8000/download/path/to/file. This will redirect to Amazon S3 the first time, but send the file directly the second time.

Setting up the download script

If you want to follow along with what I did, or create your own version of this script, you’ll need to install a few things first. This all assumes a fairly recent version of Ubuntu / Debian, but the same packages can also be found on other operating systems.

Pre-requisites

sudo apt-get install build-essential python-pip python-devsudo pip install flask boto uwsgi

download.py

Next, create a new python file in the directory of your choice. I decided to go with /home/s3cache/download.py. You should probably place this app in its own folder for now.

from flask import Flask, redirect
import boto
import boto.s3
import os
import os.path
import threading
import urllib
from time import time
from boto.s3.key import Key
from flask import make_response
app = Flask(__name__)
app.debug = True

'''Change the constants below to your specific details'''
CACHE_ROOT = "/var/www"
#Timeout in seconds 
CACHE_TIMEOUT = 3600 * 24 * 30
#AWS environment variables
AWS_ACCESS_KEY_ID = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET_ACCESS_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]
S3_BUCKET = os.environ["S3_BUCKET"]

access_times = {}


@app.route("/download/<path:filename>")
def download(filename):
    path = os.path.join(CACHE_ROOT, filename)
    access_times[path] = time()
    # Check if the file exists and a file sync is not in progress
    if os.access(path, os.F_OK) is False and os.access(path + ".lock", os.F_OK) is False:
        conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
        bucket = conn.get_bucket(S3_BUCKET)
        k = Key(bucket, filename)
        url = k.generate_url(3600, query_auth=True, force_http=True)
        # Attempt to resync the missing file
        sync_thread = threading.Thread()
        sync_thread.run = lambda: sync_file(url, path)
        sync_thread.start()
        # Clear cache of files that haven't been accessed recently
        clear_thread = threading.Thread()
        clear_thread.run = lambda: clear_cache()
        clear_thread.start()
        return redirect(url, code=302)
    else:
        response = make_response()
        response.headers['Content-Type'] = ""
        response.headers['Content-Disposition'] = 'attachment; filename="%s"' % (os.path.basename(filename),)
        response.headers['X-Accel-Redirect'] = os.path.join("/internal-redirect", filename)

        return response


def sync_file(source, destination):
    if os.access(destination + ".lock", os.F_OK) is False:
        if os.access(os.path.dirname(destination), os.F_OK) is False:
            os.makedirs(os.path.dirname(destination))
        lock = open(destination + ".lock", 'w')
        lock.write("")
        lock.flush()
        lock.close()
        try:
            urllib.urlretrieve(source, destination)
        except Exception as e:
            print(e)
        finally:
            os.remove(destination + ".lock")


def clear_cache():
    ''' Removes files contained within [path] that have not
        been accessed recently within the [timeout] period.
    '''
    for root, dirs, files in os.walk(CACHE_ROOT):
        for name in files:
            path = os.path.join(root, name)
            if name in access_times:
                access_time = access_times[path]
            else:
                access_time = os.stat(path).st_atime
            if (time() - access_time) > CACHE_TIMEOUT:
                os.remove(path)


if __name__ == "__main__":
    app.run(host='0.0.0.0', port=8080)

from flask import Flask, redirect

import boto

import boto.s3

import os

import os.path

import threading

import urllib

from time import time

from boto.s3.key import Key

from flask import make_response

app = Flask(__name__)

app.debug = True

'''Change the constants below to your specific details'''

CACHE_ROOT = "/var/www"

#Timeout in seconds

CACHE_TIMEOUT = 3600 * 24 * 30

#AWS environment variables

AWS_ACCESS_KEY_ID = os.environ["AWS_ACCESS_KEY_ID"]

AWS_SECRET_ACCESS_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]

S3_BUCKET = os.environ["S3_BUCKET"]

access_times = {}

@app.route("/download/<path:filename>")

def download(filename):

path = os.path.join(CACHE_ROOT, filename)

access_times[path] = time()

# Check if the file exists and a file sync is not in progress

if os.access(path, os.F_OK) is False and os.access(path + ".lock", os.F_OK) is False:

conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

bucket = conn.get_bucket(S3_BUCKET)

k = Key(bucket, filename)

url = k.generate_url(3600, query_auth=True, force_http=True)

# Attempt to resync the missing file

sync_thread = threading.Thread()

sync_thread.run = lambda: sync_file(url, path)

sync_thread.start()

# Clear cache of files that haven't been accessed recently

clear_thread = threading.Thread()

clear_thread.run = lambda: clear_cache()

clear_thread.start()

return redirect(url, code=302)

else:

response = make_response()

response.headers['Content-Type'] = ""

response.headers['Content-Disposition'] = 'attachment; filename="%s"' % (os.path.basename(filename),)

response.headers['X-Accel-Redirect'] = os.path.join("/internal-redirect", filename)

return response

def sync_file(source, destination):

if os.access(destination + ".lock", os.F_OK) is False:

if os.access(os.path.dirname(destination), os.F_OK) is False:

os.makedirs(os.path.dirname(destination))

lock = open(destination + ".lock", 'w')

lock.write("")

lock.flush()

lock.close()

try:

urllib.urlretrieve(source, destination)

except Exception as e:

print(e)

finally:

os.remove(destination + ".lock")

def clear_cache():

''' Removes files contained within [path] that have not

been accessed recently within the [timeout] period.

'''

for root, dirs, files in os.walk(CACHE_ROOT):

for name in files:

path = os.path.join(root, name)

if name in access_times:

access_time = access_times[path]

else:

access_time = os.stat(path).st_atime

if (time() - access_time) > CACHE_TIMEOUT:

os.remove(path)

if __name__ == "__main__":

app.run(host='0.0.0.0', port=8080)

Setting Up the Nginx web server

The nginx web server now comes packaged with most Linux distributions. In Ubuntu / Debian, you’ll want to run the following command to install the version that includes all the features we need.

sudo apt-get install nginx-full

Site Config

Create a new file in /etc/nginx/site-enabled (for example /etc/nginx/sites-enabled/s3cache.conf)

server {
    listen      80;
    server_name localhost;
    charset     utf-8;
    client_max_body_size 75M;

    location / { try_files $uri @s3cache; }

    location @s3cache {
        include uwsgi_params;
        uwsgi_pass unix:/home/s3cache/uwsgi.sock;
    }

    location /internal-redirect/ {
        internal;
        alias /var/www/;
    }
}

server {

listen 80;

server_name localhost;

charset utf-8;

client_max_body_size 75M;

location / { try_files $uri @s3cache; }

location @s3cache {

include uwsgi_params;

uwsgi_pass unix:/home/s3cache/uwsgi.sock;

}

location /internal-redirect/ {

internal;

alias /var/www/;

}

Be sure to adjust the value for server_name and change “/home/s3cache” to match CACHE_ROOT in download.py From there, restart nginx or reload the configuration for the changes to take effect.
sudo /etc/init.d/nginx restart

However, if you try to access the site now, you’ll get an error. If you look through the configuration, you’ll notice a reference to a socket connection (s3cache.sock). We’ll need to configure uwsgi to get it going by creating uswgi.ini in the same folder as our app (/home/s3cache)

uwsgi.ini

[uwsgi]
#application's base folder
base = /home/s3cache
#python module to import
app = download
module = %(app)

pythonpath = %(base)

#socket file's location
socket = /home/s3cache/%n.sock

#permissions for the socket file
chmod-socket    = 666

#the variable that holds a flask application inside the module imported at line #6
callable = app

#location of log files
logto = /tmp/%n.log

#Enable threads so that syncing / removing files will work
enable-threads=True

[uwsgi]

#application's base folder

base = /home/s3cache

#python module to import

app = download

module = %(app)

pythonpath = %(base)

#socket file's location

socket = /home/s3cache/%n.sock

#permissions for the socket file

chmod-socket = 666

#the variable that holds a flask application inside the module imported at line #6

callable = app

#location of log files

logto = /tmp/%n.log

#Enable threads so that syncing / removing files will work

enable-threads=True

To start the app, simply run uwsgi uwsgi.ini in the app folder. You should then be able to access any file stored in $S3_BUCKET using http://localhost/download/path/to/file. This will redirect to Amazon S3 the first time, but send the file directly the second time. In a production environment, you would want to adjust uwsgi.ini run as another user and setup a uwsgi emperor

Blog

Saving bandwidth on Amazon S3 by caching the most recent files

Quickstart

Setting up the download script

Pre-requisites

download.py

Setting Up the Nginx web server

Site Config

uwsgi.ini

One Comment

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta