Amazon AWS Lambda S3 I/O Python Example

What is Lambda?

I’m a newbie to the Amazon Lambda/AWS service.  If you haven’t heard about it, it’s a cloud resource that can run Java(nodejs/Python) scripts for free in the cloud.   I stumbled onto it while exploring using the cloud to speed lengthy pattern compiles for work.  Amazon AWS customers get 1 Million free Lambda “events” per month.  A Lambda “event” is any short running (~3min or less) task that you can run in the cloud.  There are times when I have some task I want to do on a large number of files, and I’d rather not wait for my laptop to slog through all that work.  So I’ve been thinking the Lambda service would be just the thing for that type of work.

The nice thing about Lambda, is that it scales… if you have 10,000 tasks to do, it just starts 10,000 of them in parallel.  Keep in mind, that at some point, the “free” million events + compute+ram seconds add up and the meter will charge your credit card.  But for small scale tasks it’s an interesting concept.  Free massively scaleable compute power in the cloud you say?

Its Not Easy

Turns out the Lambda cloud requires some experience with Python or NodeJS.  I hadn’t used Python since perhaps the early rev 1.0 days, and it has evolved to this massively powerful object oriented scripting language.  Who knew?  It’s a bit more massive in concepts than perl, in my opinion.   But Amazon has some nifty tutorials, and I thought I’d dig into how difficult it would be to leverage Lambda for my customers.   I spent a few days back in the first quarter… got about 10% down the road, and gave up, as I have a day job which requires I make customers happy.  I had some spare cycles today, andn dug in further, so I have been able to get a S3 Python script that can OPEN a S3 buckut (input file) read bytes from that file, and copy them a line at a time to another S3 output file.  This seems trivial to the guru programmers out there… but it seemed to be massively difficult to me.


Amazon provides an API (Applications Programming Interface) for accessing AWS resources in the Amazon Cloud.  I have an AWS Amazon account, and have setup some S3 (Simple Storage Service) buckets in the cloud.  There is a Command Line Interface (CLI) and some plug-ins for Visual Studio to store/retrieve files to/from the S3 storage.  And it’s a nice place to securely store files in the cloud.  The BOTO3 interface allows python scripts locally and in the cloud to access S3 resources.  When a python script runs in the Lambda cloud, the Lambda account setup provides all the required authentication via IAM (Identity and Access Management) keys.  If you want to run the Python Script on your laptop, the secrete keys to the cloud must be passed to the BOTO API.

There is a way to specify the “bucket” and “key” which essentally are the path to the file you want to read… but you don’t get a “FILE” object, you get some kind of “StreamingBody” object.  That was the killer here.  Some of the Amazon examples show copying the S3 file to a temporary local unix file before having the python script operate on it.  I didn’t want to do that, So I had to fight to get something that would do some buffered reads (4k bytes at a time) from the S3 cloud.   I will probably up that to 64K but anyway… I have something that works.

I can OPEN a S3 Bucket.  Read lines in, and OPEN another S3 output bucket and save the identical copy of the file to that bucket.   This is the first step to having any kind of file processing utility automated.  The idea is put a file of type X into the cloud, and the cloud modifies it and produces a file of type “Y” that you can fetch.  If you need to process 100 files of type X, just upload them to the cloud.

Thats the idea.  I will probably publish this sucker on github when I get around to it. But I’m lazy, I’m just going to paste it here:

from __future__ import print_function

import json
import urllib
import uuid
import boto3
import re
# ReadOnce Object From:
class ReadOnce(object):
def __init__(self, k):
self.key = k
self.has_read_once = False

def read(self, size=0):
if self.has_read_once:
return b”
data =
if not data:
self.has_read_once = True
return data

print(‘Loading IO function’)

s3 = boto3.client(‘s3’)

def lambda_handler(event, context):
print(“Received event: ” + json.dumps(event, indent=2))

# Get the object from the event and show its content type
inbucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’]
outbucket = “outlambda”
inkey = urllib.unquote_plus(event[‘Records’][0][‘s3’][‘object’][‘key’].encode(‘utf8’))
outkey = “out” + inkey
infile = s3.get_object(Bucket=inbucket, Key=inkey)

except Exception as e:
print(‘Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.’.format(inkey, bucket))
raise e

inbody = infile[‘Body’]
tmp_path = ‘/tmp/{}{}’.format(uuid.uuid4(), “tmp.txt”)
# upload_path = ‘/tmp/resized-{}’.format(key)

with open(tmp_path,’w’) as out:
unfinished_line = ”
while( bytes ):
bytes = unfinished_line + bytes
#split on whatever, or use a regex with re.split()
lines = bytes.split(‘\n’)
print (“bytes %s” % bytes)
unfinished_line = lines.pop()
for line in lines:
print (“line %s” % line)
# yield line
# Upload the file to S3
tmp = open(tmp_path,”r”)
outfile = s3.put_object(Bucket=outbucket,Key=outkey,Body=tmp)
except Exception as e:
print(‘Error putting object {} from bucket {} Body {}. Make sure they exist and your bucket is in the same region as this function.’.format(outkey, outbucket,”tmp.txt”))
raise e





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s