tl;dr; You can download files from S3 with requests.get()
(whole or in stream) or use the boto3
library. Although slight differences in speed, the network I/O dictates more than the relative implementation of how you do it.
I'm working on an application that needs to download relatively large objects from S3. Some files are gzipped and size hovers around 1MB to 20MB (compressed).
So what's the fastest way to download them? In chunks, all in one go or with the boto3 library? I should warn, if the object we're downloading is not publically exposed I actually don't even know how to download other than using the boto3
library. In this experiment I'm only concerned with publicly available objects.
The Functions
f1()
The simplest first. Note that in a real application you would do something more with the r.content
and not just return its size. And in fact you might want to get the text
out instead since that's encoded.
def f1(url):
r = requests.get(url)
return len(r.content)
f2()
If you stream it you can minimize memory bloat in your application since you can re-use the chunks of memory if you're able to do something with the buffered content. In this case, the buffer is just piled on in memory, 512 bytes at a time.
def f2(url):
r = requests.get(url, stream=True)
buffer = io.BytesIO()
for chunk in r.iter_content(chunk_size=512):
if chunk:
buffer.write(chunk)
return len(buffer.getvalue())
I did put a counter into that for-loop to see how many times it writes and if you multiple that with 512 or 1024 respectively it does add up.
f3()
Same as f2()
but with twice as large chunks/
def f3(url): # same as f2 but bigger chunk size
r = requests.get(url, stream=True)
buffer = io.BytesIO()
for chunk in r.iter_content(chunk_size=1024):
if chunk:
buffer.write(chunk)
return len(buffer.getvalue())
f4()
I'm actually quite new to boto3
(the cool thing was to use boto
before) and from some StackOverflow-surfing I found this solution to support downloading of gzipped or non-gzipped objects into a buffer:
def f4(url):
_, bucket_name, key = urlparse(url).path.split('/', 2)
obj = s3.Object(
bucket_name=bucket_name,
key=key
)
buffer = io.BytesIO(obj.get()["Body"].read())
try:
got_text = GzipFile(None, 'rb', fileobj=buffer).read()
except OSError:
buffer.seek(0)
got_text = buffer.read()
return len(got_text)
Note how it doesn't try to find out if the buffer is gzipped but instead relying on assuming it is plus a raised exception.
This feels clunky, around the "gunzipping", but it's probably quite representative of a final solution.
The Results
At first I ran this on my laptop here on my decent home broadband whilst having lunch. The results were very similar to what I later found on EC2 but 7-10 times slower here. So let's focus on the results from within an EC2 node in us-west-1c.
The raw numbers are as follows (showing median values):
Function | 18MB file | Std Dev | 1MB file | Std Dev |
---|---|---|---|---|
f1 | 1.053s | 0.492s | 0.395s | 0.104s |
f2 | 1.742s | 0.314s | 0.398s | 0.064s |
f3 | 1.393s | 0.727s | 0.388s | 0.08s |
f4 | 1.135s | 0.09s | 0.264s | 0.079s |
I ran each function 20 times. It's interesting, but not totally surprising that the function that was fastest for the large file wasn't necessarily the fastest for the smaller file.
The winners are f1()
and f4()
both with one gold and one silver each. Makes sense because it's often faster to do big things, over the network, all at once.
Or, are there winners at all?
With a tiny margin, f1()
and f4()
are slightly faster but they are not as convenient because they're not streams. In f2()
and f3()
you have the ability to do something constructive with the stream. As a matter of fact, in my application I want to download the S3 object and parse it line by line so I can use response.iter_lines()
which makes this super convenient.
But most importantly, I think we can conclude that it doesn't matter much how you do it. Network I/O is still king.
Lastly, that boto3
solution has the advantage that with credentials set right it can download objects from a private S3 bucket.
Bonus Thought!
This experiment was conducted on a m3.xlarge
in us-west-1c. That 18MB file is a compressed file that, when unpacked, is 81MB. This little Python code basically managed to download 81MB in about 1 second. Yay!! The future is here and it's awesome.
Comments
Sure looks like you should have picked f1 and f4 as your winners.
Yes. Grrr. Typo. Will fix when I'm on a computer.
Nice, I am exactly looking for this kind of analysis
With that size I wouldn't not even bother about performance. Large files to me start with hundreds of megabytes. In other words, something that do not fit into lambdas memory being read in one chunk.
buffer = io.BytesIO(obj.get()["Body"].read())
This line reads the file into memory.
Put a print statement before and after and try on a large file and you will see.