Introduction to Boto3 and S3

In a time where datasets are getting larger and the needs of a business are growing in costs, many have turned to cloud computing as a means to mitigate both of these issues. With this increased interest, it is no wonder that Amazon Web Services (AWS) has become increasingly popular over the past couple of years. Parallel or cloud computing provides a level of flexibility and scalability that allows users or businesses to adjust accordingly to their needs.

AWS has hundreds of services, from cloud storage using S3, notification services with SNS, and even sentiment analysis using Comprehend. Boto3 is the library that allows a user to create and configure AWS instances using Python. In this blog, I will cover the basics of using Boto3 based on information from DataCamp’s “Introduction to AWS Boto in Python” course.

Intro to Boto3 and S3

Boto3 is a package that must first be installed and imported before using. Before using Boto3, you need an AWS account with the appropriate authentication credentials. In order to initialize a service, you need a valid access key id and secret access key, similar to needing a username and password. To do so, we can use the Identity Access Management (IAM) Console or the AWS Command Line Interface (CLI). We can also create sub-users to control who has access to our account’s AWS resources.

Creating a S3 client using our credentials in Boto3.

First, we go into a little more detail about cloud storage with S3 and how it works. The key feature of S3 is that it allows us to put any file into the cloud and have it be accessible anywhere through a URL. The way S3 functions is through the use of so-called Buckets and Objects. Buckets act like folders in a file system and they must have a unique name. We can perform operations on buckets such as creating new ones, listing all buckets we have in our account, and deleting buckets.

Creating multiple buckets with Bucket names and the response to confirm creation of ‘gim-staging’ bucket.

We store objects within our buckets. An object can be any type of file from a simple text file to video and audio files. An important feature of an object is its key, which is path of the object from the bucket’s root. Objects can only be in one parent bucket at a time. We can upload, download, and delete files from our buckets as well as pull metadata about any object using the head_object method. Listing all of the objects within a bucket is done with the list_objects method. We can also use this method to search for specific objects, such as ones that start with a particular prefix.

Uploading a csv file to gid_staging and retrieving its ContentLength through the head_object method.
Listing all objects within the bucket that start with ‘2018/final_’ and deleting them from the bucket.

Uploading and Sharing Files Securely

While working with objects, we may have certain files that we wish to keep private or limit who has access to them. The default permission for newly uploaded files in S3 is denied permission except for the account’s key and secret. This means that we will have to set opt-in public access to our objects rather than having public access by default. There are multiple ways to control permissions within S3.

Again, we can use IAM to limit or allow specific users’ access to AWS services and objects. Bucket policies allow us to control permissions on a particular bucket and any objects in it. We also have access control lists (ACLs) which let us denote permissions on specific objects within a bucket.

When we upload a file to a bucket, its default ACL is set to ‘private’. We can set the ACL to ‘public-read’, allowing anyone to download the file, through the put_object_acl method or passing in an extra argument upon uploading the file. Once an object is public, it can be accessed at a public URL of the format <bucket_name><objectKey> .

Passing in an extra argument for public-read ACL upon uploading a file.

In addition, you can grant temporary access to private files using pre-signed URLs. These are URLs that are meant to expire within a time limit. These are generated through the generate_presigned_url method, in which we define the time limit after which the link expires.

Generating a pre-signed URL that expires in one hour.

S3 is also able to host static HTML pages. This allows us to share our analysis and results with other people. For instance, we can take one of our DataFrames and have it render as a HTML table using Pandas’ to_html method. We can even use this to create an index page, listing multiple HTML links and files that we wish to share. We do this by listing the objects within a bucket that we wish to share, converting the response to a DataFrame, and then creating URLs using a base URL along with the object’s key.

Converting a dataframe to HTML and having its URLs converted to hyperlinks.
Creating an index page.

This was just a small sample of what S3 in AWS can do and how Boto3 is used to interact with our instances. In the next blog, I will cover how AWS automates notification services through SNS and how AWS Rekognition works with text comprehension and image detection. Thank you for reading!

Data Science student and aspiring Data Analyst

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store