Simple Storage Service (S3) — Part 1
I’m starting with something I’m sort of familiar with, of all the AWS services, this one is probably one of the older ones, and it's probably something we’ve all used without realizing it.
This is not an exhaustive article, I’m learning and writing about this in one day (an hour or so ideally).
Do visit https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html if you want to dive deeper into things.
I also recommend Bernard Golden’s the-ultimate-aws-certified-solutions-architect-associate Udemy course
Amazon S3 — Simple Storage Service
Unstructured data storage, high-level theory, and some practical stuff via the AWS console.
What is S3?
According to Amazon their elevator pitch on S3 is
Amazon Simple Storage Service is storage for the Internet. It is designed to make web-scale computing easier for developers. (https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)
I’ve come to think of it as a giant filing cabinet you can stick anything you want inside of it, a bucket [folder] system that can have any object [file] you want for any reason, uploadable from anywhere and downloadable from anywhere [that you permit].
Some fun facts
- Designed for 99.99% availability
- AWS guarantees 99.9% availability
- AWS guarantees 99.99999999999% durability
Durability is about not losing your objects. To put that into some context, if you have 10,000 objects, you will lose one of them every 10,000,000 years. So it’s pretty reliable.
Buckets have a few characteristics worth mentioning right now:
- They are located in a zone (this means they have geolocation and this will affect how fast someone can access it)
- You can replicate your bucket to other zones
- All bucket names are unique worldwide, no matter which zone they originate in.
- A name becomes available again if you delete a bucket
- They can support versioning for the objects which are added to them.
- I read somewhere the limit per account is something like 100 buckets per account but you can up this if you need more.
All objects are stored using name-value pair data. Some cliff notes on these; they have:
- a name — which is used to identify it within the bucket
- other typical metadata such as Content-Type, Content-Length, Date (modified), etc
- when created an object you can add other custom metadata if you want it for later
- the max size of an object is currently 5TB.
- objects can be encrypted
- can be versioned
The objects data consistency model
From personal experience, this is something worth knowing about. Your objects are replicated, even inside a single region so that you have redundancy and high availability.
This poses a potential problem when “updating” your objects. The first thing to note is that you don’t really UPDATE an object, you create a new one. If you have versioning turned on then your older object is there under its own ID, but otherwise, I don’t know where it goes… S3 bucket heaven?
The problem is that the new object needs time to replicate or ‘propagate’ across all the edge locations where your S3 bucket might have references to what this object ‘used’ to be. So for that reason
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all Regions
The requests for an object are atomic, which means you are going to get the latest object or the last one, but never something corrupted or mid-writing.
This is referred to as the ‘eventual consistency read’ model.
The exception to eventual consistency is when creating new objects from scratch. AWS guarantee's a read after write consistency so that we don’t get a null/undefined object after creating it.
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket
This is referred to as the ‘consistent read’ model.
Amazon does quite a good job of helping you only pay for what you need. This is a consistent theme of why people really like using it (something something economy at scale).
One of those mechanism in S3 is related to the tiering of your objects so that they are put into storage that suits the level of update and access you need. I’m yet to leverage this personally so I will not talk extensively on it, but it's worth knowing about it.
In most use cases you’re going to just put the ‘standard’ storage class on your object and forget about it. However, there are also options for ‘infrequently accessed’ storage classes (IA).
The S3 Standard-IA and S3 One Zone-IA storage classes are designed for long-lived and infrequently accessed data. (IA stands for infrequent access.)
One use case for this would be backups. The trade-offs are typically price vs durability and availability.
You save money on the storage but pay to get them is another way to think about this.
Is this the same as Glacier?
As far as I can tell, no. The difference is real-time access.
Glacier is its own service now and is special in the sense that they are for long-term storage, and the data is accessed via their own API’s.
There are storage classes for Glacier though, it’s a little confusing too in my opinion. From what I’ve read, your items remain in S3 but are ‘not available for real-time access’.
Essentially you’re putting your objects on ICE and you can restore them / access them later. We used to have tape backups at a previous company, and ship those tapes to another office as a disaster recovery plan. Glacier is an ideal option for this kind of process.
Where I’m working right now at PushDoctor, this is where we’d probably put our medical data which we need to store for 30 years.
Getting this data back can be hours unless you expedite.
Reduced Redundancy Storage
This is another storage class and one where you give up the durability for some cost.
Another one I’ve never used yet but if I’m dealing with objects I don’t mind losing — perhaps a high volume of generated data that I can just as easily generate again might make sense here.
There is an intelligent storage class too, which for a fee, will move your objects between these based on the access and update frequency. I could see this being very useful if you’re dealing with high volume buckets where no person could reasonably manage them. I hope one day I have a good reason to use it.
I’m going to wrap this up here for today… S3 is turning out to be quite a deep topic for an hour or so’s time per day.
I thought I ‘knew it’ but I’m actually really glad I took to the time to deep dive into it and look around. I really DIDN’T know it that well as a whole at all. Tomorrow I’ll get a little more practice I hope.
Thanks for reading — if you see a misunderstanding shout at me for it. If you think you want to impart knowledge and experiences about S3 buckets, some horror stories, or success stories go for it.
Every day is a school day.