Resilient Data Processes with Azure Blob Storage

Introduction

As we grow as a company, we’re continuously assessing how we provision the IT infrastructure which allows us to deliver services and projects for our clients.

Cloud services have their advantages and disadvantages, but one area in which they provide a clear win for us is data storage. In this blog, I’ll discuss how we’re using Azure blob storage to save time, money, and headaches.

What is Azure Blob Storage?

Azure Blob Storage is a cloud data storage service from Microsoft. ‘Blob’ is a backronym meaning “binary large object”. The service allows us to set up an account which can store and retrieve data (including any file) from ‘blobs’ in cloud storage, and further organise them into separate named containers. It uses a pay-as-you-go pricing system based on the volume of data stored, storage options selected, and number of operations performed.

What are we using it for?

We have two main data storage requirements which are particularly well-suited to cloud solutions.

Large database backups

We work with a lot of time-series data, which is stored in large databases as part of our internal data warehouse solution. We have a policy of regularly taking backups of these databases, which quickly occupy a lot of storage space.

ETL workload auditing and recoverability

The time-series data we handle goes through an Extract, Transform, Load (ETL) pipeline to be provisioned to our applications and analysts via our data warehouse. We receive extracts of client data (usually in some file format). These data files are then transformed into another format, before being loaded into the data warehouse. If the data subsequently available from the warehouse is not as expected, we need to quickly answer the questions:

What did we receive from the client?
Did something unexpected happen during either the transformation or loading process?
If something went wrong, how can we identify and recover data which might need to be reprocessed?

Furthermore, if we somehow experience an incident which requires restoring a database from a backup, there will then be a period since the backup was taken for which the time series data will need to be re-processed through the ETL pipeline. Accordingly, we need to retain fast access to recently ingested source data to minimise down-time after an incident.

Our solution to these problems uses several blob containers to act as ‘checkpoints’ in the ETL process and to archive data once processed.

Why are we using it?

As a small but growing enterprise, cloud data storage on Azure offers some significant advantages to us:

Scalability: Azure blob storage uses a pay as you go pricing system based on the volume of data stored, storage options selected, and number of operations performed. This allows us to respond very quickly to increased data demands without having to invest in and set up new hardware to increase our storage capacity. By using appropriate access tiers, we pay lower storage costs for data which we don’t require fast access to (e.g., older database backups), while retaining fast access where essential (e.g., recent files in the ETL pipeline).
Local and geo-redundancy: data can be stored across multiple disks or even multiple data-centres, so that the failure of one disk or even one centre does not result in total data loss. To achieve local redundancy with on-premises hardware, we would need to utilise RAID or similar technology which requires investing in relatively expensive hardware, and multiple extra disks. Geo-redundancy wouldn’t be achieved unless we expanded to open another site overseas!
Security: Strong integration with other Azure services via managed identities allows us to tightly restrict access. For example, one of our client-facing applications uses a database served from an Azure virtual machine (VM). The blob storage container which is used to store and archive backups of this database is deployed within the same virtual network, so that it can only be accessed via this VM, and not over the public internet.
Python support: We use Python extensively, including in our ETL pipeline. The ability to control all of Azure blob storage’s features programmatically through the provided Python libraries makes it a good fit for our team and easy to integrate with our processes.

In conclusion

Azure Blob Storage is helping us to ensure our data processes are resilient and scalable. By taking care of the hardware for us, it allows our engineers to focus on solving more interesting problems. As our products continue to evolve, I look forward to learning more about how we can best leverage both on-premises and cloud technologies to keep them running smoothly.

7th September 2023

Matt Smith

Matt is in his seventh year working as a consultant to water companies both in the UK and internationally. Proficient in Python and PostgreSQL, he has an eye for analytics and is currently broadening his knowledge of analytical tools and platforms.