An Experimental and Pragmatic Approach to Data Infrastructure

Written by Andrés Quintero Moreano | Sep 25, 2024 8:10:10 PM

A case study in reducing costs by 95% on an IoT data pipeline.

Introduction

In the seemingly endless landscape of data engineering, it can be challenging to find an architecture that balances cost, performance, and maintainability. A pragmatic and experimental approach to building data infrastructure is an absolute necessity if cost and performance are important to your organization.

This approach encourages you to design data infrastructure with the following principles in mind:

Architecture > Technologies
Understand the types of workload for your pipelines
Experimentation, not assumptions

This article tells the story of how we reduced the monthly cost of a high-volume IoT data streaming pipeline from $9,000 a month all the way down to less than $500. This post will cover a high-level overview of the technical details of the solution.

Optimizing a Pipeline for Cost and Performance

The pipeline was designed for an IoT solution that moves data from hundreds of thousands of sensors to an analytical database. The pipeline had to manipulate raw XML data from IoT sensors and perform the necessary transformations to properly store the data for analytics and modeling purposes.

The Initial Architecture

Our client used Microsoft Azure as their main cloud provider, so the initial architecture was based around Azure’s services. Initially, many possible architectures were proposed, but we ended up settling for an architecture that made heavy use of Azure Functions. Azure Functions are a form of serverless, on-demand functions that are supposed to scale with load, making them great for handling heavy workloads.

However, Azure’s hard-to-digest documentation and unclear pricing made it difficult to predict with a high degree of certainty what the final cost of the solution would be. After a couple of rounds of experimentation, we found an estimated cost of $9,000 USD per month.

The pipeline would receive raw XML files in cloud blob storage. This would trigger the Azure Event Grid and send a task to the Azure Functions that read the XML file, deserialized it, did some cleanup, and serialized it into a JSON format that would then be written to a document database. The new documents in the database would then be copied by another set of Azure Functions into an analytical database solution.

The Flaws with the Initial Architecture

There were two main flaws with the initial design. The first was simply the high cost of the architecture. Even though it was under the budget for the project, increases in load or added complexity could increase the cost significantly and in unpredictable ways. The second was the rate of failure of the functions. Since we were using the Azure Event Grid as a means of triggering the functions, we were subject to its behavior. We quickly learned that the Event Grid is very sensitive to response times. Therefore, our solution had high failure rates due to latency, resulting in unpredictable write patterns to the warehousing solution.

Why Was Latency High? (IO and CPU-Bound Operations)

The XMLs that were being processed were not particularly light, and the deserialization, cleanup, and serialization steps took up a significant amount of CPU time. These operations are fundamentally blocking; therefore, the "simply add async" solution was not really viable.

The CPU could handle all the IO operations like handling HTTP requests and sending data over to the document database, but the CPU-bound steps constantly blocked requests and caused latency to spike. The Azure Functions would then try to scale up and, without budgeting constraints, could easily incur thousands of dollars in just compute.

Building Our Own Queuing Service

For many experienced data engineers, the solution is to include some sort of external queuing system that can handle IO and respond to the Event Grid quickly and efficiently. This is exactly what we did; however, we opted to build our own queue instead of relying on an out-of-the-box solution.

Why Build Your Own Queue?

Even though it may seem like a daunting task to build your own queuing solution, as long as the scope is well defined, it can be done with less effort than deploying and maintaining an out-of-the-box solution.

The whole purpose of this queue was to separate the CPU-bound and IO-bound operations into different processes. We opted for an 8-core virtual machine; we could then specialize each core for a different purpose. Six cores were dedicated to only CPU-bound/blocking operations, one core for handling incoming requests from the Azure Event Grid, and one core dedicated to writing the serialized data to the document database.

The code for the CPU-bound operations was the same Python code we already had on the Azure Functions, so that step was completed quickly. Afterwards, implementing the handling of requests and write operations consisted of using a queue (built directly with asyncio primitives) and a couple of tasks that poll the queue for new tasks.

These processes were then deployed through Docker Swarm to properly limit the amount of CPU and memory each could use. Voilà, we have a queue that responds immediately to the Azure Event Grid with very low latency (never more than 250 ms) and processes the data as effectively as the Azure Functions.

This service was written in about 500 lines of Python and the necessary Docker configuration to limit and distribute load.

This solution ended up handling all of the load without reaching more than 80% CPU usage and staying within the memory constraints of the VPS. If in the future more IoT devices are added, it would be trivial to allocate more compute resources to the VPS or horizontally scale by adding another VPS and load balancing between them.

The total cost for the VPS was $250 USD per month, a great improvement over the previous estimate.

Conclusion

By adopting a pragmatic and experimental approach, we were able to drastically reduce the cost of our client’s IoT data pipeline while maintaining performance and scalability. Building our own queuing service allowed us to tailor the solution precisely to our needs, avoiding the pitfalls of relying solely on off-the-shelf cloud services with unpredictable costs and behaviors.

This case study demonstrates the value of understanding your workload, questioning assumptions, and being willing to experiment with different architectural approaches. The journey doesn't end here; as technology evolves and new challenges arise, continuous experimentation and adaptation remain key to optimizing data infrastructure.

Are you ready to rethink your data pipeline architecture and explore innovative solutions that balance cost, performance, and maintainability? Stay tuned for more insights and detailed guides on building efficient data infrastructures that can propel your organization forward.

View full post