Obtaining an optimal and cost-effective solution in the field of Data Engineering and Architecture in the cloud can be a complex process. At ixpantia, we believe that the success of such a solution is based on ensuring robust architecture processes through a strong culture of experimentation, the adoption of DevOps and DataOps practices, the implementation of a metrics monitoring system, and the use of a declarative language to facilitate infrastructure management as code, such as Terraform.
In our data science projects, one of the initial steps typically involves designing a high-level architecture solution tailored to the client's specific needs.
Among the factors to consider for the initial architecture design are the data source (whether it comes from IoT devices, transactional databases, data lakes, etc.), the data format (XML, JSON, CSV, etc.), the frequency of data arrival, the chosen cloud platform, budget constraints, the required levels of data versioning, as well as the capabilities and technological knowledge of the data science teams within the companies we support.
Considering all these elements, we typically propose an initial architecture for the problem at hand: a solution that includes the proposed technologies, monitoring solutions, approximate total cost, and the definition of service agreements in collaboration with the client.
It is important to highlight that the implementation process should be considered from the beginning as a phase of continuous adaptation and improvement. In this context, data architecture and engineering solutions, although well-founded and planned, may require adjustments during implementation to optimize performance. This is due to the inherent complexity of these systems and unforeseen factors such as unexpected data load spikes, arrival of files with corrupted or unexpected formats, changes in the technological environment, or shifts in the available budget of the companies, among others.
At ixpantia, we believe that bringing scientific experimentation into these more engineering-oriented disciplines is vital to maintaining optimal, cost-effective, and efficient data pipeline innovation cycles. We visualize the flow used at a high level in the following figure:
Figura 1. Flow during the development and implementation of our Data Engineering and Architecture solutions
When we talk about agreeing, we mean creating service agreements that outline the needs and objectives that our architecture must fulfill. This requires a series of meetings with the client to inventory the company's needs. This best practice is based on Site Reliability Engineering (SRE). Under this methodology, we translate the client's requirements into service agreements with their respective indicators and objectives, understanding that these may evolve over time.
Regarding planning, this involves procuring the necessary resources, both technological and human, needed to achieve our goal within the defined financial constraints (budget). In this step, we arrive at one or several initial proposals, which we present to the client. Our role here is to assist them in making an informed decision based on a concise and comprehensive presentation of the proposal(s).
Next, we have the experimentation level. In this phase, we propose a series of experiments with different configurations for the various parameters that make up our solution. These parameters can include, for example, the number of instances to use, the required processing capacity (RAM), the number of operations needed in the databases, among others.
Finally, we reach the evaluation phase. At this point, we collect the metrics from the experiments conducted to determine and/or project the most suitable configurations for the parameters that make up our solution, while also identifying opportunities for performance and cost improvements.
Here are some reasons why we focus our data engineering and architecture strategy around experimentation:
The main characteristics we have defined for the experiments we conduct are as follows:
At a higher level, within experimentation in data engineering and architecture, it is also important to mention the following concepts on which we base our methodology: DevOps and DataOps, Infrastructure as Code, and Monitoring.
DevOps y DataOps
DevOps and DataOps methodologies are always reflected in the practices we use to develop our experiments. Some elements that must always be taken into account are:
A significant part of what we need to do when designing specific final solutions for a client involves creating, decommissioning, and experimenting with different configurations or types of resources. Take, for example, the need to serialize data coming from a sensor in a typical Internet of Things (IoT) solution. This requires developing code, possibly in Python, R, or another language, that can serialize file types such as XML or JSON.
However, the choice of technology to execute this code is a separate decision. We might consider a serverless technology, such as a Function App in Azure, that links file reception to an event emitter to manage the file queue in real time. This option could be particularly efficient for certain scenarios. Alternatively, it might be appropriate to run the code on a Virtual Machine in the cloud, which processes the data after reading it from a storage account. The choice between these technologies depends on several factors, and it will often be necessary to experiment with different approaches to identify the most viable and cost-effective solutions according to the specific needs at hand.
One of the problems that can arise here is that creating and deleting resources manually can be a cumbersome and laborious process, involving numerous "clicks" on cloud platform portals. This is not only tedious but also limits the traceability of tests. The recommended practice is to manage the code that defines the infrastructure as code (IaC) in separate repositories. This significantly reduces the time in experimentation cycles, thus optimizing the development process.
For managing infrastructure as code (IaC), we have found that Terraform is an extremely versatile tool that allows us to:
In this context, an essential part of an experimentation-based workflow is monitoring (included in the Evaluation phase), as it is the way we test our hypotheses, propose new experiments, or identify improvement points. The tools we can use for monitoring will obviously depend on the cloud platform we are working with.
For example, in Azure, we have Log Analytics and Application Insights, both tools capable of granularly monitoring almost any resource. However, if they are used, it is highly advisable to test them in very controlled environments to avoid unexpected costs.
Another very useful tool in Azure is Azure Monitor, which can provide detailed information on key metrics, such as CPU usage, HTTP 404 errors, the volume of files received, and data input and output statistics. Additionally, Azure Monitor facilitates the tracking of input/output operations, both write and read, among other crucial aspects for system performance management and optimization.
As a bonus, some resources in Azure have a “Diagnose and Solve Problems” section, which can provide us with a wealth of information not only about potential issues during the deployment of applications or resources but also about the number of executions, errors, and more according to the monitored resource.
In other platforms like GCP and AWS, there are also specific tools for monitoring and evaluating the performance of our cloud resources in a granular way. In GCP, Cloud Logging and Cloud Monitoring are essential tools that enable the collection and visualization of metrics, logs, and events in real-time, helping to identify and resolve performance issues and errors. Meanwhile, in AWS, CloudWatch is the primary monitoring tool, offering capabilities for data and log collection, detailed metrics, and configurable alarms to proactively detect and assess systems.
At ixpantia, we believe that experimentation is an essential process in all stages of developing data engineering and architecture solutions, from conception and implementation to tuning and optimization. We recognize that taking this step can be challenging for many companies, often limiting their ability to fully leverage the advantages of cloud technologies and develop increasingly efficient and cost-effective solutions. However, we firmly believe that this is an achievable goal through team collaboration and alignment, the promotion of a culture of version control and constant documentation, the adoption of infrastructure as code, and the continuous monitoring of resources with the right technologies in controlled environments.