Experimentation in Data Engineering and Architecture

Written by Kimberley Orozco Cornejo | Jul 1, 2024 2:34:49 PM

Summary.

Obtaining an optimal and cost-effective solution in the field of Data Engineering and Architecture in the cloud can be a complex process. At ixpantia, we believe that the success of such a solution is based on ensuring robust architecture processes through a strong culture of experimentation, the adoption of DevOps and DataOps practices, the implementation of a metrics monitoring system, and the use of a declarative language to facilitate infrastructure management as code, such as Terraform.

Experimentation in Data Engineering and Architecture

In our data science projects, one of the initial steps typically involves designing a high-level architecture solution tailored to the client's specific needs.

Among the factors to consider for the initial architecture design are the data source (whether it comes from IoT devices, transactional databases, data lakes, etc.), the data format (XML, JSON, CSV, etc.), the frequency of data arrival, the chosen cloud platform, budget constraints, the required levels of data versioning, as well as the capabilities and technological knowledge of the data science teams within the companies we support.

Considering all these elements, we typically propose an initial architecture for the problem at hand: a solution that includes the proposed technologies, monitoring solutions, approximate total cost, and the definition of service agreements in collaboration with the client.
It is important to highlight that the implementation process should be considered from the beginning as a phase of continuous adaptation and improvement. In this context, data architecture and engineering solutions, although well-founded and planned, may require adjustments during implementation to optimize performance. This is due to the inherent complexity of these systems and unforeseen factors such as unexpected data load spikes, arrival of files with corrupted or unexpected formats, changes in the technological environment, or shifts in the available budget of the companies, among others.

At ixpantia, we believe that bringing scientific experimentation into these more engineering-oriented disciplines is vital to maintaining optimal, cost-effective, and efficient data pipeline innovation cycles. We visualize the flow used at a high level in the following figure:

Figura 1. Flow during the development and implementation of our Data Engineering and Architecture solutions

When we talk about agreeing, we mean creating service agreements that outline the needs and objectives that our architecture must fulfill. This requires a series of meetings with the client to inventory the company's needs. This best practice is based on Site Reliability Engineering (SRE). Under this methodology, we translate the client's requirements into service agreements with their respective indicators and objectives, understanding that these may evolve over time.

Regarding planning, this involves procuring the necessary resources, both technological and human, needed to achieve our goal within the defined financial constraints (budget). In this step, we arrive at one or several initial proposals, which we present to the client. Our role here is to assist them in making an informed decision based on a concise and comprehensive presentation of the proposal(s).

Next, we have the experimentation level. In this phase, we propose a series of experiments with different configurations for the various parameters that make up our solution. These parameters can include, for example, the number of instances to use, the required processing capacity (RAM), the number of operations needed in the databases, among others.

Finally, we reach the evaluation phase. At this point, we collect the metrics from the experiments conducted to determine and/or project the most suitable configurations for the parameters that make up our solution, while also identifying opportunities for performance and cost improvements.

Here are some reasons why we focus our data engineering and architecture strategy around experimentation:

It allows us to achieve more cost-effective monitoring solutions. Cloud platforms often offer a wide range of tools to obtain granular information about our resources; however, using these tools can consume a significant portion of companies' budgets.
Experimentation allows us to propose alternative solutions to meet monitoring needs at comparatively lower costs.
Technologies are constantly evolving. Despite having a deep understanding of how a particular cloud technology works, companies must be prepared to adapt to the constant changes in this field. Sometimes, this will require teams to experiment and adapt to new functionalities through an experimentation process during project execution.
Ensuring data consistency is crucial, especially when data can undergo complex transformations or serializations. This aspect is particularly relevant in designing solutions for the Internet of Things (IoT), where large volumes of XML or JSON files from sensors are managed. It is not uncommon to find files with variable structures, some corrupted or with fields that differ from what is expected. Therefore, it is essential to conduct testing and experimentation to effectively handle edge cases.
Validating and iterating on proposed architectures is fundamental. Sometimes, the difference in using two or more distinct architectures may initially seem insignificant to the company. However, even small differences (of a few dozen dollars per month, depending on the company) can have a considerable long-term impact. For this reason, the recommendation is to quickly evaluate and iterate on different possible configurations to improve cost-effectiveness.
It allows us to determine if any resource is being overutilized or underutilized, to identify where and how to reduce costs. For example, we could better identify the necessary computing capacity for a serverless technology like a Function App in Azure, or we might determine that it is feasible to use a less expensive technology during processing while adjusting the specifications of a database.
Identifying the correct combination of horizontal/vertical scaling of serverless technologies is often a complicated task, which can incur unexpected costs or infrastructure failures if not done correctly. Designing and conducting small-scale experiments allows us to better understand the thresholds as well as the minimum required processing power (vertical scaling) and the optimal number of instances (horizontal scaling) to adequately process a certain data flow.
Precisely identifying the source and the best way to eliminate bottlenecks in data pipelines. Architectural solutions usually involve the use of different technologies, such as event emitters or receivers, databases with configurable limits, storage solutions, among others, which allow various configurations that we can tune to optimize performance as well as costs.

Experiment Characteristics

The main characteristics we have defined for the experiments we conduct are as follows:

Low Cost: Often, we will need to perform several tests to find an optimal solution, so each experiment should ideally represent a low cost.
Replicable: Maintaining the history of experiments is crucial for best documentation practices. Using Terraform, resource tags, Git, and CI/CD pipelines greatly aids in streamlining the experimentation processes.
Concise: The solutions we develop can often be complex and consist of many components such as multiple databases, storage solutions, serverless solutions, virtual machines, event emitters and receivers, among others. In the context of experimentation, it is crucial to reduce the complexity of the solutions we wish to test so that we can properly discern the impact of adjusting the chosen variables on the results obtained..

At a higher level, within experimentation in data engineering and architecture, it is also important to mention the following concepts on which we base our methodology: DevOps and DataOps, Infrastructure as Code, and Monitoring.

DevOps y DataOps

DevOps and DataOps methodologies are always reflected in the practices we use to develop our experiments. Some elements that must always be taken into account are:

Use of project configuration systems, such as Azure DevOps, GitHub, ixplorer, among others.
Use of version control systems, particularly Git.
Constant documentation in repositories.
Use of pipelines whenever possible for continuous integration and delivery of code.

Infrastructure as Code (Terraform)

A significant part of what we need to do when designing specific final solutions for a client involves creating, decommissioning, and experimenting with different configurations or types of resources. Take, for example, the need to serialize data coming from a sensor in a typical Internet of Things (IoT) solution. This requires developing code, possibly in Python, R, or another language, that can serialize file types such as XML or JSON.

However, the choice of technology to execute this code is a separate decision. We might consider a serverless technology, such as a Function App in Azure, that links file reception to an event emitter to manage the file queue in real time. This option could be particularly efficient for certain scenarios. Alternatively, it might be appropriate to run the code on a Virtual Machine in the cloud, which processes the data after reading it from a storage account. The choice between these technologies depends on several factors, and it will often be necessary to experiment with different approaches to identify the most viable and cost-effective solutions according to the specific needs at hand.

One of the problems that can arise here is that creating and deleting resources manually can be a cumbersome and laborious process, involving numerous "clicks" on cloud platform portals. This is not only tedious but also limits the traceability of tests. The recommended practice is to manage the code that defines the infrastructure as code (IaC) in separate repositories. This significantly reduces the time in experimentation cycles, thus optimizing the development process.

For managing infrastructure as code (IaC), we have found that Terraform is an extremely versatile tool that allows us to:

Consistency: By defining infrastructure in code, teams can avoid configuration errors and manual inconsistencies.
Reproducibility: It is much easier to replicate infrastructure for different environments (such as development, staging, and production) when infrastructure is defined in code.
Version Control: It allows teams to track changes over time and revert to previous versions if necessary.
Automation: It enables teams to automate infrastructure deployment for different specifications, significantly reducing the time and effort involved.
Cloud Agnostic: Terraform allows us to define resources as code for different cloud platforms, such as Azure, GCP, and AWS. Once the syntax is understood, it is relatively straightforward to use this declarative language for various existing platforms.

Monitoring

In this context, an essential part of an experimentation-based workflow is monitoring (included in the Evaluation phase), as it is the way we test our hypotheses, propose new experiments, or identify improvement points. The tools we can use for monitoring will obviously depend on the cloud platform we are working with.

For example, in Azure, we have Log Analytics and Application Insights, both tools capable of granularly monitoring almost any resource. However, if they are used, it is highly advisable to test them in very controlled environments to avoid unexpected costs.

Another very useful tool in Azure is Azure Monitor, which can provide detailed information on key metrics, such as CPU usage, HTTP 404 errors, the volume of files received, and data input and output statistics. Additionally, Azure Monitor facilitates the tracking of input/output operations, both write and read, among other crucial aspects for system performance management and optimization.

As a bonus, some resources in Azure have a “Diagnose and Solve Problems” section, which can provide us with a wealth of information not only about potential issues during the deployment of applications or resources but also about the number of executions, errors, and more according to the monitored resource.

In other platforms like GCP and AWS, there are also specific tools for monitoring and evaluating the performance of our cloud resources in a granular way. In GCP, Cloud Logging and Cloud Monitoring are essential tools that enable the collection and visualization of metrics, logs, and events in real-time, helping to identify and resolve performance issues and errors. Meanwhile, in AWS, CloudWatch is the primary monitoring tool, offering capabilities for data and log collection, detailed metrics, and configurable alarms to proactively detect and assess systems.

Conclusion

At ixpantia, we believe that experimentation is an essential process in all stages of developing data engineering and architecture solutions, from conception and implementation to tuning and optimization. We recognize that taking this step can be challenging for many companies, often limiting their ability to fully leverage the advantages of cloud technologies and develop increasingly efficient and cost-effective solutions. However, we firmly believe that this is an achievable goal through team collaboration and alignment, the promotion of a culture of version control and constant documentation, the adoption of infrastructure as code, and the continuous monitoring of resources with the right technologies in controlled environments.

View full post