Skip to content

Efficiently Processing Gigantic JSON Objects with Rust, Serde, and Tokio

JSON is now a ubiquitous format for data interchange. It's impossible to avoid. Despite all of its benefits, JSON tends to appear in places it probably shouldn't. One challenging scenario is managing Healthcare Price Transparency datasets or similar, where single JSON objects can exceed hundreds of gigabytes. In this blog post, we'll outline high-level strategies we've employed to process massive JSON objects using less than one gigabyte of memory.

The Stack

For this ETL pipeline, we leveraged Rust, Tokio, Serde, and Parquet. We designed the pipeline for maximum compute efficiency because we had numerous huge JSON files, each exceeding memory capacity.

The Challenge: Drowning in Data

The primary challenge was the enormous size of the JSON files, tens of gigabytes compressed and hundreds uncompressed. Traditional JSON parsers fail in these situations. Attempting to load such JSON files into memory typically crashes the application before processing can even begin.

Even distributed systems like Spark struggle due to JSON's inherent structure. JSON's tree-like structure hinders parallel processing since each node typically depends on its parent node, forcing a sequential parsing approach.

The Strategy: Divide and Conquer with Streaming

Since JSON parsing is inherently sequential due to nesting and dependencies, we adopted a streaming approach, parsing data incrementally into manageable chunks.

We employed the Visitor pattern to traverse the JSON tree, extracting only the necessary data. The Visitor pattern separates algorithms from the data structures they operate on. Think of navigating a city laid out in a strict grid: you don't need every building’s address to explore the city, only a clear path and turning rules.

Serde is ideal for implementing this, as it's built around the Visitor pattern, allowing us to utilize its primitives effectively.

Technique 1: Serde's Streaming Deserialization (DeserializeSeed)

Typically, deserialization is a one-stop shop. As mentioned previously, you generally deserialize an entire JSON string into a single object or an array of objects at once. This approach, which Serde (Rust’s standard Serialization and Deserialization library) usually takes, involves reading the entire list into memory, becoming impractical for massive datasets like the Healthcare Price Transparency JSON file we are working with.

Rust’s type system allowed us to inject custom stateful deserializers into the JSON parsing operation through the trait DeserializeSeed. Stateful deserialization means we can hold onto things like channels, which will allow us to defer processing to different tasks.

Technique 2: Asynchronous Processing with Tokio & Channels

Tokio is an asynchronous runtime for Rust, designed to handle multiple tasks concurrently without significant overhead. Our DeserializeSeed implementations push parsed items immediately into Tokio channels, which efficiently communicate data between different threads or tasks concurrently. One side of the channel is the sender, and the other is the receiver. The minimal cost of sending data between tasks allows our deserializer to process JSON incrementally and efficiently forward data to other tasks for further processing.

Separate Tokio tasks concurrently process items from these channels. This asynchronous, decoupled setup allows parsing to run uninterrupted, enhancing throughput and efficiency.

Technique 3: Batching & Efficient Output (Parquet)

Writing millions of individual items to disk or cloud storage is inefficient. We solved this by batching items in memory and writing them as bulk operations. Additionally, we converted these batches into Apache Parquet format for efficient storage and analytics. Parquet’s columnar structure enhances compression and enables faster queries, improving analytical efficiency significantly.

We utilized Apache Arrow libraries, robustly supported in Rust, to efficiently convert Rust structs into Parquet.

Technique 4: Asynchronous I/O for Output

Even batched Parquet outputs involve slow I/O operations, potentially blocking processing tasks. To avoid this bottleneck, we leveraged Tokio's asynchronous I/O capabilities. Our tasks asynchronously initiate writes or uploads, immediately returning to process the next batch, while Tokio handles the actual data transfer in the background.

Dedicated Tokio tasks manage uploads, using asynchronous storage clients like cloud-storage, keeping the pipeline non-blocking and efficient.

Putting It All Together: The Pipeline.

At a high level, our pipeline is designed as a sequence of streamlined, interconnected components, emphasizing efficiency, concurrency, and low memory usage. The architecture flows naturally from data ingestion to asynchronous output, ensuring continuous processing without bottlenecks:

  1. Input & Decompression: Stream JSON data directly, optionally decompressing on-the-fly.
  2. Incremental Parsing & Streaming: Employ incremental parsing methods to break down large JSON data into manageable streams.
  3. Dispatch & Concurrency: Immediately dispatch parsed data through channels, enabling concurrent, asynchronous task execution.
  4. Batch Processing & Transformation: Group items into memory-efficient batches, transforming them for optimized storage.
  5. Efficient Storage (Parquet): Convert batches into Parquet format using columnar storage for improved analytics and compression.
  6. Asynchronous Output: Independently and asynchronously handle batch uploads, eliminating I/O bottlenecks.

This streamlined, modular architecture ensures maximum throughput while minimizing memory usage, demonstrating an effective and scalable approach to handling enormous JSON datasets.

Results & Conclusion.

By thoughtfully integrating Rust’s powerful ecosystem like Serde’s incremental deserialization, Tokio’s asynchronous processing capabilities, and Apache Parquet for efficient storage we successfully handled massive JSON datasets without exceeding a gigabyte of memory.

Processing huge data files in complex formats like gigantic JSON requires clever engineering and adherence to best practices for writing efficient and performant code. The initial investment in understanding and implementing these practices pays off significantly, resulting in a robust and scalable pipeline capable of effectively handling vast amounts of data.