Data Engineering with Rust

FREE course intro on using Rust for Data Engineering

Data Engineering with Rust

Two months ago I wrote a short article about exploring Rust from the perspective of a long time Python developer. That article prompted many emails, contact requests and mostly positive messages from many people around the world.

The gist of those exchanges is that there is a growing interest in exploring Rust as a performant alternative to problems that were usually reserved for languages like Python.

The motivations behind those exchanges fell in the category of "how can I do X better and faster" rather than "how can I do X in a different way", meaning that the interest in Rust is founded in genuine interest to make something better, rather than just toying with the shiny new programming language du jour.

Rust is here to stay

Now it's clear that Rust is here to stay but what's not so clear is how to get people working with Data interested in introducing Rust into their current workflows or even considering trying it out. The Data landscape is currently mostly dominated by Python and gargantuan tools that are usually very complicated to operate (Kafka, Spark, ...), which don't yet all interface seamlessly with Rust.

The argument against Python is tough when only viewed through the lens of "performance", since popular packages like Numpy or Scipy, have backends that are implemented in languages like C or C++ that Python refers to when doing some heavy lifting, which should make them performant - in theory.

There is still an interface that ends up "touching" Python code (I/O, data transformation, etc) which ends up offsetting the benefits gained from writing things in C/C++ in most cases. Benchmarks only tell parts of the story.

Besides those consideration, tools like Apache Kafka, Apache Spark, Apache Flink, Apache Beam (mostly written in Java/Scala) are absolutely complex and very taxing to operate and use. They're all great ideas and I respect the work that went in there, but they should be the last resort when every other possible approach has failed and you're ready to bite the bullet.

They require a complete shift in mindset and pain tolerance level adjustment, so much that whole industries are based on only making them simple to use or abstracting them away somehow. Then, you end up with systems that are just too bloated and fragile to operate and of which one single person would never be able to grasp the top 10% failure modes, as you might be aware if you ever worked with those systems.

I think Rust might be able to help here too, as I'm willing to wager that developers writing in Rust would come up with very different interfaces, designs and system choices than Java developers. This is not a cheap diss at Java, but to put it in the words of Clay Shirky: "When you adopt a tool you also adopt the embedded management philosophy within that tool", and so far I've yet to see a single "Big" Data tool written in Java (enterprise philosophy) that is simple to deploy, operate and debug in 1 business day.

Using the right tool(s)

Moving on, a common thing when using Python is that it is always very inviting and tempting to over engineer a particular solution to duct tape any of the languages' shortcomings. Take package management or typing for example and ask different engineers what they use for those problems or even which tools they prefer. Some will say Poetry, Pipfile, or just pip, others will add so many types in their Python code that it ends up no longer looking like Python but more like Java "syntactically salted" with Python module calls.

There's no clear cut "one thing" that works across the board as soon as you want to do something Python was not intended to do in the first place like type checking, for instance. There's a lot of add-ons necessary to make basic functionality work reliably in a context as deterministic as dealing with data, especially around operations.

Even after all is said and done, the Python code ends up in Docker containers and all the typing stuff remains a mystery until a corner case is detected at runtime. "Runtime is fun time" only goes so long and the only way to hope to catch things like that is to run enough tests and introduce even more tooling, more friction. All things considered, the "implementation speed" inherent to dynamic languages like Python is lost when an unbounded amount of tools are added to make them safe and do things right.

Python and Rust are a perfect combination

Don't get me wrong, I absolutely love Python and built my career around it, I'll continue using it whenever I just need things to work. But I'll use it as it is and not with all the very heavy modern tooling around. Only pip and virtualenv for this fella over here. What Python has going for it is the phenomenal community and the plethora of libraries, examples and guides available for anything you can imagine. I'm sure Rust will make up for that diff when enough people get interested and start using it.

Both these languages make a fantastic combination for different milestones in any given project. The evaluation phase at the beginning of a project, to search the problem space and understand what you're dealing with is a perfect fit for Python. When it's time for production and you'd like something more reliable, fast, performant and efficient, you can use Rust. With the right precautions, you can make sure that your Rust code still keeps a certain "Pythonic" spirit, by which I mean that it can be performant without ending up looking like C++.

However, there are still many things still needed in the Rust ecosystem. While I'm certain the shift from Python to Rust will not happen overnight, it won't be so far off that Python data pipelines will be replaced by Rust ones and the benefits made evident. Don't just take my word for it, though, here are a few videos and resources of people preaching the benefits of Rust.

1357x improvement for Simulation workload by using Rust

Index
The comprehensive guide to the state of machine learning in Rust. This site catalogs ML frameworks, data structures, data cleaning and analysis, and other tools and libraries that are essential to machine learning ecosystems.

Are We Learning Yet? Reference website

Data challenges of the future

In my opinion these accounts of successful Rust implementations will only increase in the future as the challenges presented by an ever growing appetite for Data get more complex by the day.

You might have heard it in the news, AI models reach mainstream usage and gain a lot of popularity and attention (ChatGPT), especially from big cloud providers. While the code for these AI models is usually available and open source, what is not shared or won't be available is the data and compute resources required to train these models, which is where the real power lies. It isn't within the reach of everyone to manage and maintain huge systems like that, especially with the current tools in the Data space, without completely breaking the bank or getting lost in a sea of YAML files...

This is where Rust could potentially make the difference in terms of efficient resource usage and future proof approach to data processing. At big scales like AI and ML workloads, most of the time is spent waiting for models to run backpropagations (on GPU's) or mangling/shuffling/labelling data (in RAM/Disk). As mentioned earlier, Python helps mostly in implementation speed (YMMV) which doesn't have an immediate impact on these numbers. Rust, while taking longer to implement might make the investment worth it when the workloads start running and things get hairy. It's a systems programming language made exactly for that.

Now Rust is definitely not a silver bullet, what I'm saying is that it might just be a great bet for data intensive tasks. Besides, what you're learning in Rust for data work, you can also use for other things like embedded programming, WASM, and many, many more. Think of it as a very good multi purpose tool.

Why I'm doing this

From the feedback I received, it's currently very daunting and at times very confusing to start using Rust. Some guides seem outdated, others a bit "over complicated" and some just don't work anymore (outdated books).

As far as the complicated part is concerned, I believe it boils down to a matter of perspective. A lot of very intelligent and skilled people work with Rust, which makes most of the guides out there biased towards people who already know how to use Rust. There's an under represented audience which would like to just get started and solve business needs without even needing to fully appreciate or understand the inner workings of Rust.

I think a light touch, or "shallow dive" approach is best fitted when integrating Rust in a Data Engineering space to get as much people interested and benefiting from the experience of "fast wins". Put simply, I think an approach of just changing stuff and seeing what happens is the best approach, at least for me, to learn a new programming concept or tool.

Changing stuff and seeing what happens

This is why I'm releasing a work in progress called Data With Rust. It will contain most of what I've learned building data tools with Rust, coming from a Python background. It's an ongoing effort and things might not be 100% streamlined yet, but it'll get there.

You can access it for free and if you subscribe to my blog you'll get discounts on whatever paid resources (videos, pdf or paperback copy) might come out of it in the future.

Things constantly change so having simple points of reference in a new, unknown field is always helpful. I want to bring in my contribution to the Rust community and I'm releasing this free website that contains resources, tutorials and guides for getting started using Rust for Data Engineering.

Full disclosure, I'm by no means an expert in Rust, but I've been using it a lot lately to feel confident that I can share what I know. Expect a new chapter every 2 to 3 weeks.

If you have feedback or any ideas of what I should add or improve, just reach out. I'm not hard to find online. :)