this post was submitted on 21 May 2024
63 points (98.5% liked)

Rust

5989 readers
53 users here now

Welcome to the Rust community! This is a place to discuss about the Rust programming language.

Wormhole

!performance@programming.dev

Credits

  • The icon is a modified version of the official rust logo (changing the colors to a gradient and black background)

founded 1 year ago
MODERATORS
 

This is my first try at anything open source so any feedback is welcome :)

you are viewing a single comment's thread
view the rest of the comments
[โ€“] beeng@discuss.tchncs.de 1 points 5 months ago (1 children)
[โ€“] kato@programming.dev 5 points 5 months ago

ETL stands for extract transform and load and it is a widely used architecture for data pipelines where you load some data from different sources (like an S3 or gcs bucket), apply some transformation logic to either aggregate the data or do some other data transformation like changing the schema and then output the result as a different data product.

These pipelines are then usually run on a schedule or triggered to periodically output data for different time periods to be able to deal with large sets of data by breaking them down into more manageable pieces for a downstream data science team or for a team of data analysts for example.

What this library is aiming at is to combine the querying capabilities of datafusion which is a query parser and query engine, with the delta lake protocol to provide a pretty capable framework to build these pipelines in a short amount of time. I've used both datafusion and delta-rs for some time and I really love these projects as they enable me to use rust in my day job as a data engineer which is usually a python dominated field.

However they are quite complex as they cover a wide variety of usecases and this library tries to reduce the complexity using them by constraining them for the use case of building simple data pipelines.