Skip to main content

6 posts tagged with "Data Engineering"

View All Tags

Introducing Starlake.ai

· 2 min read
Abdelhamide El Arib
Starlake Core Team

We're excited to unveil Starlake.ai, a groundbreaking platform designed to streamline your data workflows and unlock the full potential of your data. 🚀

The Challenges We Solve

In the modern data landscape, businesses often face these challenges:

  • Overwhelming complexity in managing data pipelines
  • Inefficiencies in transforming and orchestrating data workflows
  • Lack of robust governance and data quality assurance

Starlake tackles these problems head-on, offering a declarative data pipeline solution that simplifies the entire data lifecycle.


Our Mission

  1. Simplify Data Management: Automate ingestion, transformation, and orchestration to reduce manual effort.
  2. Enhance Data Quality: Enforce schema validation, governance rules, and SLA tracking to ensure data consistency.
  3. Accelerate Insights: Deliver clean, transformed, and analytics-ready data faster and more reliably.

Core Capabilities

  • No-Code Data Ingestion: Seamlessly ingest and validate data from any source—no coding required.
  • Low-Code Transformations: Use YAML and SQL to apply transformation rules at scale.
  • Automated Workflow Orchestration: Simplify dependencies with Airflow and Dagster integrations.
  • Data Governance and Quality: Ensure schema enforcement, validation, and SLA tracking.
  • Multi-Cloud and On-Prem Support: Run on your preferred infrastructure, including Snowflake, Databricks, and BigQuery.

Why Starlake Stands Out

Starlake doesn’t just manage your data pipelines—it transforms them. Here’s how we compare:

FeatureStarlake.aiTraditional ETL Tools
No-Code Ingestion
Declarative Transformations
Automated Orchestration
Built-in Governance

Start Your Journey with Starlake

Getting started with Starlake is effortless:

  1. Join the Waitlist: Be among the first to explore Starlake Cloud.
  2. Integrate Your Data Sources: Quickly connect to your preferred databases and warehouses.
  3. Streamline Your Workflows: Start experiencing the power of automated, no-code pipelines.

Ready to Transform Your Data Workflows?

Starlake.ai is here to revolutionize how you manage, transform, and govern your data. Join the movement and discover a smarter way to handle data pipelines.

Learn more at Starlake.ai.

How to Load and Transform into BigQuery Wildcard Tables

· 5 min read
Hayssam Saleh
Starlake Core Team

Sharding

BigQuery Wildcard Tables

When loading files into BigQuery, you may need to split your data into multiple partitions to reduce data size, improve query performance, and lower costs. However, BigQuery’s native partitioning only supports columns with date/time or integer values. While partitioning on string columns isn’t directly supported, BigQuery provides a workaround with wildcard tables, offering nearly identical benefits.

In this example, we demonstrate how Starlake simplifies the process by seamlessly loading your data into wildcard tables.

The Problem

Assume your business receives daily transactions as CSV files from branches across various countries. These files are named in the format transactions_YYYYMMDD.csv, and the goal is to load them into BigQuery, partitioned by country, region and date.

The CSV files contains data in the following format:

store_id, customer_id, trans_id, trans_date, trans_time, country, region, ...
1, 1, 1, 2024-12-19, 12:00:00, US, CA
1, 2, 2, 2024-12-19, 13:01:00, US, CA
2, 3, 3, 2024-12-19, 13:01:00, US, NY
2, 4, 4, 2024-12-19, 13:01:00, US, NY

With hundreds of millions of transactions generated daily, querying the data efficiently by country and region is critical. Partitioning the data in this way would reduce storage size and significantly speed up queries.

However, since BigQuery doesn’t support partitioning by string columns, we can leverage wildcard tables to achieve the same benefits. The desired result is to create separate tables in BigQuery for each country and region, structured as follows:

sales.transactions_US_CA
sales.transactions_US_NY

Each table would store transactions for a specific country and region, while also being partitioned by date for optimal performance.

The Solution with Starlake Load

First, we need to instruct Starlake to infer the schema of the table from the CSV files and load them into BigQuery. This can be achieved with the following command:

starlake infer-schema --domain sales \
--table transactions \
--input transations_20241219.csv

This command infers the schema of the CSV file and generates a YAML file that defines the schema and loading instructions. The resulting YAML file would look like this:

transactions.sl.yml
version: 1
table:
name: "transactions"
pattern: "transations_.*.csv"
attributes:
- name: store_id
type: int
- name: customer_id
type: int
- name: trans_id
type: int
- name: trans_date
type: timestamp
- name: trans_time
type: timestamp
- name: country
type: string
- name: region
type: string
metadata:
format: DSV
withHeader: true
separator: ","
writeStrategy:
type: APPEND

Additionally, a second YAML file specifies where the input files are located:

_config_.sl.yml
version: 1
load:
name: "sales"
metadata:
directory: "{{incoming_path}}/sales"

To load the data into BigQuery, follow these steps:

  1. Place the transaction files, following the expected filename pattern, into the directory {{incoming_path}}/sales. The {{incoming_path}} variable is defined in the environment configuration file.
  2. Run the following command:
starlake load --domain sales --table transactions

Since we aim to partition the data by country, region, and transaction date, and BigQuery does not support string columns for partitioning, we can use wildcard tables to achieve the same benefits.

To implement this, we need to update the metadata section in the transactions.sl.yml file with additional instructions for creating wildcard tables.

transactions.sl.yml
...
metadata:
...
sink:
sharding: [country, region]
partition: [trans_date]

The sharding key tells Starlake to create a wildcard table for each unique combination of country and region. The partition key tells Starlake to partition the data in each wildcard table by transaction date.

We can now load the data into BigQuery and Starlake will create the following tables:

sales.transactions_US_CA
sales.transactions_US_NY

That's it - we've successfully loaded your data into BigQuery using wildcard tables!

Let's query the data to ensure it's been loaded correctly:

SELECT * FROM sales.transactions_*
WHERE _TABLE_SUFFIX = 'US_CA' AND trans_date = '2024-12-19'
SELECT * FROM sales.transactions_*
WHERE _TABLE_SUFFIX in ('US_CA', 'US_NY' ) and trans_date = '2024-12-19'

The Solution with Starlake Transform

In addition to loading data, Starlake also supports transforming data using SQL queries. To create the wildcard tables during the transformation process, We use the exact same attributes in the sink section of the transform YAML file as we did in the load process. Starlake will automatically create the wildcard tables in BigQuery when the transformation is executed.

Conclusion

This section illustrates how Starlake streamlines the process of loading data into BigQuery using wildcard tables.

By leveraging Starlake’s declarative approach, you can effortlessly partition your data by string columns, enhancing query performance while minimizing costs. Additionally, Starlake’s support for both data loading and transformation enables you to seamlessly create wildcard tables as part of the transformation process.

Also, Starlake’s integrated approach ensures consistency by allowing you to reuse the same attributes in the sink section of the transformation YAML file as those defined during the data loading process.

Finally, Starlake’s support for wildcard tables in BigQuery comes with zero additional cost. The data is sharded on the fly at load time in a single step, reducing cost and increasing load & transform jobs performance.

This simplifies workflows and promotes efficiency.

Incremental models, the easy way.

· 2 min read
Hayssam Saleh
Starlake Core Team

Incremental models, the easy way.

One of the key advantages of Starlake is its ability to handle incremental models without requiring state management. This is a significant benefit of it being an integrated declarative data stack. Not only does it use the same YAML DSL for both loading and transforming activities, but it also leverages the backfill capabilities of your target orchestrator.

When running transformations, we encounter two cases:

Full-refresh

The existing data is overwritten or a new table is created if it did not exist before

Full refresh

Incremental

As data arrives periodically we have to build KPIs on that new data only. For example, the daily_sales KPI does not need the whole input dataset but just the data for the day we are computing the KPI. The computed daily KPI are then appended to the existing dataset, greatly reducing the amount of data processed.

Day 1

Day 2

Day 3

Starlake allows you to write your query as follows:

alt_text

The sl_start_date and sl_end_date variables represent the start and end of the day for which the query is being executed. These are reserved environment variables that are automatically passed to the transformation process by the orchestrator (e.g., Airflow or Dagster).

You might wonder what happens if the job fails to run for a specific day. There's no need to worry—Starlake handles this seamlessly by leveraging Airflow's Catchup mechanism. By configuration, Starlake requests the orchestrator to catch up on any missed intervals. To enable this, simply add the catchup flag to your YAML DAG definition in Starlake, and as expected the orchestrator will run the missed intervals using the sl_start_date and sl_end_date variables valued accordingly.

alt_text

What if you want to backfill a large period of time ? Your orchestrator has it all too:

alt_text

Conclusion

In conclusion, Starlake's integrated and declarative approach simplifies the management of incremental and full-refresh transformations, eliminating the need for complex state management.

By leveraging reserved environment variables like sl_start_date and sl_end_date, Starlake seamlessly integrates with orchestrators such as Airflow and Dagster to handle backfills, missed intervals, and large periods of historical data with ease. This ensures responsibility segregation between Starlake and the orchestrator: Starlake's design empowers teams to focus on writing queries and delivering insights, while it takes care of orchestration and state management effortlessly.

How to unit test your data pipelines

· 6 min read
Bounkong Khamphousone
Starlake Core Team

In today's data-driven landscape, ensuring the reliability and accuracy of your data warehouse is paramount. The cost of not testing your data can be astronomical, leading to critical business decisions based on faulty data and eroding trust. 

The path to rigorous data testing comes with its own set of challenges. In this article, I will highlight how you can confidently deploy your data pipelines by leveraging Starlake JSQLTranspiler and DuckDB, while also reducing costs. we will go beyond testing your transform usually written in SQL and see how we can also test our Ingestion jobs.

The art of mastering data pipelines

Mastering your data pipeline is a challenging art. A data pipeline generally contains the following phases:

  • Collection: Extracting data from sources
  • Ingestion: Loading the extracted data into the data warehouse
  • Transformation: A phase that ultimately adds value to the collected data

The table below summarizes the tests run by Starlake on Load & Transform jobs:

Check to run onIngestion TestTransform Test
Validate the filename pattern
Validate the file structure (number and types of attributes, input file format - CSV / JSON / XML / FIXED-WIDTH)
Check if loaded files or transform SQL SELECT statements  are materialized according to the defined strategy (APPEND / OVERWRITE / UPSERT_BY_KEY, SCD2 …)
Check for missing or unexpected records in the resulting table
Check if the resulting table has a correct schema
Check all expectations
Check time based query output with time freeze
The results of these automated tests are designed for both human review and CI/CD integration. For human review, a website is generated to help users easily identify failures and their causes. For CI/CD integration, a JUnit report is generated, and there is an option to specify a minimum coverage threshold. If the evaluated coverage falls below this threshold, the command will result in an error, and any failing tests will also trigger errors.

Untested SQL costs

Thanks to data pipelines unit testing, we drastically reduce the development cost. Not running tests seems to allow high project's velocity, thereby delivering value quickly. However, the cost of a feature does not stop at its simple development but encompasses all the efforts put in until the feature's completion. Below are some hidden costs.

  • Identifying bugs
    • Without unit tests, verifying development requires deploying the project. This deployment raises challenging questions about the deployment strategy and its rollback or correction procedures.
    • Verification might be carried out by a separate QA team, sometimes even outside the project team. This can lead to the use of feature flags to avoid deploying to production, complicating the implementation. Additionally, waiting for feedback from the QA team introduces delays, increasing the cost of fixing any bugs that arise.
    • Depending on the deployment strategy, verification may also be incomplete due to a lack of control over the test data used.
  • Maintenance and Evolution Complexity
    • Many of us have faced a massive query and struggled to make modifications without disrupting existing functionality, all while aiming for improvements like optimizing processing time. Rigorous unit tests can help with this. They allow us to enhance the expected outcomes in current datasets, create new ones, and compare the modified query results with these expectations. This significantly reduces the risk of regression.
  • Decreased productivity
    • The absence of automated tests often means manually re-running parts or all of the system to ensure correct integration, which can lead to spending time fixing collateral bugs and thus reducing overall productivity. As the project advances, more components need verification, making the process even more time-consuming. This significantly diminishes the willingness to refactor or revise code.
  • Promoting expertise
    • Without unit tests, teams often assign the same tasks to the same people, which hinders skill development and increases the risk of knowledge loss due to turnover.
  • Customer dissatisfaction
    • A project with uncertain product output quality often leads to dissatisfaction, frustration, and a loss of trust in the individual, the team, or the product.

We are all aware of the hidden costs associated with the absence of tests; in my opinion, these are the most significant. Therefore, we will explore how to manage a data pipeline.

Writing unit test in Starlake

Suppose we have the following transform. Starlake transform folder hierachy

We test it by creating the following hierarchy: Starlake test transform folder hierachy

This is then how it is executed Data pipeline unit test lifecycle

Starlake unit tests benefits

Running tests on a local DuckDB database instead of the target Data Warehouse has the following advantages:

  • Fast Feedback: Local execution is significantly faster than using a remote database due to network latency. Additionally, the local environment might be better suited for handling small volumes of test data.
  • No Execution Cost: Depending on the pricing model of the target database, creating temporary resources and executing queries can incur both execution and storage costs.
  • Setup and Cleanup of Automated Tests: Guarantee of resource isolation.
  • Credential Issues: Running tests against a target database requires credentials, which may pose security risks.

Conclusion

In this article, we have demonstrated how adopting unit testing, a crucial practice for software engineers, can significantly enhance the quality of our data pipelines. This approach not only reduces overall costs in the medium to long term but also ensures the maintenance of dynamic and enduring documentation. Additionally, implementing unit tests is essential for rigorous CI/CD processes, enabling seamless continuous data pipeline deployment.

If you encounter any issues while performing your tests locally, please report them on the Starlake GitHub repository. Your feedback is invaluable in improving local test coverage, empowering more data engineers to deploy their work confidently and smoothly. For further discussions and support, join our team on Slack.

We greatly appreciate your contributions. If you found this article helpful, please star the project on GitHub and share it on your social networks to help us reach a broader audience.

Polars versus Spark

· 6 min read
Hayssam Saleh
Starlake Core Team

Introduction

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities.

As a Data Engineer, I primarily focus on the following goals:

  1. Parsing files, validating their input, and loading the data into the target data warehouse.
  2. Once the data is loaded, applying transformations by joining and aggregating the data to build KPIs.

However, on a daily basis, I also need to develop on my laptop and test my work locally before delivering it to the CI pipeline and then to production.

What about my fellow data scientist colleagues? They need to run their workload on production data through their favorite notebook environment.

This post addresses the following points:

  • How suitable each tool is for loading files into your data warehouse.
  • How easy and powerful each tool is for performing transformations.
  • How easy it is to test your code locally before deploying to the cloud

Load tasks

Data loading seems easy at first glance, as almost all databases and APIs offer some sort of single-line command to load a few lines or millions of records quickly. However, this simplicity disappears when you encounter real-world cases such as:

  • Fixed-width fields: These files are typically exported from mainframe databases.
  • XML files: I sometimes work in the finance industry where SWIFT is a common XML file format that will be around for some time.
  • Multi-character CSV: For example, where the separator consists of two characters like ||.
  • File validation: You cannot trust the files you receive and need to check their content thoroughly.

Loading correct CSV or JSON files using Spark or Polars is straightforward. However, it is also straightforward using your favorite database command line utility, making this capabilities somewhat redundant and even slower than the native data warehouse load feature.

However, in real-world scenarios, you want to ensure the incoming file adheres to the expected schema and load a variety of file formats not supported natively. This is where Spark excels compared to Polars, as it allows for the parallel loading of your JSONL or CSV files and offers through map operations local and distributed validation of your incoming file.

As XML and multichar and multiline CSV are only supported by Spark, dealing with file parsing for data loading, Spark is definitely the most suitable solution.

Transform tasks

Three interfaces are provided to transform data: SQL, Dafarames and Datasets.

SQL

SQL is the preferred language for most data analysts when it comes to computing aggregations. One of its key benefits is its widespread understanding, which eliminates the learning curve.

You can use SQL with both Spark and Polars, but Polars has significant limitations. It does not support SQL Window functions nor does it support the SQL statements for update, insert, delete, or merge operations.

Dataframes

Both Polars and Spark offer excellent support for DataFrames. The main added value of using DataFrames is the ability to reuse portions of code and access features not available in standard SQL, such as lambda functions, JSON handling and array manipulation.

Datasets

As a software engineer, I particularly appreciate Datasets. Datasets are typed DataFrames, meaning that syntax, column names, and types are checked at compile time. If you believe that statically typed languages greatly enhance code quality, then you understand the added value of datasets.

Datasets are only supported by Spark, allowing you to write clean, reusable, and statically typed transformations. They are available exclusively to statically typed languages such as Java or Scala.

Spark stands out as the only tool with complete support for SQL, DataFrames, and Datasets. Polars’ limited support for SQL makes it less suitable for my data engineering tasks.

Cloud Era

At least 60% of the projects I have worked on are cloud-based, primarily targeting Amazon Redshift, Google BigQuery, Databricks, Snowflake, or Azure Synapse. All of these platforms offer serverless support for Spark, making it incredibly easy to run workloads by simply providing your PySpark script and letting the cloud provider handle the rest.

In the cloud environment, Spark is definitely the tool of choice as I see it.

Single node

There has been much discussion about Spark being slow or too heavy for single-node computers. I was particularly interested in running this test since I currently execute most of my workloads on a single-node Google Cloud Run job with Spark embedded in my Docker image.

I decided to conduct this test on my almost 3-year-old MacBook Pro M1 Max with 64GB of memory. The test involved loading 27GB of CSV data, selecting a few attributes, computing metrics on those selected attributes, and then saving the results to a Parquet file.

I ran Spark with default settings and without any fine-tuning. This means it utilized all 10 cores of my MacBook Pro M1 Max but only 1 gigabyte of memory for execution.

note

I could have optimized my Spark workload, but given the small size of the dataset (27GB of CSV), it didn't make sense. The default settings were sufficient.

Here are the results after a cold restart of my laptop before each test to ensure the test did not benefit from any operating system cache.

  • Apache Spark pipeline took: 29 seconds

  • Polars pipeline took: 56 seconds

note

Rerunning Polars gave me 23 seconds instead of 56 seconds. This discrepancy is probably due to filesystem caching by the operating system.

Load and Save Test: Load a 27GB CSV file with 193 columns per record and save the result as parquet.

  • Apache Spark pipeline took: 2mn 18s

  • Polars pipeline took: 2mn 32s

Load parquet and filter on column value then return count : Load a 74 millions records parquet file with 193 columns, filter on 'model' column and return count.

  • Apache Spark pipeline took: 3 seconds

  • Polars pipeline took: 28 seconds

The table below summarises the results

TaskSpark 1GB memoryPolars All the available memory
Load CSV, aggregate and save the aggregation as parquet29s56s
Load CSV and Save parquet2mn 18s2mn 32s
Load Parquet, filter and count3s28s

Conclusion

I don’t think it is time to switch from Spark to Polars, at least for those of us accustomed to the JVM, running workloads in the cloud, or even working on small datasets. However, Polars may be a perfect fit for those familiar with pandas.

As of today, Spark is the only framework I see that can handle both local and distributed workloads, adapt to on-premise and cloud serverless jobs, and provide the complete SQL support required for most of our transformations.

Starlake OSS - Bringing Declarative Programming to Data Engineering and Analytics

· 6 min read
Hayssam Saleh
Starlake Core Team

Introduction

The advent of declarative programming through tools like Ansible and Terraform, has revolutionized infrastructure deployment by allowing developers to achieve intended goals without specifying the order of code execution.

This paradigm shift brings forth benefits such as reduced error rates, significantly shortened development cycles, enhanced code readability, and increased accessibility for developers of all levels.

This is the story of how a small team of developers crafted a platform that goes beyond the boundaries of conventional data engineering by applying a declarative approach to data extraction, loading, transformation and orchestration.

Starlake

The Genesis

Back in 2015, at the helm of ebiznext, a boutique data engineering company, we faced a daunting challenge. Our client, a prominent entity in need of a robust big data solution, sought to harness the power of Hadoop and Spark. Despite our modest size (20 people), we dared to compete against industry giants with tenfold resources (100.000+ headcount).

Our only chance to succeed was to innovate: we needed a data platform that could exponentially outperform the traditional ETL solutions pushed by our competitors. To build data pipelines these GUI based ETLs require an effort that is proportional to the number and complexity of the sources.

Determined to disrupt this norm, we embarked on a quest to devise a DevOps friendly platform capable of lightning-fast data ingestion from any source, without the drawbacks of ETLs or specialized engineering skills.

The day of the tender, our ability to deliver a solution that could load data in a few weeks instead of many months allowed us to stand out from the competition and win the project.

Expisode 1: Smartlake Emerges

note

The basic idea behind Smartlake was that no datawarehouse would stay clean if data quality is checked after the data has been loaded and this pre-load quality checks needed to be handled by data owners.

This left us with little choice but to embrace the declarative approach. Empowering business users, we devised a system where data formats and transformations could be described in simple JSON files. Smartlake wasn’t merely a code generator; it was a versatile engine, seamlessly ingesting diverse data formats, executing transformations, and orchestrating operations with unparalleled efficiency.

To streamline user interaction, we devised an intuitive Excel-to-JSON converter, enabling effortless specification of input formats. Thanks to Smartlake and its declarative approach, the business users were able to define load and transformation operations in a matter of minutes.

Smartlake Standout features

  • Load almost any file format at Spark speed (CSV, JSON, XML, FIXED WITH, Multi-record types …) or Kafka topic

  • Validate fields using user-defined schemas with user defined semantic types

  • Apply transformations on the fly to data being loaded (GDPR, normalisation, computed fields ...) with and without schema evolution

  • Sink to almost any target including Spark, Kafka, Elasticsearch.

Episode 2: Evolution to Starlake

note

The basic idea behind Starlake was to bring in all Smartlake benefits to the cloud by leveraging serverless services and Cloud Datawarehouses capabilities while minimising development and execution costs.

As the data landscape evolved, so did our vision. Cloud data warehouses emerged as formidable competitors to Spark for query execution. Recognizing this shift, we evolved Smartlake into Starlake, preserving its declarative essence while embracing YAML for enhanced readability. We maintained Spark’s prowess to run inside single or multiple container(s) for data ingestion, leveraging cloud data warehouses for query execution.

This strategic blend allowed us to optimize performance and cost-effectiveness based on specific workload requirements. The result was a reimagined platform, tailored for the cloud era, yet grounded in the principles of efficiency and simplicity that defined its inception.

The result is the Starlake OSS project that you can find on Github.

The capabilities of Starlake are extensively described here.

The people behind Starlake

Smartlake, the precursor to Starlake, owes its existence to the collective efforts of numerous individuals, but a select few stand out for their exceptional contributions:

  • Sam Bessalah With Sam’s presence, rallying others became effortless. His visionary outlook and knack for simplifying complexities proved transformative, setting a new standard for implementation.

  • Olivier Girardot Every team has its coding wizard, and Olivier filled that role impeccably. From leveraging Spark codegen to exploring mathematical frameworks like matryoshka, he pushed the boundaries, mentoring the team with his expertise spanning Docker, Ansible, Python, Scala and Spark internals.

  • Valentin Kasas Valentin championed functional programming in Scala. Introducing concepts like recursion schemes, he empowered the team to craft code that was not just functional but also elegant and maintainable.

As the journey progressed towards the cloud, long time data experts joined and made Starlake what it is today:

  • Bounkong Khamphousone The speed and efficiency of Starlake’s extraction and load engines owe much to his contributions.

  • Mohamad Kassir His direct involvement with customer projects and his in-depth knowledge of cloud platforms and business needs have been major assets in the evolution of Starlake.

  • Abdelhamide EL ARIB An early contributor to the load engine, this foresight and execution prowess played a significant role in shaping the platform’s today capabilities.

  • Stephane Manciot The developer behind Starlake’s declarative workflows on top of Airflow and Dagster, was pivotal in shaping its operational backbone.

  • Cyrille Chépélov A master of codebase optimisation, Cyrille’s rewrite efforts were instrumental in ensuring the reentrant nature of Starlake’s API.

The next journey

Today with hundreds of gigabytes of data loaded and transformed daily into thousands of tables in various data warehouses, we can confidently say that Starlake is battle tested and ready for the most demanding data engineering & analytics challenges.

As Starlake is, and will always be open source, join us in building a supportive community. Your insights and feature requests aren’t just welcome, they guide our roadmap.

Get Started with Starlake:

Join our community on GitHub

P.S. Please star the repository: https://github.com/starlake-ai/starlake. Also, any issue or enhancement with Starlake, please just report it. It will fall under the scope of gracious care taking of course.


title: "Declarative Data Pipelines: The Future of ETL with Starlake" description: "Learn how declarative programming is revolutionizing data engineering. See how Starlake.ai simplifies ETL and data transformation." keywords: ["ETL", "data pipeline", "declarative programming", "data engineering"]