Skip to main content

From Data Engineer to YAML Engineer (Part II)

· 6 min read
Julien Hurault

Bonjour!

I'm Julien, freelance data engineer based in Geneva 🇨🇭.

Every week, I research and share ideas about the data engineering craft.

Not subscribed yet?

👨🏽‍💻 echo {YOUR_INBOX} >>

Subscribe

Software has always been a matter of abstraction.

Over the years, the industry has constructed layers upon layers to develop increasingly complex software.

The same trend is happening in the data world.

More and more tools are emerging to standardize the construction of data pipelines, pushing towards a declarative paradigm.

Engineers spend less and less time coding and more and more parametrizing and coordinating functional building blocks.

In the first version of this post (co-written with Benoît Pimpaud), we highlighted signs of this trend (AWS Pipes, Snowflake Dynamic Tables, and YAML-driven orchestration with Kestra)

We called it provocatively: From Data Engineer to YAML Engineer.

From Data Engineer to YAML Engineer -----------------------------------

Julien Hurault and Benoit Pimpaud

·

November 22, 2023

From Data Engineer to YAML Engineer

Software has always been a matter of abstraction.

Read full story

One year later, the movement has only accelerated.

So, let’s keep the exploration going with:

  • Declarative ingestion & dlt
  • Declarative transformation & SQLMesh
  • Declarative BI & Rill
  • Declarative data platform & Starlake

—

Thanks to Starlake for sponsoring this post and supporting such a discussion on declarative data tooling.

ELT tools have always been declarative—you define your connector and target and let the tool handle the rest.

And for common sources with well-supported, battle-tested connectors, this works great:

Connectors available in Airbyte

However, when faced with an obscure API, legacy MSSQL server, or an internal system lacking a reliable connector...

You're forced to write custom Python code to handle pagination, retries, and other complexities.

This is the main frustration with data ingestion: it's often all or nothing.

You either have a smooth declarative workflow or write boilerplate code from scratch.

This is where dlt enters the picture.

It's an open-source ELT tool that comes as a Python library.

It offers a declarative DSL for defining ingestion pipelines while maintaining the flexibility to use imperative Python code when necessary.

Here's what you can define declaratively:

  • Source (pre-built or custom connector) / Destination
  • Normalization rules
  • Data contract enforcement

In the case of an API, the configuration looks like this:

Because it’s native Python, it’s easy to switch to imperative mode when needed—for example, to extend a connector or tweak normalization logic.

And yes, true to this article’s title, generating ingestion pipelines dynamically from a (TOML) config file is possible.

That’s precisely what was done in this example:

From data engineer to TOML engineer

2- Declarative Data Transformation: SQLMesh

Let’s move further down the data flow: transformation.

But instead of focusing on SQL syntax, I want to look at this from the orchestration angle.

dbt was one of the first frameworks to popularize this declarative approach, especially for defining how models should be materialized.

dbt handles the SQL logic for creating the model and managing incremental updates.

No need to manually write SQL to handle MERGE statements or deduplication—it’s abstracted away.

{{
config(
materialized='incremental',
unique_key='id'
)
}}

SELECT ...

However, dbt has a limitation: it's stateless.

It has, therefore, limited awareness of execution history and timing.

Determining which models need to run is challenging, requiring comparisons of run artifacts.

SQLMesh advances the declarative paradigm by introducing stateful orchestration.

It executes models and maintains a complete execution history, automatically determining what needs to be re-run based on code changes and data freshness.

All this happens without requiring manual DAG configuration in your orchestrator or job scheduler.

You say:

MODEL (
name my.model,
cron '5 4 1,15 * *' -- Run at 04:05 on the 1st and 15th of each month
)

SELECT * FROM ...

And SQLMesh tracks the last run, checks the model frequency, and decides whether to execute.

It bridges the gap between transformation and orchestration—you stay in the declarative world the whole time.

3- Declarative BI: Rill

Let's continue our journey down the data flow—this time arriving in the BI world.

The software engineering mindset seems to stop with traditional BI tools just before BI begins.

Cross that frontier, and you'll be met with endless clicking: there is no version control, reproducible environments, or modular logic.

You're left building dashboards by hand, from scratch, every single time.

I'm excited to see BI finally embrace software engineering principles through BI-as-code tools like Rill, Light Dash, and Evidence.

A Rill project, for example, consists of YAML files defining dashboards, metrics, and sources:

You get interactive charts and dashboards that are reproducible, version-controlled, and easy to share across environments.

4- Declarative Data Platform: Starlake

Let’s flip the script and look at Starlake, an open-source tool combining both ingestion and transformation in a unified declarative framework.

Starlake doesn’t rely on external libraries or frameworks.

Instead, they've built their own ingestion engine, transformation framework (with a custom SQL parser), and YAML interface.

This unified approach allows users to define their entire pipeline in a single YAML file:

extract:
connectionRef: "pg-adventure-works-db"
# Additional extraction settings...

---
load:
pattern: "my_pattern"
schedule: "daily"
metadata:
# Metadata configurations...

---
transform:
default:
writeStrategy:
type: "OVERWRITE"
tasks:
- name: most_profitable_products
writeStrategy:
type: "UPSERT_BY_KEY_AND_TIMESTAMP"
timestamp: signup
key: [id]

Building both ingestion and transformation frameworks from scratch makes them direct competitors to many actors.

Here's a recap of how they position themselves vs dlt for the ingestion:

And vs dbt and SQLMesh for the transformation:

Finally, the open source version of Starlake comes with a UI where users can directly edit the YAML config and SQL transformation (with an AI assistant)

Starlake UI is open source as well

The main advantage of such an approach is that it provides a consistent interface for the whole data lifecycle without the need to learn and manage many different tools.

Check out their GitHub to get started with Starlake or learn more.


Thanks for reading, and thanks, Starlake, for supporting my work and this article.

-Ju

Follow me on Linkedin

I would be grateful if you could help me to improve this newsletter.

Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. You can reply to this email; it will get to me.

Thanks for reading Ju Data Engineering Newsletter! Subscribe for free to receive new posts and support my work.

Subscribe

Snowflake Data Loading

· 7 min read
Hayssam Saleh
Starlake Core Team

Summary​

Snowflake offers powerful tools for data loading and transformation, so why consider Starlake? What distinguishes Starlake, and why is it important? This article delves into these questions, exploring how Starlake complements and enhances your Snowflake experience. Specifically, this article tackles the challenges of loading files into Snowflake

Although Starlake supports transformation activities, the scope of this article is limited to data loading.

Introducing Starlake.ai

· 2 min read
Abdelhamide El Arib
Starlake Core Team

We're excited to unveil Starlake.ai, a groundbreaking platform designed to streamline your data workflows and unlock the full potential of your data. 🚀

The Challenges We Solve​

In the modern data landscape, businesses often face these challenges:

  • Overwhelming complexity in managing data pipelines
  • Inefficiencies in transforming and orchestrating data workflows
  • Lack of robust governance and data quality assurance

Starlake tackles these problems head-on, offering a declarative data pipeline solution that simplifies the entire data lifecycle.

How to Load and Transform into BigQuery Wildcard Tables

· 5 min read
Hayssam Saleh
Starlake Core Team

Sharding

BigQuery Wildcard Tables​

When loading files into BigQuery, you may need to split your data into multiple partitions to reduce data size, improve query performance, and lower costs. However, BigQuery’s native partitioning only supports columns with date/time or integer values. While partitioning on string columns isn’t directly supported, BigQuery provides a workaround with wildcard tables, offering nearly identical benefits.

In this example, we demonstrate how Starlake simplifies the process by seamlessly loading your data into wildcard tables.

Incremental models, the easy way.

· 2 min read
Hayssam Saleh
Starlake Core Team

Incremental models, the easy way.​

One of the key advantages of Starlake is its ability to handle incremental models without requiring state management. This is a significant benefit of it being an integrated declarative data stack. Not only does it use the same YAML DSL for both loading and transforming activities, but it also leverages the backfill capabilities of your target orchestrator.

How to unit test your data pipelines

· 6 min read
Bounkong Khamphousone
Starlake Core Team

In today's data-driven landscape, ensuring the reliability and accuracy of your data warehouse is paramount. The cost of not testing your data can be astronomical, leading to critical business decisions based on faulty data and eroding trust. 

The path to rigorous data testing comes with its own set of challenges. In this article, I will highlight how you can confidently deploy your data pipelines by leveraging Starlake JSQLTranspiler and DuckDB, while also reducing costs. we will go beyond testing your transform usually written in SQL and see how we can also test our Ingestion jobs.

Polars versus Spark

· 6 min read
Hayssam Saleh
Starlake Core Team

Introduction​

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities.

As a Data Engineer, I primarily focus on the following goals:

  1. Parsing files, validating their input, and loading the data into the target data warehouse.
  2. Once the data is loaded, applying transformations by joining and aggregating the data to build KPIs.

However, on a daily basis, I also need to develop on my laptop and test my work locally before delivering it to the CI pipeline and then to production.

What about my fellow data scientist colleagues? They need to run their workload on production data through their favorite notebook environment.

Starlake OSS - Bringing Declarative Programming to Data Engineering and Analytics

· 6 min read
Hayssam Saleh
Starlake Core Team

Introduction​

The advent of declarative programming through tools like Ansible and Terraform, has revolutionized infrastructure deployment by allowing developers to achieve intended goals without specifying the order of code execution.

This paradigm shift brings forth benefits such as reduced error rates, significantly shortened development cycles, enhanced code readability, and increased accessibility for developers of all levels.

This is the story of how a small team of developers crafted a platform that goes beyond the boundaries of conventional data engineering by applying a declarative approach to data extraction, loading, transformation and orchestration.

Starlake

Handling Dynamic Partitioning and Merge with Spark on BigQuery

· 7 min read
Hayssam Saleh
Starlake Core Team

Data Loading strategies​

When loading data into BigQuery, you may want to:

  • Overwrite the existing data and replace it with the incoming data.
  • Append incoming data to existing
  • Dynamic partition Overwrite where only the partitions to which the incoming data belong to are overwritten.
  • Merge incoming data with existing data by keeping the newest version of each record.

For performance reasons, when having huge amount of data, tables are usually split into multiple partitions. BigQuery supports range partitioning which are uncommon and date/time partitioning which is the most widely used type of partitioning.