Skip to main content

Dbt Fusion vs. Starlake AI: Why Openness Wins

· 5 min read
Hayssam Saleh
Starlake Core Team

Starlake vs. Dbt Fusion: Why Openness Wins

Dbt recently launched Dbt Fusion, a performance-oriented upgrade to their transformation tooling.
It’s faster, smarter, and offers features long requested by the community — but it comes bundled with tighter control, paid subscriptions, and runtime lock-in.

We've been there all along but without the trade-offs.

At Starlake, we believe great data engineering doesn’t have to come with trade-offs.

We've taken a different approach from the start:

Free and open-source core (Apache 2)
No runtime lock-in
Auto-generated orchestration for Airflow, Dagster, Snowflake Tasks, and more
Production-grade seed and transform tools

Let’s break it down.

Feature-by-Feature Comparison

note

Disclaimer: Dbt offers a free tier for teams with fewer than 15 users. This comparison focuses on organizations with more than 15 users, where most of Dbt Fusion’s advanced features are gated behind a paid subscription.

  • Fast engine: Starlake uses a Scala-based engine for lightning-fast performance, while Dbt Fusion relies on a Rust-based engine.
  • Database offloading: Dbt Fusion uses SDF and DataFusion, while Starlake leverages JSQLParser and DuckDB for cost-effective SQL transformations and database offloading.
  • Native SQL comprehension: Both tools enable real-time error detection, SQL autocompletion and context-aware assistance without needing to hit the data warehouse. The difference ? With Dbt Fusion, it’s a paid feature. With Starlake, it’s free and open.
  • State-aware orchestration: Dbt Fusion's orchestration is limited to Dbt Saas Offering, while Starlake generates DAGs for any orchestrator with ready ones for Airflow, Dagster, and Snowflake Tasks.
  • Lineage & governance: Dbt Fusion offers lineage and governance features in their paid tier, while Starlake provides these capabilities for free and open.
  • Web-based visual editor: Dbt Fusion comes with aYAML editor only as part of their paid tier, while Starlake offers a in addition to a YAML editor, a free web-based visual editor.
  • Platform integration: aka. Consistent experience across all interfaces, Dbt Fusion's platform integration is available in their paid tier, while Starlake provides free integration with various platforms.
  • Data seeding: Dbt Fusion supports CSV-only data seeding, while Starlake offers full support for various data formats (CSV, JSON, XML, Fixed Length ...) with schema validation and user-defined materialization strategies.
  • On-Premise / BYO Cloud: Dbt Fusion does not offer an on-premise or BYO cloud option, while Starlake supports both allowing you to use the same tools and codebase across environments.
  • VSCode extension: Dbt Fusion's VSCode extension is free for up to 15 users, while Starlake's extension is always free.
  • SaaS Offering: Dbt Fusion is a SaaS offering, while Starlake is open-source with a SaaS offering coming soon.
  • MCP Server: Dbt Fusion's MCP Server requires a paid subscription for tools use, while Starlake provides a free full-fledged MCP Server for managing your data pipelines.
  • SQL Productivity tools: Dbt comes with DBT Canva, a paid product, at Starlake this is handled by Starlake Copilot through english prompts, which is free and open-source.
Feature**Dbt Fusion **Starlake.ai
Fast engineYes (Rust-based)Yes (Scala-based)
State-aware orchestrationLimited to Dbt own orchestratorYes on Airflow, Dagster, Snowflake Tasks, etc.
Native SQL comprehensionBased on SDFBased on JSQLParser/JSQLTranspiler
Database offloadingDataFusionDuckDB
Lineage & governancePaid tierFree
Web-based visual editorNoYes and always free
Platform integrationPaid tierFree
Data seedingFor tiny CSV-onlyProduction grade support for various formats with schema validation
On-Premise / BYO CloudNot availableYes
VSCode extensionPaid tierAlways free
MCP ServerPaid tierYes (free)
SQL Productivity toolsPaid product (DBT Canva)Free and open-source (Starlake Copilot)
SaaS OfferingYesComing soon

Strategy Matters As Much As Features

Many tools advertise flexibility - but in practice, they quietly funnel users into proprietary runtimes.
Dbt Fusion is no exception.

Their orchestrator is gated behind a paid cloud platform, and most features require a subscription once your team grows.

Starlake doesn’t play that game.

We provide:

  • A single declarative YAML layer for extract, ingest, transform, validate, and orchestrate
  • One config = Multiple warehouses (BigQuery, Snowflake, Redshift…)
  • Your orchestrator = Your choice, fully integrated
  • Auto-generated DAGs, no manual workflow wiring
  • Run it locally, in the cloud, or anywhere in between

Who Should Choose Starlake?

Starlake is ideal for:

  • Data teams who want speed without lock-in
  • Enterprises who need production-grade on premise and cloud data pipelines without vendor lock-in
  • Startups who want open-source pricing and cloud-scale performance
  • Teams who prefer Airflow, Dagster, Google Cloud Composer, AWS Managed Airflow, Asttronomer, Snowflake Tasks, or any engine they already trust

Whether you're building your first pipeline or managing thousands across clouds, Starlake lets you grow on your terms.


Final Thought

Dbt Fusion makes bold claims — and to their credit, they’ve pushed the modern data stack forward.

But openness without freedom is just marketing.

Starlake gives you both.
✅ Open-source.
✅ Free to use.
✅ Orchestrate anywhere.

👉 Ready to experience the freedom of open-source, no lock-in data engineering ? Visit starlake.ai, check out our documentation to get started or join our community to learn more.

Starlake Iceberg integration

· 3 min read
Hayssam Saleh
Starlake Core Team

Introduction

This is a quick explanation on how to use Starlake to load data into Iceberg and how to use Iceberg tables to run transformations with Starlake. To make sure it works, we will query those tables with duckdb.

Project setup

We will use the starlake bootstrap command to create a new project.

$ mkdir starlake-iceberg
$ cd starlake-iceberg
$ starlake bootstrap

This will create a new project with a default configuration.

Let's update the application.sl.yml file to use iceberg. We do not need to include any library since iceberg jars are distributed with Starlake.

application:
defaultWriteFormat: iceberg
spark:
sql.extensions: "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
sql.catalog.spark_catalog.type: hadoop
sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
sql.catalog.local.type: hadoop
sql.catalog.spark_catalog.warehouse: "{{SL_ROOT}}/warehouse"
sql.catalog.local.warehouse: "{{SL_ROOT}}/warehouse"
sql.defaultCatalog: local

iceberg metadata is stored in the warehouse folder.

By default, Starlake will use the duckdb connection. Let's define a new connection for iceberg. Create the file env.ICEBERG.sl.yml in the metadata directory, with the following content:

version: 1
env:
activeConnection: spark_local

Loading data into Iceberg

The bootstrap comes with sample files. We just need to run the following command to load the data into iceberg.

$ export SL_ENV=ICEBERG # to use definitions in the env.ICEBERG.sl.yml file
$ starlake autoload

That's it! We have loaded the data into iceberg.

Run transformations with Starlake on Iceberg tables

Let's create a new transformation.

$ mkdir metadata/transform/kpi
$ touch metadata/transform/kpi/revenue_summary.sql

Edit the file metadata/transform/kpi/revenue_summary.sql with the following content:

SELECT
o.order_id,
o.timestamp AS order_date,
SUM(ol.quantity * ol.sale_price) AS total_revenue
FROM
starbake.orders o
JOIN starbake.order_lines ol ON o.order_id = ol.order_id
GROUP BY
o.order_id, o.timestamp

Let's first preview the results of the transformation.

$ starlake transform  --name kpi.revenue_summary --interactive table
+--------+-----------------------+------------------+
|order_id| order_date| total_revenue|
+--------+-----------------------+------------------+
| 40|2024-02-11 06:49:28.665| 68.24|
| 8|2024-01-23 20:47:53.667| 8.68|
| 27|2024-02-26 01:12:45.282| 30.0|
| 46|2024-02-10 18:27:05.732| 45.0|
| 56|2024-01-30 07:33:08.621| 75.0|
| 35|2024-01-17 00:30:21.277| 18.18|
| 3|2024-02-10 23:10:30.685| 16.84|
| 54|2024-02-05 08:03:21.197| 115.64|
| 48|2024-02-17 10:05:36.367| 17.06|
| 45|2024-01-16 04:21:01.494| 45.44|
| 98| 2024-01-16 10:47:28.92| 72.72|
| 78| 2024-01-07 07:48:02.53| 45.0|
| ...
+--------+-----------------------+------------------+

Now, let's run the transformation.

$ starlake transform  --name kpi.revenue_summary

You should see the loaded data in the iceberg tables and the transformation results in the kpi.revenue_summary iceberg table. The structure is the following:

warehouse/
├── audit
│   ├── audit
│   └── rejected
├── kpi
│   └── revenue_summary
└── starbake
├── order_lines
├── orders
└── products

Querying data

To query the data, we can use duckdb.

$ duckdb
v1.1.1 af39bd0dcf
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

> INSTALL iceberg
> LOAD iceberg
> WITH o as (
SELECT * FROM iceberg_scan('warehouse/starbake/orders')
),
ol as (
SELECT * FROM iceberg_scan('warehouse/starbake/order_lines')
)
SELECT
o.order_id,
o.timestamp AS order_date,
SUM(ol.quantity * ol.sale_price) AS total_revenue
FROM
o
JOIN ol ON o.order_id = ol.order_id
GROUP BY
o.order_id, o.timestamp;

That's it! We have queried the data from iceberg using duckdb.

Full code is available here.

Configure, Don't Code: How Declarative Data Stacks Enable Enterprise Scale

· 15 min read
Simon Späti
Data Engineer & Technical Author at ssp.sh

hero

Imagine building enterprise data infrastructure where you write 90% less code but deliver twice the value. This is the promise of declarative data stacks.

The open and modern data stack freed us from vendor lock-in, allowing teams to select best-of-breed tools for ingestion, ETL, and orchestration. But this freedom comes at a cost: fragmented governance, security gaps, and potential technical debt when stacking disconnected tools across your organization.

On the flip side, closed-source platforms offer unified experiences but trap you in their ecosystems where you can't access the code or extend beyond their feature sets in case of need.

What if we could have the best of both worlds?

Enter declarative data stacks – open-source solutions that seamlessly integrate powerful orchestration tools like Airflow and Dagster while covering your entire data lifecycle. These are complete data platforms with "batteries included" where configuration replaces coding, dramatically reducing complexity while implementing best practices by default.

In this article, we'll explore how declarative data stacks manage enterprise data complexity. You'll discover how Starlake, an open-source declarative data stack, enables data transformations through simple configurations, helping teams deliver production-ready platforms in days instead of months while automatically maintaining governance and lineage.

At its core, Starlake serves as your data warehouse orchestrator – a unified control center that manages your entire data ecosystem through a single, coherent lineage. Not just a tool, but an end-to-end OSS data platform that makes "configure, don't code" a practical reality for enterprise teams.

What Is a Declarative Data Stack?

A declarative data stack is essentially an end-to-end data engineering tool that is made for complexity. It has all the features you need included. It extracts, loads, transforms, and orchestrates. The mantra: Code less, deliver more.

What's the catch? You need to align with a common configuration style. You need a framework, a tool, something that abstracts and does the hard work, so that we, the users, can simply configure and run it. This is where YAML comes in. Some also call it the YAML Engineer.

In this Part 1, we will go through the benefits and anatomy of declarative data stacks, and see how Starlake can help us with that. In Part 2 we'll have a look at an end-to-end data engineering project that does ingestion from any configurable source, using SQL transformation and orchestrating it with Airflow, Dagster, or even Snowflake Tasks (you choose!), showing a declarative data stack in action.

And then finally, we'll choose a BI dashboard to visualize the data (not part of Starlake).

The Evolution Toward Declarative Data Engineering

If we look back at where we come from, we can see that 20 years back, everyone just picked a vendor like SAP, Oracle, or Microsoft, and ran with their database or business intelligence solutions. You had lots of options, but if you needed anything specific, you had to request that feature and hope someone implemented it. You were captured in a monolithic system.

Because these were widely spread, you had a good chance that what you wanted was already there. But there was always this one customer that wanted that one extra thing. That's why a little later, the boom with open-source happened. All of a sudden, plenty of US startups were sharing their tools like Hadoop and its whole ecosystem. Everyone got used to sharing their code and became excited about solving tough problems together.

This also took over the data space and became one of the starting points for data engineering when Maxime Beauchemin started sharing his trilogy about functional data engineering. More and more tools got open-sourced and this is where we landed a while back with the explosion of tools and the Modern Data Stack, or Open Data Stack, a modular system. Usually the deployment and DevOps involved are quite a massive effort; think Terraform, Helm Charts, or Kustomize and how these democratized declarative engineering over the last decades.

Today we are entering the next phase of the cycle. We are back to one platform ruling them all, or better, bringing together the strengths from both of these worlds: having one platform integrated with open-source tools. But how do we achieve this? Exactly, you guessed it right, with declarative data stacks.

What Makes a Data Stack Declarative?

This transition from a monolithic to a modular system, back to a monolithic platform but with OSS plug-and-play tools to choose from with integrated templates, is a great evolution learning from the past. And declarative data stacks are the key to that.

But maybe you haven't heard of that term, or are unsure what it means. I have written about it at length in The Rise of the Declarative Data Stack, but here is the short version of it:

A declarative data stack is a set of tools and, precisely, its configs can be thought of as a single function such as run_stack(serve(transform(ingest))) that can recreate the entire data stack.

Instead of having one framework for one piece, we want a combination of multiple tools combined into a single declarative data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a single deployment, like YAML.

We focus on the end-to-end data engineering lifecycle, from ingestion to visualization. But what does the combination with declarative mean? Think of functional data engineering, which leaves us in a place of confident reproducibility with little side effects (hopefully none) and uses idempotency to restart functions to recover and reinstate a particular state with conviction or rollback to a specific version.

At its core, we can define the full data stack with a single configuration file (or files). And instead of defining the how, we can configure what we want to achieve. Essentially code less and deliver more with the benefit of a more reproducible and maintainable data stack.

Breaking Free from Black Boxes

Declarative data stacks also break free from black boxes of different OSS tools that you have a hard time integrating. Instead, you get a data-aware data platform that integrates through parametric configuration into the end-to-end data pipeline.

Unlike Airflow, where each task has no idea what is happening inside of it, it's a black box. As more complex data platforms become the norm, knowing if a task has finished or not is not enough. With a declarative approach, you focus more on the what, the data assets. This means we define them, their logic, dependencies, down to their column-level lineage and types. This opens up the black box for us in an easy and approachable way, mainly in YAML.

The YAML configuration approach makes the internal workings more explicit and accessible, rather than hiding them, making the configuration and produced assets transparent and accessible to us and our users.

The YAML configs also make it approachable to non-programmers, as YAML is the universal configuration language. YAML is also easy to integrate with LLMs, more so as it's a superset of JSON and easily integrates with JSON Schema. This means each YAML can be converted to JSON and integrated with the powerful data serialization language of JSON Schema. This leads to a (longer) context window where AI and humans can iterate on and integrate into any code editor or architecture. Ultimately, it helps to make English engineering a reality.

Don't Code, Declare.

This is the main mantra behind declarative data stacks: don't code, but declare and get a fully-fledged data platform.

The Anatomy of an Open Declarative Data Stack

But what does that data platform include, and what's the anatomy of such a data stack? A declarative data stack does not only support one part of the data engineering lifecycle but the entire process end-to-end.

We go from extracting data to orchestrating them all in one platform:

  • Extract: Configuration-driven data ingestion
  • Load: Zero-code data loading with built-in validations
  • Transform: SQL-based transformations with automatic optimizations
  • Orchestrate: Integration with existing workflow managers
  • Visualization: Present and visualize created data assets
  • Lineage: Visibility across the entire stack

If you will, these are the core bones and skeleton of a declarative data stack. There are now many ways one could implement this. In the following chapters, we'll analyze how Starlake implemented these components, and how it can help us support complex data landscapes.

Starlake: An Open Source Declarative Data Platform

Starlake gets you all the discussed benefits of a declarative data stack in one single platform, configurable by YAML. It offers a powerful open-source solution to embrace declarative data engineering end-to-end.

image

In a nutshell, Starlake is a comprehensive declarative data platform that:

  • Is based in France, is entirely open-source, and had its first commit on May 21 2018 by Hayssam Saleh (LinkedIn).
  • Runs on JVM (built with Scala) for cross-platform compatibility, bringing type safety and performance
  • Uses a YAML-based configuration approach (similar to how Terraform works for infrastructure)
  • Covers the entire data lifecycle: extract, load, transform, test, and orchestrate
  • Supports multiple data warehouse engines (Snowflake, BigQuery, Databricks, etc.)
  • Integrates with popular orchestrators like Airflow, Dagster, Snowflake Tasks, and Databricks scheduler (coming soon)
  • Eliminates complex coding with a "declare, don't code" philosophy
  • Features built-in data governance, validation, and lineage tracking
  • Enables development in-memory and local DuckDB, as well as deployment to any supported platform
  • Supports multi-engine and cross-engine capabilities (query one warehouse, write to another)

It has automated data quality checks and validation, native integration with major data warehouses, and no-code and low-code paradigms support. It also comes with enterprise-grade security and governance out of the box and supports both batch and real-time processing.

Starlake bridges the gap between open-source flexibility and enterprise-ready data platforms, allowing organizations to build robust data pipelines with minimal code while maintaining complete visibility and control. It builds and orchestrates data warehouses with the complexity of data in mind from the get-go.

Why Starlake for Enterprises?

But why would you use Starlake at a large enterprise? What are the immediate benefits? In condensed form, you could say it helps address these three main problems that we face in the field of data engineering:

  • Overwhelming complexity in managing data pipelines
  • Inefficiencies in transforming and orchestrating data workflows
  • Lack of robust governance and data quality assurance

Starlake simplifies the data management effort with configuration-driven ingestion, transformation, and orchestration, and reduces manual implementation of each data pipeline.

This enhances data quality and the overall reliability of the platform with enforced governance, schema validation, rules, and SLAs across the system. It accelerates time to insights and opens up opportunities to create additional data pipelines with less technical people through the UI and configuration interface.

If we compare it to traditional ETL tools, the table below will help us understand the differences:

FeatureStarlake.aiTraditional ETL ToolsBenefits
Core ArchitectureDeclarative (YAML-based)Imperative (code-heavy)Reduced maintenance burden, improved readability
No-Code Ingestion❌ (requires custom coding)Faster implementation, accessible to non-programmers
Declarative TransformationsSimplified maintenance, consistent patterns
Automated Orchestration✅ (integrates with Airflow/Dagster)❌ (requires manual setup)Reduced setup time, automated dependency management
Built-in GovernanceEnforced data quality and consistency
Cross-Engine CapabilitiesFlexibility to work across different platforms
Multi-Engine SupportPrevents vendor lock-in
Automated Schema EvolutionAdapts to changing data structures automatically
SQL TranspilationWrite once, deploy anywhere capability
Local Development✅ (with DuckDB)Faster development cycles, reduced cloud costs
Data Quality Validation✅ (built-in)⚠️ (limited/add-on)Higher data reliability
End-to-End Lineage✅ (column-level)⚠️ (typically table-level)Enhanced visibility and troubleshooting
Row/Column Level Security⚠️ (often requires add-ons)Better compliance capabilities
Gen AI Readiness✅ (via JSON Schema)Future-proof architecture
Learning CurveModerate (YAML)High (multiple languages/tools)Faster team onboarding
Implementation TimeFastSlowQuicker time to insights
Maintenance BurdenLowHighReduced technical debt

In the end, it manages complexity and the return on investment in the platform grows with rising complexity. The reason is the declarative nature and the encapsulation of complexity into the platform (Starlake), allowing the data engineer and user of the platform to focus on the business requirements and analytics.

Contrast that with manually writing imperative Python scripts to schedule your data pipeline or managing your infrastructure by manually installing multiple tools and making them work altogether. This is the main challenge the current landscape faces, and the hidden cost of imperative data work.

That's why integrated systems are on the rise and where configuration-driven stacks bridge the gap to become easier to maintain and more reliable.

When Is Starlake the Wrong Option?

If you have a simple POC, or you know for a fact that the platform will never get wider use or grow bigger, or the requirements are in constant change, then it's easier to custom-make an imperative solution. Though you can still use Starlake to quickly start up with all batteries included, it's still a better fit to abstract a big complex data landscape into a manageable data platform, with an integrated governed solution.

If your focus is on visualization, Starlake comes with no presentation tool, though it can be easy to hook up one to the generated data assets. Best would be to stay in the code-first and declarative approach with BI solutions such as Rill, Evidence, or Lightdash.

Starlake is especially strong with large enterprises, which shows with the current customer base, mostly large enterprises with high volumes (100 GB/day and 100 TB in BigQuery), multi-cloud such as Google Cloud and reports on Azure, or on the other spectrum, streaming-like requirements with new small files every minute, ingested into Redshift and visualized in Superset.

Multiple Verticals into Data Engineering

Starlake spans across multiple verticals within the data landscape and isn't limited to one aspect of data engineering. Just as YAML has become the universal configuration language for various domains, declarative data stacks integrate several critical verticals of the data engineering ecosystem under a unified configuration approach.

  • Orchestration: Rather than writing complex DAG code, workflows are defined in YAML and automatically translated to Airflow or Dagster execution plans. This abstracts away the complexity of pipeline scheduling and dependency management.
  • Transformation: SQL is already declarative by nature, but Starlake enhances it with automated optimizations and transpilation across different warehouse dialects—write once, run anywhere.
  • Infrastructure: Similar to how Terraform revolutionized infrastructure management, Starlake applies declarative principles to data infrastructure, enabling version-controlled data assets.
  • API Integration: Much like OpenAPI Specification uses YAML to define REST APIs, Starlake allows declarative configuration of data sources and destinations.
  • Governance: Data quality rules, schema validation, and access controls are all defined declaratively rather than scattered across multiple codebases.

When to Choose a Declarative Approach

That's it, that's why you shouldn't code but configure if you want to optimize for a large enterprise. This article is part one of three. Before we check Starlake in action in Part 2 and analyze what's possible today with declarative data stacks as well as the future of it, including how to integrate well with the latest GenAI and GenBI capabilities, let's wrap up this article.

We have learned that declarative data stacks represent a paradigm shift in enterprise data engineering - "Declare your intent, don't code it." By embracing configuration over custom coding, organizations gain reproducibility, maintainability, and governance without sacrificing flexibility. With tools like Starlake, you can develop and test locally on DuckDB using your target warehouse's SQL dialect, reducing costs and enabling faster development cycles while ensuring consistent quality through automated testing.

Breaking free from black boxes doesn't mean abandoning the open data stack, but rather integrating best-in-class tools under a unified, configuration-driven approach. The result is a data platform that scales with your complexity, reduces technical debt, and democratizes data engineering by making it accessible to YAML engineers alongside traditional coders. As data landscapes grow increasingly complex, the declarative approach offers a sustainable path forward that balances enterprise requirements with development agility.

I believe the future of data engineering will change significantly, but with a declarative data stack, we are ready for whatever changes come with an adaptable configuration-first system that nicely abstracts implementation logic from business requirements. This allows the future of data pipeline development to be adaptable, governed, and ready to evolve with organizational needs.

From Data Engineer to YAML Engineer (Part II)

· 9 min read

Bonjour!

I'm Julien, freelance data engineer based in Geneva 🇨🇭.

Every week, I research and share ideas about the data engineering craft.

Not subscribed yet?

👨🏽‍💻 echo {YOUR_INBOX} >>

Subscribe

Software has always been a matter of abstraction.

Over the years, the industry has constructed layers upon layers to develop increasingly complex software.

The same trend is happening in the data world.

More and more tools are emerging to standardize the construction of data pipelines, pushing towards a declarative paradigm.

Engineers spend less and less time coding and more and more parametrizing and coordinating functional building blocks.

In the first version of this post (co-written with Benoît Pimpaud), we highlighted signs of this trend (AWS Pipes, Snowflake Dynamic Tables, and YAML-driven orchestration with Kestra)

We called it provocatively: From Data Engineer to YAML Engineer.

From Data Engineer to YAML Engineer -----------------------------------

Julien Hurault and Benoit Pimpaud

·

November 22, 2023

From Data Engineer to YAML Engineer

Software has always been a matter of abstraction.

Read full story

One year later, the movement has only accelerated.

So, let’s keep the exploration going with:

  • Declarative ingestion & dlt
  • Declarative transformation & SQLMesh
  • Declarative BI & Rill
  • Declarative data platform & Starlake

Thanks to Starlake for sponsoring this post and supporting such a discussion on declarative data tooling.

ELT tools have always been declarative—you define your connector and target and let the tool handle the rest.

And for common sources with well-supported, battle-tested connectors, this works great:

Connectors available in Airbyte

However, when faced with an obscure API, legacy MSSQL server, or an internal system lacking a reliable connector...

You're forced to write custom Python code to handle pagination, retries, and other complexities.

This is the main frustration with data ingestion: it's often all or nothing.

You either have a smooth declarative workflow or write boilerplate code from scratch.

This is where dlt enters the picture.

It's an open-source ELT tool that comes as a Python library.

It offers a declarative DSL for defining ingestion pipelines while maintaining the flexibility to use imperative Python code when necessary.

Here's what you can define declaratively:

  • Source (pre-built or custom connector) / Destination
  • Normalization rules
  • Data contract enforcement

In the case of an API, the configuration looks like this:

Because it’s native Python, it’s easy to switch to imperative mode when needed—for example, to extend a connector or tweak normalization logic.

And yes, true to this article’s title, generating ingestion pipelines dynamically from a (TOML) config file is possible.

That’s precisely what was done in this example:

From data engineer to TOML engineer

2- Declarative Data Transformation: SQLMesh

Let’s move further down the data flow: transformation.

But instead of focusing on SQL syntax, I want to look at this from the orchestration angle.

dbt was one of the first frameworks to popularize this declarative approach, especially for defining how models should be materialized.

dbt handles the SQL logic for creating the model and managing incremental updates.

No need to manually write SQL to handle MERGE statements or deduplication—it’s abstracted away.

{{
config(
materialized='incremental',
unique_key='id'
)
}}

SELECT ...

However, dbt has a limitation: it's stateless.

It has, therefore, limited awareness of execution history and timing.

Determining which models need to run is challenging, requiring comparisons of run artifacts.

SQLMesh advances the declarative paradigm by introducing stateful orchestration.

It executes models and maintains a complete execution history, automatically determining what needs to be re-run based on code changes and data freshness.

All this happens without requiring manual DAG configuration in your orchestrator or job scheduler.

You say:

MODEL (
name my.model,
cron '5 4 1,15 * *' -- Run at 04:05 on the 1st and 15th of each month
)

SELECT * FROM ...

And SQLMesh tracks the last run, checks the model frequency, and decides whether to execute.

It bridges the gap between transformation and orchestration—you stay in the declarative world the whole time.

3- Declarative BI: Rill

Let's continue our journey down the data flow—this time arriving in the BI world.

The software engineering mindset seems to stop with traditional BI tools just before BI begins.

Cross that frontier, and you'll be met with endless clicking: there is no version control, reproducible environments, or modular logic.

You're left building dashboards by hand, from scratch, every single time.

I'm excited to see BI finally embrace software engineering principles through BI-as-code tools like Rill, Light Dash, and Evidence.

A Rill project, for example, consists of YAML files defining dashboards, metrics, and sources:

You get interactive charts and dashboards that are reproducible, version-controlled, and easy to share across environments.

4- Declarative Data Platform: Starlake

Let’s flip the script and look at Starlake, an open-source tool combining both ingestion and transformation in a unified declarative framework.

Starlake doesn’t rely on external libraries or frameworks.

Instead, they've built their own ingestion engine, transformation framework (with a custom SQL parser), and YAML interface.

This unified approach allows users to define their entire pipeline in a single YAML file:

extract:
connectionRef: "pg-adventure-works-db"
# Additional extraction settings...

---
load:
pattern: "my_pattern"
schedule: "daily"
metadata:
# Metadata configurations...

---
transform:
default:
writeStrategy:
type: "OVERWRITE"
tasks:
- name: most_profitable_products
writeStrategy:
type: "UPSERT_BY_KEY_AND_TIMESTAMP"
timestamp: signup
key: [id]

Building both ingestion and transformation frameworks from scratch makes them direct competitors to many actors.

Here's a recap of how they position themselves vs dlt for the ingestion:

And vs dbt and SQLMesh for the transformation:

Finally, the open source version of Starlake comes with a UI where users can directly edit the YAML config and SQL transformation (with an AI assistant)

Starlake UI is open source as well

The main advantage of such an approach is that it provides a consistent interface for the whole data lifecycle without the need to learn and manage many different tools.

Check out their GitHub to get started with Starlake or learn more.


Thanks for reading, and thanks, Starlake, for supporting my work and this article.

-Ju

Follow me on Linkedin

I would be grateful if you could help me to improve this newsletter.

Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. You can reply to this email; it will get to me.

Thanks for reading Ju Data Engineering Newsletter! Subscribe for free to receive new posts and support my work.

Subscribe

Snowflake Data Loading

· 7 min read
Hayssam Saleh
Starlake Core Team

Summary

Snowflake offers powerful tools for data loading and transformation, so why consider Starlake? What distinguishes Starlake, and why is it important? This article delves into these questions, exploring how Starlake complements and enhances your Snowflake experience. Specifically, this article tackles the challenges of loading files into Snowflake

Although Starlake supports transformation activities, the scope of this article is limited to data loading.

Introducing Starlake.ai

· 2 min read
Abdelhamide El Arib
Starlake Core Team

We're excited to unveil Starlake.ai, a groundbreaking platform designed to streamline your data workflows and unlock the full potential of your data. 🚀

The Challenges We Solve

In the modern data landscape, businesses often face these challenges:

  • Overwhelming complexity in managing data pipelines
  • Inefficiencies in transforming and orchestrating data workflows
  • Lack of robust governance and data quality assurance

Starlake tackles these problems head-on, offering a declarative data pipeline solution that simplifies the entire data lifecycle.

How to Load and Transform into BigQuery Wildcard Tables

· 5 min read
Hayssam Saleh
Starlake Core Team

Sharding

BigQuery Wildcard Tables

When loading files into BigQuery, you may need to split your data into multiple partitions to reduce data size, improve query performance, and lower costs. However, BigQuery’s native partitioning only supports columns with date/time or integer values. While partitioning on string columns isn’t directly supported, BigQuery provides a workaround with wildcard tables, offering nearly identical benefits.

In this example, we demonstrate how Starlake simplifies the process by seamlessly loading your data into wildcard tables.

Incremental models, the easy way.

· 3 min read
Hayssam Saleh
Starlake Core Team

Incremental models, the easy way.

One of the key advantages of Starlake is its ability to handle incremental models without requiring state management. This is a significant benefit of it being an integrated declarative data stack. Not only does it use the same YAML DSL for both loading and transforming activities, but it also leverages the backfill capabilities of your target orchestrator.

How to unit test your data pipelines

· 6 min read
Bounkong Khamphousone
Starlake Core Team

In today's data-driven landscape, ensuring the reliability and accuracy of your data warehouse is paramount. The cost of not testing your data can be astronomical, leading to critical business decisions based on faulty data and eroding trust. 

The path to rigorous data testing comes with its own set of challenges. In this article, I will highlight how you can confidently deploy your data pipelines by leveraging Starlake JSQLTranspiler and DuckDB, while also reducing costs. we will go beyond testing your transform usually written in SQL and see how we can also test our Ingestion jobs.

Polars versus Spark

· 6 min read
Hayssam Saleh
Starlake Core Team

Introduction

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities.

As a Data Engineer, I primarily focus on the following goals:

  1. Parsing files, validating their input, and loading the data into the target data warehouse.
  2. Once the data is loaded, applying transformations by joining and aggregating the data to build KPIs.

However, on a daily basis, I also need to develop on my laptop and test my work locally before delivering it to the CI pipeline and then to production.

What about my fellow data scientist colleagues? They need to run their workload on production data through their favorite notebook environment.