Skip to main content

2 posts tagged with "dbt"

View All Tags

Dbt Fusion vs. Starlake AI: Why Openness Wins

· 5 min read
Hayssam Saleh
Starlake Core Team

Starlake vs. Dbt Fusion: Why Openness Wins

Dbt recently launched Dbt Fusion, a performance-oriented upgrade to their transformation tooling.
It’s faster, smarter, and offers features long requested by the community — but it comes bundled with tighter control, paid subscriptions, and runtime lock-in.

We've been there all along but without the trade-offs.

At Starlake, we believe great data engineering doesn’t have to come with trade-offs.

We've taken a different approach from the start:

Free and open-source core (Apache 2)
No runtime lock-in
Auto-generated orchestration for Airflow, Dagster, Snowflake Tasks, and more
Production-grade seed and transform tools

Let’s break it down.

Feature-by-Feature Comparison

note

Disclaimer: Dbt offers a free tier for teams with fewer than 15 users. This comparison focuses on organizations with more than 15 users, where most of Dbt Fusion’s advanced features are gated behind a paid subscription.

  • Fast engine: Starlake uses a Scala-based engine for lightning-fast performance, while Dbt Fusion relies on a Rust-based engine.
  • Database offloading: Dbt Fusion uses SDF and DataFusion, while Starlake leverages JSQLParser and DuckDB for cost-effective SQL transformations and database offloading.
  • Native SQL comprehension: Both tools enable real-time error detection, SQL autocompletion and context-aware assistance without needing to hit the data warehouse. The difference ? With Dbt Fusion, it’s a paid feature. With Starlake, it’s free and open.
  • State-aware orchestration: Dbt Fusion's orchestration is limited to Dbt Saas Offering, while Starlake generates DAGs for any orchestrator with ready ones for Airflow, Dagster, and Snowflake Tasks.
  • Lineage & governance: Dbt Fusion offers lineage and governance features in their paid tier, while Starlake provides these capabilities for free and open.
  • Web-based visual editor: Dbt Fusion comes with aYAML editor only as part of their paid tier, while Starlake offers a in addition to a YAML editor, a free web-based visual editor.
  • Platform integration: aka. Consistent experience across all interfaces, Dbt Fusion's platform integration is available in their paid tier, while Starlake provides free integration with various platforms.
  • Data seeding: Dbt Fusion supports CSV-only data seeding, while Starlake offers full support for various data formats (CSV, JSON, XML, Fixed Length ...) with schema validation and user-defined materialization strategies.
  • On-Premise / BYO Cloud: Dbt Fusion does not offer an on-premise or BYO cloud option, while Starlake supports both allowing you to use the same tools and codebase across environments.
  • VSCode extension: Dbt Fusion's VSCode extension is free for up to 15 users, while Starlake's extension is always free.
  • SaaS Offering: Dbt Fusion is a SaaS offering, while Starlake is open-source with a SaaS offering coming soon.
  • MCP Server: Dbt Fusion's MCP Server requires a paid subscription for tools use, while Starlake provides a free full-fledged MCP Server for managing your data pipelines.
  • SQL Productivity tools: Dbt comes with DBT Canva, a paid product, at Starlake this is handled by Starlake Copilot through english prompts, which is free and open-source.
Feature**Dbt Fusion **Starlake.ai
Fast engineYes (Rust-based)Yes (Scala-based)
State-aware orchestrationLimited to Dbt own orchestratorYes on Airflow, Dagster, Snowflake Tasks, etc.
Native SQL comprehensionBased on SDFBased on JSQLParser/JSQLTranspiler
Database offloadingDataFusionDuckDB
Lineage & governancePaid tierFree
Web-based visual editorNoYes and always free
Platform integrationPaid tierFree
Data seedingFor tiny CSV-onlyProduction grade support for various formats with schema validation
On-Premise / BYO CloudNot availableYes
VSCode extensionPaid tierAlways free
MCP ServerPaid tierYes (free)
SQL Productivity toolsPaid product (DBT Canva)Free and open-source (Starlake Copilot)
SaaS OfferingYesComing soon

Strategy Matters As Much As Features

Many tools advertise flexibility - but in practice, they quietly funnel users into proprietary runtimes.
Dbt Fusion is no exception.

Their orchestrator is gated behind a paid cloud platform, and most features require a subscription once your team grows.

Starlake doesn’t play that game.

We provide:

  • A single declarative YAML layer for extract, ingest, transform, validate, and orchestrate
  • One config = Multiple warehouses (BigQuery, Snowflake, Redshift…)
  • Your orchestrator = Your choice, fully integrated
  • Auto-generated DAGs, no manual workflow wiring
  • Run it locally, in the cloud, or anywhere in between

Who Should Choose Starlake?

Starlake is ideal for:

  • Data teams who want speed without lock-in
  • Enterprises who need production-grade on premise and cloud data pipelines without vendor lock-in
  • Startups who want open-source pricing and cloud-scale performance
  • Teams who prefer Airflow, Dagster, Google Cloud Composer, AWS Managed Airflow, Asttronomer, Snowflake Tasks, or any engine they already trust

Whether you're building your first pipeline or managing thousands across clouds, Starlake lets you grow on your terms.


Final Thought

Dbt Fusion makes bold claims — and to their credit, they’ve pushed the modern data stack forward.

But openness without freedom is just marketing.

Starlake gives you both.
✅ Open-source.
✅ Free to use.
✅ Orchestrate anywhere.

👉 Ready to experience the freedom of open-source, no lock-in data engineering ? Visit starlake.ai, check out our documentation to get started or join our community to learn more.

From Data Engineer to YAML Engineer (Part II)

· 9 min read

Bonjour!

I'm Julien, freelance data engineer based in Geneva 🇨🇭.

Every week, I research and share ideas about the data engineering craft.

Not subscribed yet?

👨🏽‍💻 echo {YOUR_INBOX} >>

Subscribe

Software has always been a matter of abstraction.

Over the years, the industry has constructed layers upon layers to develop increasingly complex software.

The same trend is happening in the data world.

More and more tools are emerging to standardize the construction of data pipelines, pushing towards a declarative paradigm.

Engineers spend less and less time coding and more and more parametrizing and coordinating functional building blocks.

In the first version of this post (co-written with Benoît Pimpaud), we highlighted signs of this trend (AWS Pipes, Snowflake Dynamic Tables, and YAML-driven orchestration with Kestra)

We called it provocatively: From Data Engineer to YAML Engineer.

From Data Engineer to YAML Engineer -----------------------------------

Julien Hurault and Benoit Pimpaud

·

November 22, 2023

From Data Engineer to YAML Engineer

Software has always been a matter of abstraction.

Read full story

One year later, the movement has only accelerated.

So, let’s keep the exploration going with:

  • Declarative ingestion & dlt
  • Declarative transformation & SQLMesh
  • Declarative BI & Rill
  • Declarative data platform & Starlake

Thanks to Starlake for sponsoring this post and supporting such a discussion on declarative data tooling.

ELT tools have always been declarative—you define your connector and target and let the tool handle the rest.

And for common sources with well-supported, battle-tested connectors, this works great:

Connectors available in Airbyte

However, when faced with an obscure API, legacy MSSQL server, or an internal system lacking a reliable connector...

You're forced to write custom Python code to handle pagination, retries, and other complexities.

This is the main frustration with data ingestion: it's often all or nothing.

You either have a smooth declarative workflow or write boilerplate code from scratch.

This is where dlt enters the picture.

It's an open-source ELT tool that comes as a Python library.

It offers a declarative DSL for defining ingestion pipelines while maintaining the flexibility to use imperative Python code when necessary.

Here's what you can define declaratively:

  • Source (pre-built or custom connector) / Destination
  • Normalization rules
  • Data contract enforcement

In the case of an API, the configuration looks like this:

Because it’s native Python, it’s easy to switch to imperative mode when needed—for example, to extend a connector or tweak normalization logic.

And yes, true to this article’s title, generating ingestion pipelines dynamically from a (TOML) config file is possible.

That’s precisely what was done in this example:

From data engineer to TOML engineer

2- Declarative Data Transformation: SQLMesh

Let’s move further down the data flow: transformation.

But instead of focusing on SQL syntax, I want to look at this from the orchestration angle.

dbt was one of the first frameworks to popularize this declarative approach, especially for defining how models should be materialized.

dbt handles the SQL logic for creating the model and managing incremental updates.

No need to manually write SQL to handle MERGE statements or deduplication—it’s abstracted away.

{{
config(
materialized='incremental',
unique_key='id'
)
}}

SELECT ...

However, dbt has a limitation: it's stateless.

It has, therefore, limited awareness of execution history and timing.

Determining which models need to run is challenging, requiring comparisons of run artifacts.

SQLMesh advances the declarative paradigm by introducing stateful orchestration.

It executes models and maintains a complete execution history, automatically determining what needs to be re-run based on code changes and data freshness.

All this happens without requiring manual DAG configuration in your orchestrator or job scheduler.

You say:

MODEL (
name my.model,
cron '5 4 1,15 * *' -- Run at 04:05 on the 1st and 15th of each month
)

SELECT * FROM ...

And SQLMesh tracks the last run, checks the model frequency, and decides whether to execute.

It bridges the gap between transformation and orchestration—you stay in the declarative world the whole time.

3- Declarative BI: Rill

Let's continue our journey down the data flow—this time arriving in the BI world.

The software engineering mindset seems to stop with traditional BI tools just before BI begins.

Cross that frontier, and you'll be met with endless clicking: there is no version control, reproducible environments, or modular logic.

You're left building dashboards by hand, from scratch, every single time.

I'm excited to see BI finally embrace software engineering principles through BI-as-code tools like Rill, Light Dash, and Evidence.

A Rill project, for example, consists of YAML files defining dashboards, metrics, and sources:

You get interactive charts and dashboards that are reproducible, version-controlled, and easy to share across environments.

4- Declarative Data Platform: Starlake

Let’s flip the script and look at Starlake, an open-source tool combining both ingestion and transformation in a unified declarative framework.

Starlake doesn’t rely on external libraries or frameworks.

Instead, they've built their own ingestion engine, transformation framework (with a custom SQL parser), and YAML interface.

This unified approach allows users to define their entire pipeline in a single YAML file:

extract:
connectionRef: "pg-adventure-works-db"
# Additional extraction settings...

---
load:
pattern: "my_pattern"
schedule: "daily"
metadata:
# Metadata configurations...

---
transform:
default:
writeStrategy:
type: "OVERWRITE"
tasks:
- name: most_profitable_products
writeStrategy:
type: "UPSERT_BY_KEY_AND_TIMESTAMP"
timestamp: signup
key: [id]

Building both ingestion and transformation frameworks from scratch makes them direct competitors to many actors.

Here's a recap of how they position themselves vs dlt for the ingestion:

And vs dbt and SQLMesh for the transformation:

Finally, the open source version of Starlake comes with a UI where users can directly edit the YAML config and SQL transformation (with an AI assistant)

Starlake UI is open source as well

The main advantage of such an approach is that it provides a consistent interface for the whole data lifecycle without the need to learn and manage many different tools.

Check out their GitHub to get started with Starlake or learn more.


Thanks for reading, and thanks, Starlake, for supporting my work and this article.

-Ju

Follow me on Linkedin

I would be grateful if you could help me to improve this newsletter.

Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. You can reply to this email; it will get to me.

Thanks for reading Ju Data Engineering Newsletter! Subscribe for free to receive new posts and support my work.

Subscribe