Skip to main content

2 posts tagged with "Data Governance"

View All Tags

Beyond Basic ETL: Enterprise Data Capabilities Without the Complexity

· 17 min read
Simon Späti
Data Engineer & Technical Author at ssp.sh

hero

Most data teams spend 80% of their time wrestling with infrastructure—writing custom UPSERT logic, building incremental loading from scratch, or debugging why their Airflow DAGs failed again at 2 AM. Meanwhile, the business is still waiting for those critical data insights.

This article is a follow-up to Part 1, where we explored declarative data stacks. Here, we dive into the specific capabilities that let you skip most of the engineering challenges and focus on delivering business value. You'll discover how to extract and load data from multiple sources with built-in quality controls, configure complex write strategies like SCD2 and UPSERT with simple configuration changes, and transform data using SQL that transpiles seamlessly across BigQuery, Snowflake, Databricks, and DuckDB. We'll also explore how to generate production-ready orchestration workflows without writing custom code, all while testing locally before deploying to production.

By the end, you'll see how declarative approaches can eliminate custom development while giving you enterprise features that usually require dedicated platform teams, from automated data quality validation to intelligent dependency management across multiple execution engines.

In Part 3, we'll explore advanced use cases and unique features that make Starlake particularly powerful for complex data scenarios.

Why Declarative Data Stacks Go Beyond the Modern Data Stack

Let's quickly recap why you'd want a declarative data stack. The modern data stack is only a definition and not something that exists physically. But generally, it's defined as a modular stack where you can flexibly exchange certain data tools and combine them, leading to integration into an end-to-end data platform.

This is also where most of the frustration comes from—some of them do not work well together, and it's a lot of work with constant changes on each of the tools while you try to build. Though the code and license to use the tools cost nothing, it's expensive to set up yourself if you don't have the expertise.

This is where a declarative stack comes in handy. It's a pre-built framework to stitch together the best-in-class tools. It's opinionated toward tools that work well together, but that comes with a big efficiency gain. Plus, you can get started in minutes, as everything is declarative.

Starlake, for example, allows you to configure your stack with only configurations. It's as if it came from one vendor or platform, and you have a unified configuration layer to set up and configure your data stack despite using various tools from the data stack behind the scenes.

The Configuration-First Approach to Data Platform Building

A declarative data stack lets you focus on the business challenges and use automation with orchestration and other complex topics that are usually added as an afterthought from day one. Because Starlake is open-source and uses the so-called Modern Data Stack, there's no vendor lock-in.

Starlake uses external engines to run its compute. With SQL transpilers and support for multiple engines, you can easily switch your compute while having the same business logic and integration from Starlake. Plus, you get instant scaling through having external engines.

Where declarative data stacks shine is when complexity grows. With a predefined setup that's built and used in some of the biggest companies in France and elsewhere, you are prepared for that growth. Starlake was built from the ground up to fix this with declarative configurations. Instead of glue code, type-safe definitions, built-in orchestration integration, and full cloud flexibility, all open-source.

If you want to know more about Starlake's open-source declarative data platform and why you might consider it for an enterprise use case, I suggest checking out Part 1 first. Otherwise, let's explore Starlake in action before we focus on the key features that make it very interesting for data engineers later in this article.

Complete ETL and Data Engineering Capabilities

In this chapter, we cover capabilities that every data engineer needs for a successful analytics solution: solving the fundamental data engineering jobs and delivering analytics results with capabilities that come out of the box with Starlake, supporting beyond basic ETL.

Illustration of Starlake with its modular structure from extract, load and transform, connected by orchestration.

Powerful features include:

  • No-Code Data Ingestion: Connect to any data source through simple YAML configurations without writing custom extraction code.
  • Data Governance and Quality at every step: Built-in validation rules and lineage tracking ensure data reliability from ingestion to analytics.
  • Automated Workflow Orchestration: Dependencies and scheduling are handled automatically based on your declarative configurations.

In the next chapters, we go through them step by step and elaborate on how a declarative data stack like Starlake works and the capabilities it offers.

Extract & Load: From Source to Warehouse

Extracting is the direct way to load from your source databases. The extraction lets you extract tables in one shot and incrementally from a database as a set of files.

Load, on the other hand, is when you load from local files or files on S3/R2, for example. This is loading your Parquet, CSV, or JSON files.

Supported Data Sources

Starlake supports a comprehensive range of data sources:

CategoryData Sources
File FormatsStructured: CSV, TSV, PSV (with multi-char separators)
Semi-structured: JSON, XML, YAML
Binary: Parquet, Avro, ORC
Compressed: ZIP, GZIP, BZIP2
Other: Fixed-width position files, Excel spreadsheets
Cloud StorageAmazon S3, Google Cloud Storage, Azure Blob Storage, Local filesystem
Databases & WarehousesSnowflake, BigQuery, Redshift, Databricks, PostgreSQL, MySQL, Oracle, SQL Server, DuckDB/Ducklake, any JDBC Databases
Streaming & EventsApache Kafka, Amazon Kinesis, Google Pub/Sub, Azure Event Hubs

For some warehouses or cloud providers, deep integration exists. For example, with Snowflake through Snowflake's Native App capability, you simply install: install starlake -> native app and get all the features inside Snowflake. This allows you to avoid sharing your Snowflake credentials with an external SaaS platform and avoid data exfiltration, as well as use your Snowflake compute credits to run Starlake and have a unified bill.

You can use markup on top of Snowflake and avoid DevOps. With Starlake as a native app, it means no security headaches, credentials are never shared, and no data is exfiltrated. Everything is SQL, and the data is secure in Snowflake. You save a lot of time by not needing to do this work.

Organized by Domain

In Starlake, the data is organized by domains. A domain is a database schema also called a dataset on BigQuery.

Built-in Quality Features

The loading capabilities go beyond moving files. For example, we can set data types and their format at the beginning when we load, which applies them for everyone at the very beginning of loading. We can set attributes and write strategies such as APPEND, OVERWRITE, UPSERT_BY_KEY, UPSERT_BY_TIMESTAMP, DELETE_THEN_INSERT, SCD2, or even ADAPTIVE. These are all very powerful on their own, as typically you'd need to write your own extensive custom code, especially for incremental, Slowly Changing Dimension (Type 2), or UPSERTs. Having these integrated and switching quickly by just changing the writing strategy is handy and significant.

image Example of how to load data and its configurations in Starlake GUI

Dedup Strategy for Better Data Quality

Write Strategies help you always make sure users won't see duplicated rows where specified. For example, you can specify that a customer data table should be deduped, and Starlake makes sure to upsert the data accordingly.

Type Validations and expectations are two features to improve data quality out of the box. It validates the types of the data you are loading by specifying the schema, and with expectations, allows you to test if the resulting table contains the expected data.

Transform on Load to quickly add columns you need but are not provided by the source database, such as load-time or locally formatted timestamp.

For BigQuery and Databricks, there is a feature that supports their user-defined clustering and partitioning to improve the performance of queries on large datasets.

Access Control with ACL or row-level security for Snowflake, BigQuery or Spark is built in. This is pretty cool.

image

All of these configurations have one place to configure all your data needs, and they will be propagated downstream to all your consumers.

Other powerful and advanced enterprise features that you'd spend much time and effort to build otherwise are:

No-Code Data IngestionCloud-Native IntegrationSecurity & Governance
Real-time streaming via KafkaNative AWS, GCP, Azure support
Built-in error handling & recoveryCloud-agnostic deploymentColumn-level encryption
Alerting, Freshness, ScheduleServerless executionData masking & anonymization
Multi-cloud data syncAudit logging & compliance
GDPR & CCPA support

Beyond core data processing, Starlake excels in development workflow and platform integration:

CI/CD IntegrationVersion Control & TestingPlatform Integrations
Git-based version controlAutomated testing with DuckDBGitHub Actions
Pipeline validation workflowsInfrastructure as Code supportGitLab CI
Azure DevOps
Jenkins

Data Modeling and Schema Definition

We can also define foreign keys and primary keys in the load menu. This helps us model the data when we combine different sources and create new tables, or if the data source does not have any proper relations set.

image An example of the many options when loading data

Transform: Integrate Business Logic Simplified

Transforms are the critical business logic. We define them mostly in SQL and Python, and custom transformations are also possible. We can define them in the worksheet.

image Starlake worksheet data transformation example.

We can then preview our transformation to check if we have the right granularity or result. But beyond that, we can transpile into other SQL dialects if we want to test our DuckDB SQL on our Spark or BigQuery cluster. This allows us to test the SQL on top of an in-memory DuckDB database without hitting the data warehouse, inducing more interactivity and less cost.

If you click on preview, it will show transpiled code for the current configured engine that's going to be run. Or we can just check the lineage for that aggregation.

image Preview our data aggregation—or hit transpiled or lineage to see more detailed information.

These transformations can be stacked upon each other, building your data application with newly created data models and powerful aggregations. Check out a demo here.

image An example of column-level lineage with its dependencies and an overview of your data assets in the web UI. Hovering with the mouse will reveal the transformation applied (see SUM() above).

Lineage from plain SQL

Thanks to its deep integration with open-source projects like JSQLParser and JSQLTranspiler (both funded by Starlake), the platform can automatically infer data lineage directly from your SQL statements. This enables Starlake to generate orchestrator DAGs and manage dependencies without requiring manual Jinja references or explicit dependency declarations. As a result, orchestration order and lineage tracking are handled by the platform, freeing developers from boilerplate and reducing the risk of errors.

This also makes it easy to copy and share queries with other team members, streamlining collaboration and reducing friction in your workflow.

Tests: Ensuring Data Quality at Scale

The next step is to test, orchestrate, and deploy. Testing is most important in data engineering and data analytics platforms because if we have false key performance indicators (KPI) or get inconsistent data, business consumers of the data won't trust the data anymore.

That's where extensive tests such as unit tests and full tasks for loading or transformation come in. These tests check the correctness of the data transformations, the data loading process, and the data quality.

Type validation is built-in and done with DuckDB. For example, you can get quite creative to define your types and have them validated ahead of runtime through a test dataset.

When you have rejected rows during a production run, these can be written to specific rejected tables with rows that invalidated the definition, including a report summary. This allows only valid data in the system and provides an elegant way to send a report based on the rejected table to the domain owner with invalid data for further analysis.

The best part is that Starlake allows you to test load tasks locally (using DuckDB) before deploying to production. This helps you save many development cycles as we can avoid running a long-running task at night only to find a typo in the morning.

Orchestrate: Automated Workflow Management

Orchestration is for scheduling the transformations on a regular schedule using your orchestration tool of choice. For example, Dagster, Airflow, or Snowflake Tasks are integrated.

Starlake doesn't have its own orchestrator but uses major open-source ones. It natively integrates with Dagster or Airflow, or you can use Snowflake Tasks (Snowflake's native orchestrator), to run compute. The motto is to always use the cloud orchestrator that already exists or is in use.

No-Code DAG Generation with Templates

Furthermore, Starlake supports templating for customizing the DAG generation using Jinja2 and Python. This helps load similar tasks with the same template, helping to maintain consistency across a large codebase.

DAG generation relies on command line tools, configurations, templates, and the starlake-orchestration framework to manage dependencies. Templates for data loading and transformation can be customized with runtime parameters using Python variables like jobs and user_defined_macros. Dependencies between tasks can be managed either inline (generating all dependencies) or through external state changes using Airflow datasets or Dagster Assets.

The best part is that users can schedule advanced schedules with code-free templating, also called no-code—you just select one of the predefined/custom orchestrator templates and let Starlake generate and deploy your DAGs on your selected orchestrator. See some examples in Orchestration Tutorial.

Furthermore, you get intelligent and visual dependency management with parallel execution when needed, and automated error handling and recovery. It's a key component of Starlake, as it not only defines the data stack via flexible configurations but can also run it with whatever execution engine your data organization uses.

Multi-Engine Flexibility and Cross-Platform Support

In fact, Starlake supports multi-engines and works across different ones.

What does that mean? This concept is what differentiates it from normal data stacks that can run only a proprietary engine. This way you have a well-defined configuration for your data stack and business logic, but with the flexibility to exchange to newer or faster engines.

image Cross-engine support to use any compute you want. As you can see in the image, you could use Snowflake or hook it up to Excel.

Further, you are not vendor-locked in—you can use different engines when needed or can change if you have to because of an organizational decision. By default, Starlake lets you specify your default engine for your entire project while also giving you the flexibility to override it on a per-model basis.

Let's say you need the native warehouse engine in production for speed and security reasons, while leveraging DuckDB in the development environment or using Spark to process thousands of XML files in the test environment. Starlake is also cross-engine capable, enabling transformations that query one data warehouse and write results to another. This is especially useful for export tasks to formats like CSV, Parquet, or Excel, as well as for integrations with external analytical or operational databases.

Deploy: From Development to Production

The last step is deploying your project. Maybe you run it locally with Docker and want to run it on Kubernetes or Starlake Cloud. For that, you can simply copy the configurations and be up to speed. That's the beauty of being declarative.

Visualize: Connecting to Your Analytics Stack

Visualization is usually not covered by a declarative data stack, but you can plug in any notebook, data apps, or business intelligence tools to use your created data marts or data artifacts.

Every data analytics project needs a visualization tool. Starlake leaves this open, so you could use any you want.

Key Takeaways and What's Next

So whether you're a business user, data analyst, or data engineer, Starlake, with its declarative approach and low-code interface, makes it easy for everyone to create a sophisticated data platform with minimal configuration and all the bells and whistles included.

It allows efficient data transformation with the engine of your choice, supporting SQL and Python defined in YAML. It provides automatic handling of advanced write methodologies such as incremental and SCD2, built-in support for complex data types and error handling, and everything is version-controllable with simple YAML. Best of all, Starlake is fully open-source.

I believe the future is declarative data engineering, and Starlake can play a big part in this. Some also talk about the move From Data Engineer to YAML Engineer and explain how declarative stacks free up engineers to focus on higher-value work.

In Part 3, we'll discuss use cases such as transpiling SQL between dialects and data quality and validation with expectations in action. We'll demonstrate the built-in, declarative security features and showcase DAG generation using templates that integrate with Airflow, Dagster, and Snowflake tasks out of the box, including column-level lineage support. We'll also analyze how the end goal of moving toward English engineering through LLMs could look—a road to self-service data platforms with a declarative backbone.

I hope you enjoyed this. Please let me know in the comments or write me if you have any feedback or experiences with declarative data stacks.


Want to get started with Starlake? Check out the GitHub repo or get started with a demo on demo.starlake.ai. Again, Starlake is fully open-source—not only its engine but also SQL transpilers and used tools—but you can also use its cloud hosting to get started immediately. If you use VS Code, check out the extension for this too.

Configure, Don't Code: How Declarative Data Stacks Enable Enterprise Scale

· 15 min read
Simon Späti
Data Engineer & Technical Author at ssp.sh

hero

Imagine building enterprise data infrastructure where you write 90% less code but deliver twice the value. This is the promise of declarative data stacks.

The open and modern data stack freed us from vendor lock-in, allowing teams to select best-of-breed tools for ingestion, ETL, and orchestration. But this freedom comes at a cost: fragmented governance, security gaps, and potential technical debt when stacking disconnected tools across your organization.

On the flip side, closed-source platforms offer unified experiences but trap you in their ecosystems where you can't access the code or extend beyond their feature sets in case of need.

What if we could have the best of both worlds?

Enter declarative data stacks – open-source solutions that seamlessly integrate powerful orchestration tools like Airflow and Dagster while covering your entire data lifecycle. These are complete data platforms with "batteries included" where configuration replaces coding, dramatically reducing complexity while implementing best practices by default.

In this article, we'll explore how declarative data stacks manage enterprise data complexity. You'll discover how Starlake, an open-source declarative data stack, enables data transformations through simple configurations, helping teams deliver production-ready platforms in days instead of months while automatically maintaining governance and lineage.

At its core, Starlake serves as your data warehouse orchestrator – a unified control center that manages your entire data ecosystem through a single, coherent lineage. Not just a tool, but an end-to-end OSS data platform that makes "configure, don't code" a practical reality for enterprise teams.

What Is a Declarative Data Stack?

A declarative data stack is essentially an end-to-end data engineering tool that is made for complexity. It has all the features you need included. It extracts, loads, transforms, and orchestrates. The mantra: Code less, deliver more.

What's the catch? You need to align with a common configuration style. You need a framework, a tool, something that abstracts and does the hard work, so that we, the users, can simply configure and run it. This is where YAML comes in. Some also call it the YAML Engineer.

In this Part 1, we will go through the benefits and anatomy of declarative data stacks, and see how Starlake can help us with that. In Part 2 we'll have a look at an end-to-end data engineering project that does ingestion from any configurable source, using SQL transformation and orchestrating it with Airflow, Dagster, or even Snowflake Tasks (you choose!), showing a declarative data stack in action.

And then finally, we'll choose a BI dashboard to visualize the data (not part of Starlake).

The Evolution Toward Declarative Data Engineering

If we look back at where we come from, we can see that 20 years back, everyone just picked a vendor like SAP, Oracle, or Microsoft, and ran with their database or business intelligence solutions. You had lots of options, but if you needed anything specific, you had to request that feature and hope someone implemented it. You were captured in a monolithic system.

Because these were widely spread, you had a good chance that what you wanted was already there. But there was always this one customer that wanted that one extra thing. That's why a little later, the boom with open-source happened. All of a sudden, plenty of US startups were sharing their tools like Hadoop and its whole ecosystem. Everyone got used to sharing their code and became excited about solving tough problems together.

This also took over the data space and became one of the starting points for data engineering when Maxime Beauchemin started sharing his trilogy about functional data engineering. More and more tools got open-sourced and this is where we landed a while back with the explosion of tools and the Modern Data Stack, or Open Data Stack, a modular system. Usually the deployment and DevOps involved are quite a massive effort; think Terraform, Helm Charts, or Kustomize and how these democratized declarative engineering over the last decades.

Today we are entering the next phase of the cycle. We are back to one platform ruling them all, or better, bringing together the strengths from both of these worlds: having one platform integrated with open-source tools. But how do we achieve this? Exactly, you guessed it right, with declarative data stacks.

What Makes a Data Stack Declarative?

This transition from a monolithic to a modular system, back to a monolithic platform but with OSS plug-and-play tools to choose from with integrated templates, is a great evolution learning from the past. And declarative data stacks are the key to that.

But maybe you haven't heard of that term, or are unsure what it means. I have written about it at length in The Rise of the Declarative Data Stack, but here is the short version of it:

A declarative data stack is a set of tools and, precisely, its configs can be thought of as a single function such as run_stack(serve(transform(ingest))) that can recreate the entire data stack.

Instead of having one framework for one piece, we want a combination of multiple tools combined into a single declarative data stack. Like the Modern Data Stack, but integrated the way Kubernetes integrates all infrastructure into a single deployment, like YAML.

We focus on the end-to-end data engineering lifecycle, from ingestion to visualization. But what does the combination with declarative mean? Think of functional data engineering, which leaves us in a place of confident reproducibility with little side effects (hopefully none) and uses idempotency to restart functions to recover and reinstate a particular state with conviction or rollback to a specific version.

At its core, we can define the full data stack with a single configuration file (or files). And instead of defining the how, we can configure what we want to achieve. Essentially code less and deliver more with the benefit of a more reproducible and maintainable data stack.

Breaking Free from Black Boxes

Declarative data stacks also break free from black boxes of different OSS tools that you have a hard time integrating. Instead, you get a data-aware data platform that integrates through parametric configuration into the end-to-end data pipeline.

Unlike Airflow, where each task has no idea what is happening inside of it, it's a black box. As more complex data platforms become the norm, knowing if a task has finished or not is not enough. With a declarative approach, you focus more on the what, the data assets. This means we define them, their logic, dependencies, down to their column-level lineage and types. This opens up the black box for us in an easy and approachable way, mainly in YAML.

The YAML configuration approach makes the internal workings more explicit and accessible, rather than hiding them, making the configuration and produced assets transparent and accessible to us and our users.

The YAML configs also make it approachable to non-programmers, as YAML is the universal configuration language. YAML is also easy to integrate with LLMs, more so as it's a superset of JSON and easily integrates with JSON Schema. This means each YAML can be converted to JSON and integrated with the powerful data serialization language of JSON Schema. This leads to a (longer) context window where AI and humans can iterate on and integrate into any code editor or architecture. Ultimately, it helps to make English engineering a reality.

Don't Code, Declare.

This is the main mantra behind declarative data stacks: don't code, but declare and get a fully-fledged data platform.

The Anatomy of an Open Declarative Data Stack

But what does that data platform include, and what's the anatomy of such a data stack? A declarative data stack does not only support one part of the data engineering lifecycle but the entire process end-to-end.

We go from extracting data to orchestrating them all in one platform:

  • Extract: Configuration-driven data ingestion
  • Load: Zero-code data loading with built-in validations
  • Transform: SQL-based transformations with automatic optimizations
  • Orchestrate: Integration with existing workflow managers
  • Visualization: Present and visualize created data assets
  • Lineage: Visibility across the entire stack

If you will, these are the core bones and skeleton of a declarative data stack. There are now many ways one could implement this. In the following chapters, we'll analyze how Starlake implemented these components, and how it can help us support complex data landscapes.

Starlake: An Open Source Declarative Data Platform

Starlake gets you all the discussed benefits of a declarative data stack in one single platform, configurable by YAML. It offers a powerful open-source solution to embrace declarative data engineering end-to-end.

image

In a nutshell, Starlake is a comprehensive declarative data platform that:

  • Is based in France, is entirely open-source, and had its first commit on May 21 2018 by Hayssam Saleh (LinkedIn).
  • Runs on JVM (built with Scala) for cross-platform compatibility, bringing type safety and performance
  • Uses a YAML-based configuration approach (similar to how Terraform works for infrastructure)
  • Covers the entire data lifecycle: extract, load, transform, test, and orchestrate
  • Supports multiple data warehouse engines (Snowflake, BigQuery, Databricks, etc.)
  • Integrates with popular orchestrators like Airflow, Dagster, Snowflake Tasks, and Databricks scheduler (coming soon)
  • Eliminates complex coding with a "declare, don't code" philosophy
  • Features built-in data governance, validation, and lineage tracking
  • Enables development in-memory and local DuckDB, as well as deployment to any supported platform
  • Supports multi-engine and cross-engine capabilities (query one warehouse, write to another)

It has automated data quality checks and validation, native integration with major data warehouses, and no-code and low-code paradigms support. It also comes with enterprise-grade security and governance out of the box and supports both batch and real-time processing.

Starlake bridges the gap between open-source flexibility and enterprise-ready data platforms, allowing organizations to build robust data pipelines with minimal code while maintaining complete visibility and control. It builds and orchestrates data warehouses with the complexity of data in mind from the get-go.

Why Starlake for Enterprises?

But why would you use Starlake at a large enterprise? What are the immediate benefits? In condensed form, you could say it helps address these three main problems that we face in the field of data engineering:

  • Overwhelming complexity in managing data pipelines
  • Inefficiencies in transforming and orchestrating data workflows
  • Lack of robust governance and data quality assurance

Starlake simplifies the data management effort with configuration-driven ingestion, transformation, and orchestration, and reduces manual implementation of each data pipeline.

This enhances data quality and the overall reliability of the platform with enforced governance, schema validation, rules, and SLAs across the system. It accelerates time to insights and opens up opportunities to create additional data pipelines with less technical people through the UI and configuration interface.

If we compare it to traditional ETL tools, the table below will help us understand the differences:

FeatureStarlake.aiTraditional ETL ToolsBenefits
Core ArchitectureDeclarative (YAML-based)Imperative (code-heavy)Reduced maintenance burden, improved readability
No-Code Ingestion❌ (requires custom coding)Faster implementation, accessible to non-programmers
Declarative TransformationsSimplified maintenance, consistent patterns
Automated Orchestration✅ (integrates with Airflow/Dagster)❌ (requires manual setup)Reduced setup time, automated dependency management
Built-in GovernanceEnforced data quality and consistency
Cross-Engine CapabilitiesFlexibility to work across different platforms
Multi-Engine SupportPrevents vendor lock-in
Automated Schema EvolutionAdapts to changing data structures automatically
SQL TranspilationWrite once, deploy anywhere capability
Local Development✅ (with DuckDB)Faster development cycles, reduced cloud costs
Data Quality Validation✅ (built-in)⚠️ (limited/add-on)Higher data reliability
End-to-End Lineage✅ (column-level)⚠️ (typically table-level)Enhanced visibility and troubleshooting
Row/Column Level Security⚠️ (often requires add-ons)Better compliance capabilities
Gen AI Readiness✅ (via JSON Schema)Future-proof architecture
Learning CurveModerate (YAML)High (multiple languages/tools)Faster team onboarding
Implementation TimeFastSlowQuicker time to insights
Maintenance BurdenLowHighReduced technical debt

In the end, it manages complexity and the return on investment in the platform grows with rising complexity. The reason is the declarative nature and the encapsulation of complexity into the platform (Starlake), allowing the data engineer and user of the platform to focus on the business requirements and analytics.

Contrast that with manually writing imperative Python scripts to schedule your data pipeline or managing your infrastructure by manually installing multiple tools and making them work altogether. This is the main challenge the current landscape faces, and the hidden cost of imperative data work.

That's why integrated systems are on the rise and where configuration-driven stacks bridge the gap to become easier to maintain and more reliable.

When Is Starlake the Wrong Option?

If you have a simple POC, or you know for a fact that the platform will never get wider use or grow bigger, or the requirements are in constant change, then it's easier to custom-make an imperative solution. Though you can still use Starlake to quickly start up with all batteries included, it's still a better fit to abstract a big complex data landscape into a manageable data platform, with an integrated governed solution.

If your focus is on visualization, Starlake comes with no presentation tool, though it can be easy to hook up one to the generated data assets. Best would be to stay in the code-first and declarative approach with BI solutions such as Rill, Evidence, or Lightdash.

Starlake is especially strong with large enterprises, which shows with the current customer base, mostly large enterprises with high volumes (100 GB/day and 100 TB in BigQuery), multi-cloud such as Google Cloud and reports on Azure, or on the other spectrum, streaming-like requirements with new small files every minute, ingested into Redshift and visualized in Superset.

Multiple Verticals into Data Engineering

Starlake spans across multiple verticals within the data landscape and isn't limited to one aspect of data engineering. Just as YAML has become the universal configuration language for various domains, declarative data stacks integrate several critical verticals of the data engineering ecosystem under a unified configuration approach.

  • Orchestration: Rather than writing complex DAG code, workflows are defined in YAML and automatically translated to Airflow or Dagster execution plans. This abstracts away the complexity of pipeline scheduling and dependency management.
  • Transformation: SQL is already declarative by nature, but Starlake enhances it with automated optimizations and transpilation across different warehouse dialects—write once, run anywhere.
  • Infrastructure: Similar to how Terraform revolutionized infrastructure management, Starlake applies declarative principles to data infrastructure, enabling version-controlled data assets.
  • API Integration: Much like OpenAPI Specification uses YAML to define REST APIs, Starlake allows declarative configuration of data sources and destinations.
  • Governance: Data quality rules, schema validation, and access controls are all defined declaratively rather than scattered across multiple codebases.

When to Choose a Declarative Approach

That's it, that's why you shouldn't code but configure if you want to optimize for a large enterprise. This article is part one of three. Before we check Starlake in action in Part 2 and analyze what's possible today with declarative data stacks as well as the future of it, including how to integrate well with the latest GenAI and GenBI capabilities, let's wrap up this article.

We have learned that declarative data stacks represent a paradigm shift in enterprise data engineering - "Declare your intent, don't code it." By embracing configuration over custom coding, organizations gain reproducibility, maintainability, and governance without sacrificing flexibility. With tools like Starlake, you can develop and test locally on DuckDB using your target warehouse's SQL dialect, reducing costs and enabling faster development cycles while ensuring consistent quality through automated testing.

Breaking free from black boxes doesn't mean abandoning the open data stack, but rather integrating best-in-class tools under a unified, configuration-driven approach. The result is a data platform that scales with your complexity, reduces technical debt, and democratizes data engineering by making it accessible to YAML engineers alongside traditional coders. As data landscapes grow increasingly complex, the declarative approach offers a sustainable path forward that balances enterprise requirements with development agility.

I believe the future of data engineering will change significantly, but with a declarative data stack, we are ready for whatever changes come with an adaptable configuration-first system that nicely abstracts implementation logic from business requirements. This allows the future of data pipeline development to be adaptable, governed, and ready to evolve with organizational needs.