Blog | Starlake

Incremental models, the easy way.

December 16, 2024 · 3 min read

Starlake Core Team

Incremental models, the easy way.

One of the key advantages of Starlake is its ability to handle incremental models without requiring state management. This is a significant benefit of it being an integrated declarative data stack. Not only does it use the same YAML DSL for both loading and transforming activities, but it also leverages the backfill capabilities of your target orchestrator.

How to unit test your data pipelines

July 5, 2024 · 6 min read

Bounkong Khamphousone

Starlake Core Team

In today's data-driven landscape, ensuring the reliability and accuracy of your data warehouse is paramount. The cost of not testing your data can be astronomical, leading to critical business decisions based on faulty data and eroding trust.

The path to rigorous data testing comes with its own set of challenges. In this article, I will highlight how you can confidently deploy your data pipelines by leveraging Starlake JSQLTranspiler and DuckDB, while also reducing costs. we will go beyond testing your transform usually written in SQL and see how we can also test our Ingestion jobs.

Polars versus Spark

May 28, 2024 · 6 min read

Hayssam Saleh

Starlake Core Team

Introduction

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities.

As a Data Engineer, I primarily focus on the following goals:

Parsing files, validating their input, and loading the data into the target data warehouse.
Once the data is loaded, applying transformations by joining and aggregating the data to build KPIs.

However, on a daily basis, I also need to develop on my laptop and test my work locally before delivering it to the CI pipeline and then to production.

What about my fellow data scientist colleagues? They need to run their workload on production data through their favorite notebook environment.

Starlake OSS - Bringing Declarative Programming to Data Engineering and Analytics

May 14, 2024 · 6 min read

Hayssam Saleh

Starlake Core Team

Introduction

The advent of declarative programming through tools like Ansible and Terraform, has revolutionized infrastructure deployment by allowing developers to achieve intended goals without specifying the order of code execution.

This paradigm shift brings forth benefits such as reduced error rates, significantly shortened development cycles, enhanced code readability, and increased accessibility for developers of all levels.

This is the story of how a small team of developers crafted a platform that goes beyond the boundaries of conventional data engineering by applying a declarative approach to data extraction, loading, transformation and orchestration.

Starlake

Column and Row Level Security in BigQuery

February 15, 2022 · 4 min read

Hayssam Saleh

Starlake Core Team

Data exposition strategies

Data may be exposed using views or authorized views and more recently using Row / Column level security.

Historically, to restrict access on specific columns or rows in BigQuery, one can create a (authorized) view with a SQL request like the one below:

CLS / RLS using Views

Handling Dynamic Partitioning and Merge with Spark on BigQuery

December 15, 2021 · 7 min read

Hayssam Saleh

Starlake Core Team

Data Loading strategies

When loading data into BigQuery, you may want to:

Overwrite the existing data and replace it with the incoming data.
Append incoming data to existing
Dynamic partition Overwrite where only the partitions to which the incoming data belong to are overwritten.
Merge incoming data with existing data by keeping the newest version of each record.

For performance reasons, when having huge amount of data, tables are usually split into multiple partitions. BigQuery supports range partitioning which are uncommon and date/time partitioning which is the most widely used type of partitioning.

Bonjour

September 18, 2021 · One min read

Hayssam Saleh

Starlake Core Team

Pipelining fast data is big. Pipelining big data fast is bigger. :)

Incremental models, the easy way.​

Introduction​

Introduction​

Data exposition strategies​

Data Loading strategies​

Incremental models, the easy way.

Introduction

Introduction

Data exposition strategies

Data Loading strategies