Skip to main content

One post tagged with "Wildcards tables"

View All Tags

How to Load and Transform into BigQuery Wildcard Tables

· 5 min read
Hayssam Saleh
Starlake Core Team

Sharding

BigQuery Wildcard Tables

When loading files into BigQuery, you may need to split your data into multiple partitions to reduce data size, improve query performance, and lower costs. However, BigQuery’s native partitioning only supports columns with date/time or integer values. While partitioning on string columns isn’t directly supported, BigQuery provides a workaround with wildcard tables, offering nearly identical benefits.

In this example, we demonstrate how Starlake simplifies the process by seamlessly loading your data into wildcard tables.

The Problem

Assume your business receives daily transactions as CSV files from branches across various countries. These files are named in the format transactions_YYYYMMDD.csv, and the goal is to load them into BigQuery, partitioned by country, region and date.

The CSV files contains data in the following format:

store_id, customer_id, trans_id, trans_date, trans_time, country, region, ...
1, 1, 1, 2024-12-19, 12:00:00, US, CA
1, 2, 2, 2024-12-19, 13:01:00, US, CA
2, 3, 3, 2024-12-19, 13:01:00, US, NY
2, 4, 4, 2024-12-19, 13:01:00, US, NY

With hundreds of millions of transactions generated daily, querying the data efficiently by country and region is critical. Partitioning the data in this way would reduce storage size and significantly speed up queries.

However, since BigQuery doesn’t support partitioning by string columns, we can leverage wildcard tables to achieve the same benefits. The desired result is to create separate tables in BigQuery for each country and region, structured as follows:

sales.transactions_US_CA
sales.transactions_US_NY

Each table would store transactions for a specific country and region, while also being partitioned by date for optimal performance.

The Solution with Starlake Load

First, we need to instruct Starlake to infer the schema of the table from the CSV files and load them into BigQuery. This can be achieved with the following command:

starlake infer-schema --domain sales \
--table transactions \
--input transations_20241219.csv

This command infers the schema of the CSV file and generates a YAML file that defines the schema and loading instructions. The resulting YAML file would look like this:

transactions.sl.yml
version: 1
table:
name: "transactions"
pattern: "transations_.*.csv"
attributes:
- name: store_id
type: int
- name: customer_id
type: int
- name: trans_id
type: int
- name: trans_date
type: timestamp
- name: trans_time
type: timestamp
- name: country
type: string
- name: region
type: string
metadata:
format: DSV
withHeader: true
separator: ","
writeStrategy:
type: APPEND

Additionally, a second YAML file specifies where the input files are located:

_config_.sl.yml
version: 1
load:
name: "sales"
metadata:
directory: "{{incoming_path}}/sales"

To load the data into BigQuery, follow these steps:

  1. Place the transaction files, following the expected filename pattern, into the directory {{incoming_path}}/sales. The {{incoming_path}} variable is defined in the environment configuration file.
  2. Run the following command:
starlake load --domain sales --table transactions

Since we aim to partition the data by country, region, and transaction date, and BigQuery does not support string columns for partitioning, we can use wildcard tables to achieve the same benefits.

To implement this, we need to update the metadata section in the transactions.sl.yml file with additional instructions for creating wildcard tables.

transactions.sl.yml
...
metadata:
...
sink:
sharding: [country, region]
partition: [trans_date]

The sharding key tells Starlake to create a wildcard table for each unique combination of country and region. The partition key tells Starlake to partition the data in each wildcard table by transaction date.

We can now load the data into BigQuery and Starlake will create the following tables:

sales.transactions_US_CA
sales.transactions_US_NY

That's it - we've successfully loaded your data into BigQuery using wildcard tables!

Let's query the data to ensure it's been loaded correctly:

SELECT * FROM sales.transactions_*
WHERE _TABLE_SUFFIX = 'US_CA' AND trans_date = '2024-12-19'
SELECT * FROM sales.transactions_*
WHERE _TABLE_SUFFIX in ('US_CA', 'US_NY' ) and trans_date = '2024-12-19'

The Solution with Starlake Transform

In addition to loading data, Starlake also supports transforming data using SQL queries. To create the wildcard tables during the transformation process, We use the exact same attributes in the sink section of the transform YAML file as we did in the load process. Starlake will automatically create the wildcard tables in BigQuery when the transformation is executed.

Conclusion

This section illustrates how Starlake streamlines the process of loading data into BigQuery using wildcard tables.

By leveraging Starlake’s declarative approach, you can effortlessly partition your data by string columns, enhancing query performance while minimizing costs. Additionally, Starlake’s support for both data loading and transformation enables you to seamlessly create wildcard tables as part of the transformation process.

Also, Starlake’s integrated approach ensures consistency by allowing you to reuse the same attributes in the sink section of the transformation YAML file as those defined during the data loading process.

Finally, Starlake’s support for wildcard tables in BigQuery comes with zero additional cost. The data is sharded on the fly at load time in a single step, reducing cost and increasing load & transform jobs performance.

This simplifies workflows and promotes efficiency.