Optimize Queries

To reduce query costs and improve performance when working with Numia’s datasets, it's essential to minimize the number of bytes scanned. Below are key recommendations to help you optimize your queries effectively.

Avoid `SELECT *`

BigQuery uses column-based storage, meaning the number of columns you query directly affects the number of bytes scanned. Specify only the columns you need in your query to reduce unnecessary data processing.

Leverage Partitions and Clusters

Most Numia tables are partitioned by ingestion_timestamp (the time data is ingested into BigQuery) and clustered by event_type and attribute_key. To optimize performance:

Use these columns in your WHERE clause to narrow down the dataset and minimize data scans.
Avoid using ingestion_timestamp in production environments, as it may not align with the actual block_timestamp. It’s best reserved for testing purposes.

Perform Late Aggregation

Late aggregation refers to delaying summary operations (e.g., GROUP BY, COUNT, or SUM) until you’ve filtered or processed the data as much as possible. This minimizes the amount of data being aggregated, resulting in faster queries.

Aggregate data as late as possible in your query workflow to avoid unnecessary intermediate computations.
Avoid using GROUP BY clauses in subqueries unless the data has been significantly reduced before joining or further processing.

Join Tables Efficiently

Efficient table joins are critical when working with large relational datasets. Poorly optimized joins can drastically increase query costs and processing times.

Order Matters: Always join tables in decreasing order of size, starting with the largest table first. This ensures the most data-intensive operations are handled upfront, streamlining subsequent steps.
Filter Before Joining: Use WHERE clauses to remove unnecessary data from all tables involved in the join. Applying filters early reduces the size of the datasets being joined, improving performance and reducing costs.

Optimize `WHERE` Clause Order

BigQuery processes filters in the order they appear in your WHERE clause. Place the filter that eliminates the largest amount of data first to improve query efficiency.

Example Optimized Query

SELECT event_type, COUNT(*) 
FROM `numia-data.osmosis.events`
WHERE ingestion_timestamp BETWEEN '2023-01-01' AND '2023-01-31'
  AND event_type = 'transfer'
GROUP BY event_type

Filters are applied early (ingestion_timestamp, event_type).
Only relevant columns (event_type, COUNT(*)) are selected.
Aggregation (GROUP BY) occurs after filtering to reduce data being processed.

Avoid SELECT *​

Leverage Partitions and Clusters​

Perform Late Aggregation​

Join Tables Efficiently​

Optimize WHERE Clause Order​

Example Optimized Query​