Optimize Queries
To reduce query costs and improve performance when working with Numia’s datasets, it's essential to minimize the number of bytes scanned. Below are key recommendations to help you optimize your queries effectively.
Avoid SELECT *
BigQuery uses column-based storage, meaning the number of columns you query directly affects the number of bytes scanned. Specify only the columns you need in your query to reduce unnecessary data processing.
Leverage Partitions and Clusters
Most Numia tables are partitioned by ingestion_timestamp
(the time data is ingested into BigQuery) and clustered by event_type
and attribute_key
. To optimize performance:
- Use these columns in your
WHERE
clause to narrow down the dataset and minimize data scans. - Avoid using
ingestion_timestamp
in production environments, as it may not align with the actualblock_timestamp
. It’s best reserved for testing purposes.
Perform Late Aggregation
Late aggregation refers to delaying summary operations (e.g., GROUP BY
, COUNT
, or SUM
) until you’ve filtered or processed the data as much as possible. This minimizes the amount of data being aggregated, resulting in faster queries.
- Aggregate data as late as possible in your query workflow to avoid unnecessary intermediate computations.
- Avoid using
GROUP BY
clauses in subqueries unless the data has been significantly reduced before joining or further processing.
Join Tables Efficiently
Efficient table joins are critical when working with large relational datasets. Poorly optimized joins can drastically increase query costs and processing times.
- Order Matters: Always join tables in decreasing order of size, starting with the largest table first. This ensures the most data-intensive operations are handled upfront, streamlining subsequent steps.
- Filter Before Joining: Use
WHERE
clauses to remove unnecessary data from all tables involved in the join. Applying filters early reduces the size of the datasets being joined, improving performance and reducing costs.
Optimize WHERE
Clause Order
BigQuery processes filters in the order they appear in your WHERE
clause. Place the filter that eliminates the largest amount of data first to improve query efficiency.
Example Optimized Query
SELECT event_type, COUNT(*)
FROM `numia-data.osmosis.events`
WHERE ingestion_timestamp BETWEEN '2023-01-01' AND '2023-01-31'
AND event_type = 'transfer'
GROUP BY event_type
- Filters are applied early (
ingestion_timestamp
,event_type
). - Only relevant columns (
event_type
,COUNT(*)
) are selected. - Aggregation (
GROUP BY
) occurs after filtering to reduce data being processed.