Skip to main content

Elevating the recipe

This is a follow-up on an earlier post: The right recipe. There I used Github actions to push data to a google spreadsheet that would update a looker studio dashboard.

The data we were handling didn't scale with google spreadsheet: both ingesting and accessing got slower. And we needed feature around data access controls, onboarding new clients. So we migrated to Supabase (a Postgres-based platform) and this is the new pipeline:

  1. Hourly API pulls, store JSON response in a postgres table
  2. Flatten part of JSON, subset of key value pairs into another table (materialized view)
  3. A dashboard connected to this view

Infrastructure on Supabase is managed using SQLAlchemy models, I use Alembic for migrations, and Github actions for automation. Everything is version controlled with separate development and production environments.

New pipeline, new challenges. As the data grew, refreshing the materialized view started taking longer and longer. Github action failed as the query to flatten the data took longer and exceeded Supabase's timeouts. This is the query that I was running:

SELECT
  json_dump -> 'total' -> 'time' as time,
  json_dump -> 'total' -> 'value' as energey
FROM telemetry_table
WHERE telemetry_table.json_dump -> 'total' is not null;

We would adjust timeouts on Supabase roles and in a week or two the run would fail again. With close to 35k entries, it took almost 20 seconds to refresh the view.

At some point adjusting timeout didn't cut and I started looking into the query (explain analyze). Just parsing the JSON column was biggest factor. And as I extracted any field it would also increase the execution time.

Here is result from not including JSON column in query:

QUERY PLAN
Seq Scan on telemetry_table (cost=0.00..6843.29 rows=31629 width=8) (actual time=0.014..13.774 rows=31424 loops=1)
Planning Time: 0.230 ms
Execution Time: 15.178 ms

Here is result from using the JSON column, filtering null rows and extracting fields:

QUERY PLAN
Seq Scan on telemetry_table (cost=0.00..8181.60 rows=31471 width=8) (actual time=0.524..11074.447 rows=26012 loops=1)
Filter: ((json_dump -> 'total'::text) IS NOT NULL)
Rows Removed by Filter: 5412
Planning Time: 0.249 ms
Execution Time: 11079.209 ms

Regular search and llm both suggested me to use JSONB instead of JSON and GIN indexes. I created a backup table and altered the column to JSONB and that reduced the refresh time to less than 2 seconds. 10x improvement:

QUERY PLAN
Seq Scan on experimental_telemetry_table (cost=0.00..10696.63 rows=43026 width=8) (actual time=0.753..405.313 rows=37030 loops=1)
Filter: ((json_dump -> 'total'::text) IS NOT NULL)
Rows Removed by Filter: 6212
Planning Time: 0.778 ms
Execution Time: 931.380 ms

This was good enough for us and I didn't need GIN index. I am not filtering rows based on specific fields from the json data.

I will work on it more as we onboard new clients. We know the limitation of existing system: despite the best efforts we still have some manual steps for onboarding a new client. We have functions tied to the project. The data is time-series and there are other well suited databases specially meant for it. API response json varies for each client. Lets see how the project and solution (recipe) evolves.