tl;dr give more context, let ai get better, repeat until its good enough or money and willpower runs out
so you’ve heard the pitch; “we’re going to generate sql for you using the schema in your warehouse/db” and you think to yourself… how are we going to infer what prod_stripe_reports_<productname3>_q4_demo_final in the warehouse means or what sandbox_id ends up joining against? how are we going to get the context to begin to understand this without spending so much time crawling and onboarding that we’re basically doing manual labor for every incremental customer.
sql-to-text on query history
all the banked queries that were previously run at the company that is stored in redshift or snowflake or tableau can be run through sql-to-text to understand what they did, what tables they referenced, who it was useful for, and manually checked off by the people who wrote it. then it is added as further context to the model.
100 domain specific pre-trained queries
cs, marketing, and related teams often ask the same style of questions. starting with question bank of 100 questions per domain to get started with, we can poll the our users on which ones they want to get first (providing them inspriation for what they can ask). from there, we can work with our model to create supervised sql queries that we can fine tune on with validation.
calls to any warehouse table or database follow a power law distribution, 80% of the calls are made to 20% of the tables. going to the data team and documenting which 20% of the tables are called the most and where do they come from isn’t impossible.
semantic layers and dbt
the semantic layer provided by companies like dbt join tables together that are frequently joined together in a layer on top of snowflake to make things easier to query. this provides yet more context to our model.
data dictionary generation and alation | collibra
the prior history of queries run and existing documentation lets us automate data dictionary generation upon onboarding a customer. we can manually clean the dictionary and regularly scan their existing knowledge base to add more context over time. if a customer already has a well organized data dictionary, it gives us an immediate edge to start leveraging for better query quality.
fine tuning on apis
every company has a chunk of sfdc data in their warehouse. every other company has a chunk of stripe data. these are usually stored in reliable formats thanks to connectors like fivetran sending them with set lables, with few customizable columns. there is publically available documentation on those labels and their associated meanings we can fine tune our model on.
regex error catching
(yea yea i know) but if the sql produced calls on tables / columns that don’t exist in the schema, we can flag it pretty quickly and auto re-run. ****
unspecified way of adding more context we’ll come up with next week
description of how we’ll execute on this new way of adding context
if nothing else works?
we iterate and improve the experience while continuing to do things that don’t scale until gpt-4 comes in in a couple months, which hopefully gets us there….
and if that still doesn’t work?
we iterate and hire forward deployed engineers to spend more time onboarding customers and generating code documentation of their data (via genai), and then feed that documentation back into our model… for another 3 years… utnil gpt-5 comes out…. which hopefully gets us there.
you’re just going to say this recursively huh?
yes. we’re technologists; we are investing our time and capital and lives into technology. what does it say about our faith in ai if we don’t believe that we can get it to a state where it can understand and write sql better than a human can…. when we’ve already seen it paint art?
there are 100s of levers we can pull to improve the fidelity of a model or allow a human to better tune and validate its outputs. we’re going to keep pulling new ones until we solve this problem, we run out of cash, or we hit the heat death of the universe.