1. First of all, thanks for outlining how you trained the model here in the repo: https://github.com/NumbersStationAI/DuckDB-NSQL?tab=readme-o...! I did not know about `sqlglot`, that's a pretty cool lib. Which part of the project was the most challenging or time-consuming: generating the training data, the actual training, or testing? How did you iterate, improve, and test the model?
2. How would you suggest using this model effectively if we have custom data in our DBs? For example, we might have a column called `purpose` that's a custom defined enum (i.e. not a very well-known concept outside of our business). Currently, we've fed it in as context by defining all the possible values it can have. Do you have any other recs on how to tune our prompts so that this model is just as effective with our own custom data?
3. Similar to above, do you know you can use the same model to work effectively on tens or even hundreds of tables? I've used multiple question-SQL example pairs as context, but I've found that I need 15-20 for it to be effective for even one table, let alone tens of tables.
Hi, Till here, worked on the DuckDB-NSQL model on MotherDuck side.
1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.
3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.
I didn't see this in the blog post, but did you train this from scratch or finetune an existing base model?
If from scratch, quite impressive that the model is capable of understanding natural language prompts (English presumably) from such a small, targeted training set.
2. How would you suggest using this model effectively if we have custom data in our DBs? For example, we might have a column called `purpose` that's a custom defined enum (i.e. not a very well-known concept outside of our business). Currently, we've fed it in as context by defining all the possible values it can have. Do you have any other recs on how to tune our prompts so that this model is just as effective with our own custom data?
3. Similar to above, do you know you can use the same model to work effectively on tens or even hundreds of tables? I've used multiple question-SQL example pairs as context, but I've found that I need 15-20 for it to be effective for even one table, let alone tens of tables.