
A new tool makes it easier for database users to perform complicated statistical analyses of tabular data without the need to know what is going on behind the scenes.
GenSQL, a generative AI system for databases, could help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes.
For instance, if the system were used to analyze medical data from a patient who has always had high blood pressure, it could catch a blood pressure reading that is low for that particular patient but would otherwise be in the normal range.
GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adjust their decision-making based on new data.
Moreover, GenSQL can be used to produce and analyze synthetic data that mimic the real data in a database. This could be especially useful in situations where sensitive data cannot be shared, such as patient health records, or when real data are sparse.
This new tool is built on top of SQL, a programming language for database creation and manipulation that was introduced in the late 1970s and is used by millions of developers worldwide.
“Historically, SQL taught the business world what a computer could do. They didn’t have to write custom programs, they just had to ask questions of a database in high-level language. We think that, when we move from just querying data to asking questions of models and data, we are going to need an analogous language that teaches people the coherent questions you can ask a computer that has a probabilistic model of the data,” said Senior Author Vikash Mansinghka, Ph.D., Principal Research Scientist and leader of the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences.
When the researchers compared GenSQL to popular AI-based approaches for data analysis, they found that it was not only faster but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are explainable, so users can read and edit them.
“Looking at the data and trying to find some meaningful patterns by just using some simple statistical rules might miss important interactions. You really want to capture the correlations and the dependencies of the variables, which can be quite complicated, in a model. With GenSQL, we want to enable a large set of users to query their data and their model without having to know all the details,” said Lead Author Mathieu Huot, Research Scientist in the Department of Brain and Cognitive Sciences and member of the Probabilistic Computing Project.
Here is an exclusive Tech Briefs interview, edited for length and clarity, with Huot.
Tech Briefs: What was the biggest technical challenge you faced while developing Gen SQL?
Huot: I think being able to compactly represent a whole range of problems you want to solve in a way that's going to be accessible — how do you not develop 20 different tools and sets of ideas that people can reuse. The point was to be concise and accessible to a large audience. So, how do you make that possible and unify all these different kind of questions — that was definitely a challenge.
Tech Briefs: Can you explain in simple terms how it works?
Huot: You usually have a model; if not, you can have one automatically synthesized for you, given some data. And then given the data — which is either what you train your model on or a different set; but the data of interest that you want to analyze and your model — you can type questions in this formal language. But now we also have some prototype language integration where you can just write in English and it would translate to the formal language. Then, your query is run in the formal language, and you get specialized answers, which can be tables, numbers, etc.
Tech Briefs: How did this work come about? What was the catalyst for the project?
Huot: We've had, in our team, a history of looking at data science and statistics. So, people were aware of the different challenges in applying existing tools to certain problems. Either it was too specialized, or the automation wasn't good, or the model needed too much expertise. So, there was a growing desire in the team to produce something more accessible.
Maybe having a hybrid team of some people being more interested in applications and others thinking about language design was the key to utilizing this opportunity to create something new that can solve a problem by being more accessible to data scientists.
Tech Briefs: What are your next steps? Do you have plans for further research, work, etc. to accomplish these goals?
Huot: There are several technical challenges, which I won't go into the details of. But one challenge is how to scale these things up: how to build a better backend system that can create richer and faster models to query. That's one part.
Another is making it more accessible, for example, through language modeling. So, you speak or write in English, then we can convert it to a formal query. But in the process, sometimes you find ambiguities in the language. One example is, ‘Show me all the people, who are Democratic voters who voted one way in the U.S. in a certain year. But those voters might be registered in that party or just identify as being in that party. So, it can be ambiguous, depending on the dataset. And you may not know that up front.
Here, the system could show you two queries depending on which one you are actually asking for.
On a larger scale, we have some people looking at how to apply these tools to build models for prostate cancer data. Here the challenge is that many experiments are on a small scale, like a single hospital and they can't really share the data. So, if you train a model on that, it's not really telling you enough to make many predictions. But there are a lot of these experiments on the internet that are freely accessible. They're usually not super clean or they have different conventions; you can't really train a model by magic on all of these models and then solve all your problems.
But, these tools that we develop can help. We’re pushing to make the tools able to construct a model that's going to look at, for example, two different clinical trials where the roles don't exactly match. One may have people grouped by age, another by category, how do people get assigned to a group, they might have different labels. So, automating data harmonization.
Tech Briefs: Do you have any advice for researchers or engineers aiming to bring their ideas to fruition, broadly speaking?
Huot: I think it's great — if you want to create something to apply to a problem — to clearly know what the problem is. It may sound extremely simple, but it's really hard in research.
You have to spend, for example, six months in the mud to really get to know the problem and what people actually want instead of what you think they want. That can be really hard to do. Usually, it's going take way longer to solve that problem than you might think; creating these tools takes years. Start from the bottom, be humble.