Top 4 Challenges using RAG with LLMs to Query Database (Text-to-SQL) and how to solve it.

Published in

WrenAI

6 min readApr 17, 2024

The Advent of LLMs shows the ability of machines to comprehend natural language. These capabilities have helped engineers to do a lot of amazing things, such as writing code documentation and code reviews, and one of the most common use cases is code generation; GitHub copilot has shown the capability of AI to comprehend engineers’ intention for code generation, such as Python, Javascript, and SQL, though LLM’s comprehension AI could understand what we want to do and generate code accordingly.

Using LLM to solve Text-to-SQL

Based on the code generation capability of LLMs, many people have started considering using LLMs to solve the long-term hurdle of using natural language to retrieve data from databases, sometimes called “Text-to-SQL.” The idea of “Text-to-SQL” is not new; after the presence of “Retrieval Augmented Generation (RAG)” and the latest LLM models breakthrough, Text-to-SQL has a new opportunity to leverage LLM comprehension with RAG techniques to understand internal data and knowledge.

Challenges of Text-to-SQL using RAG

In the text-to-SQL scenario, users must have precision, security, and stability to trust LLM-generated results. However, it’s not that simple to pursue an executable, accurate, and security-controlled text-to-SQL solution; here, we conclude the four key technical challenges using LLM with RAG to query databases through natural language: context collection, retrieval, SQL generation, and collaboration.

Four key challenges of using LLM with RAG for Text-to-SQL

Challenge 1: Context Collection Challenges

Interoperability Across Diverse Sources: To generalize and normalize searched and integrated information seamlessly across varied sources, metadata services, and APIs.
Complex Linking of Data and Metadata: This involves associating data with its metadata in a document store. It involves storing metadata, schema, and context, such as relationships, calculations, and aggregations.

Challenge 2: Retrieval Challenges

Optimization of Vector Stores: Developing and implementing optimization techniques for vector stores, such as indexing and chunking, are critical for enhancing search efficiency and precision.
Precision in Semantic Search: The challenge lies in the nuances of comprehension of queries in the context, which can significantly affect the accuracy of the results. This usually involves techniques such as query rewrite, re-ranker, etc.

Challenge 3: SQL Generation Challenges

Accuracy and Executability of SQL Queries: Generating SQL queries that are both accurate and executable poses a significant challenge. This requires the LLM to have an in-depth understanding of SQL syntax, database schemas, and the specific dialects of different database systems.
Adaptation to Query Engine Dialects: Databases often have unique dialects and nuances in SQL implementation. Designing LLMs that can adapt to these differences and generate compatible queries across various systems adds another layer of complexity to the challenge.

Challenge 4: Collaboration Challenges

Collective Knowledge Accumulation: The challenge lies in creating a mechanism that can effectively gather, integrate, and utilize the collective insights and feedback from a diverse user base to enhance the accuracy and relevance of the data retrieved by LLM.
Access Control: While we are finally retrieving the data, the next most important challenge is ensuring that the existing organizational data access policies and privacy regulations also apply to the new LLM and RAG architecture.

Interested in learning more? We plan to share more about each challenge in future posts; follow us on Medium to get notified!

How could we solve it? Semantic Layer for LLMs.

To solve the challenges above, we need a layer between LLMs and data sources to allow LLMs to learn the context of business semantics and metadata from data sources; this layer also needs to map the semantics with the physical data structure, often called the “semantic layer.” The Semantic Layer must solve the connection between semantics and data structure and coordinate access control and identity management, ensuring only the right person accesses the right data.

What should be included in the semantic layer for LLMs? Here, we generalize into a few aspects.

Data Interpretation and Presentation

Business Terminology and Concepts: The semantic layer includes definitions of business terms and concepts. For example, a term like “revenue” is defined in the semantic layer, so when business users query their BI tool for “revenue,” the system knows exactly what data to retrieve and how to calculate it based on the underlying data sources.
Data Relationships: It defines the relationships between different data entities. For instance, how customer data relates to sales data or how product data is linked to inventory data. These relationships are crucial for performing complex analyses and generating insights.
Calculations and Aggregations: The semantic layer often includes predefined calculations and aggregation rules. This means that users don’t need to know how to write complex formulas to, for example, calculate year-to-date sales; the semantic layer handles these operations based on the definitions and rules it contains.

Data Access and Security

Security and Access Controls: It can also manage who has access to what data, ensuring that users can only see and analyze data that they are authorized to access. This is crucial for maintaining data privacy and compliance with regulations.

Data Structure and Organization

Data Source Mapping: The semantic layer maps the business terms and concepts to the actual data sources. This includes specifying which database tables and columns correspond to each business term, allowing the BI tool to retrieve the correct data.
Multidimensional Models: In some BI systems, the semantic layer includes multidimensional models (like OLAP cubes) that allow for complex analyses and data slicing/dicing. These models organize data into dimensions and measures that users can easily explore and analyze.

Metadata

Metadata Management: It manages metadata, which is data about the data. This includes descriptions of data sources, transformations, data lineage, and any other information that helps users understand the data they are working with.

Introducing WrenAI

WrenAI makes your database RAG-ready.

WrenAI is open-source. You can deploy WrenAI anywhere in your data, LLM APIs, and environment. It comes with an intuitive onboarding and user interface that allows you to connect and model your data models in data sources within minutes.

Under the hood of WrenAI, we developed a framework called “Wren Engine”—Semantic Layer for LLM, which I mentioned in the previous section. The Wren Engine is also open-sourced on GitHub. If you are interested in Wren Engine, please leave a comment; I plan to share more about its architecture and design in future posts.

Modeling in WrenAI

After connecting with your data sources, it will automatically collect all the metadata, and with WrenAI UI, you can add business semantics and relationships through the user interface; it will automatically update your vector store for future semantic search.

Ask and Follow-up Questions

After modeling, you can start asking your business questions; WrenAI will search to retrieve the most relevant 3 results for you. Once you choose one of the options, it will break down into step-by-step explanations of where the data comes from and with summaries so that you can gain more confidence using the result suggested by WrenAI.

Getting the result back from WrenAI, you can, based on the returned result, ask follow-up questions for deeper insights or analytics.