Introduction
Here at The Data Refinery, we have a key objective that is to empower users to get the most from their data, without needing to be data experts.
We have managed to do that through the generation of automated reports combined with nicely designed segmentation and dashboard building tool. Even with this approach there is still a hurdle to overcome, which is familiarity of the data and it's structure.
A lot of tools in the market leverage AI to generate SQL, that can then be used to query a data source. This works well, but can lead to further confusion if the results outputted are not as expected, as the user would ideally need to check the query to validate that it aligns to the question asked, will is challenging without developer experience.
So for us, we wanted to continue pushing our levels of data accessibility further by allowing our users to ask a question to which, we would provide an answer along with actual data. That response could then be used to populate charts and dashboards.
This simplification of insight building further removes the technical hurdle of a user needing to have a comprehensive understanding of a data model to start getting valuable insight from data. It also helps our users become more familiar with our dashboard building tools.
The requirements for our AI Assistant were as follows:
- Must accept free text questions
- Must respond with an answer to the question
- Must provide a useful summary of the answer, based on the question asked
- Must provide accompanying charts or tables to visualise the data
- Must allow a user to ask further questions or adjust the query we generate
In addition to the above, we had a crucial requirement, for the AI to produce a JSON model which matched our existing API requirements, this would ensure the outputs from the AI could be re-used across our entire application.
Enter: The Refinery Bot (Or Geoff as he is also known):

Our journey to get to this point has been somewhat stop/start, as we had to spend an extended period of time evaluating and validating our new AI product to ensure it behaves as we expect and is consistent in returning valid results.
We also invested heavily in our underlying data tier, as detailed in our 1st blog post. Having a highly consistent and meta rich data warehouse greatly helped our AI agents complete the tasks assigned.
Below is a list of key learnings from our journey:
Prompt Structure
Having a super consistent structure to your prompts is common sense, both because it means you should get consistent behaviour from your AI model and having a convention or standard ensures maintainability.
Here is our current prompt template:
You are an expert {role} responsible for {responsibilities}
Given the below {context item}
--------------------------------------------------------------------------
{data context}
--------------------------------------------------------------------------
Perform the following {action(s)}
Return only {response requirement}
General Prompt Commands
There are also some general pointers on how to instruct your AI model
- JSON Mode: We wanted our AI assistant to return JSON data, to do that we included the following key prompt:
"You are a database api that can only respond with valid JSON that can be parsed"
- Being Succinct: AI models tend to over share when providing responses to questions, we needed our AI to just return data, nothing more. To ensure this happens we used the prompt:
"Return your response directly without any explanations and code snippet formatting."
- Consistent Tone: To avoid double negatives and chances of conflicting logic, ensure all prompts use the same language tone e.g.
"Do this"
rather than"Don't do that"
, mixing instructions like this often confuses the model.
- Temperature & Top P: These two parameters were very important for us as we didn't want our AI having any license to stray away from our metadata schema. In early attempts our AI would do a great job building a data structure, however it would inject fictional properties which sounded like a property that would answer a user’s question. To control this behaviour, we dropped down the Temperature to 0.01 and set the TopP to be 0.1. Having these properties as low as this can be ensured our AI only used the data we wanted it to and was much more consistent in its responses.
- Leverage a RAG framework: We found that interacting directly with the various AI API's quite clunky, so we invested time in looking at frameworks to help, leveraging a framework makes the coding side of this task much easier as you can further standardise the inputs and outputs of your process flows and AI interactions. For reference we would recommend both LlamaIndex and PhiData as excellent frameworks for building AI workflows. See our future blog posts on how we use those tools
Data Schema
We implement a semantic layer (Cube) atop of our data warehouse, this level of abstraction allows us to store a wealth of metadata regarding the schema of our warehouse and use cases for specific tables and columns.
Having this level of metadata was originally designed to ensure our users had access to useful information to guide them through how to interact with our warehouse model, as it turns out, our AI model also found this metadata very useful.
We did change the structure/language of our metadata slightly to benefit our AI. The metadata was tweaked from:
"This table contains information about customers"
To:
"Use this table to answer questions about customers"
The change in language made the difference for us, that might just be linked to our prompt engineering style, but having instructional metadata gave our AI agents better direction when making a decision. Further more we also updated the column, metric and join metadata in the same way.
Table Name: e.g. "Customer"
Table Description: e.g. "Use this table to answer questions about customers characteristics "
Column Name: e.g. "OptInLevel"
Column Description: e.g. "Use this column to filter the table by the level of marketing consent"
Column Type: e.g. string
Agent Workflow
In the early days we created a comprehensive prompt to handle the majority of the logic we wanted our AI model to perform. This actually worked really well, but as time progressed and we conducted more testing, we found that having such a complex prompt lead to issues fine tuning our outputs. This was mainly due to the flow of our prompts and how in some cases it was too easy for the AI to fallback to earlier instructions, which led to confusing outputs.
To resolve this, we opted to implement a workflow of agents, each of which had a single responsibility. This allowed us to test each step of the workflow and build up our desired output in phases. This pattern also allowed us to exit early, if we didn't get a valid response from our agents, which saves time and token spend.
This workflow is more commonly known as RAG (Retrieval, Augmentation, Generation) architecture, which enables an AI model to arrive at a desired answer through marshalling the data it should reason over. To implement this architecture, we had inputs and outputs from each agent, where the outputs from one agent were then given to the next, at a high level this is the workflow we settled upon look like this:
Agent 1: Evaluate the question history and determine if the most recent question is a new question or a follow up question.
Agent 2: Determine the most suitable tables that can be used to answer the question asked
Agent 3: Determine the most suitable columns to query.
Agent 4: Construct a query definition to be issued against the data warehouse.
Agent 5: Validate filter values if present
Agent 6: Evaluate the response to the query to then provide an answer summary.
This workflow is depicted below:

For each stage of this workflow, a couple of things are happening:
- The context for the agent is set, e.g. a question to answer, or a question along with some meta or data context to be used when generating an answer
- A prompt is constructed with instructions that has key context and data attributes injected into it.
- The AI model is called using the generated prompt
- The AI models answer is evaluated and either the next stage of the work flow is executed, or the process returns a response back to the user.
Summary
Even with all the above suggestions, there is still some hard yards and a pretty big investment in time to test and validate your AI model, there isn't really any getting away from that currently. As responsible AI engineers, we have to ensure that our users are not exposed to any harmful or incorrect results from an AI interaction, so make sure you are factoring this into any AI engineering work you plan to undertake.
The next 12 months will be very interesting, it feels like a valuable progression in this space would be a mechanism to further standardise how we interact with AI models to ensure a consistent input and output. Prompt engineering in its current form feels difficult to get right simply because there is little standardisation currently available.