Building a Simple Retrieval-Augmented Generation (RAG) System with RAGTools

Let's build a Retrieval-Augmented Generation (RAG) chatbot, tailored to navigate and interact with the DataFrames.jl documentation. "RAG" is probably the most common and valuable pattern in Generative AI at the moment.

If you're not familiar with "RAG", start with this article.

julia

using LinearAlgebra, SparseArrays
using PromptingTools
using PromptingTools.Experimental.RAGTools
## Note: RAGTools module is still experimental and will change in the future. Ideally, they will be cleaned up and moved to a dedicated package
using JSON3, Serialization, DataFramesMeta
using Statistics: mean
const PT = PromptingTools
const RT = PromptingTools.Experimental.RAGTools

RAG in Two Lines

Let's put together a few text pages from DataFrames.jl docs. Simply go to DataFrames.jl docs and copy&paste a few pages into separate text files. Save them in the examples/data folder (see some example pages provided). Ideally, delete all the noise (like headers, footers, etc.) and keep only the text you want to use for the chatbot. Remember, garbage in, garbage out!

julia

files = [
    joinpath("examples", "data", "database_style_joins.txt"),
    joinpath("examples", "data", "what_is_dataframes.txt"),
]
# Build an index of chunks, embed them, and create a lookup index of metadata/tags for each chunk
index = build_index(files; extract_metadata = false);

Let's ask a question

julia

# Embeds the question, finds the closest chunks in the index, and generates an answer from the closest chunks
answer = airag(index; question = "I like dplyr, what is the equivalent in Julia?")

AIMessage("The equivalent package in Julia to dplyr in R is DataFramesMeta.jl. It provides convenience functions for data manipulation with syntax similar to dplyr.")

First RAG in two lines? Done!

What does it do?

build_index will chunk the documents into smaller pieces, embed them into numbers (to be able to judge the similarity of chunks) and, optionally, create a lookup index of metadata/tags for each chunk)
- index is the result of this step and it holds your chunks, embeddings, and other metadata! Just show it 😃
airag will
- embed your question
- find the closest chunks in the index (use parameters top_k and minimum_similarity to tweak the "relevant" chunks)
- [OPTIONAL] extracts any potential tags/filters from the question and applies them to filter down the potential candidates (use extract_metadata=true in build_index, you can also provide some filters explicitly via tag_filter)
- [OPTIONAL] re-ranks the candidate chunks (define and provide your own rerank_strategy, eg Cohere ReRank API)
- build a context from the closest chunks (use chunks_window_margin to tweak if we include preceding and succeeding chunks as well, see ?build_context for more details)
generate an answer from the closest chunks (use return_all=true to see under the hood and debug your application)

You should save the index for later to avoid re-embedding / re-extracting the document chunks!

julia

serialize("examples/index.jls", index)
index = deserialize("examples/index.jls");

Evaluations

However, we want to evaluate the quality of the system. For that, we need a set of questions and answers. Ideally, we would handcraft a set of high-quality Q&A pairs. However, this is time-consuming and expensive. Let's generate them from the chunks in our index!

Generate Q&A pairs

We need to provide: chunks and sources (file paths for future reference)

julia

evals = build_qa_evals(RT.chunks(index),
    RT.sources(index);
    instructions = "None.",
    verbose = true);

[ Info: Q&A Sets built! (cost: $0.102)

In practice, you would review each item in this golden evaluation set (and delete any generic/poor questions). It will determine the future success of your app, so you need to make sure it's good!

julia

# Save the evals for later
JSON3.write("examples/evals.json", evals)
evals = JSON3.read("examples/evals.json", Vector{RT.QAEvalItem});

Explore one Q&A pair

Let's explore one evals item – it's not the best quality but gives you the idea!

julia

evals[1]

QAEvalItem:
 source: examples/data/database_style_joins.txt
 context: Database-Style Joins
Introduction to joins
We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

julia> using DataFrames
 question: What is the purpose of joining two or more data sets together?
 answer: The purpose of joining two or more data sets together is to provide a complete picture of the topic being studied.

Evaluate this Q&A pair

Let's evaluate this QA item with a "judge model" (often GPT-4 is used as a judge).

julia

# Note: that we used the same question, but generated a different context and answer via `airag`
ctx = airag(index; evals[1].question, return_all = true);
# ctx is a RAGContext object that keeps all intermediate states of the RAG pipeline for easy evaluation
judged = aiextract(:RAGJudgeAnswerFromContext;
    ctx.context,
    ctx.question,
    ctx.answer,
    return_type = RT.JudgeAllScores)
judged.content

Dict{Symbol, Any} with 6 entries:
  :final_rating => 4.8
  :clarity => 5
  :completeness => 4
  :relevance => 5
  :consistency => 5
  :helpfulness => 5

We can also run the generation + evaluation in a function (a few more metrics are available, eg, retrieval score):

julia

x = run_qa_evals(evals[10], ctx;
    parameters_dict = Dict(:top_k => 3), verbose = true, model_judge = "gpt4t")

QAEvalResult:
 source: examples/data/database_style_joins.txt
 context: outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.
semijoin: Like an inner join, but output is restricted to columns from the first (left) argument.
 question: What is the difference between outer join and semi join?
 answer: The purpose of joining two or more data sets together is to combine them in order to provide a complete picture or analysis of a specific topic or dataset. By joining data sets, we can combine information from multiple sources to gain more insights and make more informed decisions.
 retrieval_score: 0.0
 retrieval_rank: nothing
 answer_score: 5
 parameters: Dict(:top_k => 3)

Fortunately, we don't have to do this one by one – let's evaluate all our Q&A pairs at once.

Evaluate the Whole Set

Let's run each question & answer through our eval loop in async (we do it only for the first 10 to save time). See the ?airag for which parameters you can tweak, eg, top_k

julia

results = asyncmap(evals[1:10]) do qa_item
    # Generate an answer -- often you want the model_judge to be the highest quality possible, eg, "GPT-4 Turbo" (alias "gpt4t)
    msg, ctx = airag(index; qa_item.question, return_all = true,
        top_k = 3, verbose = false, model_judge = "gpt4t")
    # Evaluate the response
    # Note: you can log key parameters for easier analysis later
    run_qa_evals(qa_item, ctx; parameters_dict = Dict(:top_k => 3), verbose = false)
end
## Note that the "failed" evals can show as "nothing" (failed as in there was some API error or parsing error), so make sure to handle them.
results = filter(x->!isnothing(x.answer_score), results);

Note: You could also use the vectorized version results = run_qa_evals(evals) to evaluate all items at once.

julia


# Let's take a simple average to calculate our score
@info "RAG Evals: $(length(results)) results, Avg. score: $(round(mean(x->x.answer_score, results);digits=1)), Retrieval score: $(100*round(Int,mean(x->x.retrieval_score,results)))%"

[ Info: RAG Evals: 10 results, Avg. score: 4.6, Retrieval score: 100%

Note: The retrieval score is 100% only because we have two small documents and running on 10 items only. In practice, you would have a much larger document set and a much larger eval set, which would result in a more representative retrieval score.

You can also analyze the results in a DataFrame:

julia

df = DataFrame(results)

10×8 DataFrame

Row	source	context	question	answer	retrieval_score	retrieval_rank	answer_score	parameters
	String	String	String	SubStrin…	Float64	Int64	Float64	Dict…
1	examples/data/database_style_joins.txt	Database-Style Joins\nIntroduction to joins\nWe often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:\n\njulia> using DataFrames	What is the purpose of joining two or more data sets together?	The purpose of joining two or more data sets together is to combine the data sets based on a common key and provide a complete picture of the topic being studied.	1.0	1	5.0	Dict(:top_k=>3)
2	examples/data/database_style_joins.txt	julia> people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])\n2×2 DataFrame\n Row │ ID Name\n │ Int64 String\n─────┼─────────────────\n 1 │ 20 John Doe\n 2 │ 40 Jane Doe	What is the DataFrame called 'people' composed of?	The DataFrame called 'people' consists of two columns: 'ID' and 'Name'. The 'ID' column contains integers, and the 'Name' column contains strings.	1.0	1	4.0	Dict(:top_k=>3)
3	examples/data/database_style_joins.txt	julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])\n2×2 DataFrame\n Row │ ID Job\n │ Int64 String\n─────┼───────────────\n 1 │ 20 Lawyer\n 2 │ 40 Doctor	What are the jobs and IDs listed in the dataframe?	The jobs and IDs listed in the dataframe are as follows:\n\nID: 20\nJob: Lawyer\n\nID: 40\nJob: Doctor	1.0	1	4.67	Dict(:top_k=>3)
4	examples/data/database_style_joins.txt	We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the innerjoin function:	How can we combine the names and jobs for each ID in a larger data set?	We can use the `innerjoin` function to combine the names and jobs for each ID in a larger data set.	1.0	1	4.33333	Dict(:top_k=>3)
5	examples/data/database_style_joins.txt	julia> innerjoin(people, jobs, on = :ID)\n2×3 DataFrame\n Row │ ID Name Job\n │ Int64 String String\n─────┼─────────────────────────\n 1 │ 20 John Doe Lawyer\n 2 │ 40 Jane Doe Doctor	What is the name of the person with the ID 40 and their job?	The name of the person with the ID 40 is Jane Doe and their job is Doctor.	1.0	2	4.67	Dict(:top_k=>3)
6	examples/data/database_style_joins.txt	In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.\n\nThe following functions are provided to perform seven kinds of joins:	What are the different kinds of joins?	The different kinds of joins are:\n\n1. Inner Join: Returns only the rows that have matching values in both data frames.\n2. Left Join: Returns all rows from the left data frame and the matching rows from the right data frame.\n3. Right Join: Returns all rows from the right data frame and the matching rows from the left data frame.\n4. Full Outer Join: Returns all rows from both data frames and fills in missing values with null.\n5. Cross Join: Returns the cartesian product of the rows from both data frames.\n6. Semi Join: Returns only the rows from the left data frame that have matching values in the right data frame.\n7. Anti Join: Returns only the rows from the left data frame that do not have matching values in the right data frame.	1.0	1	4.66667	Dict(:top_k=>3)
7	examples/data/database_style_joins.txt	innerjoin: the output contains rows for values of the key that exist in all passed data frames.	What does the output of the inner join operation contain?	The output of the inner join operation contains only the rows for values of the key that exist in all passed data frames.	1.0	1	5.0	Dict(:top_k=>3)
8	examples/data/database_style_joins.txt	leftjoin: the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.	What is the purpose of the left join operation?	The purpose of the left join operation is to combine data from two tables based on a common key, where all rows from the left (first) table are included in the output, regardless of whether there is a match in the right (second) table.	1.0	1	4.66667	Dict(:top_k=>3)
9	examples/data/database_style_joins.txt	rightjoin: the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.	What is the purpose of the right join operation?	The purpose of the right join operation is to include all the rows from the second (right) argument, regardless of whether a match is found in the first (left) argument.	1.0	1	4.67	Dict(:top_k=>3)
10	examples/data/database_style_joins.txt	outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.\nsemijoin: Like an inner join, but output is restricted to columns from the first (left) argument.	What is the difference between outer join and semi join?	The difference between outer join and semi join is that outer join includes rows for values of the key that exist in any of the passed data frames, whereas semi join is like an inner join but only outputs columns from the first argument.	1.0	1	4.66667	Dict(:top_k=>3)

We're done for today!

What would we do next?

Review your evaluation golden data set and keep only the good items
Play with the chunk sizes (max_length in build_index) and see how it affects the quality
Explore using metadata/key filters (extract_metadata=true in build_index)
Add filtering for semantic similarity (embedding distance) to make sure we don't pick up irrelevant chunks in the context
Use multiple indices or a hybrid index (add a simple BM25 lookup from TextAnalysis.jl)
Data processing is the most important step - properly parsed and split text could make wonders
Add re-ranking of context (see rerank function, you can use Cohere ReRank API)
Improve the question embedding (eg, rephrase it, generate hypothetical answers and use them to find better context)

... and much more! See some ideas in Anyscale RAG tutorial

This page was generated using Literate.jl.

Building a Simple Retrieval-Augmented Generation (RAG) System with RAGTools ​

RAG in Two Lines ​

Evaluations ​

Generate Q&A pairs ​

Explore one Q&A pair ​

Evaluate this Q&A pair ​

Evaluate the Whole Set ​

What would we do next? ​

Building a Simple Retrieval-Augmented Generation (RAG) System with RAGTools

RAG in Two Lines

Evaluations

Generate Q&A pairs

Explore one Q&A pair

Evaluate this Q&A pair

Evaluate the Whole Set

What would we do next?