🤖 Fine-Tuning or RAG: What I Chose for my Latest Project and Why

8 min readMar 3, 2024

If you are a software developer who is introducing itself in the Generative AI and LLM world for a software project, then THIS IS YOUR ARTICLE 🎯.

Disclaimer: I am no expert whatsoever in Artificial Intelligence and in this article, I just share my experience and some of the learnings I have g about the topic as a 👨‍💻 developer using LLMs in his projects.

Before we start I assume you have already played around at some point with ChatGPT by asking it questions or solving a challenge. When doing so, you were basically asking a structured question without little to no context, assuming ChatGPT should be able to answer you. This is what is called Prompting.

In my last article, I talked about my latest project which consisted in summarizing and rewording in simple words, the content and new laws published by the Spanish Government in their website. Check my article here.

Right before diving deep into the topic just keep two important learnings in mind that I want you to take from this article:

LLMs love to make up stuff, and there is a reason, they don’t store facts, they store probabilities.

PROMPTING

I am going to talk to the developers out there, but once you have played around with prompting (as I was doing in the first version of my project), you see that prompts become more elaborated over time as you want them to give you a better and more accurate output and therefore, you end up with longer prompts which translates into more money spent on each interaction. This is what it was happening to me with OctoBOE, even though the task was simple, I wanted the model to return me a more precise output with as little as no errors on any interaction (as an error means retrying it and that means more money spend).

As the prompt kept growing I realized that a well-structured and functional prompt had this structure:

1️⃣ Preparation: Give the model a role to “impersonate” and some background to it.
2️⃣ Tone and Audience: Establish the difficulty in the vocabulary to use for the output and indicate how does your end-user look like. In my case, I wanted to use a really easy vocabulary as it was oriented to people that may not be familiar with the law world.
3️⃣ Output Format: You specify what is the structure of the response you are expecting to receive. In my case I was using ChatGPT as another API and therefore, I was expecting ChatGPT to return me the response in a valid JSON format.
4️⃣ Ideal interaction examples: This is a technique that you probably used but you did not know (at least I didn’t know) it had a name and it is called “Few-shot prompting”. LLMs like to make up stuff about what it is being asked about. Either they don’t respect the output format, they reply with something unrelated or they just do things wrong, and there is a reason for it: LLMs don’t store facts but rather probabilities.
I recommend you also indicate what to do when a failed interaction happens, the model should know what to do when it is not capable of doing what you request it.

FINE-TUNING

Prompting is fine, but as I explained, it has its limitations once you want a better and more precise output at a lower cost. What it would be ideal is to take this ideal interaction I added in the prompt and instead prepare many of these ideal interactions involving many scenarios and feed them to the data. Well, this is fundamentally called fine-tuning your model. In my case as I had used prompting for many days and I had real-life interactions that I really liked, and for that reason, I knew which ones to pick for feeding the model.

This is what I applied for the newest version of OctoBOE, I grabbed 30 of the best interactions I had seen the model do, and by following the simple guide that 📖 ChatGPT OpenAI docs offer, I started to fine-tune my model.

An example of the format of the data to feed is explained in the docs (it should follow the same format as an interaction using the Chat Completions API A.K.A. “the conversational chat API everybody usually uses”).

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

So I am going to explain you why I started using Fine-tuning and when you should start using it as well. Or better said, why you should use it:

Refine the style, the format and the tone to use for your interaction outputs. The more data you feed the model, the more precise it will answer you and the more you teach the model to use similar vocabulary and tone.
It narrows down the range of possible outputs. It is less probable that the model makes up its response and doesn’t comply with your instructions.
It reduces your prompt length as examples are already given beforehand and you basically get a shorter prompt with a more accurate output.
It doesn’t require much data initially to feed the model. In my case I fed it 30 examples but you can start with as little as 10 of them to see improvements. In the end, we saw before that with the few-shot technique, we were just providing a couple of examples and it was enough context to get more accurate results.
It is not expensive anymore. At least OpenAI offers a deal that is rather affordable for everybody. And also on every interaction you are saving tokens (keep that in mind).

To finish up this part, just mention that there are 2 possible strategies to follow for the fine-tuning and those are “Quality first”: Train a model to optimize the output, and “Speed and Cost Optimization”: Looking to perform at a higher level and reducing costs. This strategy requires a larger dataset.

RAG (Retrieval Augmented Generation)

Prompting and fine-tuning are fine when you have a use case like mine where what matters is the tone, the output structure and that it can learn from previous interactions to get better.

But sometimes, your use case is to train your model on certain data that belongs specifically to your company or to a topic that is private and that the LLM was not trained on. This data can be a PDF, a web page, the Confluence of your company, a knowledge-sharing document, etc.

Most of the time these documents are not public or they cover topics in which the model was trained by using outdated data. Let me explain myself, until GPT4, OpenAI models were trained on data that goes as recent as January 2022, which means that there is a 2 year gap with nowadays, and it can lead to outdated outputs.

RAG consists on providing these specific data so that, as mentioned before, the probabilities of choosing the right output increases. Nowadays, GPT 4 allows you to attach documents to the chat to give certain context to the model and ask questions about them. This is essentially what RAG is, but even though it is not what I used in my project, let’s dive a bit into how RAG works:

STEP 1: Store your data

Take your documents/data and split them into chunks of the size you decide. It can be that a chunk is a paragraph, a chapter, a page or maybe just a sentence. The size is up to you (of course granularity will affect the performance, costs, etc).
Take these chunks and store them into vector databases. Basically using an embedding model, you convert these chunks into vector representations.

STEP 2: Interaction

Whenever a user is interacting with your model, you will take that inquiry, and embed it using the same embedding model so it becomes a vector.
Then you request your vector database to retrieve the X number of most-relevant vectors based on the input vector (kind of like getVectorsByGivenVector([x,y,z]) or at least it is how I imagine it).

This is what a simple RAG would look like but it can get as complex and optimized as desired. For step 2, you can get more accurate results if you place intermediary steps where you ask the LLM for example, to choose out of the n results requested to the database, to choose the most applicable one. Or you could, for example, refine the output by asking the LLM to rewrite the answer with a more accurate result. You could add iterations to refine it, you could combine fine-tuning with RAG, … you could do many things but you get the point, RAG is in the simplest way to define it:
PROMPT + An Authoritative Knowledge Base.

✍️ Conclusions

This is a very simple summary of the differences between RAG, Fine-Tuning and pure prompting. I keep learning everyday about these topics as I improve the model used for my project, as well as I reduce the costs, but in the meantime I just wanted to share with you some of my learnings after reading and watching content about these topics. Some of the resources I consumed when researching are:

https://www.youtube.com/watch?v=iOJD1hw2xaw&t=17s 👉 When to use RAG or Fine-tuning.
https://www.youtube.com/watch?v=bjCdsnkQ6Dw&t=1459s&pp=ygUMY29kZWx5dHYgcmFn 👉 In Spanish but very valuable resource about RAG, Fine-tuning and prompting.
https://www.youtube.com/watch?v=YVWxbHJakgg 👉 Great summary about these topics. Extremely valuable.
https://www.linkedin.com/pulse/exploring-power-retrieval-augmented-generation-rag-dr-patrick-j-wolf-syzoc/?utm_source=share&utm_medium=member_ios&utm_campaign=share_via 👉 Intro to RAG
https://www.databricks.com/glossary/retrieval-augmented-generation-rag 👉 All about RAG