WordMap, where Wordle meets Semantic Search
We all know or have played Wordle before, I personally love it ❤️. It’s that fun game where you try to guess a five-letter word within six tries.
Some weeks ago while using semantic search in a RAG (if you want to know more about what is a RAG, check out my previous post) an idea came to my mind.
What if there was a game like Wordle, but instead of guessing the word based on letter positions, you guess the word of the day by how close your guess is in meaning? You’d input different words, and the game would score how semantically similar they are to the word of the day, that is, score words based on their similarity in meanings or how related they are in context. The goal would be to guess the word with the fewest tries, though you could take as many as you want.
That’s how ☀️ WorldMap came to be!
In order to develop the game I knew I needed to embed the input word of the user, the result word for that day and then calculate how close those 2 words were semantically. Then the score normalize it to a score between 0–100 and show it to the user in some nice UI so it would be intuitive.
The embedding part (Tricky one)
RAG’s are very popular nowadays for searching relevant data based on an input. The problem in this case is that we are dealing with words, not with full paragraphs so the context to grab from is quite small.
Based on the granularity of the input, there exist 2 types of embeddings: word-level embeddings and sentence-level embeddings. We are gonna see what is the difference between the 2 but as you can assume… we are going to use sentence-level embeddings. Probably you are thinking “Why!? Why don’t you just use word-level embeddings which is perfect for this case?”.
Well, we will get to that later. For now let’s see the main characteristics of each one without going into detail: I was not an expert on this before so I dug a bit; plus some help from ChatGPT and other LLMs.
Word-level Embeddings
Word-level embeddings represent individual words as vectors in a vector space. These embeddings are based on the idea that words with similar meanings appear in similar contexts.
Key Characteristics
🅰️ Granularity: Each word gets its own vector.
📀 Training Data: Trained on a corpus of text, where each word is used as input.
⚙️ How it works: Each word in the vocabulary gets its own vector.
- Words that appear in similar contexts tend to have similar vectors.
- Relationships like word similarity or analogy can be computed by measuring distances between vectors (e.g., cosine similarity).
Example
– – – – – – – – – – – – – – – – – – – – – – – –
| Word. | Embedding (Vector) |
– – – – – – – – – – – – – – – – – – – – – – – –
| king. | [0.21, -0.45, …, 0.12] |
| queen. | [0.19, -0.47, …, 0.14] |
| apple. | [0.78, 0.24, …, -0.35] |
| … | … |
– – – – – – – – – – – – – – – – – – – – – – – –
similarity(king, queen) ≈ 0.92
Some popular examples are Word2Vec which is from Google and GloVe.
Biggest Weakness
It has a big weakness though. It treats each word in isolation. The same word can have different meanings in different contexts (e.g., “bank” as in a financial institution vs. “bank” as the side of a river).
Sentence-level Embeddings
Sentence embeddings represent entire sentences (or paragraphs) as vectors. These embeddings aim to capture the meaning of the entire sentence, accounting for the order and relationships between words.
Key Characteristics
🅰️ Granularity: Each sentence gets its own vector.
⚙️ How it works: Instead of representing individual words, the model learns a vector for the whole sentence.
- The sentence embedding captures the context and meaning of the sentence.
- These embeddings are often used in tasks like sentence similarity, paraphrase detection, and document classification.
Biggest Weakness
Requires more computational resources compared to word embeddings. Longer sentences may lose some granularity.
Main differences
So why the hell did you use Sentence Embeddings!?
The simple answer is simplicity in terms of implementation. Most of the embedding models easily available nowadays are sentence embeddings like OpenAI’s text-embedding-3-large
.
I investigated a bit how to implement Word2Vec instead, but I didn’t found an easy solution where I didn’t have to load a pre-trained model that was quite big in size. I have seen that in Python there are easy ways to do so, considering Google already provides its own pre-trained model with Google News data. Again, a model for Word2Vec needs to be trained with tons of data in order to be precise.
Again it is not that the results are totally wrong using sentence embedding, it just comes with some limitations that we are going to see next.
What limitations did I find?
First of all, the accuracy of course. It is a model that has not been trained to embed single words but rather sentences in a given context.
In order to improve a bit the accuracy, and provide more context to the model, I decided to embed along with the word, a dictionary meaning to it so it is more precise. Again, this also brings a limitation that I explained before, which is that a word can have several meanings (and it is a point to improve), but for now I grab the first meaning that appears in the dictionary.
The other main limitation is also related with accuracy. Most of the guesses and words provided that are semantically close do not surpass a cosine similarity score of 0.45 which is very low in normal scenarios. I didn’t want to show users a score of (following the previous example) 45 over 100. That would seem that they are pretty far away from the answer. Instead, I normalized the score outputted by the api so that it is more realistic. To do so I needed to fine-tune a bit the multiplication factor of the score.
Again, I know, this is not ideal and even a less accurate model to give a cosine similarity score but this is something that can be improved in the future.
The final result 🎉
You have the final result in https://wordmap.vercel.app/
📣 Big shoutout to v0 which I totally recommend, for building UIs from scratch and increasing your speed 10x.
Hope you like it!! 🙌