Understanding BM25 Retriever in LangChain (Made Simple!) ✨✅✨

3 min readDec 26, 2024

Hello there, Are you curious about how we can make computers fetch the right information from a sea of text? Let me introduce you to the BM25 Retriever, a super cool tool you can use in LangChain to level up your search game. And don’t worry, I’ll keep it simple and fun! ☕✨✉

What’s BM25, Anyway? 🕵‍♂️✨✅

BM25 stands for Best Matching 25, and it’s one of the most popular algorithms used for information retrieval. Imagine you have a library with tons of books, and you’re searching for specific ones using a few keywords. BM25 is like a super-smart librarian who quickly finds the books most relevant to your query. 🔖☕🕵‍

The algorithm ranks documents based on how well it matches your keywords while considering things like:

How often the keyword appears (more is better).
How unique the keyword is (rare keywords are weighted more heavily).
How long the document is (it avoids unfairly favoring long documents).

Why Use BM25 in LangChain? ✨⚙✨

LangChain is a framework that helps you build AI apps with ease. It’s great for handling large amounts of text, and using a BM25 Retriever with it makes finding specific information within that text super fast and accurate. 🌐✅🔖

Whether you’re building a chatbot, search engine, or knowledge-based app, BM25 ensures that your users get the most relevant answers without a hitch. 🕵‍⚖☕

Let’s Get Practical: An Example 🔄✅🕵

Now, let’s dive into some code to see BM25 in action. Don’t worry if you’re not a coding pro — I’ll guide you through it step by step. ✨☕🌐

Step 1: Install LangChain and Dependencies ⚡✅✨

First, make sure you’ve got everything set up. You’ll need LangChain and a library called rank_bm25. 🌐⚡🔖

pip install -q langchain langchain_community wikipedia rank_bm25

Step 2: Import What You Need 🌐✅⚡

Start by importing the necessary tools. 🔖⚡🌐

from langchain.chat_models import ChatOpenAI
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from langchain.chains import RetrievalQA

Step 3: Load Your Data ⚡✨🌐

Let’s say you have a bunch of documents you want to search through. Load them like this: ☕✨✉

loader = WikipediaLoader(query="India", load_max_docs=50)
documents = loader.load()

chunk_size = 300
chunk_overlap = 100
# text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
docs = text_splitter.split_documents(documents=documents)

Step 4: Create the BM25 Retriever ✨✅⚡

Now, we’ll use the BM25 Retriever to index these documents and make them searchable. 🌐✨🔖

retriever = BM25Retriever.from_documents(docs)

Step 5: Query the Retriever 🌐✨☕

Ask the retriever a question, and it’ll return the most relevant documents! Using RetrievalQA ☕✅✨

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_llm(
    llm, retriever=retriever
)
qa_chain("Name of PM")

# OUTPUT
#{'query': 'Name of PM',
#'result': 'The Prime Minister of India is Narendra Modi.'}

How Does It Work Behind the Scenes? 🕵‍⚖✨

BM25 scores each document based on your query. It uses fancy math to: ☕✨🌐

Give higher scores to documents that use your query keywords more often.
Balance the score if the document is too long or too short.
Weigh rare keywords more heavily since they’re likely more meaningful.

Where Can You Use BM25? ⚡✨⚙

BM25 is perfect for: ✨☕⚖

Chatbots: Retrieve the most relevant knowledge base articles.
Search Engines: Help users find the content they’re looking for.
Summarization: Find key passages to summarize text more effectively.

Wrapping Up ✨⚖☕

BM25 is a fantastic way to add intelligent retrieval to your LangChain projects. It’s fast, effective, and easy to use. With just a few lines of code, you can build apps that provide highly relevant answers to your users’ queries. ✨⚡✅

So, go ahead and give it a try! And if you have any questions or need help, don’t hesitate to ask. Happy coding, my friend! ☕✨✉