Hypothetical Document Embeddings (HyDE) Made-Easy 🚀🎩✨ Explained in Simple English

Digvijay Bhakuni
3 min readDec 21, 2024

--

Hey there, 👋 Let’s dive into the fascinating world of Hypothetical Document Embeddings (HyDE) and see how they can supercharge your Retrieval-Augmented Generation (RAG) systems. 🚀

Photo by Yazid N on Unsplash

Understanding the Basics:

Imagine you’re building a smart assistant that fetches information for users. 🕵️‍♂️ To make it efficient, you combine two powerful techniques:

  1. Retrieval: Pulling in relevant documents from a vast collection. 📚
  2. Generation: Crafting human-like responses based on the retrieved info. 📝

This combo is known as Retrieval-Augmented Generation (RAG). But how can we make the retrieval part even smarter? Enter HyDE! 🎩✨

What is HyDE?

Hypothetical Document Embeddings (HyDE) is like giving your system a crystal ball. 🔮 When a user asks a question, HyDE generates a “hypothetical” document that it imagines would answer the query. This imagined document captures the essence of the question, making it easier to find real documents that match. 🧩

How Does HyDE Work?

  1. Generate a Hypothetical Document: When a query comes in, HyDE uses a language model (like GPT-3.5) to create a detailed response as if it already knows the answer. 🧠💡
  2. Create an Embedding: This generated document is transformed into a vector (a list of numbers) that represents its meaning. Think of it as placing the document in a multi-dimensional space where similar meanings are closer together. 🌌
  3. Retrieve Real Documents: The system then searches for real documents whose embeddings are close to the hypothetical one. This means they’re likely relevant to the user’s query. 🎯

Why Use HyDE?

  • Zero-Shot Retrieval: HyDE doesn’t need prior training on specific datasets. It can handle new topics on the fly, making it versatile and adaptable. 🦾
  • Enhanced Understanding: By generating a hypothetical document, HyDE grasps the context and nuances of a query better, leading to more accurate results. 🎓
  • Multilingual Support: HyDE works across different languages, broadening its applicability. 🌐

Implementing HyDE:

To get started with HyDE, you’ll need:

  • A language model to generate hypothetical documents. 🖥️
  • An encoder to create embeddings from these documents. 🔢

Frameworks like Haystack and LangChain have integrated HyDE into their pipelines, offering tools to simplify the implementation process. 🛠️

Implementing Hypothetical Document Embeddings (HyDE) in LangChain

Now, let’s get hands-on and see how to implement a simple HyDE Retriever using LangChain, a powerful framework for developing applications with language models.

pip install -q langchain langchain_huggingface langchain_chroma wikipedia
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma

llm = ChatOpenAI(model="gpt-4", temperature=0.2)
embedding_function = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# Loading Documents from Wikipedia
loader = WikipediaLoader(query="Turing test", load_max_docs=5)
documents = loader.load()

# Splitting Documents
chunk_size = 300
chunk_overlap = 100
text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
docs = text_splitter.split_documents(documents=documents)

# Create Vector Store using Chroma
vector_store = Chroma.from_documents(documents, embedding_function)
base_retriever = vector_store.as_retriever(search_kwargs = {"k":2})

# Create Chat Prompt from HyDE Reteriever
prompt = ChatPromptTemplate([
("system", "You are an expert in various domains."),
("human", f"Provide a detailed explanation on: ‘{query}’")
])

# Create HyDE Chain and using query
hyde_chain = prompt | llm
query = "Explain the significance of the Turing Test."
hypothetical_doc_ans = hyde_chain.invoke({"query": query})
hypo_doc = hypothetical_doc_ans.content

rel_docs = base_retriever.get_relevant_documents(hypo_doc)

`HypotheticalDocumentEmbedder` Example

from langchain.chains import HypotheticalDocumentEmbedder

hyde_embedding_function = HypotheticalDocumentEmbedder.from_llm(llm = llm, base_embeddings = embedding_function, prompt_key = 'web_search' )

# Default prompts: [‘web_search’, ‘sci_fact’, ‘arguana’, ‘trec_covid’, ‘fiqa’, ‘dbpedia_entity’, ‘trec_news’, ‘mr_tydi’]

doc_db = Chroma.from_documents(docs, hyde_embedding_function)

Things to Keep in Mind:

The quality of the generated hypothetical document is crucial. If it doesn’t accurately reflect the query’s intent, the retrieval might not be effective. Ensuring your language model is well-configured and, if possible, fine-tuned for your specific domain can make a big difference. 🎯

In a nutshell, HyDE adds a layer of intelligence to your RAG systems, helping them understand and retrieve information more effectively. It’s like giving your system a superpower to “imagine” the perfect answer and then find it! 🦸‍♂️

Happy coding! 💻✨

--

--

No responses yet