Post
835
Although this might sound like another way to make money on LLM API calls...
Good folks at @AnthropicAI just introduced Contextual Retrieval, and it's a significant yet logical step up from simple Retrieval-Augmented Generation (RAG)!
Here are the steps to implement Contextual Retrieval based on Anthropic's approach:
1. Preprocess the knowledge base:
- Break down documents into smaller chunks (typically a few hundred tokens each).
- Generate contextual information for each chunk using Claude 3 Haiku with a specific prompt.
- Prepend the generated context (usually 50-100 tokens) to each chunk.
2. Create embeddings and a BM25 index:
- Use an embedding model (Gemini or Voyage recommended) to convert contextualized chunks into vector embeddings.
- Create a BM25 index using the contextualized chunks.
3. Set up the retrieval process:
- Implement a system to search both the vector embeddings and the BM25 index.
- Use rank fusion techniques to combine and deduplicate results from both searches.
4. Implement reranking (optional but recommended):
- Retrieve the top 150 potentially relevant chunks initially.
- Use a reranking model (e.g., Cohere reranker) to score these chunks based on relevance to the query.
- Select the top 20 chunks after reranking.
5. Integrate with the generative model:
- Add the top 20 chunks (or top K, based on your specific needs) to the prompt sent to the generative model.
6. Optimize for your use case:
- Experiment with chunk sizes, boundary selection, and overlap.
- Consider creating custom contextualizer prompts for your specific domain.
- Test different numbers of retrieved chunks (5, 10, 20) to find the optimal balance.
7. Leverage prompt caching:
- Use Claude's prompt caching feature to reduce costs when generating contextualized chunks.
- Cache the reference document once and reference it for each chunk, rather than passing it repeatedly.
8. Evaluate and iterate
Good folks at @AnthropicAI just introduced Contextual Retrieval, and it's a significant yet logical step up from simple Retrieval-Augmented Generation (RAG)!
Here are the steps to implement Contextual Retrieval based on Anthropic's approach:
1. Preprocess the knowledge base:
- Break down documents into smaller chunks (typically a few hundred tokens each).
- Generate contextual information for each chunk using Claude 3 Haiku with a specific prompt.
- Prepend the generated context (usually 50-100 tokens) to each chunk.
2. Create embeddings and a BM25 index:
- Use an embedding model (Gemini or Voyage recommended) to convert contextualized chunks into vector embeddings.
- Create a BM25 index using the contextualized chunks.
3. Set up the retrieval process:
- Implement a system to search both the vector embeddings and the BM25 index.
- Use rank fusion techniques to combine and deduplicate results from both searches.
4. Implement reranking (optional but recommended):
- Retrieve the top 150 potentially relevant chunks initially.
- Use a reranking model (e.g., Cohere reranker) to score these chunks based on relevance to the query.
- Select the top 20 chunks after reranking.
5. Integrate with the generative model:
- Add the top 20 chunks (or top K, based on your specific needs) to the prompt sent to the generative model.
6. Optimize for your use case:
- Experiment with chunk sizes, boundary selection, and overlap.
- Consider creating custom contextualizer prompts for your specific domain.
- Test different numbers of retrieved chunks (5, 10, 20) to find the optimal balance.
7. Leverage prompt caching:
- Use Claude's prompt caching feature to reduce costs when generating contextualized chunks.
- Cache the reference document once and reference it for each chunk, rather than passing it repeatedly.
8. Evaluate and iterate