Hypothetical Document Embeddings

https://www.jiang.jp/posts/20230430_HyDE/

The paper “Precise Zero-Shot Dense Retrieval without Relevance Labels” addresses the challenges and solutions of dense document retrieval without reference labels. Specifically, it introduces a new approach called Hypothetical Document Embeddings (HyDE) that demonstrates effective zero-shot retrieval. Below is a detailed summary of each section in Japanese:

  1. Introduction
  • Dense retrieval has been successful in various tasks such as search, question answering, and fact verification. However, building an effective retrieval system without reference labels, in a complete zero-shot scenario, has been considered difficult. This paper proposes a method using HyDE to perform searches in a zero-shot manner through the generation of virtual documents. HyDE captures relevance patterns to questions, utilizes generated non-factual documents, and conducts searches in a dense vector space using an encoder trained with Contrastive Learning. This approach allows HyDE to outperform traditional zero-shot retrieval methods in various tasks and languages.
  1. Related Work
  • Dense retrieval techniques have advanced alongside the development of large-scale transformer language models since 2019. Existing methods aimed to improve search accuracy mainly through negative sampling based on reference data or distillation learning. However, zero-shot retrieval still poses challenges. Recent studies have shown that instruction-based generative models (e.g., GPT-3) can provide general responses for various prompts in a zero-shot manner. HyDE leverages such generative models to create virtual related documents and enhances search performance in a zero-shot environment by comparing document similarities based on Contrastive Learning-trained encoders.
  1. Methodology 3.1 Background Knowledge
  • Capturing the relationship between questions and documents in a vector space for zero-shot retrieval is difficult without reference data, making it nearly impossible to learn.

3.2 HyDE Design

  • HyDE utilizes a pre-trained document encoder through Contrastive Learning to vectorize generated virtual documents. Instructions are provided to generative models (like InstructGPT) from questions to create virtual documents, capturing relevance patterns. These generated documents do not need to be accurate; they just need to capture relevance patterns. By comparing the generated vectors with actual documents in the corpus, the most relevant documents are retrieved.
  1. Experiments 4.1 Experimental Setup
  • HyDE is implemented based on InstructGPT and Contriever (or its multilingual version, mContriever). Unlike the English-only Contriever model, HyDE supports multilingual search tasks. Its performance was evaluated in various search tasks like TREC DL19 and BEIR datasets.

4.2 Performance in Web Search

  • HyDE significantly outperformed existing Contriever baselines, especially in TREC DL19/20. It achieved substantial improvements in accuracy compared to BM25. Additionally, its performance is competitive with models using training data and can even exceed them in some cases.

4.3 Performance in Low-Resource TasksIn the BEIR dataset, HyDE has shown significant improvements over the Contriever baseline even in low-resource tasks. Additionally, it has demonstrated competitive performance compared to traditional models like BM25, DPR, and ANCE, surpassing them in specific tasks.

4.4 Performance in Multilingual Environments

HyDE has outperformed the baseline of the mContriever model in multilingual tasks such as Swahili, Korean, Japanese, and Bengali. However, there is a slight difference when comparing it to the multilingual model used during Contriever’s training.

5. Analysis

5.1 Impact of Different Generation Models

In the case of HyDE, evaluations have been conducted on the performance when using generation models like Cohere and FLAN-T5, in addition to InstructGPT. The results suggest that differences in model size and training techniques affect performance.

5.2 Effect of Fine-Tuned Encoders and HyDE

HyDE has been combined with fine-tuned encoders like ContrieverFT. It has been observed that while powerful generation models with strong instructions further enhance performance, small-scale generation models can lead to a decrease in performance.

6. Conclusion

HyDE provides a robust approach to zero-shot search, demonstrating superior performance compared to traditional models that do not require reference data, through the collaboration of generation models and Contrastive Learning-based encoders. This approach holds potential for application in more complex tasks, multi-stage searches, and conversational searches in future research. HyDE is expected to be utilized as a practical tool in the early stages of developing new search systems and in low-resource environments.