The detailed summary of each section of the paper “Generate then Retrieve: Conversational Response Retrieval Using LLMs as Answer and Query Generators” is as follows:

  1. Introduction: The paper discusses the need to use multiple passages to infer complex user questions in the field of Conversational Information Seeking (CIS). It proposes the “Generate-Retrieve” (GR) approach as a solution to the limitations of traditional single-query generation approaches for complex questions. This approach involves using Large Language Models (LLMs) to generate initial responses to user information requests and linking them to search targets. Experimental results show that this method improves the accuracy of passage retrieval for complex queries. By breaking down LLM responses into multiple queries and conducting searches, it is suggested that this approach is more effective than single-query rewriting.

  2. Related Work: CIS is a significant topic in Information Retrieval (IR) and Natural Language Processing (NLP). Various approaches have been explored, particularly focusing on query rewriting and modeling dialog contexts to accurately understand user intentions and search for appropriate passages. While existing methods mainly generate and search with a single query, this study proposes a new approach utilizing LLM knowledge and reasoning abilities to generate responses based on dialog context. It converts these responses into multiple searchable queries for retrieval.

  3. Methodology: Three models based on the GR approach are proposed in this study:

  • AQ Model (Answer as Query): Uses the initial LLM response to the user’s utterance as a single long query for searching.
  • MQ Model (Multiple Query generation): Generates multiple queries directly based on the user’s utterance and dialog context, conducts individual searches, and synthesizes the results to generate an answer.
  • MQA Model (Multiple Query generation from Answer): Similar to the AQ Model, generates initial response and then multiple queries for searching. The generated list of queries is combined and re-ranked based on the initial LLM response. Experiments are conducted using different datasets for each model, and results are compared.
  1. Experimental Setup: The study utilizes datasets from TREC CAsT 2020 and 2022, and TREC iKAT 2023, adopting a two-stage search pipeline using BM25 and Cross-encoder. Evaluation metrics such as nDCG, Recall, and MRR are used for assessing search performance, with GPT-4 and LLaMA models employed as LLMs. The experiments also explore the impact of varying the number of queries (ϕ) used for query generation.

  2. Results and Discussion: Experimental results demonstrate that the proposed GR approach improves search accuracy compared to traditional Query Rewriting (QR) approaches. Particularly, the MQ Model utilizing multiple queries shows higher search accuracy, confirming its effectiveness in acquiring information compared to single queries. Optimizing the number of queries suggests further potential for improving accuracy. The iKAT dataset highlights the necessity for multiple queries due to the complexity of user utterances.

  3. Conclusion- In this study, three search models based on the GR approach were proposed to understand dialogue context and generate responses by leveraging the internal knowledge of LLM. The proposed method was shown to be more effective in handling complex queries than traditional methods, with improved search performance achieved by generating multiple queries. Optimizing the number of queries and enhancing the quality of response generation remain as future research tasks.

  • Limitations The proposed method in this study relies on the quality of responses generated by LLM. If inaccurate responses are produced, there is a possibility that search accuracy may decrease as a result. Additionally, consideration is needed regarding the influence of biases inherent in the model, emphasizing caution when applying it to real search systems.

  • Ethical Considerations This study points out the ethical implications of biases in the data generated by LLM. Therefore, when applying the system to real-world scenarios, it is essential to carefully evaluate the impact of biases and consider how they may affect the final output.