From Search to Reasoning:
A Five-Level RAG Capability Framework for Enterprise Data
Abstract
Retrieval-Augmented Generation (RAG) has emerged as the standard paradigm for answering questions on enterprise data. Traditionally, RAG has centered on text-based semantic search and re-ranking. However, this approach falls short when dealing with questions beyond data summarization or non-text data. This has led to various attempts to supplement RAG to bridge the gap between RAG, the implementation paradigm, and the question answering problem that enterprise users expect it to solve.
Given that contemporary RAG is a collection of techniques rather than a defined implementation, discussion of RAG and related question-answering systems benefits from a problem-oriented understanding.
We propose a new classification framework (L1–L5) to categorize systems based on data modalities and task complexity of the underlying question answering problems: L1 (Surface Knowledge of Unstructured Data) through L4 (Reflective and Reasoned Knowledge) and the aspirational L5 (General Intelligence). We also introduce benchmarks aligned with these levels and evaluate four state-of-the-art platforms: LangChain, Azure AI Search, OpenAI, and Corvic AI. Our experiments highlight the value of multi-space retrieval and dynamic orchestration for enabling L1–L4 capabilities. We empirically validate our findings using diverse datasets indicative of enterprise use cases.
1 Introduction
Large Language Models (LLMs) have introduced a new paradigm (Brown et al.,, 2020) in data processing and utilization, and in enterprise environments, where private data represents a significant untapped asset, LLMs promise to transform vast stores of yet unexploited data into actionable insights. The potential applications are wide-ranging: from enhancing automated customer interactions to facilitating sophisticated decision-making processes. And, by leveraging a deeper understanding of data through LLMs, enterprises can augment their workforce with AI automation, thereby boosting productivity and operational efficiency.
However, the use of LLMs in enterprise scenarios has challenges. One significant hurdle is the tendency of LLMs to generate “hallucinations” or content-fabricated information not present in the input data (Huang et al.,, 2025). This issue of hallucination, coupled with data incompleteness, can lead to unreliable outputs that jeopardize business decisions with potentially catastrophic consequences. As enterprises rely heavily on data integrity, the prevalence of misinformation generated by LLMs presents an existential risk.
To address these challenges, recent focus has shifted towards mitigating the hallucination phenomenon. Retrieval-Augmented Generation (RAG) (Lewis et al.,, 2020) has emerged as a promising approach, enhancing LLMs by integrating them with contextually relevant, enterprise-specific data (Izacard and Grave,, 2021; Karpukhin et al.,, 2020). This integration aims to ground an LLM’s responses with known facts and reduce the probability of fabrication (Shuster et al.,, 2021). Yet, despite its potential, RAG still grapples with the same challenges as using LLMs without RAG (Gao et al.,, 2024), such as maintaining response accuracy and relevance. This paper studies these issues, exploring innovative strategies to harness the full potential of LLMs in enterprise settings while ensuring accuracy and reliability crucial for business applications.


2 Background
Retrieval-Augmented Generation (RAG) (Fig. 1) has emerged as a standard way to address the challenges of hallucination and incomplete responses of Large Language Models (LLMs) in data-sensitive environments. It adds a retrieval mechanism that fetches contextually relevant information from a private datasets. The original query is then augmented with this additional context before being passed to an LLM for question-answering (Ram et al.,, 2023). This method not only curtails the generation of hallucinated content but also enhances the completeness and factual accuracy of the responses (Lewis et al.,, 2020) with respect to the private data. Despite its advancements, the RAG approach is still not a panacea. For example, Magesh et al., (2024) have shown significant hallucination and incompleteness for legal questions, noting hallucination and incompleteness rates across systems between 10% and 60%. These findings have been observed across various other problem domains (Amugongo et al.,, 2025).
One limitation of RAG is that retrieval is through an embedding space that only captures language understanding and semantic similarity (Xiong et al.,, 2021). While useful for certain kinds of problems, this single-faceted approach falls short when dealing with complex queries that require nuanced understanding and multi-faceted contextual grounding. In an attempt to address this issue, the majority of the research in RAG systems has been focused on re-ranking algorithms (Nogueira and Cho,, 2020; Nogueira et al.,, 2019; Zhuang and Zuccon,, 2021), which refine the selection of retrieved documents based on their perceived relevance to the input query (Guu et al.,, 2020; Karpukhin et al.,, 2020).
However, the efficacy of re-ranking is inherently limited by the quality of the initial retrieval phase. If the retrieval component fails to bring forward the relevant context, even the most sophisticated re-ranker cannot compensate for the initial deficiency. This decoupling of retrieval and re-ranking phases often leads to an accuracy bottleneck (Yu et al.,, 2023), which highlights the need for a different approach. Enhancing the retrieval mechanisms to consider multiple dimensions of semantic relevance and context variability (Khattab and Zaharia,, 2020; Santhanam et al.,, 2022) paves the way for more robust and reliable outputs from LLMs, particularly in complex and dynamic information environments. Ultimately, an LLM output is only as good as its retrieved inputs (Shi et al.,, 2023). If pertinent information cannot be retrieved, no amount of LLM sophistication can compensate.
3 Capability Levels: A Five-Level Framework
While recent advancements in Retrieval-Augmented Generation (RAG) have delivered encouraging results, we argue that a more meaningful framing for enterprise use cases is: What kinds of questions can be reliably answered from private data? In enterprise settings, private data is not a free variable, it must be acquired, maintained, and protected. Much of the knowledge required for enterprise decision-making exists only in proprietary sources. Therefore, RAG capabilities should be measured not by end accuracy per se but by the depth and reliability of answers across increasingly complex data landscapes.
To standardize this perspective, we propose a five-level classification system organized around increasing complexity of reasoning and understanding required to support real-world enterprise tasks.
3.1 L1: Surface Knowledge of Unstructured Data
At L1, systems operate on unstructured text and perform basic retrieval using semantic similarity or keyword matching. They can answer factoid or lookup-style questions when the information exists explicitly in the source. However, they are unaware of any structural or contextual relationships among facts and cannot resolve ambiguity or synthesize across documents. These systems are effective for simple use cases such as semantic search, FAQ answering, and document summarization, where the required knowledge is shallow and localized.
3.2 L2: Surface Knowledge of Multifaceted Data
L2 systems can process a wider range of multi-modal and multi-structural data types, including tabular data, knowledge graphs, text, images, and metadata. They incorporate awareness of document structure and modality, enabling better matching for more complex formats. However, their responses remain surface-level: they retrieve based on local matches without deeper inference or understanding. These systems are context-aware, but not reasoning-capable. They are suitable for moderately structured content like contracts, handbooks, or technical manuals, where surface context is helpful but not sufficient for deep understanding.
3.3 L3: Implicative Knowledge of Multifaceted Data
L3 introduces reasoning capabilities. Systems at this level can synthesize information across multiple sources, infer relationships between disparate data points, and respond to queries that involve ambiguity, inconsistency, or incompleteness. They move beyond surface matching to link entities, draw conclusions across documents, and bridge questions with their many possible interpretations. This level is essential for enterprise scenarios involving diagnostic analysis, compliance validation, or decision support where answers depend on diverse, distributed, and contextualized knowledge.
3.4 L4: Reflective and Reasoned Knowledge
At L4, the system becomes self-aware in its reasoning process. It can reflect on the sufficiency and reliability of its own answers, identify missing context, and revise its retrieval or reasoning strategy accordingly. It may dynamically invoke tools (e.g., calculators, database queries, visual parsers) or select from multiple data sources depending on the task. This level introduces true agentic behavior: the ability to plan, adapt, and reason recursively. It supports high-stakes applications such as enterprise copilots, compliance agents, and automated report generation where accuracy, traceability, and adaptive reasoning are non-negotiable.
3.5 L5: General Intelligence
L5 represents an aspirational future state: a system capable of general-purpose reasoning across any domain, modality, or task type. It would possess the capacity to generalize knowledge, solve novel problems, and operate with minimal task-specific tuning or prompting. While it remains beyond current system capabilities, it serves as a direction for long-term research and system design.
Capability | Data Types Handled | Knowledge Capability Description | Example Tasks | |
---|---|---|---|---|
L1 | Surface Knowledge of Unstructured Data | Primarily text-based unstructured sources | Retrieves explicit facts from text using keyword or semantic similarity; no structural or multi-source reasoning. | FAQ answering, document search, basic summarization |
L2 | Surface Knowledge of Multifaceted Data | Unstructured + semi-structured (tables, metadata, images) | Retrieves from diverse formats with structural context awareness, but answers remain surface-level without inference or synthesis. | Table lookup, contract clause search, metadata-based filtering |
L3 | Implicative Knowledge of Multifaceted Data | Unstructured + semi-structured + structured (databases, graphs) | Synthesizes across sources and modalities, infers relationships, and handles ambiguity, incompleteness, and inconsistencies. | Root cause analysis, compliance checks, product comparisons |
L4 | Reflective and Reasoned Knowledge | Multi-modal (text, tables, APIs, images, code) | Adapts retrieval strategies, orchestrates tools, and verifies or revises answers for high-stakes tasks: self-aware reasoning | Regulatory report generation, enterprise copilots, design validation |
L5 | General Intelligence | Any domain, modality, or data structure | Aspirational: domain-agnostic, autonomous reasoning and general-purpose problem-solving across all modalities. | Open-ended research, autonomous decision-making, novel problem solving |
4 Methodology: Enabling L4 Systems
Achieving L4 (reflective and reasoned knowledge) requires a fundamental departure from traditional dense-vector search and static chunk-based retrieval. While L1 through L3 are sufficient for simple lookup and context-aware or even cross-source synthesis, enterprise environments often demand capabilities that are dynamic, multi-modal, and self-correcting. L4 systems must retrieve relevant information but also evaluate, adapt, and orchestrate their reasoning to ensure accuracy and completeness in mission-critical settings.
To meet these demands, an L4 system must:
-
•
Accurately represent and unify complex structured, semi-structured, and unstructured data
-
•
Retrieve context across multiple semantic, structural, and metadata dimensions
-
•
Dynamically adapt retrieval and reasoning strategies to the query’s intent
-
•
Seamlessly orchestrate multiple tools, models, and knowledge views
-
•
Self-reflect, re-plan, and re-execute retrieval steps to “connect the dots”
We introduce Corvic AI, a platform designed from the ground up to operate at L4. Corvic AI integrates three pillars, (1) Structure-Aware Data Representation, (2) Mixture of Spaces, and (3) Adaptive Chain of Actions, transforming static pipelines into highly accurate agentic reasoning systems.
4.1 Structure-Aware Data Representation
Enterprise knowledge ecosystems rarely exist in a single format. They blend unstructured content (e.g., manuals, reports, narrative text) with structured elements (e.g., tables, forms, relational databases, graphs). L4 performance depends on a unified representation that captures the multi-faceted nature of the data.
The Structure-Aware Data Representation approach parses each document into an enriched intermediate form encoding:
-
•
Document hierarchies, e.g., section, subsection, heading, paragraph hierarchies
-
•
Embedded tables, lists, and field-level forms
-
•
Local and global metadata, e.g., document type, version, author
-
•
Cross-references, e.g., “See Table 1”, “as shown above”
This structure-aware representation bridges the gap between unstructured and structured modalities, enabling precise section or field-specific retrieval, cross-format linking, and context mapping across disparate data types. It is modular and extensible, and it forms the foundation for the techniques that follow (as shown in Fig. 2).
4.2 Mixture of Spaces
Traditional RAG systems (as shown in Fig. 1) compress all content into a single semantic vector space, losing critical structural and contextual cues. The Mixture of Spaces (MoS) approach builds multiple parallel representations of the same document:
-
•
A semantic space for meaning-rich embeddings
-
•
A structural space modeling layout, hierarchy, and relationships
-
•
A metadata space encoding titles, tags, and annotations
This multi-view indexing allows the system to retrieve relevant context through different pathways. For example, if a semantic search misses a passage, a structural or metadata search may still retrieve it based on its heading or relational position. This redundancy increases both recall and precision. This is key for high-stakes decision-making.
4.3 Adaptive Chain of Actions
Most RAG pipelines follow a rigid retrieve, augment, and generate sequence. In contrast, the Adaptive Chain of Actions (ACoA) framework dynamically assembles query-specific retrieval and reasoning plans.
Adaptive Chain of Actions can:
-
•
Select the optimal retrieval spaces based on query intent
-
•
Sequence multiple retrieval and enrichment steps (e.g., schema lookup graph traversal synthesis)
-
•
Integrate specialized tools for computation, visualization, or external data access
By adapting its strategy on the fly, ACoA supports reflective reasoning: the ability to detect gaps, revise plans based on discovered knowledge, and re-execute and refine steps until the answer meets completeness and reliability criteria.
In combination, Structure-Aware Data Representation, Mixture of Spaces, and Adaptive Chain of Actions enable Corvic AI to operate at L4 (reflective and reasoned knowledge), bridging structured and unstructured sources, adapting to varied enterprise queries, and delivering accurate, explainable, and high-trust outputs (as shown in Fig. 2). The following section evaluates Corvic AI’s performance against other leading RAG platforms.
5 Experimental Results
5.1 Datasets
Conventional benchmarks for Retrieval-Augmented Generation (RAG) typically evaluate performance on preprocessed, clean text chunks. This approach, however, is not representative of real-world enterprise scenarios, where information is predominantly stored in unstructured documents. For a more precise performance assessment on realistic enterprise tasks, our evaluation employs four datasets with source documents in PDF format. Each dataset is paired with a corresponding question corpus, formulated to probe for information embedded within the documents’ complex structure 111All datasets are hosted on HuggingFace for convenient download and use (see Table 2)..
-
•
DelucionQA: The DelucionQA222https://github.com/boschresearch/DelucionQA dataset (Sadat et al.,, 2023) is a benchmark for question answering, using the Jeep 2023 Gladiator Car manual as its knowledge base. DelucionQA provides a challenging dataset designed to evaluate a system’s ability to answer specific, technical questions based on a complex, real-world document.
- •
-
•
DMV Handbooks: The DMV Handbooks dataset is a Corvic AI-created dataset comprising the official DMV handbooks from all 50 US states. This collection of 50 long-form, structurally complex PDFs serves as the source for answering comprehensive sets of official state DMV test questions. It evaluates a system’s ability to perform precise information retrieval across a large and heterogeneous document collection, mirroring a common enterprise requirement.
-
•
Architectural Manuals: The Architectural Manuals dataset is a Corvic AI-created dataset containing real-world complex architectural and engineering documentation. This includes CAD drawings, technical schematics, circuit diagrams, and building project blueprints that combine intricate 2D visual elements with embedded technical text. The dataset is designed to test a system’s ability to jointly process and reason over multi-modal inputs where precise visual–text alignment is critical.
Dataset | Modality | #Q | #Files | #Pages | Enterprise Use Case |
---|---|---|---|---|---|
DelucionQA | Text | 184 | 1 | 384 | Basic retrieval from manuals/product guides; customer support and FAQ automation. |
FinTabNet | Text + Tables | 1,000 | 591 | 591 | Financial analysis from reports; risk assessment, KPI computation, and compliance reporting. |
DMV Handbooks | Text + Images + Tables | 135 | 50 | 4,689 | Regulatory/compliance document search; onboarding, training, and policy lookup. |
Architecture | Images + Diagrams + Embedded Text | 43 | 8 | 36 | Engineering/design review; reasoning over blueprints, CAD diagrams, and technical schematics. |
Dataset | Example Query | Ground Truth | Complexity | |
---|---|---|---|---|
L1 | DelucionQA | What is the DEF? | Diesel Exhaust Fluid | Simple question on simple text search; answer explicitly stated in plain text. |
L2 | FinTabNet | For PPL, which portfolio had the highest percentage of assets allocated to debt securities in 2015? | Growth Portfolio (Rationale: Debt Securities = 13%) | Direct question but requires searching and reading tables from unstructured data accurately to locate the answer. |
L3 | DMV Handbooks | What is the estimated BAC for a 120-pound woman in California after 2 drinks? | 0.11% (Rationale: BAC chart entry with 120 lb, 2 drinks) | Direct question but answer is not in plain text; requires combining table lookup with multimodal (chart) parsing and implicit knowledge generation. |
L4 | Architectural Manuals | Find an enclosure with three knockouts. | Enclosed Series KT7 Motor Controller (Rationale: From Figure 3, the enclosure with three circular knockouts) | Indirect question requiring deeper domain understanding; needs multimodal retrieval and reasoning across diagrams, labels, and text. |
The examples in Table 3 illustrate the progressive complexity of retrieval-augmented generation (RAG) tasks as the capability level increases. At L1, such as the DelucionQA dataset, the question is straightforward and the answer exists verbatim in the text, requiring only basic semantic search over unstructured content. FinTabNet (L2) introduces structured elements, in this case, a table, which demand accurate parsing and retrieval of non-textual formats, but answers are still directly factual. DMV Handbooks (L3) raise the bar by requiring multi-modal comprehension: the system must interpret a chart, map its contents to the query, and apply implicit reasoning to extract the correct result. Finally, Architectural Manuals (L4) represent an indirect, domain-specific query where the answer is not explicitly stated in any single source; instead, it requires understanding technical terminology (“knockouts”), combining information from diagrams and text, and synthesizing a correct and contextually appropriate answer. This progression underscores how higher levels demand richer data representations, more sophisticated retrieval strategies, and reasoning capabilities that extend well beyond surface-level keyword or vector search.
5.2 Evaluation
For our experiments, we selected four representative implementations aligned with the capability levels. LangChain was used to implement an L1 (surface knowledge of unstructured data) system, focused on semantic search over unstructured text. OpenAI RAG and Azure AI Search represent L2 (surface knowledge of multifaceted data) systems, capable of handling unstructured and semi-structured data with surface-level structural awareness but without deep reasoning. Corvic AI is representative of an L4 (reflective and reasoned knowledge) system. It incorporates structure-aware representation, Mixture of Spaces, and Adaptive Chain of Actions for reflective and adaptive reasoning.
To ensure consistent comparison of response quality, we used gpt-4.1 for the final answer generation and summarization step. Depending on the system, there are a variety of parameters that affect retrieval. To obtain representative behavior, we used the default retrieval settings. LangChain and Corvic AI use OpenAI’s text-embedding-3-large embeddings, while OpenAI RAG uses its native embedding solution. LangChain uses ChromaDB as the vector database. For Azure AI Search, we used the default parameters which use OpenAI’s text-embedding-3-small model for embeddings.
For evaluation, we employed an LLM-as-a-judge (Zheng et al.,, 2023) approach using the ragas333https://github.com/explodinggradients/ragas library (Es et al.,, 2024), leveraging gemini-2.0-flash as the evaluation model to score the outputs. This choice ensures an unbiased assessment, as the judge model is from a different family than the generation model used in the pipelines.
Note: In this study, our primary goal was to measure the performance advantage that L4 capabilities bring to a knowledge search and management use case primarily involving unstructured data; future work will extend this evaluation to multi-structured datasets along with unstructured ones.
Framework | DelucionQA | FinTabNet | DMV Handbooks | Architectural Manuals |
---|---|---|---|---|
LangChain | 69.83 | 38.23 | 64.63 | 36.63 |
Azure AI Search | 67.11 | 41.75 | 51.30 | 40.12 |
OpenAI RAG | 67.53 | 52.55 | 71.85 | 58.72 |
Corvic AI | 82.74 | 63.83 | 79.44 | 65.12 |




5.3 Key Observations
Across all four datasets, Corvic AI, which is designed to solve L4 (reflective and reasoned knowledge) problems, delivers the strongest accuracy, even when the corpora are primarily unstructured. The combination of Structure-Aware Data Representation, Mixture of Spaces, and Adaptive Chain of Actions consistently recovers evidence that single-space pipelines miss (e.g., passages discoverable via headings or layout structure rather than pure semantics) and then verifies or augments that evidence through iterative retrieval. In practice, this manifests as higher groundedness and fewer omissions on long, layout-heavy PDFs, where relevant content is often contextually implied by section hierarchy or table locality rather than surface phrasing.
Within the same capability tier, performance can vary markedly, e.g., among the two L2 (surface knowledge of multifaceted data) systems (Azure AI Search vs. OpenAI RAG). This divergence highlights the impact of design choices: vector vs. keyword vs. hybrid retrieval (and their weighting), domain vs. general-purpose embedding models, cross-encoder vs. bi-encoder re-rankers, and the summarization LLM’s length control and citation style. Small configuration shifts can change which passages are retrieved, how they are ordered, and what ultimately makes it into the context window. This, in turn, leads to materially different answer quality without changing the nominal “level” of the system.
Dataset complexity acts as an amplifier. On simpler, manual-like material (DelucionQA), the gap between L1 (surface knowledge of unstructured data) and higher level systems is narrower because many answers are explicitly stated and semantically proximate. As structure and reasoning demands grow, from the simpler DMV (mixed modalities and long-range references) dataset to the increasingly more complicated FinTabNet (table-centric, numeric, and multi-step computation) and Architectural Manuals (visual–text fusion with layout noise) datasets, the advantage of Corvic AI widens compared to L2 systems on average. Multi-space retrieval and reflective planning help recover table cells, respect schema and column context, and follow cross-figure references, while systems designed for lower levels suffer recall drops or over-summarization errors.
A related pattern is robustness. Corvic AI not only achieves higher accuracy but also exhibits smaller swings across datasets. Reflective re-planning reduces brittleness to corpus idiosyncrasies (e.g., varied headings, inconsistent table markup, OCR artifacts) by retrying with alternate spaces or tools when evidence is sparse. In contrast, systems designed for L1 and L2 show larger sensitivity to modality mix and layout density because they rely more heavily on a single retrieval view and fixed ordering.
Finally, the error taxonomy differs by level. LangChain (L1) tends to miss cross-section or cross-document cues and may retrieve semantically similar but contextually wrong passages. L2 systems improve recall but are brittle: changes in hybrid weights or re-ranker choice can flip top passages, and table/figure evidence is often underutilized. Corvic AI mitigates both classes of errors by (1) retrieving through multiple views (semantic, structural, metadata), (2) verifying coverage against the question intent, and (3) iterating when gaps are detected, which collectively reduces hallucinations and incomplete answers on enterprise-style documents.
6 Discussion
Our results show that techniques for addressing L4 (reflective and reasoned knowledge) problems deliver a clear and consistent advantage for enterprise knowledge search and management even when corpora are predominantly unstructured. The architectural ingredients behind L4 (Structure-Aware Data Representation, Mixture of Spaces, and Adaptive Chain of Actions) translate into higher accuracy and more reliable grounding across diverse datasets, while lower levels exhibit wider swings.
The study also clarifies how to interpret “levels” versus “implementations”. A level specifies a capability envelope. Where a concrete system lands within that envelope depends on design choices, e.g., retrieval strategy, embeddings, re-rankers, summarization LLM. Thus, the classification provides the ceiling and direction of travel, and implementation determines realized performance within a level.
Finally, as task and data complexity rise, the value of higher levels grows. This suggests a pragmatic roadmap: organizations starting with simpler use cases can begin at lower levels but should plan upgrades toward L4 as modality mix, reasoning depth, and reliability requirements increase.
Scope. This study focused on question answering on datasets primarily comprised of unstructured-data. Future work will extend to structured and multi-modal corpora and include L3 baselines and targeted ablations to map accuracy–latency–cost trade-offs more completely.
7 Conclusion
We introduced a five-level RAG classification that organizes system capabilities from L1 (surface knowledge of unstructured data) through to L4 (reflective and reasoned knowledge) and the aspirational L5 (general intelligence). This is intended both as a methodological framework and as a practical guide for aligning enterprise requirements with implementation choices.
At lower levels, results reveal that choices of retrieval strategy, embeddings, re-rankers, and summarization LLMs can shift outcomes. This underscores the value of system classification for setting appropriate expectations.
Corvic AI consistently outperforms representative L1 and L2 systems on knowledge search and management tasks, even on predominantly unstructured datasets. Since Corvic AI was designed to solve L4 problems, this highlights the leverage of addressing complex problems upfront.
As enterprises adopt RAG for mission-critical workflows, framing the problem correctly becomes essential for determining the right system capabilities—and Corvic AI exemplifies how higher level problem solving can deliver measurable enterprise value today.
References
- Amugongo et al., (2025) Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S., and Seidel, J. (2025). Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4(6):1–33.
- Brown et al., (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Es et al., (2024) Es, S., James, J., Espinosa Anke, L., and Schockaert, S. (2024). RAGAs: Automated evaluation of retrieval augmented generation. In Aletras, N. and De Clercq, O., editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics.
- Gao et al., (2024) Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey.
- Guu et al., (2020) Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). Realm: retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- Huang et al., (2025) Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55.
- Izacard and Grave, (2021) Izacard, G. and Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. In Merlo, P., Tiedemann, J., and Tsarfaty, R., editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
- Karpukhin et al., (2020) Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Khattab and Zaharia, (2020) Khattab, O. and Zaharia, M. (2020). Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA. Association for Computing Machinery.
- Lewis et al., (2020) Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
- Magesh et al., (2024) Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., and Ho, D. E. (2024). Hallucination-free? assessing the reliability of leading ai legal research tools.
- Nogueira and Cho, (2020) Nogueira, R. and Cho, K. (2020). Passage re-ranking with bert.
- Nogueira et al., (2019) Nogueira, R., Yang, W., Cho, K., and Lin, J. (2019). Multi-stage document ranking with bert.
- Ram et al., (2023) Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
- Sadat et al., (2023) Sadat, M., Zhou, Z., Lange, L., Araki, J., Gundroo, A., Wang, B., Menon, R., Parvez, M., and Feng, Z. (2023). DelucionQA: Detecting hallucinations in domain-specific question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 822–835, Singapore. Association for Computational Linguistics.
- Santhanam et al., (2022) Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. (2022). Colbertv2: Effective and efficient retrieval via lightweight late interaction.
- Shi et al., (2023) Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., and tau Yih, W. (2023). Replug: Retrieval-augmented black-box language models.
- Shuster et al., (2021) Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J. (2021). Retrieval augmentation reduces hallucination in conversation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Xiong et al., (2021) Xiong, L., Xiong, C., Li, Y., Tang, K.-F., Liu, J., Bennett, P. N., Ahmed, J., and Overwijk, A. (2021). Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
- Yu et al., (2023) Yu, W., Iter, D., Wang, S., Xu, Y., Ju, M., Sanyal, S., Zhu, C., Zeng, M., and Jiang, M. (2023). Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
- Zhao et al., (2024) Zhao, W., Feng, H., Liu, Q., Tang, J., Wu, B., Liao, L., Wei, S., Ye, Y., Liu, H., Zhou, W., Li, H., and Huang, C. (2024). Tabpedia: Towards comprehensive visual table understanding with concept synergy. In Advances in Neural Information Processing Systems.
- Zheng et al., (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
- Zheng et al., (2020) Zheng, X., Burdick, D., Popa, L., Zhong, X., and Wang, N. X. R. (2020). Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context.
- Zhuang and Zuccon, (2021) Zhuang, S. and Zuccon, G. (2021). Tilde: Term independent likelihood model for passage re-ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 1483–1492, New York, NY, USA. Association for Computing Machinery.