Home > Published Issues > 2025 > Volume 16, No. 6, 2025 >
JAIT 2025 Vol.16(6): 819-829
doi: 10.12720/jait.16.6.819-829

Hybrid Retrieval for Retrieval Augmented Generation in the German Language Production Domain

Simon Knollmeyer 1,*, Sebastian Pfaff 2, Muhammad Uzair Akmal 1, Leonid Koval 1, Saara Asif 1, Selvine G. Mathias 1, and Daniel Großmann 1
1. AImotion Bavaria, Technische Hochschule Ingolstadt, Ingolstadt, Germany
2. Information and Technology, TUM School of Computation, Technische Universität München, Garching bei München, Germany
Email: Simon.Knollmeyer@thi.de (S.K.); S.Pfaff@tum.de (S.P.); MuhammadUzair.Akmal (M.U.A.); Leonid.Koval@thi.de (L.K); Saara.Asif@thi.de (S.A.); SelvineGeorge.Mathias@thi.de (S.G.M.); Daniel.Grossmann@thi.de (D.G.)
*Corresponding author

Manuscript received January 20, 2025; revised February 18, 2025; accepted March 14, 2025; published June 12, 2025.

Abstract—Retrieval Augmented Generation (RAG) is an emerging method for leveraging Artificial Intelligence in the field of knowledge management, particularly within specialized domains. In this study, we focus on evaluating the effect of hybrid retrieval techniques on German technical documents, which are widely used in the production and engineering departments of German companies. RAG employs a Large Language Model (LLM) that accesses an extensive information store to answer user queries by retrieving the most relevant text passages, known as chunks, from a database. The efficiency of this retrieval process is crucial for the overall performance of the RAG system. Classical RAG employs dense vector embedding and nearest neighbor search for information retrieval. State of the art of shelf embedding models tend to struggle with non-English and highly domain-specific texts. Given the language, complexity, and specificity of technical production planning documents, we propose a hybrid retrieval approach combining full-text and common RAG vector search. We constructed 2,000 question-and-answer pairs for each language, German and English, in a representative corpus of identical texts. Our proposed hybrid approach consistently enhances the retrieval performance for German documents by 20% over a purely vector-based search, entirely erasing the deficiencies of embedding models for German texts, thus demonstrating its significant potential to improve knowledge management in technical and industrial contexts.
 
Keywords—hybrid retrieval, retrieval augmented generation, production domain, German documents

Cite: Simon Knollmeyer, Sebastian Pfaff, Muhammad Uzair Akmal, Leonid Koval, Saara Asif, Selvine G. Mathias, and Daniel Großmann, "Hybrid Retrieval for Retrieval Augmented Generation in the German Language Production Domain," Journal of Advances in Information Technology, Vol. 16, No. 6, pp. 819-829, 2025. doi: 10.12720/jait.16.6.819-829

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions