Hybrid Question-Answering System: A FAISS and BM25 Approach for Extracting Information from Technical Document
Özlem Hakdağlı
Teracity Yazılım Teknolojileri A.Ş.
https://orcid.org/0000-0002-3637-4309
DOI: https://doi.org/10.56038/oprd.v5i1.535
Keywords: Bilgi Çıkarma, Soru-Cevap Sistemleri, FAISS, BM25, Teknik Dokümanlar, Kurumsal Bilgi Yönetimi
Abstract
In this study, a hybrid question-answering system was developed to accelerate access to information contained in corporate technical documents and to generate appropriate responses to user queries. The system combines dense vector-based retrieval (FAISS) and sparse text-based retrieval (BM25) methods, integrated with the XLM-RoBERTa Large model. Evaluations conducted on a dataset consisting of 23 technical documents demonstrated the system's effectiveness in responding to both semantic and keyword-based queries. This study presents an innovative approach that enables fast and accurate access to information from technical documents, enhancing the efficiency of corporate knowledge management processes.
References
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
J. Devlin, M.-W. Chang, K. Lee ve K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, ABD, Haz. 2019, ss. 4171–4186. [Çevrimiçi]. Erişim: https://aclanthology.org/N19-1423/
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto ve P. Fung, "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, cilt 55, sayı 12, s. 1–38, Şub. 2022. [Çevrimiçi]. Erişim: https://arxiv.org/pdf/2202.03629v1 DOI: https://doi.org/10.1145/3571730
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel ve D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," arXiv preprint arXiv:2005.11401, 2020. [Çevrimiçi]. Erişim: https://arxiv.org/abs/2005.11401
J. Johnson, M. Douze, and H. Jégou, "Billion-Scale Similarity Search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/8733051. DOI: https://doi.org/10.1109/TBDATA.2019.2921572
N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410. DOI: https://doi.org/10.18653/v1/D19-1410
S. E. Robertson and H. Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond," Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. DOI: https://doi.org/10.1561/1500000019
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008. DOI: https://doi.org/10.1017/CBO9780511809071
A. Conneau et al., "Unsupervised Cross-lingual Representation Learning at Scale," in Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8440–8451