Visual Discovery in Retail: Operationalizing AI-Powered Visual Search at Boyner
Mert Alacan
Boyner
https://orcid.org/0000-0003-3893-6309
Seza Dursun
Boyner
https://orcid.org/0000-0003-1389-072X
Bahar Önel
Boyner
https://orcid.org/0009-0007-4597-6591
Tülin Işıkkent
Boyner
https://orcid.org/0009-0005-5775-0093
Sedat Çelik
Boyner
https://orcid.org/0009-0003-0335-6440
DOI: https://doi.org/10.56038/oprd.v7i1.742
Keywords: Visual Search, Multimodal AI, GroundingDINO, SigLIP, Milvus, Retail Intelligence, Semantic Search, AI in E-Commerce, Omnichannel Retail, Customer Experience
Abstract
In today's retail landscape, where millions of products and visual stimuli compete for customer attention, the integration of artificial intelligence into visual search has emerged as a crucial lever of operational efficiency. This paper presents Boyner Group's AI-powered visual discovery system, which enables customers to search using photos instead of keywords, making product discovery more intuitive and visually engaging. The architecture leverages a hybrid approach combining Large Language Models (LLMs), vision models such as GroundingDINO, and vector-based semantic similarity engines like SigLIP+Milvus to deliver scalable and high-accuracy image retrieval. The system, currently operational across the Boyner.com.tr ecosystem, supports enhanced filtering and storytelling capabilities, increasing customer satisfaction and conversion rates. The implementation process, system components, and operational results of this large-scale AI integration are explored, highlighting its transformative impact within omnichannel retail.
Keywords: Visual Search, Multimodal AI, GroundingDINO, SigLIP, Milvus, Retail Intelligence, Semantic Search, AI in E-Commerce, Omnichannel Retail, Customer Experience
References
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. DOI: https://doi.org/10.1145/3065386
Kannan, P. K., & Li, H. (2017). Digital marketing: A framework, review and research agenda. International Journal of Research in Marketing, 34(1), 22–45. DOI: https://doi.org/10.1016/j.ijresmar.2016.11.006
Gu, J., Wang, Z., Kuen, J., et al. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354–377. DOI: https://doi.org/10.1016/j.patcog.2017.10.013
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Liu, S., Qi, L., Qin, H., et al. (2023). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:2303.05499. DOI: https://doi.org/10.1007/978-3-031-72970-6_3
Wang, J., Zhu, Y., & Wang, Y. (2022). Milvus: A Purpose-Built Vector Database to Power Embedding-Based Applications. In Proceedings of the VLDB Endowment, 15(12), 3596–3603.
Zhang, J., Zhang, Z., & Wang, Y. (2021). FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
