Culinary Culture: A Global Exploration of Health and Diversity in Cuisine
December 1, 2025·,
,·
0 min read
Mubaswira Ibnat Zidney
Anik Kumar Sannyashi
Raihan Tanvir
Faisal Muhammad Shah
Abstract
Accurate classification of food by cuisine and dietary categories is pivotal for
advancing personalized nutrition and intelligent recommendation systems, yet
unimodal approaches often struggle with label inconsistencies and cultural
diversity in recipes. This study presents a novel multimodal deep learning
methodology that synergistically integrates textual ingredient semantics with
visual food image features to jointly predict cuisine and diet, establishing it as
a robust solution to existing limitations. We refine a dataset of 4,986 recipes
by consolidating over 76 regional cuisine labels into 30 country-level classes,
enhancing semantic coherence and class balance. Our proposed framework employs
transformer-based encoders to distill contextual ingredient information and
advanced visual encoders to extract image representations, which are fused via an
optimized average projection and dropout mechanism to maximize predictive
accuracy. Evaluated on the refined dataset, this multimodal approach achieves 81%
accuracy for cuisine and 79% for diet, significantly surpassing text-only
baselines (up to 40% cuisine accuracy) and image-only baselines (up to 25%
cuisine accuracy) by 15-40%. Ablation studies underscore the efficacy of our
fusion strategy in addressing noisy labels, positioning this methodology as a
scalable foundation for applications in dietary assessment, smart kitchen systems,
and food informatics.
Type
Publication
2025 28th International Conference on Computer and Information Technology (ICCIT)

Authors
Senior Lecturer
I am Raihan Tanvir, currently serving as a Senior Lecturer in the Department of Computer Science and Engineering at Ahsanullah University of Science and Technology (AUST) in Dhaka, Bangladesh. My research spans Computer Vision, Natural Language Processing (NLP), Large Language Models (LLMs), Vision-Language Models (VLMs), and multimodal deep learning.