Tourism and travel planning increasingly rely on digital assistance, yet
existing multimodal AI systems often lack specialized knowledge and contextual
understanding of urban environments. We present TraveLLaMA, a specialized
multimodal language model designed for urban scene understanding and travel
assistance. Our work addresses the fundamental challenge of developing
practical AI travel assistants through a novel large-scale dataset of 220k
question-answer pairs. This comprehensive dataset uniquely combines 130k text
QA pairs meticulously curated from authentic travel forums with GPT-enhanced
responses, alongside 90k vision-language QA pairs specifically focused on map
understanding and scene comprehension. Through extensive fine-tuning
experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL,
Shikra), we demonstrate significant performance improvements ranging from
6.5\%-9.4\% in both pure text travel understanding and visual question
answering tasks. Our model exhibits exceptional capabilities in providing
contextual travel recommendations, interpreting map locations, and
understanding place-specific imagery while offering practical information such
as operating hours and visitor reviews. Comparative evaluations show TraveLLaMA
significantly outperforms general-purpose models in travel-specific tasks,
establishing a new benchmark for multi-modal travel assistance systems.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:



