Home > Published Issues > 2024 > Volume 15, No. 7, 2024 >
JAIT 2024 Vol.15(7): 822-837
doi: 10.12720/jait.15.7.822-837

A Cross-Modal Transformer Based Model for Box-office Revenue Prediction

Canaan T. Madongo *, Zhongjun Tang, and Jahanzeb Hassan
School of Economics and Management, Beijing Modern Manufacturing Development,
Beijing University of Technology, Beijing, China
Email: ctmadongo@yahoo.co.uk (C.T.M.); tangzhongjun@bjut.edu.cn (Z.J.T.); jahanzab.hassan@gmail.com (J.H.)
*Corresponding author

Manuscript received January 13, 2024; revised February 24, 2024; accepted March 13, 2024; published July 8, 2024.

Abstract—In the dynamic entertainment industry, predicting a movie’s opening box office revenue remains critical for filmmakers and studios. To address this challenge, we present a novel Cross-modal transformer and a Hierarchical Fusion Neural Network (CHFNN) model tailored to predict movie box office earnings based on multimodal features extracted from movie trailers, posters, and reviews. The Cross-modal Transformer component of the CHFNN model captures intricate inter-modal relationships by performing a cross-modal fusion of the extracted features. It employs self-attention mechanisms to dynamically weigh the importance of each modality’s information. This allows the model to learn to focus on the most relevant information from trailers, posters, and reviews, adapting to the unique characteristics of each movie. The Hierarchical Fusion Neural Network within CHFNN further refines the fused features, enabling a deeper understanding of the inherent hierarchical structure of multimodal data. By hierarchically combining the cross-modal features, our model learns to capture both global and local interactions, enhancing its predictive capacity. We evaluate the performance of the CHFNN model on a comprehensive Internet Movie Dataset by obtaining metadata for 50,186 movies from the 1990s to 2022, which includes movie trailers, posters, and review data. Our results demonstrate that the CHFNN model outperforms existing models in prediction accuracy, achieving 95.80% prediction accuracy. The CHFNN model provides state-of-the-art predictive power and offers interpretability through attention mechanisms, allowing insights into the factors contributing to a movie’s box office success.
Keywords—box-office, movie posters, movie trailers, movie reviews, cross-modal transformers, predictions

Cite: Canaan T. Madongo, Zhongjun Tang, and Jahanzeb Hassan, "A Cross-Modal Transformer Based Model for Box-office Revenue Prediction," Journal of Advances in Information Technology, Vol. 15, No. 7, pp. 822-837, 2024.

Copyright © 2024 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.