Multimodal Machine Translation for Children’s Picture Books: Combining Vision and NLP to Support Chinese-to English Bilingual Literacy in Primary Education

Xueer  Tian; Xin  Qiao

doi:10.6180/jase.202608_31.071

Multimodal Machine Translation for Children’s Picture Books: Combining Vision and NLP to Support Chinese-to English Bilingual Literacy in Primary Education

Computer Science and Information Engineering

Xueer Tian¹ and Xin Qiao²This email address is being protected from spambots. You need JavaScript enabled to view it.

¹School of Culture and Communication, Hubei Preschool Teachers College, Wuhan, Hubei, 430223, China.

²College of Artificial Intelligence, Hubei Preschool Teachers College, Wuhan, Hubei, 430223, China.

Received: December 18, 2025
Accepted: March 4, 2026
Publication Date: March 27, 2026

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202608_31.071

Multimodal machine translation is increasingly recognized as an effective tool for early bilingual literacy, especially for junior students engaging with picture-story books. The proposed Hybrid Transformer-ViT framework unifies text encoding and visual grounding, enabling coherent translations without redundant visual linguistic features. Key components include advanced text preprocessing, image normalization, multimodal alignment, and cross attention fusion, enhanced by Target-Visual Consistency (TVC) and Bilingual-Visual Consistency (BiVC) mechanisms. Hyperparameters were optimized using the Grey Wolf Optimizer to accelerate convergence and improve robust cross-modal representation learning. Experiments on the 3AM dataset, containing 15,728 image-text pairs from 436 children’s books, achieved BLEU-4: 50.8, METEOR: 52.9, ROUGE-L: 64.1, and CIDEr: 1.73, outperforming the text-only Transformer (BLEU-4: 37.8). Human evaluation confirmed gains in fluency (+18.4%), adequacy (+30.6%), image-text consistency (+65.5%), and child readability (+31.4%). Optimized GPU deployment reduced latency to 27 ms , enabling real-time translation and enhanced bilingual story comprehension.

Keywords: Multimodal Machine Translation; Hybrid Transformer Model; Bilingual-Visual Consistency; Children’s Picture Books; Grey Wolf Optimizer; Target-Visual Consistency

[1] A. Razavitabar, F. Norouzi, T. Afrasyabi, et al., (2025) “Deep Learning Applications for Mental Health Disorder Diagnosis Using Medical Imaging" International Journal of Applied Data Science in Engineering and Health 1(3): 28–35.
[2] M. Eser and M. Bilgin, (2025) “Irony and Sarcasm Detection in Turkish Texts: A Comparative Study of Transformer-Based Models and Ensemble Learning" Applied Sciences 15(23): 12498. DOI: 10.3390/app152312498.
[3] S. Li, Q. Li, and W. Hope, (2022) “Translating (and rewriting) Jane Austen’s food across time and space" Asia Pacific Translation and Intercultural Studies 9(2): 201–216. DOI: 10.1080/23306343.2022.2106068.
[4] G. Cheung, A. Y. Su, K. Wu, B. Yue, S. Yates, A. Martinez Ruiz, R. Krishnamurthi, and S. Cullum, (2022) “The understanding and experiences of living with dementia in Chinese New Zealanders" International Journal of Environmental Research and Public Health 19(3): 1280. DOI: 10.3390/ijerph19031280.
[5] X. Lin, G. Qulian, Y. Bai, and Q. Liu, (2024) “Differences in the knowledge, attitudes, and needs of caregivers and healthcare providers regarding palliative care" BMC Nursing 23(1): 386. DOI: 10.1186/s12912-024-02052 2.
[6] Y. Chen, E. C. L. Yang, B. Moyle, and T. H. Le, (2024) “Exploring the travel experience of Chinese solo female travelers through a gender and cultural lens" Journal of China Tourism Research 20(3): 657–675. DOI: 10.1080/19388160.2023.2270693.
[7] H.Shamsi,(2025)“Alzheimer’sdiagnosisfromEEGwith reliable probabilities: subject-wise, leakage-free evaluation and isotonic calibration" Journal of Engineering and Applied Science 72(1): 226. DOI: 10.1186/s44147-025 00821-7.
[8] G. Zhu, J. Kang, H. Ma, and C. Wang, (2023) “Characterization of soundscape assessment in outdoor public spaces of urban high-rise residential communities" Journal of the Acoustical Society of America 154(6): 3660–3671. DOI: 10.1121/10.0022531.
[9] J. Yang, X. Qi, L. Wang, B. Sun, and M. Zheng, (2022) “A reading model of young EFL learners regarding attention, cognitive-load and auditory-assistance" Journal of Educational Research 115(1): 51–63. DOI: 10.1080/00220671.2022.2027327.
[10] M. Tegmark, T. Alatalo, M. Vinterek, and M. Win berg, (2022) “What motivates students to read at school? Student views on reading practices in middle and lower secondary school" Journal of Research in Reading 45(1): 100–118. DOI: 10.1111/1467-9817.12386.
[11] L.-K. Ng and C.-K. Lo, (2022) “Flipped classroom and gamification approach: Its impact on performance and aca demic commitment on sustainable learning in education" Sustainability 14(9): 5428. DOI: 10.3390/su14095428.
[12] S. K. Mondal, C. Wang, Y. Chen, Y. Cheng, Y. Huang, H.-N. Dai, and H. D. Kabir, (2024) “Enhancement of English-Bengali machine translation leveraging back translation" Applied Sciences 14(15): 6848. DOI: 10.3390/app14156848.
[13] E. Tour, M. Turner, A. Keary, and K.-L. Tran-Dang, (2024) “Bringing plurilingual strategies into linguistically diverse classrooms" Language Education 38(3): 465–481. DOI: 10.1080/09500782.2023.2203124.
[14] M.Stella, M. S. Vitevitch, and F. Botta, (2022) “Cognitive Networks Extract Insights on COVID-19 Vaccines from English and Italian Popular Tweets" Big Data and Cognitive Computing 6(2): 52. DOI: 10.3390/ bdcc6020052.
[15] G.-H. Fang, Z.-M. Lin, C.-Z. Xie, Q.-Z. Han, M.-Y. Hong, and X.-Y. Zhao, (2024) “Optimized machine learning model for predicting compressive strength of alkali-activated concrete through multi-faceted comparative analysis" Materials 17(20): 5086. DOI: 10.3390/ma17205086.
[16] J. Eunice, Y. Sei, andD.J.Hemanth,(2023)“Sign2Pose: Apose-based approach for gloss prediction using a trans former model" Sensors 23(5): 2853. DOI: 10.3390/s23052853.
[17] F.-Z. Nakach, A. Idri, and E. Goceri, (2024) “A comprehensive investigation of multimodal deep learning fusion strategies for breast cancer classification" Artificial Intelligence Review 57(12): 327. DOI: 10.1007/s10462 024-10984-z.
[18] T. Tayir, L. Li, B. Li, J. Liu, and K. A. Lee, (2024) “Encoder–Decoder Calibration for Multimodal Machine Translation" IEEE Transactions on Artificial Intelligence 5(8): 3965–3973. DOI: 10.1109/TAI.2024.3354668.
[19] X. Shi, Z. Yu, X. Wang, Y. Li, and Y. Niu, (2023) “Text image matching for multi-model machine translation" Journal of Supercomputing 79(16): 17810–17823. DOI: 10.1007/s11227-023-05318-9.
[20] N. K. Kahlon and W. Singh, (2023) “Machine translation from text to sign language: a systematic review" Universal Access in the Information Society 22(1): 1–35. DOI: 10.1007/s10209-021-00823-1.
[21] Y. Huang, T. Zhang, and C. Xu, (2023) “Learning to decode to future success for multi-modal neural machine translation" Journal of Engineering Research 11(2): 100084. DOI: 10.1016/j.jer.2023.100084.
[22] J. Wang, Y. Ji, Y. Zhang, Y. Zhu, and T. Sakai, (2024) “Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models" IEEEAccess12: 420–434. DOI: 10.1109/ACCESS.2023. 3347192.
[23] L. Jing, Y. Li, J. Xu, Y. Yu, P. Shen, and X. Song, (2023) “Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization" Machine Intelligence Research 20(2): 289–298. DOI: 10.1007/s11633-022-1372-x.
[24] Y. Bazi, M. M. A. Rahhal, L. Bashmal, and M. Zuair, (2023) “Vision–language model for visual question answering in medical imagery" Bioengineering 10(3): 380. DOI: 10.3390/bioengineering10030380.
[25] K. F. Peets, O. Yim, andE.Bialystok, (2022) “Language proficiency, reading comprehension and home literacy in bilingual children" International Journal of Bilingual Education and Bilingualism 25(1): 226–240. DOI: 10.1080/13670050.2019.1677551.
[26] S. Babayi˘git, G. J. Hitch, S. Kandru-Pothineni, A. Clarke, and M. Warmington, (2022) “Vocabulary limitations undermine bilingual children’s reading comprehension despite bilingual cognitive strengths" Reading and Writing 35(7): 1651–1673. DOI: 10.1007/s11145 021-10240-8.
[27] T.Zhang,(2022)“DeepLearningClassification Model for English Translation Styles Introducing Attention Mechanism" Mathematical Problems in Engineering: 1–10. DOI: 10.1155/2022/6798505.
[28] Y.Zhao,J.Zhang,andC.Zong,(2023)“Transformer: A General Framework from Machine Translation to Others" Machine Intelligence Research 20(4): 514–538. DOI: 10.1007/s11633-022-1393-5.
[29] M.A.Tayal, M. Deshmukh, V. Pangave, M. Joshi, S. Malwade, and S. Ovale, (2023) “VMLHST: Develop ment of an Efficient Novel Virtual Reality ML Framework with Haptic Feedbacks for Improving Sports Training Sce narios" International Journal of Electrical and Elec tronics Research 11(2): 601–608. DOI: 10.37391/ijeer.110249.
[30] W. Jia, X. Fu, and J. Pun, (2023) “How Do EMI Lecturers’ Translanguaging Perceptions Translate into Their Practice?" Sustainability 15(6): 4895. DOI: 10.3390/su15064895.
[31] Y. Wen, Y. Qiu, C. X. R. Leong, and W. J. B. Van Heuven, (2023) “LexCHI: A quick lexical test for estimating language proficiency in Chinese" Behavior Research Methods 56(3): 2333–2352. DOI: 10.3758/s13428-023 02151-z.
[32] Y. Wu, (2025) “Counterfactual Reasoning Development in Different Languages" Child Development Perspectives 19(4): 223–228. DOI: 10.1111/cdep.12549.
[33] X. Wang and J. Wang, (2023) “Implementing an interactive approach in translator education" Interactive Learning Environments 31(4): 2288–2304. DOI: 10. 1080/10494820.2021.1879871.
[34] X. Ma, X. Liu, D. F. Wong, J. Rao, B. Li, L. Ding, L. S. Chao, D. Tao, and M. Zhang. “3AM: An Ambiguity Aware Multi-Modal Machine Translation Dataset”. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia: ELRA and ICCL, 2024, 1–13.
[35] X. Liu, P. Coluzzi, and S. L. Ding, (2025) “A tale of two languages: charting Chinese parents’ beliefs about and engagement with Chinese and English language learning" Current Issues in Language Planning 26(4): 551–572. DOI: 10.1080/14664208.2024.2425524.
[36] D. C. Castro, X. Franco-Jenkins, and L. J. Chaparro Moreno, (2025) “The Effects of Dual Language Education on Young Bilingual Children’s Learning" Education Sciences 15(3): 312. DOI: 10.3390/educsci15030312.

Latest Articles