MFSM: Chinese-English sentence alignment based on multi- feature self-attention mechanism fusion

Baolong  Li

doi:10.6180/jase.202510_28(10).0005

MFSM: Chinese-English sentence alignment based on multi- feature self-attention mechanism fusion

Research Categories

Baolong LiThis email address is being protected from spambots. You need JavaScript enabled to view it.

School of Foreign Languages, Zhengzhou University of Science and Technology Zhengzhou 450064 China

Received: October 8, 2024
Accepted: November 1, 2024
Publication Date: January 13, 2025

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202510_28(10).0005

Bilingual parallel corpora is a very important basic resource in the research field of natural language processing based on statistics. There are cross alignment and empty alignment in Chinese-English bilingual text, it is easy to affect the effect of Chinese-English sentence alignment. Therefore, we propose a novel Chinese-English sentence alignment method based on multi-feature self-attention mechanism fusion. First, the long features of Chinese-English bilingual sentences are integrated into the Glove word vector. Then bidirectional gated recurrent unit is used to encode the feature word vector to obtain more fine-grained sentence local information. Second, the interactive attention mechanism is introduced to extract global information in bilingual sentences to ensure the effective use of contextual semantic features. Finally, the Kuhn-Munkres (KM) algorithm is introduced on the basis of multi-layer perceptron, which can deal with non-monotonic aligned text and improve the generalization ability of the model. Experiments show that, the F index with the proposed method exceeds 90%, the proposed method can effectively improve the correct rate and recall rate of sentence alignment, and improve the construction efficiency of Chinese-English parallel corpora.

Keywords: Chinese-English sentence alignment; multi-feature self-attention mechanism fusion; bidirectional gated recurrent unit; Kuhn-Munkres algorithm

[1] C.W.Li, M.M.Benjamin, and G. V. Korshin, (2000) “Use of UV Spectroscopy To Characterize the Reaction between NOM and Free Chlorine" Environmental Sci ence & Technology 34(12): 2570–2575. DOI: 10.1021/es990899o.
[2] H.Wang, H. Wu,Z. He, L. Huang, and C. Kenneth Ward, (2022) “Progress in Machine Translation" Engi neering 18(2): 143–153. DOI: 10.1016/j.eng.2021.03.023.
[3] T. Phuoc, N. Thien, D.-H. Vu, and B. V. Huu-Anh Tran, (2022) “A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation" IEEE Ac cess 10: 78928–78938. DOI: 10.1109/ACCESS.2022.3186978.
[4] X. Zhao, W. Song, M. Li, and J. Zhang, (2023) “Query Translation Optimization and Mathematical Modeling for English-Chinese Cross-Language Information Retrieval" Applied Mathematics and Nonlinear Sciences 8(1): 1777–1784. DOI: 10.2478/amns.2022.2.0166.
[5] X. Meng, X. Wang, S. Yin, and L. Hang, (2023) “Few shot image classification algorithm based on attention mechanism and weight fusion" Journal of Engineer ing and Applied Science 70(1): 1–14. DOI: 10.1186/ s44147-023-00186-9.
[6] S.Yin, (2023) “Object Detection Based on Deep Learning: ABrief Review" IJLAI Transactions on Science and Engineering 1(2): 1–6.
[7] A. Fernando, S. Ranathunga, D. Sachintha, L. Pi yarathna, and C. Rajitha, (2023) “Exploiting bilingual lexicons to improve multilingual embedding-based docu ment and sentence alignment for low-resource languages" Knowledge and Information Systems 65: 571–612.
[8] X. Lan, Y. Yuan, X. Wang, Z. Wang, and W. Zhu, (2023) “A Survey on Temporal Sentence Grounding in Videos" ACM Trans. Multim. Comput. Commun. Appl. 19(2): 51:1–51:33. DOI: 10.1145/3532626.
[9] L. Zhang, (2022) “Design of New Word Retrieval Al gorithm for Chinese-English Bilingual Parallel Corpus" Mathematical Problems in Engineering 2022:
[10] Z. Ahanin and M. A. Ismail, (2022) “A multi-label emoji classification method using balanced pointwise mutual information-based feature selection" Computer Speech & Language 73: 101330. DOI: 10.1016/j.csl.2021.101330.
[11] Y. Shen and H. Guo, (2023) “Research on high performance English translation based on topic model" Digital Communications and Networks 9(2): 505 511. DOI: 10.1016/j.dcan.2022.03.015.
[12] D. Feifei, (2024) “Applying Optimisation Theory in En glish Language Teaching in Practice" Applied Mathe matics and Nonlinear Sciences 9(1):–. DOI: 10.2478/amns.2023.2.01158.
[13] G. Hu,Y. Zhao, G. Lu, F. Yin, and J. Chen, (2022) “An exploration of mutual information based on emotion-cause pair extraction" Knowl. Based Syst. 256: 109822. DOI: 10.1016/j.knosys.2022.109822.
[14] S. J. Johnson, M. R. Murty, and I. Navakanth, (2024) “A detailed review on word embedding techniques with emphasis on word2vec" Multim. Tools Appl. 83(13): 37979–38007. DOI: 10.1007/S11042-023-17007-Z.
[15] d. S. D. K. George A M Ansumana R, (2024) “Cli mate change and the rising incidence of vector-borne dis eases globally" International Journal of Infectious Diseases 139: 143–145.
[16] P. Li, A. A. Laghari, M. Rashid, J. Gao, T. R. Gadekallu, A. R. Javed, and S. Yin, (2023) “A Deep Multimodal Adversarial Cycle-Consistent Network for Smart Enterprise System" IEEE Trans. Ind. Informat ics 19(1): 693–702. DOI: 10.1109/TII.2022.3197201.
[17] S. Yin, H. Li, A. A. Laghari, T. R. Gadekallu, G. A. R. Sampedro, and A. S. Almadhor, (2024) “An Anomaly Detection Model Based on Deep Auto-Encoder and Cap sule Graph Convolution via Sparrow Search Algorithm in 6G Internet of Everything" IEEE Internet Things J. 11(18): 29402–29411. DOI: 10.1109/JIOT.2024.3353337.
[18] A. Kenarang, M. Farahani, and M. Manthouri, (2022) “BiGRU attention capsule neural network for persian text classification" J. Ambient Intell. Humaniz. Comput. 13(8): 3923–3933. DOI: 10.1007/S12652-022-03742-Y.
[19] Q. Zhang, L. Jiang, and Z. Han, (2023) “A Fast and Robust Multiple Individuals Tracking Algorithm Based on Artificial Neural Networks" Proceedings of 2021 5th Chinese Conference on SwarmIntelligence and Cooperative Control: 418–428.
[20] J. Oruh, S. Viriri, and A. Adegun, (2022) “Long Short Term Memory Recurrent Neural Network for Automatic Speech Recognition" IEEE Access10:30069–30079. DOI: 10.1109/ACCESS.2022.3159339.