SMS Spam Classification–Simple Deep Learning Models With Higher Accuracy Using BUNOW And GloVe Word Embedding

Surajit Giri; Sayak Das; Sutirtha Bharati Das; Siddhartha Banerjee

doi:10.6180/jase.202310_26(10).0015

SMS Spam Classification–Simple Deep Learning Models With Higher Accuracy Using BUNOW And GloVe Word Embedding

Computer Science and Information Engineering

Proposed CNN Architecture

Surajit Giri, Sayak Das, Sutirtha Bharati Das, and Siddhartha Banerjee

Department of Computer Science, Ramakrishna Mission Residential College, Narendrapur, West Bengal, India

Received: March 16, 2022
Accepted: November 30, 2022
Publication Date: February 21, 2023

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202310_26(10).0015

ABSTRACT

Unwanted text messages are called Spam SMSs. It has been proven that Machine Learning Models can categorize spam messages efficiently and with great accuracy. However, the lack of proper spam filtering software or misclassification of genuine SMS as spam by existing software, the use of spam detection applications has not become popular. In this paper, we propose multiple deep neural network models to classify spam messages. Tiago’s Dataset is used for this research. Initially, preprocessing step is applied to the messages in the data set, which involves lowercasing the text, tokenization, lemmatization of the text, and removal of numbers, punctuations, and stop words. These preprocessed messages are fed in two different deep learning models with simpler architectures, namely Convolution Neural Network and a hybrid Convolution Neural Network with Long Short-Term Memory Network for classification. To increase the accuracy of these two simple architectures, BUNOW and GloVe word embedding techniques are incorporated with deep learning models. BUNOW and GloVe are popular choices in sentiment analysis, but in this work, these two-word embedding techniques are tried in the context of text classification to improve accuracy. The best accuracy of 98.44% is achieved by the CNN LSTM BUNOW model after 15 epochs on a 70% - 30% train-test split. The proposed model can be used in many practical applications like real-time SMS spam detection, email spam detection, sentiment analysis, text categorization, etc.

Keywords: SMS Spam; Machine Learning; CNN; CNN-LSTM; Word Embedding; GloVe; BUNOW

REFERENCES

[1] What is Text Message Marketing. https://www.tatango.com/.
[2] A. A. Helmy, Y. M. Omar, and R. Hodhod. “An innovative word encoding method for text classification using convolutional neural network”. In: 2018 14th international computer engineering conference (ICENCO). IEEE. 2018, 42–47. DOI: 10.1109/ICENCO.2018.8636143.
[3] J. Pennington, R. Socher, and C. D. Manning. “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, 1532–1543.
[4] T. Almeida and J. Hidalgo. SMS Spam Collection v.1. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.
[5] https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection.
[6] T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami.“Contributions to the study of SMS spam filtering: new collection and results”. In: Proceedings of the 11th ACM symposium on Document engineering. 2011, 259–262. DOI: 10.1145/2034691.2034742.
[7] P. Sethi, V. Bhandari, and B. Kohli. “SMS spam detection and comparison of various machine learning algorithms”. In: 2017 international conference on computing and communication technologies for smart nation (IC3TSN). IEEE. 2017, 28–31. DOI: 10.1109/IC3TSN.2017.8284445.
[8] P. Navaney, G. Dubey, and A. Rana. “SMS spam filtering using supervised machine learning algorithms”. In: 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE. 2018, 43–48. DOI: 10 . 1109 /CONFLUENCE. 2018.8442564.
[9] A. Alzahrani and D. B. Rawat. “Comparative study of machine learning algorithms for SMS spam detection”. In: 2019 SoutheastCon. IEEE. 2019, 1–6. DOI: 10.1109/SoutheastCon42311.2019.9020530.
[10] T. Xia and X. Chen, (2020) “A discrete hidden Markov model for SMS spam detection" Applied Sciences 10(14): 5011. DOI: 10.3390/app10145011.
[11] T. Xia and X. Chen, (2021) “A weighted feature enhanced Hidden Markov Model for spam SMS filtering" Neurocomputing 444: 48–58. DOI: 10.1016/j.neucom.2021.02.075.
[12] B. Diallo, J. Hu, T. Li, G. Khan, and C. Ji. “Conceptenhanced multi-view clustering of document data”. In: 2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE). IEEE. 2019, 1258–1264. DOI: 10.1109/ISKE47853.2019.9170436.
[13] B. Diallo, J. Hu, T. Li, G. A. Khan, and A. S. Hussein, (2022) “Multi-view document clustering based on geometrical similarity measurement" International Journal of Machine Learning and Cybernetics 13(3): 663–675. DOI: 10.1007/s13042-021-01295-8.
[14] R. Taheri and R. Javidan. “Spam filtering in SMS using recurrent neural networks”. In: 2017 Artificial Intelligence and Signal Processing Conference (AISP). IEEE.2017, 331–336. DOI: 10.1109/AISP.2017.8515158.
[15] G. Jain, M. Sharma, and B. Agarwal, (2019) “Optimizing semantic LSTM for spam detection" International Journal of Information Technology 11(2): 239–250. DOI: 10.1007/s41870-018-0157-5.
[16] M. Popovac, M. Karanovic, S. Sladojevic, M. Arsenovic, and A. Anderla. “Convolutional neural network based SMS spam detection”. In: 2018 26th Telecommunications Forum (TELFOR). IEEE. 2018, 1–4. DOI: 10.1109/TELFOR.2018.8611916.
[17] S. Annareddy and S. Tammina. “A comparative study of deep learning methods for spam detection”. In: 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC).IEEE.2019, 66–72. DOI: 10.1109/I-SMAC47947.2019.9032627.
[18] P. K. Roy, J. P. Singh, and S. Banerjee, (2020) “Deep learning to filter SMS spam" Future Generation Computer Systems 102: 524–533. DOI: 10.1016/j.future.2019.09.001.
[19] A. Chandra and S. K. Khatri. “Spam SMS filtering using recurrent neural network and long short term memory”. In: 2019 4th International Conference on Information Systems and Computer Networks (ISCON). IEEE. 2019, 118–122. DOI: 10.1109/ISCON47742.2019.9036269.
[20] S. Kotni, D. Chandrasekhar Potala, and L. Sahoo, (2022) “Spam Detection Using Deep Learning Models" International Journal of Advanced Research in Engineering and Technology 13(5): 55–64. DOI: 10.17605/OSF.IO/NT4.
[21] O. Abayomi-Alli, S. Misra, and A. Abayomi-Alli, (2022) “A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset" Concurrency and Computation: Practice and Experience 34(17): 1–15. DOI: 10.1002/cpe.6989.
[22] B. Diallo, J. Hu, T. Li, G. A. Khan, X. Liang, and Y. Zhao, (2021) “Deep embedding clustering based on contractive autoencoder" Neurocomputing 433: 96–107. DOI: 10.1016/j.neucom.2020.12.094.
[23] M. A. Shaaban, Y. F. Hassan, and S. K. Guirguis, (2022) “Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text" Complex & Intelligent Systems: 1–13. DOI: 10.1007/s40747-022-00741-6.
[24] A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, (2020) “A hybrid CNN-LSTM model for SMS spam detection in Arabic and english messages" Future Internet 12(9): 156. DOI: 10.3390/fi12090156.
[25] Z. Jianqiang, G. Xiaolin, and Z. Xuejun, (2018) “Deep convolution neural networks for twitter sentiment analysis" IEEE access 6: 23253–23260. DOI: 10.1109/ACCESS.2017.2776930.
[26] S. M. Rezaeinia, R. Rahmani, A. Ghodsi, and H. Veisi, (2019) “Sentiment analysis based on improved pre-trained word embeddings" Expert Systems with Applications 117: 139–147. DOI: 10.1016/j.eswa.2018.08.044.
[27] A. K. Uysal and S. Gunal, (2014) “The impact of preprocessing on text classification" Information processing & management 50(1): 104–112. DOI: 10.1016/j.ipm.2013.08.006.
[28] J. Camacho-Collados and M. T. Pilehvar, (2017) “On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis" arXiv preprint arXiv:1707.01780: DOI: 10.48550/arXiv.1707.01780.
[29] S.Weidman. Deep learning from scratch: building with python from first principles. O’Reilly Media, 2019.
[30] A. C. Michalos. Encyclopedia of quality of life and wellbeing research. Springer Netherlands Dordrecht, 2014.