Phishing Message Detection Based on Keyword Matching

Main Article Content

Keng-Theen Tham
Kok-Why Ng
Su-Cheng Haw


keyword matching, phishing detection, Naïve Bayes, natural language processing, stemming


This paper proposes to use the Naïve Bayes-based algorithm for phishing detection, specifically in spam emails. The paper compares probability-based and frequency-based approaches and investigates the impact of imbalanced datasets and the use of stemming as a natural language processing (NLP) technique. Results show that both algorithms perform similarly in spam detection, with the choice between them depending on factors such as efficiency and scalability. Accuracy is influenced by the dataset configuration and stemming. Imbalanced datasets lead to higher accuracy in detecting emails in the majority class, while they struggle to classify minority-class emails. In contrast, balanced datasets yield overall high accuracy for both spam and ham email identification. This study reveals that stemming has a minor impact on algorithm performance, occasionally decreasing in accuracy due to word grouping. Balancing the dataset is crucial for improving algorithm performance and achieving accurate spam email detection. Hence, both probability-based and frequency-based Naïve Bayes algorithms are effective for phishing detection using balanced datasets. The frequency-based approach, with a balanced dataset and stemming, achieves a balanced performance between recall and precision, while the probability-based method with a balanced dataset and no stemming prioritises overall accuracy.



Download data is not yet available.
Abstract 311 | 776-PDF-v11n3pp105-119 Downloads 25


Aburrous, M., Hossain, M. A., Dahal, K. & Thabtah, F. (2010). Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Systems with Applications, 37(12), 7913–7921.
Adebowale, M. A., Lwin, K. T., Sánchez, E. & Hossain, M. A. (2019). Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text. Expert Systems with Applications, 115, 300–313.
Aljofey, A., Jiang, Q., Rasool, A., Chen, H., Liu, W., Qu, Q. & Wang, Y. (2022). An effective detection approach for phishing websites using URL and HTML features. Scientific Reports, 12(1).
Amir Sjarif, N. N., Mohd Azmi, N. F., Chuprat, S., Sarkan, H. M., Yahya, Y. & Sam, S. M. (2019). SMS spam message detection using term frequency-inverse document frequency and random forest algorithm. Procedia Computer Science, 161, 509–515.
Barraclough, P. & Sexton, G. (2015). Phishing website detection fuzzy system modelling [Paper presentation]. 2015 Science and Information Conference (SAI).
Baykara, M. & Gurel, Z. Z. (2018). Detection of phishing attacks [Paper presentation]. 2018 6th International Symposium on Digital Forensic and Security (ISDFS).
Cook, S. (2023, 21 June). 50+ Phishing statistics, facts and trends 2017–2018. Comparitech.
Cveticanin, N. (2023, 14 July). Phishing statistics & how to avoid taking the bait. Dataprot.
Dalia, S. A., Hanan, A. A. A. A. & Ishraq, K. A. (2021). Effective phishing emails detection method. Turkish Journal of Computer and Mathematics Education, 12(14), 4898–4904.
Desolda, G., Ferro, L. S., Marrella, A., Catarci, T. & Costabile, M. F. (2022). Human factors in phishing attacks: A systematic literature review. ACM Computing Surveys, 54(8), 1–35.
Frauenstein, E. D. & Flowerday, S. (2020). Susceptibility to phishing on social network sites: A personality information processing model. Computers & Security, 94, 101862.
Harikrishnan N B. (2021, 13 December). Confusion matrix, accuracy, precision, recall, F1 score. Medium.
Jari, M. (2022). An overview of phishing victimization: Human factors, training and the role of emotions [Paper presentation]. 12th International Conference on Computer Science and Information Technology.
Julis, M. & Alagesan, S. (2020). Spam detection in SMS using machine learning through text mining. International Journal of Scientific & Technology Research, 9. Available at
Lin, T., Capecci, D. E., Ellis, D. M., Rocha, H. A., Dommaraju, S., Oliveira, D. S. & Ebner, N. C. (2019). Susceptibility to spear-phishing emails. ACM Transactions on Computer–Human Interaction, 26(5), 1–28.
Mohamed, G., Visumathi, J., Mahdal, M., Anand, J. & Elangovan, M. (2022). An effective and secure mechanism for phishing attacks using a machine learning approach. Processes, 10(7), Article 1356.
Mughaid, A., AlZu’bi, S., Hnaif, A., Taamneh, S., Alnajjar, A. & Elsoud, E. A. (2022). An intelligent cyber security phishing detection system using deep learning techniques. Cluster Computing, 25, 3819–3828.
Nurul, A. A. & Isredza, R. A. H. (2021). COVID-19 phishing detection based on hyperlink using K-nearest neighbor (KNN) algorithm. Applied Information Technology and Computer Science, 2(2), 287–301. Available from
Sheikhi, S., Taghi Kheirabadi, M. & Bazzazi, A. (2020). An effective model for SMS spam detection using content-based features and averaged neural network. International Journal of Engineering, Transactions B: Applications, 33(2), 221–228.
Sonowal, G. (2020). Detecting phishing SMS based on multiple correlation algorithms. SN Computer Science, 1(6).
Tay, Y. H., Ooi, S. Y., Pang, Y. H., Gan, Y. H., & Lew, S. L. (2023). Ensuring Privacy and Security on Banking Websites in Malaysia: A Cookies Scanner Solution. Journal of Informatics and Web Engineering, 2(2), 153-167.