Improving Phishing Email Detection Using the Hybrid Machine Learning Approach

Main Article Content

Naveen Palanichamy
Yoga Shri Murti


machine learning, phishing email detection, hybrid classification


Phishing emails pose a severe risk to online users, necessitating effective identification methods to safeguard digital communication. Detection techniques are continuously researched to address the evolution of phishing strategies. Machine learning (ML) is a powerful tool for automated phishing email detection, but existing techniques like support vector machines and Naive Bayes have proven slow or ineffective in handling spam filtering. This study attempts to provide a phishing email detector and reliable classifier using a hybrid machine classifier with term frequency-inverse document frequency (TF-IDF) and an effective feature extraction technique (FET) on a real-world dataset from Kaggle. Exploratory data analysis is conducted to enhance understanding of the dataset and identify any conspicuous errors and outliers to facilitate the detection process. The FET converts the data text into a numerical representation that can be used for ML algorithms. The model’s performance is evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve and area under the ROC curve metrics. The research findings indicate that the hybrid model utilising TF-IDF achieved superior performance, with an accuracy of 87.5%. The paper offers valuable knowledge on using ML to identify phishing emails and highlights the importance of combining various models.


Download data is not yet available.
Abstract 285 | 778-PDF-v11n3pp120-142 Downloads 18


Ablel-Rheem, D. M., Ibrahim, A. O., Kasim, S., Almazroi, A. A., & Ismail, M. A. (2020). Hybrid Feature Selection and Ensemble Learning Method for Spam Email Classification. International Journal of Advanced Trends in Computer Science and Engineering, 9(1.4), 217–223.
Akashsurya156. (2020). Phishing Email Collection. Kaggle.
Bhandari, A. (2023, March 13). Understanding & Interpreting Confusion Matrices for Machine Learning (Updated 2023).
BYJU'S. (n.d.). Accuracy And Precision - Definition, Examples, Need for Measurement. BYJUS.
Chandra, J. V., Challa, N., & Pasupuleti, S. K. (2019, October). Machine Learning Framework to Analyze Against Spear Phishing. International Journal of Innovative Technology and Exploring Engineering, 8(12).
Dhiraj, K. (2019, June 14). Top 4 Advantages and Disadvantages of Support Vector Machine or SVM. Retrieved from
Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing Email Detection Using Improved RCNN Model with Multilevel Vectors and Attention Mechanism. IEEE Access, 7, 56329–56340.
Form, L. M., Chiew, K. L., Sze, S. N. & Tiong, W. T. (2022, September 25). Phishing Email Detection Technique by Using Hybrid Features. 2015 9th International Conference on IT in Asia (CITA) (p. 5).
Gallo, L., Maiello, A., Botta, A., & Ventre, G. (2021). 2 Years in the Anti-Phishing Group of a Large Company. Computers and Security, 105, 102259.
Ganesan, K. (2019, December 5). 10+ Examples for Using CountVectorizer. Kavita Ganesan, Ph.D.
Hall, C. (n.d.). Phishing Email Data by Type.
Harrison, O. (2018, September 10). Machine Learning Basics with the K-Nearest Neighbors Algorithm. Medium; Towards Data Science.
IBM. (n.d.). What Is Random Forest? Retrieved from,both%20classification%20and%20regression%20problems
IBM. (n.d.). What Are Naïve Bayes Classifiers? Retrieved from
Jawale, D. S., Diksha, S., Jawale, K. R., & Shinkar, K. R. (2018). Hybrid Spam Detection Using Machine Learning. International Journal of Advance Research, Ideas and Innovations in Technology, 4(2), 1–6.
Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K.(2020). Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework. IEEE Access, 8, (pp. 154759–154788).
Kolmar, C. (2023, March 30). 75 Incredible Email Statistics [2023]: How Many Emails Are Sent Per Day? Retrieved from
Kontsewaya, Y., Antonov, E., & Artamonov, A. (2020). Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Computer Science, 190, 479–486. Retrieved
Narkhede, S. (2018, June 26). Understanding AUC – ROC Curve. Medium; Towards Data Science.
Raza, M., Jayasinghe, N. D., & Muslam, M. M. (2022). A Comprehensive Review on Email Spam Classification Using Machine Learning Algorithms. 2021 International Conference on Information Networking (ICOIN), (pp. 1–6).
Saini, A. (2021, August 29). Decision Tree Algorithm – A Complete Guide. Analytics Vidhya.
Shafiq, M., Ng, H., Yap, T. T. V., & Goh, V. T. (2022). Performance of Sentiment Classifiers on Tweets of Different Clothing Brands. Journal of Informatics and Web Engineering, 1(1), 16-22.
Toolan, F., & Carthy, J. (2022). Feature Selection for Spam and Phishing Detection. 2010 eCrime Researchers Summit, Dallas, TX, USA. (pp. 1–12).
Vade Secure. (n.d.). Q1 2023 Phishing and Malware Report: Phishing Increases 102% QoQ.
Vazhayil, A., Harikrishnan, N. B., Vinayakumar, R., & Soman, K. P. (2018). Phishing Email Detection Using Classical Machine Learning Techniques. In Proceedings of the 1st AntiPhishing Shared Pilot at 4th ACM International Workshop on Security and Privacy Analytics (IWSPA, 2018), (pp. 1–8). Arizona.
Wijaya, A., & Bisri, A. (2016). Hybrid Decision Tree and Logistic Regression Classifier for Email Spam Detection. 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE) (p. 4).