Efficient clustering of e-mails by applying supervised machine learning algorithms.
Main Article Content
Abstract
In today’s digital age, effective detection of unwanted emails, commonly known as ”spam”,
has become a priority for individuals and organisations alike. As email inboxes fill up with
unsolicited messages, it has become evident that the predefined rules and heuristics used by
traditional spam filters have lost their effectiveness. This persistent problem poses challenges
at both the personal and business level.
Despite efforts to protect email accounts with anti-virus, which in many cases come at a cost,
spam remains a growing concern. For businesses, implementing costly firewalls can be an un-
necessary burden. The problem of spam persists, and its impact on the efficiency and security
of email communication is indisputable.
The primary objective of this paper is to investigate and evaluate machine learning algorithms
specifically designed to address the challenge of automatic spam detection. This is achieved by
using text classification techniques applied to mail servers and personal computers. In particu-
lar, three key algorithms are examined: Random Forest, Decision Tree and Naive Bayes, with
the intention of determining their applicability in both environments.
This study relies on two essential research methodologies. First, feature selection, a crucial
process that identifies the most relevant variables in mail classification, including keywords
and word frequencies, is carried out. In addition, performance evaluation, which uses metrics
such as accuracy, recall and F1-score, is employed to understand the performance of Machine
Learning models in detecting spam and legitimate emails.
The results of this study are presented in the form of comparative tables showing the hit and
miss rates of the three models evaluated. Notably, it is determined that the Random Forest
model, when applied in conjunction with tokenisation techniques, exhibits superior efficiency
compared to the other two models.
The choice of the right Machine Learning model is critical to ensure efficiency in email classifica-
tion, and this study provides a solid basis for making informed decisions in the implementation
of email security systems in real world business environments. Spam detection, supported by
machine learning algorhythms, remains an evolving field and offers a promising solution to
address a persistent problem in the digital world.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.