Using Huffman Trees in Features Selection to Enhance Performance in Spam Detection
Resumo
Spam detection is very costly when compared to the simple task of spreading spam. Most approaches aim to reach higher accuracy percentages, leaving the classification performance in background, what may cause many problems, such as bottlenecks in the e-mail system, huge infrastructure investments and waste of resources pooling. To avoid these problems, this paper proposes a hierarchical spam features organization using Huffman Trees, where the most important features stay closer to the root. With the reduction of these trees (leaves pruning) the feature space is significantly reduced, speeding up the e-mail classification process. The experiments showed a performance 60 times faster when compared to Spam Assassin.Referências
K. Kleiner. Happy Spamyversary! Spam Reaches 30 [Online]. Available: https://www.newscientist.com/article/dn13777-happy-spamiversary-spam-reaches-30/
B. Hoanca, "How good are our weapons in the spam wars?", IEEE Technology and Society Magazine, vol. 25, issue 1, pp. 22-30, 2006.
B. Whitworth and E. Whitworth, "Spam and the social technical gap", IEEE Computer, vol. 37, issue 10, pp. 38-45, 2004.
Symantec, "January 2011 Intelligence Report" [Online]. Available: https://www.navixia.com/images/pdf/newsletter/MLI_2011_01_January_Final_enus.pdf
Symantec, "May 2013 Intelligence Report" [Online]. Available: [link].
Symantec, "Internet Security Threat Report" [Online]. Rep. 21, Apr. 2016. Available: https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf
Olivo, C. K.; Santin, A. O.; Oliveira, L. S., "Obtaining the Threat Model for E-mail Phishing", Applied Soft Computing, vol. 13, issue 12, pp. 4841-4848, 2013.
J. Klensin. (2001, April). RFC 2821 - Simple Mail Transfer Protocol [Online]. Available: http://www.ietf.org/rfc/rfc2821.txt
N. Freed and I. Borenstein. (1996, November). RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies [Online]. Available: http://tools.ietf.org/rfc/rfc2045.txt
R. Duda, P. Hart and D. Stork, Pattern Classification, 2nd edition, Wiley-Interscience, 2000.
There are 600,426,974,379,824,381,952 ways to spell Viagra [Online], Available: http://cockeyed.com/lessons/viagra/viagra.html
The Unicode Standard "Technical Introduction [Online]. Available: http://www.unicode.org/standard/principles.html
C. Liu and S. Stamm, "Fighting Unicode-Obfuscated Spam", Proceedings of the Anti-Phishing Working Group - 2nd Annual eCrime Researchers Summit, pp. 45-59, ACM, 2007.
C. Bishop, Pattern Recognition and Machine Learning, 1st edition, Springer, 2007.
K. Schneider, "A comparison of event models for Naive Bayes anti-spam e-mail filtering", Proceedings of the 10th conference on European chapter of the Association for Computational Linguistics, vol. 1, pp. 307-314, ACM, 2003.
I. Androutsopoulos, G. Paliouras and E. Michelakis, "Learning to Filter Unsolicited Commercial E-Mail" [Online], NCSR "Demokritos" Technical Rep. 2004/2, Mar. 2004. Available: http://nlp.cs.aueb.gr/pubs/TR2004_updated.pdf
C. Chen, Y. Tian and C. Zhang, "Spam Filtering with Several Novel Bayesian Classifiers", IEEE 19th International Conference on Pattern Recognition, pp. 1-4, 2008.
E. Frank, M. Hall and B. Pfahringer, "Locally Weighted Naive Bayes", Proceedings of the 19th conference on Uncertainty in Artificial Intelligence, ACM, pp. 249-256, 2002.
H. Zhang, L. Jiang and J. Su, "Hidden Naive Bayes", Proceedings of the 20th National Conference on Artificial Intelligence, vol. 2, ACM, pp. 919-914, 2005.
G. Webb, J. Boughton, and Z. Wang, "Not so Naive Bayes: Aggregating One-Dependence Estimators", Machine Learning, vol. 58, issue 1, pp. 5-24, 2005.
H. Drucker, S. Wu and V. Vapnik, "Support Vector Machines for Spam Categorization", IEEE Transactions on Neural Networks, vol. 10, issue 5, pp. 1048-1054, 1999.
R. S. S. Kiran and I. Atmosukarto, "Spam or Not Spam "That is the Question" [Online], Technical Report, University of Washington, 2005. Available: [link].
T. Guzella and W. Caminhas, "A Review of Machine Learning Approaches to Spam Filtering", Expert Systems with Applications, Elsevier, vol. 36, issue 7, pp. 10206- 10222, 2009.
B. Nelson, B. Rubinstein, L. Huang, A. Joseph and J. Tygar, "Classifier Evasion: Models and Open Problems", Privacy and Security Issues in Data Mining and Machine Learning, vol. 6549, Lecture Notes in Computer Science, p. 92-98, Springer, 2011.
M. Barreno, B. Nelson, R. Sears, A. Joseph and J. Tygar, "Can Machine Learning be Secure?", Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, pp. 16-25, 2006.
"TF-IDF: A Single Page Tutorial "Information Retrieval and Text Mining [Online]. Available: http://www.tfidf.com
M. Sharma, "Compression Using Huffman Coding", International Journal of Computer Science and Network Security, vol. 10 no.5, pp. 133-141, 2010.
V. Metsis, I. Androutsopoulos and G. Paliouras, "Spam Filtering with Naive Bayes "Which Naive Bayes?", Proceedings of the 3rd Conference on E-mail and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.
The Enron-Spam Datasets [Online], Available: http://www.aueb.gr/users/ion/data/enron-spam
G. Cormack, "E-mail Spam Filtering: A Systematic Review", Foundations and Trends in Information Retrieval, vol. 1, n.4, pp. 335-455, 2006.
Untroubled Spam Archive [Online]. Available: http://untroubled.org/spam/
Natural Language Toolkit "NLTK 3.0 documentation [Online]. Available: http://www.nltk.org
R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin, "LIBLINEAR: A Library for Large Linear Classification", Journal of Machine Learning Research, vol. 9, 2008.
LIBLINEAR: A Library for Large Linear Classification [Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/liblinear
C. Hsu, C. Chang and C. Lin. A Practical Guide to Support Vector Classification [Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Z. Xu, G. Huang and K. Weinberger, "Gradient Boosted Feature Selection", Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 522-531, 2014.
SpamAssassin "The #1 Enterprise Open-Source Spam Filter [Online], Available: http://spamassassin.apache.org
SpamAssassin Configuration File [Online]. Available: [link].
A. Jain and D. Zongker, "Feature Selection: Evaluation, Application, and Small Sample Performance", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, issue 2, pp. 153-158, 1997.
J. Meng, H. Lin and Y. Yu, "A Two-stage Feature Selection Method for Text Categorization", Computers and Mathematics with Applications, Elsevier, pp. 2793-2800, 2011.
N. Wiratunga, I. Koychev and S. Massie, "Feature Selection and Generalization for Retrieval of Textual Cases", Proceedings of the 7th European Conference on Case-Based Reasoning, Springer Verlag, pp. 806-820.
S. Trivedi and S. Dey, "Effect of Feature Selection Methods on Machine Learning Classifiers for Detecting Email Spams", Proceedings of the 2013 Research in Adaptive and Convergent Systems, ACM, pp. 35-40, 2013.
B. Hoanca, "How good are our weapons in the spam wars?", IEEE Technology and Society Magazine, vol. 25, issue 1, pp. 22-30, 2006.
B. Whitworth and E. Whitworth, "Spam and the social technical gap", IEEE Computer, vol. 37, issue 10, pp. 38-45, 2004.
Symantec, "January 2011 Intelligence Report" [Online]. Available: https://www.navixia.com/images/pdf/newsletter/MLI_2011_01_January_Final_enus.pdf
Symantec, "May 2013 Intelligence Report" [Online]. Available: [link].
Symantec, "Internet Security Threat Report" [Online]. Rep. 21, Apr. 2016. Available: https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf
Olivo, C. K.; Santin, A. O.; Oliveira, L. S., "Obtaining the Threat Model for E-mail Phishing", Applied Soft Computing, vol. 13, issue 12, pp. 4841-4848, 2013.
J. Klensin. (2001, April). RFC 2821 - Simple Mail Transfer Protocol [Online]. Available: http://www.ietf.org/rfc/rfc2821.txt
N. Freed and I. Borenstein. (1996, November). RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies [Online]. Available: http://tools.ietf.org/rfc/rfc2045.txt
R. Duda, P. Hart and D. Stork, Pattern Classification, 2nd edition, Wiley-Interscience, 2000.
There are 600,426,974,379,824,381,952 ways to spell Viagra [Online], Available: http://cockeyed.com/lessons/viagra/viagra.html
The Unicode Standard "Technical Introduction [Online]. Available: http://www.unicode.org/standard/principles.html
C. Liu and S. Stamm, "Fighting Unicode-Obfuscated Spam", Proceedings of the Anti-Phishing Working Group - 2nd Annual eCrime Researchers Summit, pp. 45-59, ACM, 2007.
C. Bishop, Pattern Recognition and Machine Learning, 1st edition, Springer, 2007.
K. Schneider, "A comparison of event models for Naive Bayes anti-spam e-mail filtering", Proceedings of the 10th conference on European chapter of the Association for Computational Linguistics, vol. 1, pp. 307-314, ACM, 2003.
I. Androutsopoulos, G. Paliouras and E. Michelakis, "Learning to Filter Unsolicited Commercial E-Mail" [Online], NCSR "Demokritos" Technical Rep. 2004/2, Mar. 2004. Available: http://nlp.cs.aueb.gr/pubs/TR2004_updated.pdf
C. Chen, Y. Tian and C. Zhang, "Spam Filtering with Several Novel Bayesian Classifiers", IEEE 19th International Conference on Pattern Recognition, pp. 1-4, 2008.
E. Frank, M. Hall and B. Pfahringer, "Locally Weighted Naive Bayes", Proceedings of the 19th conference on Uncertainty in Artificial Intelligence, ACM, pp. 249-256, 2002.
H. Zhang, L. Jiang and J. Su, "Hidden Naive Bayes", Proceedings of the 20th National Conference on Artificial Intelligence, vol. 2, ACM, pp. 919-914, 2005.
G. Webb, J. Boughton, and Z. Wang, "Not so Naive Bayes: Aggregating One-Dependence Estimators", Machine Learning, vol. 58, issue 1, pp. 5-24, 2005.
H. Drucker, S. Wu and V. Vapnik, "Support Vector Machines for Spam Categorization", IEEE Transactions on Neural Networks, vol. 10, issue 5, pp. 1048-1054, 1999.
R. S. S. Kiran and I. Atmosukarto, "Spam or Not Spam "That is the Question" [Online], Technical Report, University of Washington, 2005. Available: [link].
T. Guzella and W. Caminhas, "A Review of Machine Learning Approaches to Spam Filtering", Expert Systems with Applications, Elsevier, vol. 36, issue 7, pp. 10206- 10222, 2009.
B. Nelson, B. Rubinstein, L. Huang, A. Joseph and J. Tygar, "Classifier Evasion: Models and Open Problems", Privacy and Security Issues in Data Mining and Machine Learning, vol. 6549, Lecture Notes in Computer Science, p. 92-98, Springer, 2011.
M. Barreno, B. Nelson, R. Sears, A. Joseph and J. Tygar, "Can Machine Learning be Secure?", Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, pp. 16-25, 2006.
"TF-IDF: A Single Page Tutorial "Information Retrieval and Text Mining [Online]. Available: http://www.tfidf.com
M. Sharma, "Compression Using Huffman Coding", International Journal of Computer Science and Network Security, vol. 10 no.5, pp. 133-141, 2010.
V. Metsis, I. Androutsopoulos and G. Paliouras, "Spam Filtering with Naive Bayes "Which Naive Bayes?", Proceedings of the 3rd Conference on E-mail and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.
The Enron-Spam Datasets [Online], Available: http://www.aueb.gr/users/ion/data/enron-spam
G. Cormack, "E-mail Spam Filtering: A Systematic Review", Foundations and Trends in Information Retrieval, vol. 1, n.4, pp. 335-455, 2006.
Untroubled Spam Archive [Online]. Available: http://untroubled.org/spam/
Natural Language Toolkit "NLTK 3.0 documentation [Online]. Available: http://www.nltk.org
R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin, "LIBLINEAR: A Library for Large Linear Classification", Journal of Machine Learning Research, vol. 9, 2008.
LIBLINEAR: A Library for Large Linear Classification [Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/liblinear
C. Hsu, C. Chang and C. Lin. A Practical Guide to Support Vector Classification [Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Z. Xu, G. Huang and K. Weinberger, "Gradient Boosted Feature Selection", Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 522-531, 2014.
SpamAssassin "The #1 Enterprise Open-Source Spam Filter [Online], Available: http://spamassassin.apache.org
SpamAssassin Configuration File [Online]. Available: [link].
A. Jain and D. Zongker, "Feature Selection: Evaluation, Application, and Small Sample Performance", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, issue 2, pp. 153-158, 1997.
J. Meng, H. Lin and Y. Yu, "A Two-stage Feature Selection Method for Text Categorization", Computers and Mathematics with Applications, Elsevier, pp. 2793-2800, 2011.
N. Wiratunga, I. Koychev and S. Massie, "Feature Selection and Generalization for Retrieval of Textual Cases", Proceedings of the 7th European Conference on Case-Based Reasoning, Springer Verlag, pp. 806-820.
S. Trivedi and S. Dey, "Effect of Feature Selection Methods on Machine Learning Classifiers for Detecting Email Spams", Proceedings of the 2013 Research in Adaptive and Convergent Systems, ACM, pp. 35-40, 2013.
Publicado
06/11/2017
Como Citar
OLIVO, Cleber K.; SANTIN, Altair O.; OLIVEIRA, Luiz E. S..
Using Huffman Trees in Features Selection to Enhance Performance in Spam Detection. In: SIMPÓSIO BRASILEIRO DE SEGURANÇA DA INFORMAÇÃO E DE SISTEMAS COMPUTACIONAIS (SBSEG), 17. , 2017, Brasília.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 278-291.
DOI: https://doi.org/10.5753/sbseg.2017.19506.