Adaptive Spam Filtering with the Minimum Description Length Principle
Abstract
The spreading of the spamming practice has motivated the adoption of filters that learn how to separate unwanted messages from the legitimate ones. This paper presents a classification scheme based on the Minimum Description Length (MDL) principle. Experiments conducted on TREC 2005 and TREC 2006 public email corpora show that the MDL classification results compare to those of state-of-the-art techniques, however, outperforming them with respect to execution speed. The good performance of MDL-based classification has led us to develop UnBeaten: a spam filter aimed at email users and integrated to the Mozilla Thunderbird mailer.References
Assis, F., Yerazunis, W., Siefkes, C., e Chhabra, S. (2006). Exponential Differential Document Count: A feature selection factor for improving bayesian filters accuracy. In The Fifteenth Text REtrieval Conference (TREC) Proceedings.
Bratko, A. e Filipič, B. (2005). Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., e Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machince Learning Research, 7:2673–2698.
Cormack, G. (2006a). TREC 2006 spam track overview. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. (2006b). TREC 2006 spam track results. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em dezembro de 2007).
Cormack, G. e Lynam, T. (2005a). TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. e Lynam, T. (2005b). TREC 2005 spam track results. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em novembro de 2007).
Drucker, H., Wu, D., e Vapnik, V. (1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories Palo Alto.
Graham, P. (2002). A plan for spam. [link] (Acessado em novembro de 2007).
Grünwald, P. (2005). A Tutorial Introduction to the Minimum Description Length Principle. The MIT Press.
Jung, J. e Sit, E. (2004). An empirical study of spam traffic and the use of DNS black lists. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 370–375, New York, NY, USA. ACM.
Leiba, B. e Fenton, J. (2007). DomainKeys Identified Mail (DKIM): Using digital signatures for domain verification. In Fourth Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Matsubara, E. T., Monard, M. C., e Prati, R. C. (2007). Exploring unclassified texts using multi-view semi-supervised learning. Idea Publishing, Hershey, PA, USA.
Metsis, V., Androutsopoulos, I., e Paliouras, G. (2006). Spam filtering with naive Bayes – Which naive Bayes? In Third Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Sahami, M., Dumais, S., Heckerman, D., e Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin. AAAI Technical Report WS-98-05.
Sculley, D. e Wachman, G. (2007). Relaxed online SVMs for spam filtering. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415–422. ACM.
Siefkes, C., Assis, F., Chhabra, S., e Yerazunis, W. S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for incremental spam filtering. In PKDD ’04: Proc of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421, New York, NY, USA. Springer-Verlag New York, Inc.
Bratko, A. e Filipič, B. (2005). Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., e Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machince Learning Research, 7:2673–2698.
Cormack, G. (2006a). TREC 2006 spam track overview. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. (2006b). TREC 2006 spam track results. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em dezembro de 2007).
Cormack, G. e Lynam, T. (2005a). TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. e Lynam, T. (2005b). TREC 2005 spam track results. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em novembro de 2007).
Drucker, H., Wu, D., e Vapnik, V. (1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories Palo Alto.
Graham, P. (2002). A plan for spam. [link] (Acessado em novembro de 2007).
Grünwald, P. (2005). A Tutorial Introduction to the Minimum Description Length Principle. The MIT Press.
Jung, J. e Sit, E. (2004). An empirical study of spam traffic and the use of DNS black lists. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 370–375, New York, NY, USA. ACM.
Leiba, B. e Fenton, J. (2007). DomainKeys Identified Mail (DKIM): Using digital signatures for domain verification. In Fourth Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Matsubara, E. T., Monard, M. C., e Prati, R. C. (2007). Exploring unclassified texts using multi-view semi-supervised learning. Idea Publishing, Hershey, PA, USA.
Metsis, V., Androutsopoulos, I., e Paliouras, G. (2006). Spam filtering with naive Bayes – Which naive Bayes? In Third Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Sahami, M., Dumais, S., Heckerman, D., e Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin. AAAI Technical Report WS-98-05.
Sculley, D. e Wachman, G. (2007). Relaxed online SVMs for spam filtering. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415–422. ACM.
Siefkes, C., Assis, F., Chhabra, S., e Yerazunis, W. S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for incremental spam filtering. In PKDD ’04: Proc of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421, New York, NY, USA. Springer-Verlag New York, Inc.
Published
2008-07-12
How to Cite
BRAGA, Ígor Assis; LADEIRA, Marcelo.
Adaptive Spam Filtering with the Minimum Description Length Principle. In: SBC UNDERGRADUATE RESEARCH CONTEST (CTIC-SBC), 27. , 2008, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2008
.
p. 11-20.