Filtragem Adaptativa de Spam com o Princípio Minimum Description Length
Resumo
A proliferação da prática de spamming no universo do correio eletrônico vem motivando a adoção de filtros que aprendem a fazer a separação automática entre as mensagens legítimas e as indesejadas. Este artigo apresenta um esquema de classificação baseado no princípio Minimum Description Length (MDL). Experimentos realizados nas bases de mensagens TREC 2005 e TREC 2006 mostram que o esquema tem resultados comparáveis aos de técnicas no estado da arte, se destacando, no entanto, em relação à rapidez no processamento de mensagens. O bom desempenho da classificação por MDL inspirou a criação do UnBeaten: um filtro com foco em usuários de correio eletrônico e integrado ao leitor de e-mail Mozilla Thunderbird.Referências
Assis, F., Yerazunis, W., Siefkes, C., e Chhabra, S. (2006). Exponential Differential Document Count: A feature selection factor for improving bayesian filters accuracy. In The Fifteenth Text REtrieval Conference (TREC) Proceedings.
Bratko, A. e Filipič, B. (2005). Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., e Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machince Learning Research, 7:2673–2698.
Cormack, G. (2006a). TREC 2006 spam track overview. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. (2006b). TREC 2006 spam track results. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em dezembro de 2007).
Cormack, G. e Lynam, T. (2005a). TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. e Lynam, T. (2005b). TREC 2005 spam track results. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em novembro de 2007).
Drucker, H., Wu, D., e Vapnik, V. (1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories Palo Alto.
Graham, P. (2002). A plan for spam. [link] (Acessado em novembro de 2007).
Grünwald, P. (2005). A Tutorial Introduction to the Minimum Description Length Principle. The MIT Press.
Jung, J. e Sit, E. (2004). An empirical study of spam traffic and the use of DNS black lists. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 370–375, New York, NY, USA. ACM.
Leiba, B. e Fenton, J. (2007). DomainKeys Identified Mail (DKIM): Using digital signatures for domain verification. In Fourth Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Matsubara, E. T., Monard, M. C., e Prati, R. C. (2007). Exploring unclassified texts using multi-view semi-supervised learning. Idea Publishing, Hershey, PA, USA.
Metsis, V., Androutsopoulos, I., e Paliouras, G. (2006). Spam filtering with naive Bayes – Which naive Bayes? In Third Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Sahami, M., Dumais, S., Heckerman, D., e Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin. AAAI Technical Report WS-98-05.
Sculley, D. e Wachman, G. (2007). Relaxed online SVMs for spam filtering. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415–422. ACM.
Siefkes, C., Assis, F., Chhabra, S., e Yerazunis, W. S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for incremental spam filtering. In PKDD ’04: Proc of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421, New York, NY, USA. Springer-Verlag New York, Inc.
Bratko, A. e Filipič, B. (2005). Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., e Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machince Learning Research, 7:2673–2698.
Cormack, G. (2006a). TREC 2006 spam track overview. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. (2006b). TREC 2006 spam track results. In The Fifteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em dezembro de 2007).
Cormack, G. e Lynam, T. (2005a). TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD.
Cormack, G. e Lynam, T. (2005b). TREC 2005 spam track results. In The Fourteenth Text REtrieval Conference (TREC) Proceedings, Gaithersburg, MD. [link] (Acessado em novembro de 2007).
Drucker, H., Wu, D., e Vapnik, V. (1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories Palo Alto.
Graham, P. (2002). A plan for spam. [link] (Acessado em novembro de 2007).
Grünwald, P. (2005). A Tutorial Introduction to the Minimum Description Length Principle. The MIT Press.
Jung, J. e Sit, E. (2004). An empirical study of spam traffic and the use of DNS black lists. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 370–375, New York, NY, USA. ACM.
Leiba, B. e Fenton, J. (2007). DomainKeys Identified Mail (DKIM): Using digital signatures for domain verification. In Fourth Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Matsubara, E. T., Monard, M. C., e Prati, R. C. (2007). Exploring unclassified texts using multi-view semi-supervised learning. Idea Publishing, Hershey, PA, USA.
Metsis, V., Androutsopoulos, I., e Paliouras, G. (2006). Spam filtering with naive Bayes – Which naive Bayes? In Third Conference on Email and Anti-Spam (CEAS), Palo Alto, CA.
Sahami, M., Dumais, S., Heckerman, D., e Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin. AAAI Technical Report WS-98-05.
Sculley, D. e Wachman, G. (2007). Relaxed online SVMs for spam filtering. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415–422. ACM.
Siefkes, C., Assis, F., Chhabra, S., e Yerazunis, W. S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for incremental spam filtering. In PKDD ’04: Proc of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421, New York, NY, USA. Springer-Verlag New York, Inc.
Publicado
12/07/2008
Como Citar
BRAGA, Ígor Assis; LADEIRA, Marcelo.
Filtragem Adaptativa de Spam com o Princípio Minimum Description Length. In: CONCURSO DE TRABALHOS DE INICIAÇÃO CIENTÍFICA DA SBC (CTIC-SBC), 27. , 2008, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2008
.
p. 11-20.