Análise de Métodos de Aprendizagem de Máquina para Detecção Automática de Spam Hosts

  • Renato Moraes Silva UNICAMP
  • Tiago A. Almeida UFSCar
  • Akebo Yamakami UNICAMP

Resumo


Web spamming é um dos principais problemas que afeta a qualidade das ferramentas de busca. O número de páginas web que usam esta técnica para conseguir melhores posições nos resultados de busca é cada vez maior. A principal motivação são os lucros obtidos com o mercado de publicidade online, além de ataques a usuários da Internet por meio de malwares, que roubam informações para facilitar roubos bancários. Diante disso, esse trabalho apresenta uma análise de técnicas de aprendizagem de máquina aplicadas na detecção de spam hosts. Experimentos realizados com uma base de dados real, pública e de grande porte indicam que as técnicas de agregação de métodos baseados em árvores são promissoras na tarefa de detecção de spam hosts.

Referências

Aha, D. W., Kibler, D., e Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1):37–66.

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., e Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proc. of the WebKDD’06, Philadelphia,USA.

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford Press, Oxford.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

Castillo, C., Donato, D., e Gionis, A. (2007). Know your neighbors: Web spam detection using the web topology. In Proc. of the 30th SIGIR, pages 423–430, Amsterdam, The Netherlands.

Chang, C.-C. e Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology, 2:27:1–27:27.

Cortes, C. e Vapnik, V. N. (1995). Support-vector networks. In Machine Learning, pages 273–297.

Freund, Y. e Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proc. of the 13th ICML, pages 148–156, Bari, Italy. Morgan Kaufmann.

Friedman, J., Hastie, T., e Tibshirani, R. (1998). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–407.

Gyongyi, Z. e Garcia-Molina, H. (2005). Spam: It’s not just for inboxes anymore. Computer, 38(10):28–34.

Gyongyi, Z., Garcia-Molina, H., e Pedersen, J. (2004). Combating web spam with trustrank. In Proc. of the 30th VLDB, pages 576–587, Toronto, Canada.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., e Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter, 11(1):10–18.

Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Hall, New York, NY, USA, 2th edition.

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–90.

Hsu, C.-W., Chang, C.-C., e Lin, C.-J. (2003). A practical guide to support vector classification. Technical report, National Taiwan University.

Jayanthi, S. K. e Sasikala, S. (2012). WESPACT: Detection of web spamdexing with decision trees in GA perspective. In Proc. of the PRIME’12, pages 381–386.

John, G. H. e Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proc. of the 11th UAI), pages 338–345, Montreal, Quebec;, Canada.

John, J. P., Yu, F., Xie, Y., Krishnamurthy, A., e Abadi, M. (2011). deSEO: combating search-result poisoning. In Proc. of the 20th SEC, pages 20–20, Berkeley, CA, USA.

Largillier, T. e Peyronnet, S. (2012). Webspam demotion: Low complexity node aggregation methods. Neurocomputing, 76(1):105–113.

Ledford, J. L. (2009). Search Engine Optimization Bible. Wiley Publishing, Indianapolis, Indiana, USA, 2th edition.

Liu, Y., Chen, F., Kong, W., Yu, H., Zhang, M., Ma, S., e Ru, L. (2012). Identifying web spam with the wisdom of the crowds. ACM Trans. on the Web, 6(1):2:1–2:30.

Ntoulas, A., Najork, M., Manasse, M., e Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proc. of the WWW, pages 83–92, Edinburgh, Scotland.

Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, USA, 1th edition.

Rungsawang, A., Taweesiriwate, A., e Manaskasemsak, B. (2011). Spam host detection using ant colony optimization. In IT Convergence and Services, volume 107 of Lecture Notes in Electrical Engineering, pages 13–21. Springer Netherlands.

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422):486–494.

Shengen, L., Xiaofei, N., Peiqi, L., e Lin, W. (2011). Generating new features using genetic programming to detect link spam. In Proc. of the ICICTA’11, pages 135–138, Shenzhen, China.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2012a). Artificial neural networks for content-based web spam detection. In Proc. of the 14th International Conference on Artificial Intelligence (ICAI’12), pages 1–7, Las Vegas, NV, USA.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2012b). Redes neurais artificiais para detecção de web spams. In Anais do 8th Simpósio Brasileiro de Sistemas de Informação (SBSI’12), pages 636–641, São Paulo, Brazil.

Svore, K. M., Wu, Q., e Burges, C. J. (2007). Improving web spam classification using rank-time features. In Proc. of the 3rd AIRWeb, pages 9–16, Banff, Alberta, Canada.

Taweesiriwate, A., Manaskasemsak, B., e Rungsawang, A. (2012). Web spam detection using link-based ant colony optimization. In Proc. of the 26th AINA, pages 868–873.

Witten, I. H. e Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, CA, 2th edition.

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., e Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37.
Publicado
19/11/2012
SILVA, Renato Moraes; ALMEIDA, Tiago A.; YAMAKAMI, Akebo. Análise de Métodos de Aprendizagem de Máquina para Detecção Automática de Spam Hosts. In: SIMPÓSIO BRASILEIRO DE SEGURANÇA DA INFORMAÇÃO E DE SISTEMAS COMPUTACIONAIS (SBSEG), 12. , 2012, Curitiba. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 2-15. DOI: https://doi.org/10.5753/sbseg.2012.20532.