Análise de Métodos de Aprendizagem de Máquina para Detecção Automática de Spam Hosts

  • Renato Moraes Silva UNICAMP
  • Tiago A. Almeida UFSCar
  • Akebo Yamakami UNICAMP

Abstract


Web spamming is one of the main problems that affect the quality of search engines. The number of web pages that use this technique to achieve better positions in search results is growing. The main motivation is the profit achieved with the online advertising market, besides attacks on Internet users through malware that steal information to facilitate bank thefts. Given this scenario, this paper presents an analysis of machine learning techniques employed to detect spam hosts. Experiments performed with a real, public and large dataset, indicate that ensemble of decision trees are promising in the task of spam hosts detection.

References

Aha, D. W., Kibler, D., e Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1):37–66.

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., e Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proc. of the WebKDD’06, Philadelphia,USA.

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford Press, Oxford.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

Castillo, C., Donato, D., e Gionis, A. (2007). Know your neighbors: Web spam detection using the web topology. In Proc. of the 30th SIGIR, pages 423–430, Amsterdam, The Netherlands.

Chang, C.-C. e Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology, 2:27:1–27:27.

Cortes, C. e Vapnik, V. N. (1995). Support-vector networks. In Machine Learning, pages 273–297.

Freund, Y. e Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proc. of the 13th ICML, pages 148–156, Bari, Italy. Morgan Kaufmann.

Friedman, J., Hastie, T., e Tibshirani, R. (1998). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–407.

Gyongyi, Z. e Garcia-Molina, H. (2005). Spam: It’s not just for inboxes anymore. Computer, 38(10):28–34.

Gyongyi, Z., Garcia-Molina, H., e Pedersen, J. (2004). Combating web spam with trustrank. In Proc. of the 30th VLDB, pages 576–587, Toronto, Canada.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., e Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter, 11(1):10–18.

Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Hall, New York, NY, USA, 2th edition.

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–90.

Hsu, C.-W., Chang, C.-C., e Lin, C.-J. (2003). A practical guide to support vector classification. Technical report, National Taiwan University.

Jayanthi, S. K. e Sasikala, S. (2012). WESPACT: Detection of web spamdexing with decision trees in GA perspective. In Proc. of the PRIME’12, pages 381–386.

John, G. H. e Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proc. of the 11th UAI), pages 338–345, Montreal, Quebec;, Canada.

John, J. P., Yu, F., Xie, Y., Krishnamurthy, A., e Abadi, M. (2011). deSEO: combating search-result poisoning. In Proc. of the 20th SEC, pages 20–20, Berkeley, CA, USA.

Largillier, T. e Peyronnet, S. (2012). Webspam demotion: Low complexity node aggregation methods. Neurocomputing, 76(1):105–113.

Ledford, J. L. (2009). Search Engine Optimization Bible. Wiley Publishing, Indianapolis, Indiana, USA, 2th edition.

Liu, Y., Chen, F., Kong, W., Yu, H., Zhang, M., Ma, S., e Ru, L. (2012). Identifying web spam with the wisdom of the crowds. ACM Trans. on the Web, 6(1):2:1–2:30.

Ntoulas, A., Najork, M., Manasse, M., e Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proc. of the WWW, pages 83–92, Edinburgh, Scotland.

Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, USA, 1th edition.

Rungsawang, A., Taweesiriwate, A., e Manaskasemsak, B. (2011). Spam host detection using ant colony optimization. In IT Convergence and Services, volume 107 of Lecture Notes in Electrical Engineering, pages 13–21. Springer Netherlands.

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422):486–494.

Shengen, L., Xiaofei, N., Peiqi, L., e Lin, W. (2011). Generating new features using genetic programming to detect link spam. In Proc. of the ICICTA’11, pages 135–138, Shenzhen, China.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2012a). Artificial neural networks for content-based web spam detection. In Proc. of the 14th International Conference on Artificial Intelligence (ICAI’12), pages 1–7, Las Vegas, NV, USA.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2012b). Redes neurais artificiais para detecção de web spams. In Anais do 8th Simpósio Brasileiro de Sistemas de Informação (SBSI’12), pages 636–641, São Paulo, Brazil.

Svore, K. M., Wu, Q., e Burges, C. J. (2007). Improving web spam classification using rank-time features. In Proc. of the 3rd AIRWeb, pages 9–16, Banff, Alberta, Canada.

Taweesiriwate, A., Manaskasemsak, B., e Rungsawang, A. (2012). Web spam detection using link-based ant colony optimization. In Proc. of the 26th AINA, pages 868–873.

Witten, I. H. e Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, CA, 2th edition.

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., e Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37.
Published
2012-11-19
SILVA, Renato Moraes; ALMEIDA, Tiago A.; YAMAKAMI, Akebo. Análise de Métodos de Aprendizagem de Máquina para Detecção Automática de Spam Hosts. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 12. , 2012, Curitiba. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 2-15. DOI: https://doi.org/10.5753/sbseg.2012.20532.

Most read articles by the same author(s)