Real-Time Classification of Data Streams under Uncertain Training

  • Alexandre Davis UFMG
  • Adriano Veloso UFMG

Abstract


The rapid expansion of social networks has caused an increasing demand for machine learning algorithms (e.g., classifiers) that can operate on real-time data flows, such as message streams. The rapidly changing vocabulary used in such streams makes it difficult to create adequate training sets, and keeping them up-to-date is an even greater challenge. In this paper, we propose a novel way to perform real-time classification on dynamic data streams. Our proposal involves the automatic generation of training sets using the Expectation-Maximization approach in a set of messages composed of positive and uncertain examples. In order to demonstrate that our solution is general, we present two applications scenarios that use the proposed real-time classifier to disambiguate named entities references and to perform sentiment analysis in data streams. An assessment of the efficiency of this technique is presented. The Observatório da Web project is currently running the proposed approach for real-time disambiguation in its operation pipeline.

References

Batista, G. and Monard, C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6):519–533.

Comité, F. D., Denis, F., Gilleron, R., and Letouzey, F. (1999). Positive and unlabeled examples help learning. In Proc. of ALT, pages 219–230.

Davis, A., Santos, W., Veloso, A., Jr., W. M., Laender, A., and da Silva, A. S. (2011). RT-NED: Real-time named entity disambiguation on Twitter streams. In Proc. of SBBD: Demos Session, pages 43–48.

Davis, A., Veloso, A., Jr., W. M., Laender, A., and da Silva, A. S. (2012). Named entity disambiguation in streaming data. In Proc. of ACL (to appear).

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38.

Denis, F. (1998). Pac learning from positive statistical queries. In Proc. of ALT, pages 112–126.

Elkan, C. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proc. of SIGKDD, pages 213–220.

Fung, G., Yu, J., Lu, H., and Yu, P. (2006). Text classification without negative examples revisit. IEEE Trans. Knowl. Data Eng., 18(1):6–20.

Kao, B., Lee, S. D., Lee, F., Cheung, D., and Ho, W. (2010). Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Trans. Knowl. Data Eng., 22(9):1219–1233.

Li, X. and Liu, B. (2003). Learning to classify texts using positive and unlabeled data. In Proc. of IJCAI, pages 587–592.

Li, X., Liu, B., and Ng, S. (2007). Learning to classify documents with only a small positive training set. In Proc. of ECML, pages 201–213.

Li, X., Yu, P., Liu, B., and Ng, S. (2009). Positive unlabeled learning for data stream classification. In Proc. of SDM, pages 257–268.

Liu, B., Dai, Y., Li, X., Lee, W., and Yu, P. (2003). Building text classifiers using positive and unlabeled examples. In Proc. of ICDM, pages 179–188.

Liu, B., Lee, W., Yu, P., and Li, X. (2002). Partially supervised classification of text documents. In Proc. of ICML, pages 387–394.

Rocchio, J. (1971). The SMART Retrieval System Experiments in Automatic Document Processing. Prentice-Hall, Inc.

Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., and Williamson, R. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471.

Veloso, A. and Meira Jr., W. (2011). Demand-Driven Associative Classification. Springer-Verlag.
Published
2012-07-16
DAVIS, Alexandre; VELOSO, Adriano. Real-Time Classification of Data Streams under Uncertain Training. In: SBC UNDERGRADUATE RESEARCH CONTEST (CTIC-SBC), 31. , 2012, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 21-30.