Preprocessing Profiling Model for Visual Analytics
ResumoAnalyzing and managing raw data are still a challenging part of the data analysis process, mainly regarding data preprocessing. Although we can find studies proposing design implications or recommendations for visualization solutions in the data analysis scope, they do not focus on challenges during the preprocessing phase. Likewise, the current Visual Analytics processes do not consider preprocessing an equally important stage in their process. Thus, with this study, we aim to contribute to the discussion of how we can use and combine methods of visualization and data mining to assist data analysts during the preprocessing activities. To achieve that, we introduce the Preprocessing Profiling Model for Visual Analytics, which contemplates a set of features to inspire the implementation of new solutions. In turn, these features were designed considering a list of insights we obtained during an interview study with thirteen data analysts. Our contributions can be summarized as offering resources to promote a shift to a visual preprocessing.
W. Kim, B. Choi, E. Hong, S. Kim, and D. Lee, "A taxonomy of dirty data," Data Mining and Knowledge Discovery, vol. 7, pp. 81–99, 2003.
P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson Education, 2006.
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, "Wrangler: Interactive visual speciﬁcation of data transformation scripts," in Proceedings of the Conference on Human Factors in Computing Systems, 2011, pp. 3363– 3372.
S. Kandel, J. Heer, C. Plaisant, J. Kennedy, F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brodbeck, and P. Buono, "Research directions in data wrangling: Visualizations and transformations for usable and credible data," Information Visualization, vol. 10, pp. 271–288, 2011.
H. Wickham, "Tidy Data," Journal of Statistical Software, Articles, vol. 59, pp. 1–23, 2014.
S. Krishnan, D. Haas, M. J. Franklin, and E. Wu, "Towards reliable interactive data cleaning: A user survey and recommendations," in Proceedings of the Workshop on Human-In-the-Loop Data Analytics, 2016, pp. 1–5.
C. Turkay, N. Pezzotti, C. Binnig, H. Strobelt, B. Hammer, D. Keim, J.- D. Fekete, T. Palpanas, Y. Wang, and F. Rusu, "Progressive data science: Potential and challenges," arXiv preprint, vol. 1812.08032, pp. 1–10, 2018.
T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning.
T. Johnson, "Data proﬁling," Encyclopedia of Database Systems, pp. John Wiley & Sons, 2003. 604–608, 2009.
S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer, "Proﬁler: Integrated statistical analysis and visualization for data quality assess- ment," in Proceedings of the Conference on Advanced Visual Interfaces, 2012, pp. 547–554.
T. Gschwandtner, W. Aigner, S. Miksch, J. G¨artner, S. Kriglstein, M. Pohl, and N. Suchy, "TimeCleanser: A visual analytics approach for data cleansing of time-oriented data," in Proceedings of the 14th international conference on knowledge technologies and data-driven business, 2014, pp. 1–8.
D. Keim, J. Kohlhammer, and G. Ellis, Mastering the Information Age: Eurographics Association, Solving Problems with Visual Analytics. 2010.
D. Sacha, A. Stoffel, F. Stoffel, B. C. Kwon, G. Ellis, and D. A. Keim, "Knowledge generation model for visual analytics," IEEE Transactions on Visualization and Computer Graphics, vol. 20, pp. 1604–1613, 2014. http: [Online]. Available:
DATASUS, "Plataforma //plataformabrasil.saude.gov.br/login.jsf Brasil."
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, "Enterprise data analysis and visualization: An interview study," IEEE Transactions on Visualization and Computer Graphics, vol. 18, pp. 2917–2926, 2012.
A. Batch and N. Elmqvist, "The interactive visualization gap in initial exploratory data analysis," IEEE Transactions on Visualization and Computer Graphics, vol. 24, pp. 278–287, 2018.
S. Alspaugh, N. Zokaei, A. Liu, C. Jin, and M. A. Hearst, "Futzing and moseying: Interviews with professional data analysts on exploration practices," IEEE Transactions on Visualization and Computer Graphics, vol. 25, pp. 22–31, 2019.
R. Rensink, "Seeing, sensing, and scrutinizing," Vision Research, vol. 40, pp. 1469–1487, 2000.
Pandas-proﬁling, dataframe das pandas-proﬁling/pandas-proﬁling objects." "Create HTML proﬁling [Online]. Available: reports from pan- https://github.com/
R. Fisher, "The use of multiple measurements in taxonomic problems," Annals of Eugenics, vol. 7, pp. 179–188, 1936.
A. M. P. Milani, F. V. Paulovich, and I. H. Manssour, "Visualization in the preprocessing phase: Getting insights from enterprise professionals," Information Visualization, vol. 19, pp. 273–287, 2020.
L. Ciocari, "Uso de visualizac¸ao de dados para auxiliar no pr e-´ processamento de dados categoricos," Undergraduation thesis, School ´ of Technology, PUCRS, Porto Alegre, 2019.