Special Section on SIBGRAPI 2021Contrastive analysis for scatterplot-based representations of dimensionality reduction
Introduction
The analysis of high-dimensional datasets through dimensionality reduction (DR) [1], [2], [3] presents unprecedented opportunities to understand various phenomena. Using scatterplot representations of DR results, analysts inspect clusters to understand data nuances and features’ contribution to the layout organization in the projected space.
One promising strategy to analyze DR results, called contrastive analysis [4], [5], is understanding how clusters differ. Thus, the motivation is to find and comprehend the unique characteristics of each cluster. For instance, a tool for labeling textual data would benefit from contrastive analysis to highlight the differences among the clusters (e.g., topics or terms)—these clusters would then represent candidates for new classes because each cluster has its unique characteristics. Another important application is understanding which features describe two separated groups of patients after a medical experiment [6].
Only a few works are focusing on providing contrastive analysis [4], [5]. Instead, the literature presents a handful of studies that support the analysis of DR results through the visualization of global information [7], [8], [9], [10], [11], emphasizing the importance of data features (attributes) given by DR techniques to organize the embedded space. A few main approaches [7], [8] use the principal components (PCs) from PCA [12] to find which features contribute to cluster formation. However, these contributions do not emphasize the unique characteristics of the clusters. On the other hand, ccPCA [4] finds contrastive information (i.e., the unique characteristics) for each cluster through the systematic application of contrastive PCA [13]. ccPCA’s main limitation comes from PCA, which consists of the prohibitively run-time execution for high-dimensional datasets—it takes on the number of dimensions. Another approach, ContraVis [5], is only applied to textual data and cannot help interpret DR layouts since it is already a dimensionality reduction approach. Furthermore, the study does not present a strategy to understand the terms’ contribution to the visual space layout organization.
In this work, we propose cExpression, an approach to analyze DR results using contrastive analysis along with a carefully designed visualization technique. More precisely, we use statistical variables (p-values and -scores) to find the most distinctive features of clusters (t-scores) and the confidence of the results (p-values). Although t-test is a common method for feature selection in machine learning and in bioinformatics for defining cell types, it is not well-explored to analyze dimensionality reduction results. In our visualization design, users can interact with scatterplot representations of multidimensional datasets to visualize the clusters’ summaries—designed after the definition of several requirements. We use focus+context interaction on a bipartite graph to communicate the relationship between t-scores and p-values. The focus+context interaction helps users explore a higher amount of information while inspecting small-multiples of features’ histograms. A heatmap representation of the most distinctive features for each cluster also helps to overview the structures. Finally, we propose an encoding strategy to simultaneously communicate the distribution of feature values in the scatterplot representation.
As we demonstrate in the numerical experiments, cExpression can be applied to various data types and is scalable to handle big datasets with thousands of dimensions. While our approach’s computational components help generate contrastive information rapidly, our visualization design is simple and effective to analyze even complex textual data.
In summary, our contributions are:
- •
A strategy to analyze and interpret dimensionality reduction through clusters using contrastive analysis;
- •
Novel visualization strategies to analyze the relationship among statistical variables and simultaneously visualize various features in the scatterplot;
- •
An annotated dataset of COVID-19 tweets retrieved from March 2020 to August 2020.
This work is organized as follows: Section 2 presents the related works; Section 3 delineates our methodology accompanied with motivation and the visualization design; Section 4 shows the case studies; Section 5 shows numerical evaluation; Section 6 presents discussions about the work; the work is concluded in Section 7.
Section snippets
Related works
To support analysis of dimensionality reduction (DR) results, layout enrichment strategies [14] unite visualization approaches and valuable information extracted from the data on high- dimensional space concerning low-dimensional representations—usually on . Examples include using bar charts and color encoding to understand three-dimensional projections [9] or encoding attribute variation using Delaunay triangulation to assess neighborhood relations in projections on [15]. Another
cExpression—Tool for contrastive analysis of DR results
Before detailing our approach in the following sections, Fig. 1 shows the workflow for using cExpression to interpret clusters after dimensionality reduction. First, the user has to preprocess a high-dimensional dataset by applying a dimensionality reduction technique and annotating the clusters perceived in the visual space, which results in the state (A). Then, our approach uses t-test to compute a measure of deviation (t-score) and associated confidence level (-value) for each pair of
Case studies
To validate the proposed technique, we explore two document collections, a dataset of news articles from 2011 collected from different sources and a dataset of tweets about COVID-19 symptoms collected inside the São Paulo state (Brazil) territory from March 2020 to August 2020. We also analyze multivariate data using a medical dataset in the Supplementary File.
Evaluation
To further testify cExpression to support analysis of dimensionality reduction results, we compare it against well-known topic extraction techniques using the cohesion metric. Finally, we assess run-time execution of cExpression and ccPCA [4] upon various dimensionality values for a document collection.
Discussions and limitations
Contrastive analysis of dimensionality reduction results offers important mechanisms to understand how clusters differ in the projected space. Although the literature already presents a method for this task, we demonstrated that interactive visualizations with well-known statistics can enhance interpretation of the differences among clusters. There are a few other aspects regarding our approach, which we discuss in the following.
Cell-based encoding of scatterplot. To facilitate local and global
Conclusion
Understanding the influence of features on the formation of clusters and sub-clusters is a promising approach when analyzing dimensionality reduction results represented by scatterplots. Existing methods to address this task emphasize global characteristics not capable of differentiating clusters. On the other hand, current methods for contrastive analysis need unrealistic run-time execution for practical applications.
This paper presents a novel approach for contrastive analysis of
CRediT authorship contribution statement
Wilson E. Marcílio-Jr: Conceptualization, Methodology, Software, Data curation, Visualization, Writing – original draft. Danilo M. Eler: Conceptualization, Methodology, Supervision. Rogério E. Garcia: Conceptualization, Methodology, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research work was supported by FAPESP, Brazil (São Paulo Research Foundation), grants #2018/17881-3 and #2018/25755-8, and by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES) , grant #88887.487331/2020-00. We also thank the anonymous reviewers for their valuable comments.
References (49)
- et al.
Using multiple attribute-based explanations of multidimensional projections to explore high-dimensional data
Comput Grap
(2021) - et al.
Visual text mining using association rules
Comput Grap
(2007) - et al.
Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping
IEEE Trans Vis Comput Grap
(2008) - et al.
Visualizing high-dimensional data using t-SNE
J Mach Learn Res
(2008) - et al.
UMAP: Uniform manifold approximation and projection for dimension reduction
(2018) - et al.
Supporting analysis of dimensionality reduction results with contrastive learning
IEEE Trans Vis and Comp Graph
(2019) - et al.
ContraVis: Contrastive and visual topic modeling for comparing document collections
- et al.
Exploring patterns enriched in a dataset with contrastive principal component analysis
Nature Commun
(2018) - et al.
Representative factor generation for the interactive visual analysis of high-dimensional data
IEEE Trans Vis Comput Graph
(2012) - et al.
Uncovering representative groups in multidimensional projections
CGF
(2015)
Explaining three-dimensional dimensionality reduction plots
Inform Vis
Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions
IEEE Trans on Vis and Comp Graph
Understanding attribute variability in multidimensional projections
Principal component analysis
Multiple correspondence analysis
Encycl Meas Stat
Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment
IEEE Trans Vis Comput Graphics
Attribute-based visual explanation of multidimensional projections
Clustervision: Visual supervision of unsupervised clustering
IEEE Trans Vis Comput Graphics
Linear discriminative star coordinates for exploring class and cluster separation of high dimensional data
Comput Graph Forum
Linear discriminant analysis
Multidimensional scaling
Topic hypergraph: hierarchical visualization of thematic structures in long documents
Sci China Inf Sci
Analysis of document pre-processing effects in text and opinion mining
Information
Cited by (6)
Foreword to the special section on SIBGRAPI 2021
2022, Computers and Graphics (Pergamon)Editorial
2021, Computers and Graphics (Pergamon)Analyzing Accessibility Reviews Associated with Visual Disabilities or Eye Conditions
2023, Conference on Human Factors in Computing Systems - ProceedingsIncorporation of Human Knowledge into Data Embeddings to Improve Pattern Significance and Interpretability
2023, IEEE Transactions on Visualization and Computer GraphicsVisualization of the Relationship between Metadata and Acoustic Feature Values of Song Collections∗
2022, Proceedings - 2022 Nicograph International, NicoInt 2022