Elsevier

Computers & Graphics

Volume 101, December 2021, Pages 46-58
Computers & Graphics

Special Section on SIBGRAPI 2021
Contrastive analysis for scatterplot-based representations of dimensionality reduction

https://doi.org/10.1016/j.cag.2021.08.014Get rights and content

Highlights

  • A new methodology to find contrastive features in datasets.

  • Hypothesis testing employed to find descriptive terms in document collections.

  • Novel visualization techniques to interpret contrastive information of multidimensional datasets.

Abstract

Cluster interpretation after dimensionality reduction (DR) is a ubiquitous part of exploring multidimensional datasets. DR results are frequently represented by scatterplots, where spatial proximity encodes similarity among data samples. In the literature, techniques support the understanding of scatterplots’ organization by visualizing the importance of the features for cluster definition with layout enrichment strategies. However, current approaches usually focus on global information, hampering the analysis whenever the focus is to understand the differences among clusters. Thus, this paper introduces a methodology to visually explore DR results and interpret clusters’ formation based on contrastive analysis. We also introduce a bipartite graph to visually interpret and explore the relationship between the statistical variables employed to understand how the data features influence cluster formation. Our approach is demonstrated through case studies, in which we explore two document collections related to news articles and tweets about COVID-19 symptoms. Finally, we evaluate our approach through quantitative results to demonstrate its robustness to support multidimensional analysis.

Introduction

The analysis of high-dimensional datasets through dimensionality reduction (DR) [1], [2], [3] presents unprecedented opportunities to understand various phenomena. Using scatterplot representations of DR results, analysts inspect clusters to understand data nuances and features’ contribution to the layout organization in the projected space.

One promising strategy to analyze DR results, called contrastive analysis [4], [5], is understanding how clusters differ. Thus, the motivation is to find and comprehend the unique characteristics of each cluster. For instance, a tool for labeling textual data would benefit from contrastive analysis to highlight the differences among the clusters (e.g., topics or terms)—these clusters would then represent candidates for new classes because each cluster has its unique characteristics. Another important application is understanding which features describe two separated groups of patients after a medical experiment [6].

Only a few works are focusing on providing contrastive analysis [4], [5]. Instead, the literature presents a handful of studies that support the analysis of DR results through the visualization of global information [7], [8], [9], [10], [11], emphasizing the importance of data features (attributes) given by DR techniques to organize the embedded space. A few main approaches [7], [8] use the principal components (PCs) from PCA [12] to find which features contribute to cluster formation. However, these contributions do not emphasize the unique characteristics of the clusters. On the other hand, ccPCA [4] finds contrastive information (i.e., the unique characteristics) for each cluster through the systematic application of contrastive PCA [13]. ccPCA’s main limitation comes from PCA, which consists of the prohibitively run-time execution for high-dimensional datasets—it takes O(m3) on the number of dimensions. Another approach, ContraVis [5], is only applied to textual data and cannot help interpret DR layouts since it is already a dimensionality reduction approach. Furthermore, the study does not present a strategy to understand the terms’ contribution to the visual space layout organization.

In this work, we propose cExpression, an approach to analyze DR results using contrastive analysis along with a carefully designed visualization technique. More precisely, we use statistical variables (p-values and t-scores) to find the most distinctive features of clusters (t-scores) and the confidence of the results (p-values). Although t-test is a common method for feature selection in machine learning and in bioinformatics for defining cell types, it is not well-explored to analyze dimensionality reduction results. In our visualization design, users can interact with scatterplot representations of multidimensional datasets to visualize the clusters’ summaries—designed after the definition of several requirements. We use focus+context interaction on a bipartite graph to communicate the relationship between t-scores and p-values. The focus+context interaction helps users explore a higher amount of information while inspecting small-multiples of features’ histograms. A heatmap representation of the most distinctive features for each cluster also helps to overview the structures. Finally, we propose an encoding strategy to simultaneously communicate the distribution of feature values in the scatterplot representation.

As we demonstrate in the numerical experiments, cExpression can be applied to various data types and is scalable to handle big datasets with thousands of dimensions. While our approach’s computational components help generate contrastive information rapidly, our visualization design is simple and effective to analyze even complex textual data.

In summary, our contributions are:

  • A strategy to analyze and interpret dimensionality reduction through clusters using contrastive analysis;

  • Novel visualization strategies to analyze the relationship among statistical variables and simultaneously visualize various features in the scatterplot;

  • An annotated dataset of COVID-19 tweets retrieved from March 2020 to August 2020.

This work is organized as follows: Section 2 presents the related works; Section 3 delineates our methodology accompanied with motivation and the visualization design; Section 4 shows the case studies; Section 5 shows numerical evaluation; Section 6 presents discussions about the work; the work is concluded in Section 7.

Section snippets

Related works

To support analysis of dimensionality reduction (DR) results, layout enrichment strategies [14] unite visualization approaches and valuable information extracted from the data on high- dimensional space concerning low-dimensional representations—usually on R2. Examples include using bar charts and color encoding to understand three-dimensional projections [9] or encoding attribute variation using Delaunay triangulation to assess neighborhood relations in projections on R2 [15]. Another

cExpression—Tool for contrastive analysis of DR results

Before detailing our approach in the following sections, Fig. 1 shows the workflow for using cExpression to interpret clusters after dimensionality reduction. First, the user has to preprocess a high-dimensional dataset by applying a dimensionality reduction technique and annotating the clusters perceived in the visual space, which results in the state (A). Then, our approach uses t-test to compute a measure of deviation (t-score) and associated confidence level (p-value) for each pair of

Case studies

To validate the proposed technique, we explore two document collections, a dataset of news articles from 2011 collected from different sources and a dataset of tweets about COVID-19 symptoms collected inside the São Paulo state (Brazil) territory from March 2020 to August 2020. We also analyze multivariate data using a medical dataset in the Supplementary File.

Evaluation

To further testify cExpression to support analysis of dimensionality reduction results, we compare it against well-known topic extraction techniques using the cohesion metric. Finally, we assess run-time execution of cExpression and ccPCA [4] upon various dimensionality values for a document collection.

Discussions and limitations

Contrastive analysis of dimensionality reduction results offers important mechanisms to understand how clusters differ in the projected space. Although the literature already presents a method for this task, we demonstrated that interactive visualizations with well-known statistics can enhance interpretation of the differences among clusters. There are a few other aspects regarding our approach, which we discuss in the following.

Cell-based encoding of scatterplot. To facilitate local and global

Conclusion

Understanding the influence of features on the formation of clusters and sub-clusters is a promising approach when analyzing dimensionality reduction results represented by scatterplots. Existing methods to address this task emphasize global characteristics not capable of differentiating clusters. On the other hand, current methods for contrastive analysis need unrealistic run-time execution for practical applications.

This paper presents a novel approach for contrastive analysis of

CRediT authorship contribution statement

Wilson E. Marcílio-Jr: Conceptualization, Methodology, Software, Data curation, Visualization, Writing – original draft. Danilo M. Eler: Conceptualization, Methodology, Supervision. Rogério E. Garcia: Conceptualization, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research work was supported by FAPESP, Brazil (São Paulo Research Foundation), grants #2018/17881-3 and #2018/25755-8, and by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES) , grant #88887.487331/2020-00. We also thank the anonymous reviewers for their valuable comments.

References (49)

  • TianZ. et al.

    Using multiple attribute-based explanations of multidimensional projections to explore high-dimensional data

    Comput Grap

    (2021)
  • LopesA. et al.

    Visual text mining using association rules

    Comput Grap

    (2007)
  • PaulovichF.V. et al.

    Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping

    IEEE Trans Vis Comput Grap

    (2008)
  • MaatenL.J.P. et al.

    Visualizing high-dimensional data using t-SNE

    J Mach Learn Res

    (2008)
  • McInnesL. et al.

    UMAP: Uniform manifold approximation and projection for dimension reduction

    (2018)
  • FujiwaraT. et al.

    Supporting analysis of dimensionality reduction results with contrastive learning

    IEEE Trans Vis and Comp Graph

    (2019)
  • LeT. et al.

    ContraVis: Contrastive and visual topic modeling for comparing document collections

  • AbidA. et al.

    Exploring patterns enriched in a dataset with contrastive principal component analysis

    Nature Commun

    (2018)
  • TurkayC. et al.

    Representative factor generation for the interactive visual analysis of high-dimensional data

    IEEE Trans Vis Comput Graph

    (2012)
  • JoiaP. et al.

    Uncovering representative groups in multidimensional projections

    CGF

    (2015)
  • CoimbraD.B. et al.

    Explaining three-dimensional dimensionality reduction plots

    Inform Vis

    (2016)
  • StahnkeJ. et al.

    Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions

    IEEE Trans on Vis and Comp Graph

    (2016)
  • de Carvalho PagliosaL. et al.

    Understanding attribute variability in multidimensional projections

  • JolliffeI.

    Principal component analysis

    (1986)
  • AbdiH. et al.

    Multiple correspondence analysis

    Encycl Meas Stat

    (2007)
  • NonatoL.G. et al.

    Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment

    IEEE Trans Vis Comput Graphics

    (2018)
  • SilvaR.R.O.d. et al.

    Attribute-based visual explanation of multidimensional projections

  • Marcilio WE, Eler DM, Garcia RE. An approach to perform local analysis on multidimensional projection. In: 30th...
  • KwonB. et al.

    Clustervision: Visual supervision of unsupervised clustering

    IEEE Trans Vis Comput Graphics

    (2018)
  • WangY. et al.

    Linear discriminative star coordinates for exploring class and cluster separation of high dimensional data

    Comput Graph Forum

    (2017)
  • IzenmanA.J.

    Linear discriminant analysis

  • KruskalJ. et al.

    Multidimensional scaling

    (1978)
  • WangG. et al.

    Topic hypergraph: hierarchical visualization of thematic structures in long documents

    Sci China Inf Sci

    (2013)
  • ElerD. et al.

    Analysis of document pre-processing effects in text and opinion mining

    Information

    (2018)
  • View full text