Elsevier

Computers & Graphics

Volume 102, February 2022, Pages 233-244
Computers & Graphics

Special Section on SIBGRAPI 2021
Fast and reliable incremental dimensionality reduction for streaming data

https://doi.org/10.1016/j.cag.2021.08.009Get rights and content

Highlights

  • A novel incremental Dimensionality Reduction technique for streaming scenarios.

  • A re-projection strategy to update visual representation without revisiting input data.

  • A visualization that presents better trade-off between quality and runtime compared to other approaches.

Abstract

Streaming data applications are becoming more common due to the ability of different information sources to continuously capture or produce data, such as sensors and social media. Although there are recent advances, most visualization approaches, particularly Dimensionality Reduction (DR) techniques, cannot be directly applied in such scenarios due to the transient nature of streaming data. A few DR methods currently address this limitation using online or incremental strategies, continuously updating the visualization as data is received. Despite their relative success, most impose the need to store and access the data multiple times to produce a complete projection, not being appropriate for streaming where data continuously grow. Others do not impose such requirements but cannot update the position of the data already projected, potentially resulting in visual artifacts. This paper presents Xtreaming, a novel incremental DR technique that continuously updates the visual representation to reflect new emerging structures or patterns without visiting the high-dimensional data more than once. Our tests show that in streaming scenarios where data is not fully stored in-memory, Xtreaming is competitive in terms of quality compared to other streaming and incremental techniques while being orders of magnitude faster.

Introduction

Prompted by the massive quantity of available data, our society is facing a paradigm shift towards making decisions based on more data-driven processes [1]. Although a positive trend, it presents several challenges for the existing visualization techniques, from visual and computational scalability to latency issues. One particular scenario that presents requirements difficult to fulfill involves streaming data. Compared to more conventional static applications, where data is first recorded in persistent tables to be later processed, in streaming applications, the data needs to be processed as it is received since it is typically discarded after that [2]. Examples include environmental monitoring [3], network intrusion detection [4], and real-time social media analysis [5].

Among the existing visualization techniques devoted to processing significant amounts of data, multidimensional projection or dimensionality reduction techniques [6] are emerging as fundamental tools by allowing the identification and analysis of similarity and neighborhood relationships and patterns [7]. Although many projection techniques can successfully handle large datasets using out-of-sample strategies [7], [8], [9], [10], [11], [12], most are not appropriate for the streaming scenario given its transient nature. A few methods currently address this limitation by using online or incremental strategies, continuously processing data, and updating the visualization. Despite their success, such methods present significant bottlenecks that impair their practical usage. Some methods impose the storage and multiple accesses to all (received) high-dimensional data  [13], [14], [15], [16], [17], [18], [19], [20], being appropriate to incremental and progressive scenarios but not streaming where data continuously grow. Others create one projection function at the beginning of the process and use it for all data  [21], [22], [23], in general, resulting in low-quality layouts since the function is not adapted to new structures or patterns that emerge over time. Finally, a few methods evolve the projection function to capture new incoming structures and patterns  [24], [25], [26], [27], [28], [29], [30]. However, the already projected instances are not updated to comply with the new function, potentially resulting in visual artifacts produced by overlapping subsequent misaligned projections.

In this paper, we present Xtreaming, a novel incremental projection technique to address the challenging design conflict of updating over time the position of the already projected instances without storing or revisiting all high-dimensional data that has already been processed. Xtreaming combines a change detection approach, an out-of-sample projection technique, and a novel re-projection strategy to update the projection without revisiting the high-dimensional input data. Xtreaming is competitive in terms of global distance preservation compared to other streaming and incremental techniques, but it is orders of magnitude faster. In summary, the main contributions of this paper are:

  • a precise and fast dimensionality reduction technique, called Xtreaming that can be efficiently and effectively used in streaming scenarios where data needs to be processed as received;

  • a novel strategy to re-project the already processed data to comply with a new projection function without the need for revisiting the original high-dimensional data, a critical aspect of any technique designed to handle streaming data that is usually discarded after being processing.

The rest of the paper is organized as follows. In Section 2, we discuss related work regarding dimensionality reduction techniques for streaming scenarios. In Section 3, we formalize the problem and present a complete overview of the Xtreaming algorithm and framework. In Section 4, we present an extensive set of experiments that attest to the stability, sensibility, and quality of the proposed technique. A case study is described in Section 5, showing how Xtreaming can be used for sentiment analysis in social media. In Section 6 we discuss Xtreaming limitations and future work, and the paper is concluded in Section 7.

Section snippets

Related work

Dimensionality reduction or multidimensional projection techniques create computational models that map data instances into graphical elements preserving pairwise distances or neighborhoods [6], [31]. One existing classification splits them into two groups, the in-sample and the out-of-sample techniques [10]. While in-sample techniques produce layouts processing the data instances altogether, the out-of-sample initially select and project a small sample of the dataset and then map the remaining

Problem formalization

To set notation, let X={x1,x2,,xn}Rm be a multidimensional dataset with δ(xi,xj) a dissimilarity function between two data instances, and Y={y1,y2,,yn}R2 its mapping to the visual space with d(yi,yj) a distance function between two graphical elements. A batch (in-sample or out-of-sample) multidimensional projection technique is a bijective function f:XY that maps X into Y preserving the dissimilarity or neighborhood structure of multidimensional datasets that are fully stored in persistent

Technique analysis and comparisons

This section presents a series of tests and comparisons to attest the proposed technique’s stability and sensibility and confirm its quality compared to other in-sample, out-of-sample, and incremental/online techniques. In these tests, we use datasets with different sizes and dimensionalities, enabling the analysis of different scenarios. The first dataset, shuttle [51], is composed of 43,500 instances with 9 dimensions representing log information. The mammals [51] is an artificially generated

Case studies

In this section, we describe two case studies that showcase the suitability of using Xtreaming to support the visual analysis of time-varying multidimensional datasets, the first with an artificially-designed dataset, and the second with sentiment and stance analysis on data from Twitter.

Discussions and limitations

Rate of streaming. One potential problem is related to the rate in which the data is captured or produced in time-dependent streaming applications. In these scenarios, it is vital to produce the projection layout on-the-fly as the data is received. However, if the data production rate is larger than the processing time imposed by Xtreaming, some data may be discarded. One possible solution is to use a double-buffer strategy, where parts of the data can be stored while others are processed. This

Conclusions

This paper proposes a novel multidimensional projection technique called Xtreaming, which is shown to be one of the first reliable approaches for projecting streaming data applications. Xtreaming presents different novel strategies, enabling the processing of data as it is received and to adjust the projection to new emerging structures without the need for visiting the high-dimensional data more than once. The set of comparisons we provide shows that Xtreaming is comparable to existing

CRediT authorship contribution statement

Tácito Trindade de Araújo Tiburtino Neves: Conceptualization, Software, Investigation, Writing. Rafael Messias Martins: Investigation, Writing– review & editing. Danilo Barbosa Coimbra: Writing– review & editing. Kostiantyn Kucher: Writing– review & editing. Andreas Kerren: Writing– review & editing. Fernando V. Paulovich: Supervision, Writing– review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Research carried out using the computational resources of the Center for Mathematical Sciences Applied to Industry (CeMEAI) funded by Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP) (grant 2013/07375-0). We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) .

References (64)

  • JoiaP. et al.

    Local affine multidimensional projection

    IEEE Trans Vis Comput Graphics

    (2011)
  • De SilvaV. et al.

    Sparse multidimensional scaling using landmark pointsTech. Rep.

    (2004)
  • SilvaV.d. et al.

    Global versus local methods in nonlinear dimensionality reduction

  • BengioY. et al.

    Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering

  • EspadotoM. et al.

    Deep learning multidimensional projections

    Inf Vis

    (2020)
  • PaulovichF.V. et al.

    Piece wise Laplacian-based projection for interactive data exploration and organization

    Comput Graph Forum

    (2011)
  • BasalajW.

    Incremental multidimensional scaling method for database visualization

  • Alsakran J, Chen Y, Zhao Y, Yang J, Luo D. STREAMIT: Dynamic visualization and interactive exploration of text streams....
  • JenkinsO.C. et al.

    A spatio-temporal extension to isomap nonlinear dimension reduction

  • LawM.H. et al.

    Nonlinear manifold learning for data stream

  • LawM.H.C. et al.

    Incremental nonlinear dimensionality reduction by manifold learning

    IEEE Trans Pattern Anal Mach Intell

    (2006)
  • Schuon S, Durkovic M, Diepold K, Scheuerle J, Markward S. Truly incremental locally linear embedding. In: 1st...
  • RauberP.E. et al.

    Visualizing time-dependent data using dynamic t-SNE

  • PaulovichF. et al.

    Two-phase mapping for projecting massive data sets

    Vis Comput Graph IEEE Trans

    (2010)
  • SaulL.K. et al.

    Think globally, fit locally: Unsupervised learning of low dimensional manifolds

    J Mach Learn Res

    (2003)
  • MahapatraS. et al.

    S-isomap++: Multi manifold learning from streaming data

  • Leng Y, Zhang L, Yang J. Locally Linear Embedding algorithm based on OMP for incremental learning. In: 2014...
  • RossD.A. et al.

    Incremental learning for robust visual tracking

    Int J Comput Vis

    (2008)
  • LeveyA. et al.

    Sequential karhunen-loeve basis extraction and its application to images

    IEEE Trans Image Process

    (2000)
  • FujiwaraT. et al.

    An incremental dimensionality reduction method for visualizing streaming multidimensional data

    IEEE Trans Vis Comput Graphics

    (2020)
  • YeJ. et al.

    IDR/QR: An incremental dimension reduction algorithm via QR decomposition

    IEEE Trans Knowl Data Eng

    (2005)
  • HintonG.E. et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • View full text