Special Section on SIBGRAPI 2021Fast and reliable incremental dimensionality reduction for streaming data
Graphical abstract
Introduction
Prompted by the massive quantity of available data, our society is facing a paradigm shift towards making decisions based on more data-driven processes [1]. Although a positive trend, it presents several challenges for the existing visualization techniques, from visual and computational scalability to latency issues. One particular scenario that presents requirements difficult to fulfill involves streaming data. Compared to more conventional static applications, where data is first recorded in persistent tables to be later processed, in streaming applications, the data needs to be processed as it is received since it is typically discarded after that [2]. Examples include environmental monitoring [3], network intrusion detection [4], and real-time social media analysis [5].
Among the existing visualization techniques devoted to processing significant amounts of data, multidimensional projection or dimensionality reduction techniques [6] are emerging as fundamental tools by allowing the identification and analysis of similarity and neighborhood relationships and patterns [7]. Although many projection techniques can successfully handle large datasets using out-of-sample strategies [7], [8], [9], [10], [11], [12], most are not appropriate for the streaming scenario given its transient nature. A few methods currently address this limitation by using online or incremental strategies, continuously processing data, and updating the visualization. Despite their success, such methods present significant bottlenecks that impair their practical usage. Some methods impose the storage and multiple accesses to all (received) high-dimensional data [13], [14], [15], [16], [17], [18], [19], [20], being appropriate to incremental and progressive scenarios but not streaming where data continuously grow. Others create one projection function at the beginning of the process and use it for all data [21], [22], [23], in general, resulting in low-quality layouts since the function is not adapted to new structures or patterns that emerge over time. Finally, a few methods evolve the projection function to capture new incoming structures and patterns [24], [25], [26], [27], [28], [29], [30]. However, the already projected instances are not updated to comply with the new function, potentially resulting in visual artifacts produced by overlapping subsequent misaligned projections.
In this paper, we present Xtreaming, a novel incremental projection technique to address the challenging design conflict of updating over time the position of the already projected instances without storing or revisiting all high-dimensional data that has already been processed. Xtreaming combines a change detection approach, an out-of-sample projection technique, and a novel re-projection strategy to update the projection without revisiting the high-dimensional input data. Xtreaming is competitive in terms of global distance preservation compared to other streaming and incremental techniques, but it is orders of magnitude faster. In summary, the main contributions of this paper are:
- •
a precise and fast dimensionality reduction technique, called Xtreaming that can be efficiently and effectively used in streaming scenarios where data needs to be processed as received;
- •
a novel strategy to re-project the already processed data to comply with a new projection function without the need for revisiting the original high-dimensional data, a critical aspect of any technique designed to handle streaming data that is usually discarded after being processing.
The rest of the paper is organized as follows. In Section 2, we discuss related work regarding dimensionality reduction techniques for streaming scenarios. In Section 3, we formalize the problem and present a complete overview of the Xtreaming algorithm and framework. In Section 4, we present an extensive set of experiments that attest to the stability, sensibility, and quality of the proposed technique. A case study is described in Section 5, showing how Xtreaming can be used for sentiment analysis in social media. In Section 6 we discuss Xtreaming limitations and future work, and the paper is concluded in Section 7.
Section snippets
Related work
Dimensionality reduction or multidimensional projection techniques create computational models that map data instances into graphical elements preserving pairwise distances or neighborhoods [6], [31]. One existing classification splits them into two groups, the in-sample and the out-of-sample techniques [10]. While in-sample techniques produce layouts processing the data instances altogether, the out-of-sample initially select and project a small sample of the dataset and then map the remaining
Problem formalization
To set notation, let be a multidimensional dataset with a dissimilarity function between two data instances, and its mapping to the visual space with a distance function between two graphical elements. A batch (in-sample or out-of-sample) multidimensional projection technique is a bijective function that maps into preserving the dissimilarity or neighborhood structure of multidimensional datasets that are fully stored in persistent
Technique analysis and comparisons
This section presents a series of tests and comparisons to attest the proposed technique’s stability and sensibility and confirm its quality compared to other in-sample, out-of-sample, and incremental/online techniques. In these tests, we use datasets with different sizes and dimensionalities, enabling the analysis of different scenarios. The first dataset, shuttle [51], is composed of 43,500 instances with dimensions representing log information. The mammals [51] is an artificially generated
Case studies
In this section, we describe two case studies that showcase the suitability of using Xtreaming to support the visual analysis of time-varying multidimensional datasets, the first with an artificially-designed dataset, and the second with sentiment and stance analysis on data from Twitter.
Discussions and limitations
Rate of streaming. One potential problem is related to the rate in which the data is captured or produced in time-dependent streaming applications. In these scenarios, it is vital to produce the projection layout on-the-fly as the data is received. However, if the data production rate is larger than the processing time imposed by Xtreaming, some data may be discarded. One possible solution is to use a double-buffer strategy, where parts of the data can be stored while others are processed. This
Conclusions
This paper proposes a novel multidimensional projection technique called Xtreaming, which is shown to be one of the first reliable approaches for projecting streaming data applications. Xtreaming presents different novel strategies, enabling the processing of data as it is received and to adjust the projection to new emerging structures without the need for visiting the high-dimensional data more than once. The set of comparisons we provide shows that Xtreaming is comparable to existing
CRediT authorship contribution statement
Tácito Trindade de Araújo Tiburtino Neves: Conceptualization, Software, Investigation, Writing. Rafael Messias Martins: Investigation, Writing– review & editing. Danilo Barbosa Coimbra: Writing– review & editing. Kostiantyn Kucher: Writing– review & editing. Andreas Kerren: Writing– review & editing. Fernando V. Paulovich: Supervision, Writing– review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Research carried out using the computational resources of the Center for Mathematical Sciences Applied to Industry (CeMEAI) funded by Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP) (grant 2013/07375-0). We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) .
References (64)
- et al.
Processing of massive audit data streams for real-time anomaly intrusion detection
Comput Commun
(2008) - et al.
Incremental locally linear embedding
Pattern Recognit
(2005) - et al.
A scalable supervised algorithm for dimensionality reduction on streaming data
Inform Sci
(2006) - et al.
Chapter 16 - cluster validity
Sentiment analysis: Detecting valence, emotions, and other affectual states from text
Everything, everywhere
Nature
(2006)A survey on learning from data streams: current and future trends
Prog Artif Intell
(2012)- et al.
Real-time environmental monitoring and notification for public safety
IEEE MultiMedia
(2010) - et al.
Tweet analysis for real-time event detection and earthquake reporting system development
IEEE Trans Knowl Data Eng
(2013) - et al.
Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment
IEEE Trans Vis Comput Graphics
(2019)
Local affine multidimensional projection
IEEE Trans Vis Comput Graphics
Sparse multidimensional scaling using landmark pointsTech. Rep.
Global versus local methods in nonlinear dimensionality reduction
Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering
Deep learning multidimensional projections
Inf Vis
Piece wise Laplacian-based projection for interactive data exploration and organization
Comput Graph Forum
Incremental multidimensional scaling method for database visualization
A spatio-temporal extension to isomap nonlinear dimension reduction
Nonlinear manifold learning for data stream
Incremental nonlinear dimensionality reduction by manifold learning
IEEE Trans Pattern Anal Mach Intell
Visualizing time-dependent data using dynamic t-SNE
Two-phase mapping for projecting massive data sets
Vis Comput Graph IEEE Trans
Think globally, fit locally: Unsupervised learning of low dimensional manifolds
J Mach Learn Res
S-isomap++: Multi manifold learning from streaming data
Incremental learning for robust visual tracking
Int J Comput Vis
Sequential karhunen-loeve basis extraction and its application to images
IEEE Trans Image Process
An incremental dimensionality reduction method for visualizing streaming multidimensional data
IEEE Trans Vis Comput Graphics
IDR/QR: An incremental dimension reduction algorithm via QR decomposition
IEEE Trans Knowl Data Eng
Reducing the dimensionality of data with neural networks
Science
Cited by (5)
Foreword to the special section on SIBGRAPI 2021
2022, Computers and Graphics (Pergamon)Quantum-PSO based unsupervised clustering of users in social networks using attributes
2024, Cluster ComputingA Parallel Framework for Streaming Dimensionality Reduction
2024, IEEE Transactions on Visualization and Computer GraphicsNeural network training fingerprint: visual analytics of the training process in classification neural networks
2022, Journal of Visualization