Assessing Depth Perception in Virtual Environments: A Comprehensive Framework

Understanding humans’ perception of depth and how they interact with virtual environments is a challenging task. This context involves investigating how features of these environments affect depth perception, which is crucial for tasks like object manipulation and navigation that require interpreting spatial information. This article presents a comprehensive (general, extensible and flexible) framework to assess depth perception in different virtual environments to support the development of more effective and immersive virtual experiences. This approach can assist developers in decision-making regarding different approaches for assessing depth perception in virtual environments, considering stereoscopic and monoscopic techniques for visualization. The framework considers parameters such as the distance between the user and virtual objects and the sizes of virtual objects. Metrics such as hit rate, response time, and presence questionnaire responses were utilized to assess depth perception. The previous experiments are presented (anaglyph and shutter glasses), as well as the new experiments, considering cave environments with and without anaglyph glasses


Introduction
An Immersive Virtual Environment (IVE) is a Virtual Reality (VR) system composed of a three-dimensional (3D) environment developed to provide interaction between a human participant and a world simulated by a computer [Slater and Usoh, 1993].An IVE should offer participants or users the feeling of being in a world different than the one where their real bodies are physically located.This goal may be accomplished by using visual, auditory and haptic devices, which provide sensory inputs to users.
Understanding how people explore IVEs is crucial for many applications, such as designing VR content, developing new image compression algorithms, or learning computational models of saliency or visual attention [Sitzmann et al., 2017].Stereoscopic techniques, such as color filtering, light shutter and polarized light, are visualization resources that can offer the feeling of immersion in these environments.
Stereoscopy is the acquisition and projection of images of a scene to both the left and the right eyes at the same time for the conversion in a single image by brain [Yanoff and Duker, 2018].This mechanism occurs because humans have binocular vision, so the two eyes capture two different images of a scene, and the brain interprets these images to provide the perception of the depth of the observed scene.
In a previous systematic reviewer [Silva et al., 2016], we have found some studies concerning the assessment of some effects provided by stereoscopic techniques.Certain effects have been evaluated: immersion [McMahan et al., 2006; Slater et al., 2010] and depth [Vinnikov andAllison, 2014; Livatino et al., 2015].The objective of evaluating such effects usually refers to checking the realism and usefulness of IVEs.The effects are studied according to parameters, called effects in some studies, such as distance, which refers to the interval perceived from the viewer to the target (egocentric) or from one target to another (exocentric) [Geuss et al., 2012].Immersion refers to how much technology can provide an inclusive, extensive, surrounding, and vivid illusion of reality to the senses of an observer [Slater and Usoh, 1993].Finally, depth refers to the 3D visual perception of a scene [Armbrüster et al., 2008].
The relationship between movement and vision in IVEs has been fairly explored under several evaluation approaches.The literature encompasses studies aimed at validating specific IVEs, investigating depth, distance, immersion and some variations [Cecotti, 2022; Hattori et al., 2022; Leopardi et al., 2021; Ochs et al., 2019; Thalmann et al., 2016; dos Santos et al., 2017], as well as works aimed at investigating such effects from IVEs built exclusively to execute such evaluations [Ng et al., 2016; Lin et al., 2019; e Silva and Nunes, 2015].However, we have not found studies proposing more general methods that could be replicated in different experiments.
In the literature, there is neither consensus on the use of stereoscopy in IVEs for the performance of some tasks, nor on the degree of adequacy of different stereoscopic techniques for different contexts.As stereoscopic techniques become more diverse, it is necessary to establish methods ca-pable of measuring and comparing the depth perception provided by different techniques within different contexts, also comparing with monoscopic technique.
This article presents a framework to evaluate depth perception in different IVEs, comparing multiple visualization techniques, especially stereoscopic techniques.A developer can select objective and subjective metrics, as well as the features and a low number of users to test.Although some understanding of variables used in an IVE is desirable, the developer does not need previous experience in research, since a guide with examples and detailed explanation is provided together with the framework itself.
The paper is organized as follows: Section 2 discusses related work on the evaluation of stereoscopic techniques; Section 3 presents the framework; Section 4 presents two experiments conducted to illustrate the framework's application; the results are shown in Section 5; Section 6 indicates the framework benefits and constraints; and Section 7 presents some final remarks.

Related Work
Current literature presents evaluation studies to compare stereoscopic technologies in IVEs, as well as to validate hypotheses regarding the influence that depth perception has on the users' performance.Considering a first scope, some studies can be characterized by ad-hoc experiments designed and conducted within the context of each IVE [Leopardi et al., 2021; Ochs et al., 2019; Thalmann et al., 2016; dos Santos et al., 2017].In contrast, considering a second scope, few studies make efforts to propose evaluation methods applied to a particular context, where IVEs are developed strictly to investigate specific effects and depth perception [Cecotti, 2022; Hattori et al., 2022; Lin et al., 2019; Zhao et al., 2020; Vienne et al., 2020].
The studies conducted by Cecotti [2022] and Hattori et al. [2022] are examples of the first-mentioned scope.According to Cecotti [2022], VR has a key impact on users' immersion in learning activities.In his work, a serious game in fully immersive VR related to astronomy education was proposed, which was assessed with undergraduate students.Hattori et al. [2022] evaluated users' performances in dental training simulators.They found that unique characteristics of VR, such as the simulated cutting sensation and the simulated 3D images created by stereo viewers, affect performance.
Considering the same scope, some studies adapt presence questionnaires available in previous literature [Slater et al., 1994; Witmer and Singer, 1998; Lessiter et al., 2001; Schubert et al., 2001] and use them to analyze subjective data, considering Likert scale.These studies usually evaluate IVEs like Virtual Medical training room [Ochs et al., 2019], Virtual Museum system [Leopardi et al., 2021], Virtual Volleyball game [Thalmann et al., 2016], comparing Head Mounted Display (HMD), desktop, Oculus Rift, Cave Automatic Virtual Environment (CAVE) and auto-stereoscopic display.
There are few studies that analyze objective data in addition to subjective data, like Dos Santos et al. 's work [dos Santos et al., 2017] and their IVE to teach robotics.They used automatic reports, time spent and movement precision as quanti-tative data, as well as a questionnaire for qualitative analysis to compare different stereoscopic technologies.These comparisons aim to investigate which one, for example, increases precision and decreases time in tasks.
In a study conducted by Lin et al. [2019], virtual targets using HMD and Stereoscopic Widescreen Display (SWD) was presented to participants, who had to estimate distances by direct reaching, computing accuracy and task completion time.Zhao et al. [2020] evaluate distance stereoacuity, where participants execute a searching task aiming to analyze the distance between separate images, based on redgreen anaglyphs, polarized light technology, active shutter and autostereoscopic.In Vienne et al. [2020], participants execute manipulation tasks to judge and adjust angles of a virtual dihedral in a L-shaped VR system, which considered HMD and a CAVE to a depth perception evaluation.These studies are examples of the second mentioned scope.
As observed, studies evaluate stereo effects considering different stereoscopic techniques and the evaluation is generally specific to one system.To our best knowledge, there is no systematic, flexible and extensible method that considers both objective and subjective data to evaluate different IVEs considering different techniques.The previously cited studies contributed to build our approach, since they indicated parameters to be evaluated, the type of environments and tasks that the framework should consider, as well as tools to gather data for the subjective score.

Framework Description
The framework offers a way to evaluate depth perception by comparing different visualization techniques, especially stereoscopic techniques, in the same IVE.It shall be suitable for IVEs developers, who are responsible for choosing and implementing stereoscopic techniques.
The following concepts are considered in this work.Effect is the product of the stereoscopic technique (depth perception).Parameters are factors that can influence an effect; in this work, the parameters considered are distance and size of virtual objects (Section 3.1), since we identified these factors as the most evaluated in literature [Silva et al., 2016].Metrics are data collected during the execution of a task to indicate qualitative and quantitative results; here, the metrics considered are hit rate, error rate, time and presence questionnaire responses.
Considering that immersion and presence are studied in the context of virtual environments, immersion can be defined as a medium's technological capacity to provide realistic experiences that can put users in another reality, removing them from their physical reality.Certain features, such as audio and visual quality, frame rate, field of view and stereoscopy can influence the immersion offered by a system.Presence is the subjective experience of these users in the mediated virtual environment [Oh et al., 2018].It is the feeling of being in another place, a virtual place different from the physical one where it actually is -a sensation of being in the virtual environment as opposed to the real one [Meehan et al., 2002].Presence is traditionally considered as the psychological perception of "being" in the virtual envi-ronment in which one is immersed (Heeter [1992]; Sheridan et al. [1992]; Steuer [1992]; Witmer and Kline [1998]), and it can be measured using questionnaires, which can show how connected and engaged a user feels within the virtual space [Grassini and Laumann, 2020].Presence is about how users perceive and experience a virtual environment on a cognitive and emotional level.It is not solely dependent on technological factors but it is also influenced by individual perception and engagement.
Figure 1 presents the main steps of the framework, which are further detailed in the following subsections.

Environment Preparation
This first step is the preparation of the IVE that will be used in the experiment, in order to gather the objective and subjective metric data and it must comply with the following requirements: (i) to have versions that use different visualization techniques (stereoscopic and monoscopic techniques)one version for each technique to be evaluated; (ii) to build tasks which allow gathering performance data related to the tasks to be completed (e.g., error/hit rates and time spent); and (iii) to allow the creation of at least two different scenarios through variations in the parameters of objects with respect to the observer's point of view.
The different scenarios aim to avoid potential bias in the evaluation of a technique due to the characteristics of the IVE.Thus, when we vary parameters, such as the size and distance of the virtual objects to the virtual camera, we can favor a more impartial evaluation.Besides the requirements above, the experimental design must be planned and implemented carefully in order to avoid the interference of confounding factors, i.e., overlooked experimental conditions whose effects cannot be distinguished from those of the techniques to be compared [Oehlert, 2010].For example, the order of techniques to which users are submitted may influence the results, insofar as successive interactions of users with the IVE may lead to effects of fatigue or adaptation along the tasks.To mitigate this risk, an experimental block design must be applied, where each block (group of users) is characterized by a sequence of techniques to be used by the participants [Oehlert, 2010].The significance test procedure proposed in Section 3.4.1 includes extensions for block designs.In Section 4.3, we show examples of varying such parameters, in which four different scenarios were generated by changing objects' distances and sizes.

Objective Data Acquisition
Objective data acquisition step consists of gathering data from users during their interactions with the IVE.
The objective evaluation consists of capturing performance metrics related to the users' interactions with the system.In the experiments conducted (Section 4), the metrics considered were error rate (positioning objects in the wrong place), hit rate (collisions with suitable objects) and time to complete the task.Nonetheless, the framework is extensible to other metrics, and allows manual data gathering by one external observer during the user's interactions; however, the source code can be changed to collect the data.

Subjective Data Acquisition
The subjective data acquisition step consists of gathering data from users after their IVE interactions.When users have completed all tasks with the same visualization technique, they answer a questionnaire with their opinions about the perceived depth.
Based on the literature, we defined a questionnaire (Tables 5 and 6) to our experiments, which considers ten levels of possible responses, adapted from Witmer and Singer's Presence Questionnaire (PQ) [Witmer and Singer, 1998].The statistical analysis routines included in the comprehensive framework are able to deal with any questionnaire composed of ordered single-answer questions, in any order scale (i.e., lower levels representing more negative answers and upper levels the more positive answers, or vice-versa).

Statistical Model
Once the experimental phase has finished and usage data has been gathered, the statistical analysis is conducted in order to assign objective and subjective scores for each technique.Essentially, the scores are computed through pairwise comparisons between techniques, in which each one earns or loses points if it is significantly better or worse than the other, at a prescribed significance level.The net balance of each technique (wins − losses) is then converted into a more intuitive scale.

Significance Test Procedure
The significance test procedure for comparison between techniques is based on randomization tests, a subclass of statistical tests called permutation tests [Edgington and Onghena, 2007].The p-value is given as the proportion of data permutations, providing a test statistic (e.g., the difference between sample means) as large as (or as small as) that obtained in the experimental results.Randomization tests share similar principles of permutation tests, except that the p-value is not computed over all data permutations (which is usually unfeasible even for moderate sample sizes), but instead on a subset of randomly generated permutations.
In contrast to traditional parametric tests (e.g., t-tests and ANOVA (Analysis of Variance)), randomization tests have several theoretical advantages: (i) it is possible to draw valid statistical inferences about experimental treatment effects on non-probabilistic samples (usually called also "convenience samples"), which are typical in experiments conducted in the Computer Science fields; (ii) they not require any assumptions about the distribution of the variables being tested; (iii) they are less sensitive to skewed distributions and outliers, which are frequent for some metrics in our context (e.g., time to complete the task); and (iv) they do not depend on asymptotic approximations valid only for large sample sizes [Edgington and Onghena, 2007].
Algorithm 1 presents a simplified version of the randomization test procedure adopted in our framework ("RANDTEST") for the pairwise comparison between measurements provided by two visualization techniques, regarding a metric.x 1 and x 2 denote vectors of size N (number  and j = 1, 2. Briefly, the procedure starts computing test statistics S x from original observations, detailed in the next paragraph.Next, it randomly swaps metric values between techniques, i.e., some users are randomly draw and their observations are swapped between techniques 1 and 2 (lines 5-8); the corresponding value of the statistic test for the permuted data S y , is then computed (line 9).This step is repeated B times and the p-value is computed as the proportion of permuted data such that |S y | ≥ |S x | (lines 10-11).
Computation of statistics S x is described in the second procedure ("STATISTICS") of Algorithm 1.For quantitative (discrete/continuous) variables, namely, users' performance metrics, S x is the average of differences between techniques outcomes (line 3).For ordinal variables, such as Likert-Type variables, S x is the standardized difference between N pos − N neg , where N pos and N neg denote, respectively, the number of times x 1 i > x 2 i and x 1 i < x 2 i (lines 5-7); ties between x 1 i and x 2 i are ignored.This statistic is based on Putter's sign test [Putter, 1955], a robust procedure that consistently holds its significance level and provides a good comparative test power, even under a moderate prevalence of ties [Coakley and Heise, 1996].

Algorithm 1
Randomization test procedure for pairwise comparison between techniques (repeated measures, two-tailed test) To determine B, Jockel [1986] established a criterion based on the test power, defined as the probability of a significance test procedure to reject the null hypothesis when it is false.The author derived an upper bound for the decrease in the power of a randomization test compared with its analogous complete permutation test, as a function of B and the significance level for rejection of the null hypothesis, α.In this work, we considered B = 10, 000 and a significance level α = 0.1, which led to a decrease in performance of less than 2%.
The randomization test version presented in Algorithm 1 has some simplifications.It only considers a two-tailed test, since the comparisons between S y and S x are given in modules and therefore ignore the signal of differences.The extension to one-tailed tests is straightforward, by properly adapting the condition in line 10.Vectors x 1 and x 2 are assumed to have only one observation for each user.As mentioned earlier, it is advisable that the objective metrics are gathered under different scenarios (e.g., permutations of objects' sizes and distances parameters), which implies that each vector has multiple observations for each user -more precisely, K observations per user, where K denote the number of scenarios.The extension for this case is also straightforward, by swapping all data from the same user between x 1 and x 2 , once the user has been randomly selected for that.This guarantees that the data for each user are treated in blocks, in such a way that all configurations of object size and distance are equally distributed in both techniques, thus avoiding that performance differences due to objects sizes and/or distances be confounded with differences due to techniques.These extensions were considered and implemented in our framework.

Score Computation
Two scores, computed from objective and subjective data, are assigned to each visualization technique.Algorithm 2 briefly presents the score computation procedure.Each metric is identified by an index m ∈ {1, 2, . . ., M }, where M denotes the number of metrics, and each technique is identified by an index t ∈ {1, 2, . . ., T }, where T denotes the number of techniques.X is a matrix of N rows (or N * K, when K scenarios are considered for each user) and T * M columns, where X m,t denotes the column of X containing user records of metric m under the technique t.VType is the type of metrics under analysis.Netb is a vector of size T , where Netb t denotes the net balance (wins − losses) earned by technique t.Score is the vector of final scores for all techniques.The procedure performs all possible pairwise comparisons between techniques, under all metrics.For each metric m; and each pair of techniques (t 1 , t 2 ) and t 1 < t 2 , the randomization test assesses the significance of S x (lines 6-8).If no significant difference is found (p.value> α), neither technique earns or loses any point.Otherwise, each technique earns (loses) one point if it wins (loses) the comparison (lines 10-11).

Algorithm 2
Score computation procedure for pairwise comparison between techniques (repeated measures, two-tailed test) 1: procedure ScoreComp(X, T, M, VType, α, B) 2: Netbt ← 0, t = 1, . . ., T 3: for m ← 1 to M do ▷ Iterations over metrics 4: for each (t 1 , t 2 ) ∈ {1, . . ., T } 2 , t 1 < t 2 do 5: ▷ Iterations over pairs of techniques 6: return Scoret After all iterations, the vector Netb is multiplied by a normalization constant c, in such a way that the final scores (vector Score) range from −10 to 10 (line 13).In the experiments presented in this work, c is given as follows: for the objective scores, we have T = 3, M = 2, resulting in c = 2.5; for the subjective scores, we have T = 3, M = 12, resulting in c ≈ 0.417.Notice that extreme scores occur when one technique loses or wins all comparisons.
The procedure shown in Algorithm 2 assumes that all metrics are positively ordered, i.e., the higher the value, the better the technique; dealing with negative ordering is straightforward, by multiplying X m,t1 and X m,t2 by (−1).This extension is also implemented, the developer can set up the ordering signal for each metric.
Nonetheless its simplicity, this approach is sufficiently flexible to allow other extensions such as handling other metrics or computing metrics with different weights, setting different weights to objective and subjective scores, adapting scale limits according to domain needs or customs, and setting other criteria for score assignment.

Verdict Graph
This last step consists of plotting the objective and subjective scores for all techniques in a two-dimensional graph, in order to allow a visual analysis of their relative performances with respect to the provision of depth perception.Each technique t is represented by a coordinate (Score o t , Score s t ), where Score o t is the objective score and Score s t is the subjective score of technique t, t = 1, . . ., T .
Figure 1 (Step 5 -Verdict Graph) also presents the base graphs representing the space of possible coordinates.We consider four quadrants, each one corresponding to a verdict about the techniques' performances: • Weak: techniques in this quadrant are those with negative objective and subjective scores, indicating a poor performance in comparison with other competitors; unless a technique in this quadrant is near the center of the graph (coordinate (0, 0)), its use should be considered only for non-critical systems and when budget constraints preclude the use of better (and more costly) tech-niques; • Regular: this verdict includes the two quadrants in which objective and subjective scores are negatively correlated, and represent techniques with good relative performance under one criterion but weak performance under the other.Their use should also be considered with caution.Nonetheless, these quadrants should have a lower probability density, insofar as techniques providing a better depth perception (consequently, with higher ratings in the questionnaire) should yield a better performance during the tasks (and therefore higher ratings in the objective metrics); • Strong: techniques in this quadrant have, on average, performed better than their competitors, and should be the preferable choice.The closer their positions to the upper right corner (coordinate (10, 10)), the more evident their superiority.

Framework Validation
As an illustration of our framework's application, we present four experiments conducted in different moments to compare visualization techniques (three techniques in the first two experiments and two techniques in the last), within two different IVEs: a simulator in the health area and an endless racing game (called "3D Running Squirrel").
Dental training simulations have drawn attention as an educational strategy in Covid-19 pandemic [Hattori et al., 2022].The simulator used in our experiment is an immersive tool, based on 3D interaction, to train dental anesthesia (Figure 5).The user's goal is to manipulate a virtual syringe and insert its needle in a specific region, to inject the anesthetic to block the nerve's electrical signals.Better performance is achieved when the user completes this task with fewer errors (inserting in other regions) and in a shorter time.Additionally, the simulator allows users to navigate the environment using the keyboard changing their viewpoints during interaction to explore virtual objects.In this case, better performance is to find objects and details in a shorter time.
The game "3D Running Squirrel" is an immersive infinite racing game available for desktops and mobile devices (Figure 6).The player can observe the performance achieved from the number of hits (walnuts capture, without falling out the path) and elapsed time.The greater the number of hits and the greater the time spent, the better is the player's performance.From now, we will refer to this IVE as "game" in this work.

Participants
Two groups of participants were recruited for the experiments.
The participants of the first two experiments (Group 1), comprised 20 students and teachers in the Computer Science area, with sixteen male participants and four female participants (twenty-nine years average age).The gender distribution is consistent with that found among Brazilian students in higher education courses within the computing field [Maciel et al., 2018].More than half (65%) of the participants mentioned having some type of eye problem; therefore, they used visual correction with their respective eyeglasses during the experiments.All participants of Group 1 said that they had some experience with 3D virtual environments.
The participants of the last two experiments (Group 2) comprised nine students in the Computer Science area, with seven male participants and two female participants (twentyone years average age).Forty percent (40%) of the participants mentioned having some type of eye problem; and they also used visual correction with their respective eyeglasses during the experiments.All participants of Group 2 said that they had some experience with 3D virtual environments.

Visualization Techniques and Devices
Five visualization techniques, one monoscopic technique and four stereoscopic techniques were evaluated in our experiments: color filtering technique by means of true anaglyph glasses, here named "True Anaglyph Technique" (TAT); color filtering technique by means of color anaglyph glasses, named "Color Anaglyph Technique" (CAT); light shutter by using shutter glasses, here named "Shutter Glasses Technique" (SGT); CAVE without glasses (CT); and CAVE with glasses (CGT).The glasses used in CGT were specifically the color anaglyph glasses.
The glasses for TAT are made of lenses with red and blue filters.Similarly, glasses for the CAT use red and cyan lenses.On the other hand, SGT is made with lenses that alternate the scene for each eye in a frequency synchronized with the monitor or projector refresh rate.Figure 2 illustrates an example of the evaluation scenario utilized for conducting experiments with Group 1. CGT is a cave and the participants used anaglyph glasses for 3D visualization.CT did not have the glasses and it was considered a monoscopic environment, although there are several viewpoints of the environment.CT and CGT were specifically designed to create an engaging environment within a classroom setting.Four multimedia projectors were strategically positioned in the classroom to provide a seamless visual experience for the participants.The projectors were carefully adjusted and calibrated to project images onto the targeted walls, ensuring a cohesive and synchronized display.This configuration allowed for a substantial portion of the classroom walls to be utilized as a canvas for the virtual environment.While the other two walls of the classroom were not directly covered by the projectors, they still contributed to the overall immersive experience.The ambient lighting in the room was adjusted to minimize distractions and enhance the perception of being enveloped in the virtual environment to perform navigation tasks.Figure 3 illustrates an example of the evaluation scenario utilized for conducting experiments with Group 2. These techniques were chosen due to their differences in cost versus benefit ratio, especially differences between TAT/CAT and SGT/CT/CGT.This difference is evidenced by the fact that shutter glasses and the projectors required for building the CAVE are high-cost equipment, whereas anaglyph glasses can be made from inexpensive materials.Since they present different cost levels, our investigation intends to verify if a technique is suitable for a determined system even if it presents a low cost.However, other devices, such as modern head-mounted displays (HMD), may be used.
To perform the procedures offered by the simulator, it was used a Leap Motion device that captures movements that are transferred to a syringe in the virtual space.To physically represent the syringe, a common straw was used (Figure 4).In the simulator, it is possible to navigate in the virtual environment using the keyboard, with keys to move the viewpoint or the virtual camera (translation and rotation).A projection equipment was used to enable the correct operation of SGT, which requires the visualization device to operate at a refresh or frequency rate of 120Hz.Concerning the game, we used a standard mouse to control the squirrel's actions in the virtual space.

Design
Both experiments considered distance and size as parameters.To assess users' performance -collecting hits (game), errors (simulator), and time (game and simulator) -under different configurations of parameters, four different scenarios were built in each IVE.The virtual camera is far from the squirrel and the nuts; and the path has a wide width Longer distance/ Smaller size The virtual camera is far from the squirrel and the nuts; the path has a narrow width Shorter distance/ Larger size The virtual camera is closer to the squirrel and the nuts; and the path has a wide width Shorter distance/ Smaller size The virtual camera is closer to the squirrel and the nuts; and the path has a narrow width In the simulator, each scenario was obtained with the variation of: (i) distance between the virtual camera (representing the user's viewpoint) and the virtual patient, (ii) target's size or region to be reached.The target region is a yellow sphere placed at the virtual patient's inner mouth surface that shows the nerve direction location to be reached.Table 1 and Figure 5 show these configurations.
In the game, each scenario was obtained with the variation of: (i) distance between the virtual camera and virtual objects, (ii) virtual objects sizes (squirrel, walnuts and the path), as shown in Table 2 and Figure 6.Both IVEs were rendered with the techniques mentioned in Subsection 4.2, as shown in Figures 7, 8, 9 and 10.Each participant of Group 1 performed twelve tasks within each IVE, each corresponding to a stereoscopy technique and a scenario.The experiments were conducted between 1 p.m. and 6 p.m., maintaining similar conditions by using artificial lighting for all users, in order to avoid lighting influence on the user perception during the experiments.
Each participant of Group 2 performed eight tasks, each one corresponding to a visualization technique (monoscopic and stereoscopic) and a scenario.The experiments were conducted between 3 p.m and 9 p.m., also maintaining similar conditions by using artificial lighting for all users.

Tasks and Metrics
In the simulator, in the first two experiments the task consisted of interacting with the system by manipulating a syringe in the virtual space to reach a small target and its respective bigger size target, from two different camera's distance perspectives.During the interaction the system collected the metrics: error rate (frequency with which the needle missed the target) and the time spent to complete the task.
In one of the last two experiments, related to the simulator, the task consisted of navigating in the virtual environment to find an object, a yellow sphere, placed at the virtual patient's inner mouth surface.
The task related to the game in all experiments consisted of controlling the squirrel that runs in a virtual wall to capture the largest amount of walnuts, without falling out a narrow path and a wider one, from two different camera's distance perspectives.During the interaction the system collected the metrics: hit rate (frequency with which the squirrel correctly turned left or right and captured walnuts) and time elapsed until falling out of the path.

Procedure
This experiment was carried out with the approval of the Committee of Ethics in Research with Human Beings of the Faculty of Medicine of the University of São Paulo (CAAE 54691916.3.0000.0065).Participants signed a Consent Form and they were previously informed about the tasks to be performed and about the devices to be used.The participants of Group 1 first tested the simulator and then the game.In order to minimize any bias and account for factors such as participants' fatigue, the testing order of each visualization technique was alternated for each volunteer during each test session for each one of the IVEs.The last two experiments conducted with participants from the second group were carried out alternating the systems.
After previous explanations by the researcher, for each technique, each participant alternated between performing a task and answering a questionnaire.During all the sessions, the researcher gave instructions to the participant whenever necessary up to the end of the session.Each session lasted on average 40 minutes.
The participants of Group 1 (the first two experiments), were invited to sit facing the video monitor and record their personal data.Next, a pair of stereoscopic visualization glasses from CAT, TAT or SGT were provided, the partici- pant took place close to a table (in front of the projection) and the tests started.After completing all tasks with the same stereoscopy technique, participants answered the evaluation questionnaire.
The participants of Group 2 (the last two experiments), were invited to sit facing the video monitor and record their personal data.Next, the participants should enter a virtual room.A pair of stereoscopic visualization glasses was provided to participants to interact with the simulator and game when the experiment refers to the CGT.For CT the visual-ization technique was monoscopic.Finally, the participants completed the tasks and answered the evaluation questionnaire.

Results
With respect to the objective metrics, Table 3 and 4 present the results of pairwise comparisons between CAT, TAT and SGT (Group 1 and the first two experiments), as well as the results of pairwise comparisons between CGT and CT (Group 2 and the last two experiments), concerning the objective metrics.Each entry contains the average difference of ratings in each metric and its respective p-value.Significant differences (p-value<0.1) are highlighted in bold.
In both IVEs (Group 1 and the first two experiments), the relative performance between techniques showed similar patterns.Nonetheless, CAT has yielded a slight superiority over TAT (lower error rate and time averages within the simulator; higher hit rate and time average within the game), no significant difference was found in either metric.On the other hand, the significant differences found in most pairwise comparisons CAT × SGT and TAT × SGT reveal that SGT stands out with the best performance.
In both IVEs (Group 2 and the last two experiments), CT obtained a superior performance compared with CGT.It is important to mention that the tasks and systems were different and only time was used as a metric for the simulator.Significant differences were found in two experiments (CT and CGT for both systems).
Concerning the subjective metrics, Tables 5 and 6 present the results of pairwise comparisons between CAT, TAT and SGT (the first two experiments -Group 1) and pairwise comparisons between CGT and CT (the last two experiments -Group 2).
The relative performance of techniques on subjective metrics reproduced approximately the same pattern found on the objective metrics.In the case of Group 1 and the first two experiments), CAT and TAT obtained very similar ratings, and significant differences (in favor of CAT) were found in only one of the 12 questions, for both IVEs.Similarly, SGT achieved a strong superiority over CAT and TAT, where most of pairwise comparisons showed significant differences: within the simulator, 10/12 for CAT × SGT and 12/12 for TAT × SGT; within the game, 9/12 for CAT × SGT and 9/12 for TAT × SGT.In the case of Group 2 and the first last experiments), CT obtained strong superiority over CGT, where most of pairwise comparisons showed significant differences within the simulator, 9/12 for CGT × CT.Within the game, 3/12 pairwise comparisons showed significant differences for CGT × CT.
The computation of the objective and subjective scores is straightforward, as described in Section 3.4.2.For example, to compute the objective score of the CAT in the simulator, we notice that it obtained 2 ties and 2 losses (Table 3), resulting in a net balance −2.Multiplying the net balance by the normalization constant c = 2.5 (Section 3.4.2) yields its objective score Score o CAT = −5.0.As for the subjective score, we notice that CAT obtained 13 ties, 1 win and 10 losses (Table 5), resulting in a net balance −9.Multiplying the net balance by the normalization constant c ≈ 0.417 yields the subjective score Score s CAT ≈ −3.8. Figure 11 presents the scores for CAT (Color Anaglyph), TAT (True Anaglyph) and SGT (Shutter Glasses), each one with their subjective and objective scores combined in a triangle, as well as their respective quadrant-based verdicts, considering Group 1 and the first two experiments.As noticed in the pairwise comparisons, SGT has shown a remarkable superiority over CAT and TAT, being therefore considered a strong stereoscopy technique.CAT and TAT, in their turn, have shown very similar performances, quite below SGT.Insofar as neither of them has shown a convincing superiority over the other, both are considered weak techniques in our experiments.In both IVEs, CAT presents a slight superiority over TAT, evidenced by its lower distance from the center of the graph.Nonetheless, its use in the simulator should be considered with caution, due to the system's criticality.On the other hand, for non-critical systems such as the game, CAT is preferable over TAT in case cost constraints preclude the acquisition of more expensive devices.
Figure 12 presents the scores for CGT (CAVE Glasses Technique) and CT (CAVE Technique), considering Group 2 and the last two experiments.CT was superior to CGT for the simulator and game, although the tasks differed for each system.We can observe that a technique (CGT) was considered weak and another technique (CT) was considered strong.
It is worthy mentioning that, as described in Section 4.2, although participants in Groups 1 and 2 were submited to the same IVEs (dental training simulator and 3D running squirrel game), the visualization techniques used by each group were different, which precludes a joint or comparative analysis of their results.

Discussion
The existing evaluation methods from literature are executed for particular scenarios and, in general, inserted in a single IVE and without the possibility of being reused.The framework can be applied to evaluate IVEs that consider the manipulation of objects and the navigation in virtual environments, and its application depends neither on the scope of the IVE nor on the visualization techniques to be evaluated.Besides, the framework is able to incorporate different objective metrics, parameters and questionnaires (as long as they are composed exclusively of ordered single-answer questionnaires).It is important to mention that the framework allows comparing stereoscopic and monoscopic techniques, which lack the presentation of distinct scenarios for each eye.
Furthermore, the literature indicated objective and subjective evaluations, generally disconnectedly.The two-way evaluation is based on the consideration that, taking into account only the objective or only the subjective dimension, may result in biased or incomplete assessments.Thus, our framework integrates both dimensions to form a verdict about the depth perceived by an individual when using different techniques.Such a verdict (Figure 1) is an initial proposal that can be adapted to the developer's needs.
Our framework requires a questionnaire to assess the users' point of view concerning their sensations perceived during the performance of tasks.Among the several questionnaires available in the literature, we adapted the one proposed by Witmer and Singer's [Witmer and Singer, 1998].The authors suggest that "involvement" is an important determinant of presence in virtual environments, so we adapted some questions from their questionnaire about this factor, resulting in questions 3 to 12 (Tables 5 and 6).Besides, two new questions (1 and 2) were included to assess the visual strain and the users' comfort when using visualization devices.The framework can incorporate other questionnaires,   [Witmer and Singer, 1998], ITC-Sense of Presence Inventory (ITC-SOPI) [Lessiter et al., 2001], and Igroup Presence Questionnaire (IPQ) [Schubert et al., 2001].
Although we intend to continue acquiring data considering other techniques, new IVEs and a greater number of participants, the number of participants considered in our experiments is according to the literature.Several studies carried out experiments with different numbers of participants using questionnaires and physiological measures, evaluating presence in virtual environments.There are studies with 10 [ Clemente et al., 2013b], 14 [Clemente et al., 2013a], 18 [Anderson et al., 2017], 19 [Poels et al., 2012], and 20 participants [Burns and Fairclough, 2015].Thus, we believe the results obtained with the first two experiments presented here are according to the literature; nevertheless, considering the last two experiments, CAVE without glasses (monoscopic) showed a superior result when compared with CAVE with glasses (stereoscopic), in both tasks (manipulation and navigation).CAVE without glasses used a monoscopic technique; however, the participant had several viewpoints of the environment.This fact can have contributed to performing tasks and the pair of glasses was not necessary, decreasing the cognitive load and facilitating the actions of the participants when compared with CAVE with glasses.That way, it is possible to observe that various aspects can influence depth perception.
The results suggest that, in the context of our experiments, the use of stereoscopic glasses can impose an additional cognitive burden on participants as they need to adapt to the stereoscopic visualization.The absence of glasses in the monoscopic environment can alleviate this cognitive burden and facilitate the execution of tasks, reaching better performance.Moreover, individual preferences of participants can also influence depth perception.Some participants may adapt more effectively to stereoscopic viewing, while others may prefer monoscopic viewing.
Besides, the permutation test procedure presented prop-  erly handles quantitative and ordinal metrics and does not rely on asymptotic convergence.Additionally, the method is easily extensible for experimental block designs.
Regarding the tasks, the manipulation tasks were addressed in the first two experiments (Group 1).Specifically, in the first experiment, participants were tasked with manipulating a virtual syringe in the simulator using the Leap Motion controller.The second experiment involved manipulating a virtual squirrel in the game using a mouse.On the other hand, in the last two experiments, manipulation and navigation tasks were addressed (Group 2).More specifically, in the last two experiments, participants were required to navigate through the virtual dental office using the keyboard, and they were tasked with manipulating the virtual squirrel in the game using a mouse.Comparing the results (task completion time metric) between tasks in the simulator (Group 1 and Group 2), the times of the navigation task were larger when compared with those of the manipulation task.Thus, the task can influence the results and this issue must be considered.This suggests that the type of task can influence the results obtained when evaluating techniques.Therefore, when comparing different techniques, it is essential to ensure that the same tasks are applied in order to obtain fairer and more meaningful results, as conducted in the mentioned ex-periments.
The variation of parameters of the virtual environments presented in the experiments is important to analyze depth perception.By manipulating parameters such as size and distance of virtual objects in relation to the virtual camera, researchers can assess how these changes impact participants' perception of depth.This allows a more comprehensive evaluation of the effectiveness of different visualization techniques, such as stereoscopic and monoscopic, in creating a sense of depth and immersion.Additionally, by creating different scenarios with varying parameters, potential biases in the evaluation process can be minimized, ensuring a more objective assessment of the techniques being compared.These variations in the virtual environments help researchers gather valuable data on how different factors influence depth perception, enhancing our understanding of human perception in immersive virtual environments.
Regarding all experiments, (two groups and four experiments), CAVE with glasses, i.e., the results of Group 2 and the last two experiments, were inferior compared to all results.Thus, the techniques based on anaglyphs were classified in the weak quadrant in the verdict graph.
Finally, evaluation methods are relevant since other technologies are developed and effects can be created.The study conducted by Thalmann et al. [2016] showed that Oculus Rift was slightly better than CAVE and stereoscopic display.
In the following subsections, we present the constraints and benefits of the framework and our experiments.

Limitations
In the comprehensive framework, the most cited parameters in literature were considered: distance and objects' size.The rendering of IVEs with visualization establishes the construction of different scenarios with variations of these parameters, generating different versions of the same system.Such different scenarios are required to assess, at which level, variations of the parameters facilitate or hinder the manipulation of objects or the navigation in the environment.Considering that each IVE has already been duly studied to define the adequate object's size and distance from the camera for its purpose, such suggested variations could have an impact on errors, hits and times measured during an evaluation.Consequently, they would enable assessing the behavior of the users' perception when the parameters are changed.
Although the experiments conducted were not the main focus of this work, some limitations could be mentioned.Considering the last two experiments executed by Group 2, the results suggest that stereoscopy techniques may not be necessary in some VR systems.However, a CAVE was used and other systems could be assessed.Only distance and objects' size variations were considered as parameters for the scenarios, as well as only the hits rate, errors rate and time metrics were included in the comparisons between techniques.The literature was considered; however, in future applications, other parameters and metrics can be addressed.
The results and verdicts obtained in the experiments are valid only in the context of the experiment.Therefore, their extrapolation to other contexts (even in IVEs with characteristics similar to the IVEs tested here) is not straightforward.
Our results obtained with users with different characteristics are similar to results reported in the literature by Zhao et al. [2020] and Livatino et al. [2015], where different techniques are compared.However, new experiments with developers of the visualization and related areas could be useful to assess the framework, especially to check the values of the differences between the techniques.

Benefits
We have not found in the literature an extensible evaluation method, applicable to different systems, to provide comparative performance assessments of visualization techniques, especially stereoscopic techniques, in the same IVE, evaluating depth perception.In this sense, the comprehensive framework is a novelty aiming to address such lack in the VR area.
Besides, the comprehensive framework is flexible in the sense that it allows to: (i) incorporate any ordered objective metrics, in any scale ordering, without the need to change any routines of the source code; this is done using a dictionary of metrics provided by the developer; (ii) incorporate any questionnaire composed by ordered single-answer questions, which is also possible using a dictionary of questions; (iii) incorporate other IVE scenarios based on parameters variations beyond distance and object's size; (iv) make changes or extensions to the scores computation routines, with minor changes in the source code: handling metrics with different weights, setting different weights to objective and subjective scores, adapting scale limits according to domain needs or customs, and setting other criteria for scores assignment.

Conclusion
The framework integrates objective and subjective evaluations, gathering data that are analyzed to identify significant differences between stereoscopic and monoscopic techniques.
The comprehensive framework is applicable to different systems within the context of IVEs with object manipulation and navigation, is flexible and allows incorporating any objective and subjective metrics (real or ordinal), and several scenarios based on parameters variations, and changing criteria and parameters in the score computation.It is aimed to provide support for VR systems developers to decide whether a stereoscopy technique must be used in a VR system -and which one.Based on the scores of candidate techniques, developers can choose a given technique based on its cost and effectiveness within a specific domain.Further, the framework can help the developer or another professional to analyze the stereoscopic technique performance on each metric, identifying its strengths and weaknesses.
In future work, we intend to conduct new experiments considering other parameters found in the literature, such as shade, texture, and lighting.Additionally, other metrics can be included, such as speed and length.We also propose to analyze other IVEs, using the framework in other contexts, as well as other techniques, with other visualization devices.Developers, as engaging experts in the field, could evaluate the framework, although certain results are similar to the results found in the literature when techniques are compared.Such collaboration with developers is planned as part of our future work.

Figure 2 .
Figure 2. A participant from Group 1 actively engaged in the experiment by playing the game.

Figure 3 .
Figure 3.A participant from Group 2 involved in the experiment utilizing the simulator in the navigation task.He was passing through the modeled ceiling in the virtual dental office environment at that moment.

Figure 4 .
Figure 4. Environment with the use of a straw (red circle) that represents a virtual syringe and Leap Motion (blue circle) for virtual tool tracking.In order to prevent interference caused by the infrared rays emitted by the Leap Motion device on the shutter glasses, a specially designed apparatus made of cardboard and ethyl vinyl acetate (EVA) sheets was used.

Figure 11 .
Figure 11.Result for verdict's graph to each stereoscopy technique classification -first two experiments (Group 1)

( a )Figure 12 .
Figure 12.Result for verdict's graph to each technique classification -last two experiments (Group 2)

Table 1 .
Configurations of parameters for the simulator Larger sizeThe virtual camera is far from the patient and the target (yellow sphere) is bigger Longer distance/ Smaller size The virtual camera is far from the patient and the target (yellow sphere) is smaller Shorter distance/ Larger size The virtual camera is closer to the patient and the target (yellow sphere) is bigger Shorter distance/ Smaller Size The virtual camera is closer to the patient and the target (yellow sphere) is smaller

Table 2 .
Configurations of parameters for the game

Table 3 .
Average differences with respect to (w.r.t.) objective metrics for pairwise comparisons between stereoscopic techniques within the Simulator, CAT × TAT, CAT × SGT and TAT × SGT, in the first two experiments (Group 1); CGT × CT in the last two experiments (Group 2).CAT = Color Anaglyph Technique; TAT = True Anaglyph Technique; SGT = Shutter Glasses Technique * * CGT = CAVE Glasses Technique; CT = CAVE Technique *

Table 4 .
Average differences w.r.t.objective metrics for pairwise comparisons between stereoscopic techniques within the Game, CAT × TAT, CAT × SGT and TAT × SGT, in the first two experiments (Group 1); CGT × CT in the last two experiments (Group 2).CAT = Color Anaglyph Technique; TAT = True Anaglyph Technique; SGT = Shutter Glasses Technique * * CGT = CAVE Glasses Technique; CT = CAVE Technique *

Table 5 .
Differences w.r.t.Subjective Metrics for Pairwise Comparisons Between stereoscopic techniques within the Simulator, CAT × TAT, CAT × SGT and TAT × SGT, in the first two experiments (Group 1); CGT × CT in the last two experiments (Group 2).

Table 6 .
Differences w.r.t.Subjective Metrics for Pairwise Comparisons Between stereoscopic techniques Within the Game, CAT × TAT, CAT × SGT and TAT × SGT, in the first two experiments (Group 1); CGT × CT in the last two experiments (Group 2).10How much did the manipulation of objects in the virtual environment seem consistent with the manipulation of objects in the real world?