Virtual look around: comparing presence, cybersickness and usability for virtual tours across different devices

Virtual Reality has become readily available in the last few years through different devices, from desktop computers to head­mounted displays (HMD). Also, virtual tours became popular with 360o panoramic photographs and video clips on online social media, so people could visit remote locations without being exposed to crowded transportation or long travels. Also, virtual tours demonstrate considerable potential as a form of escapism and even for remote teaching. Since we lack studies that evaluate the User Experience (UX) in virtual tours on different devices, this article aims to compare aspects of the User Experience (regarding sense of presence, cybersickness, and usability) in a virtual tour website developed in WebXR across different devices. To achieve our objective, we developed a virtual tour based on 360o pictures using WebXR API and React 360 framework and conducted an experiment with 41 undergraduate students using four different devices: a laptop computer, a smartphone, a Google Cardboard headset, and a Samsung Gear VR HMD.We evaluated users’ perceptions by adapting and translating the Suitability Evaluation Questionnaire (SEQ) and users’ performance by measuring the time to fulfill a set of tasks. The main findings from this study include that (i) the overall self­reported experience using Google Cardboard is worse than using other devices, (ii) the users’ performance is quite similar between the platforms, (iii) there is evidence of unexpected cybersickness symptoms in tests with the smartphone, and (iv) the development of a plausible hypothesis concerning low usability having an effect upon the sense of presence. Additional contributions of our research are the adaptation, translation into Portuguese, psychometric analysis, and revised scoring procedures of the SEQ.


Introduction
Even though Virtual Reality (VR) had been created more than 40 years ago 1 according to Costa and Ribeiro (2009), it has evolved and it has become more accessible. Parisi (2015) and Jerald (2016) state that VR aims to convince the users that they are somewhere else using the illusion of presence and immersion to change their physiological and psycholog ical condition. For Jerald (2016), the immersion is a flexi ble aspect of VR because it is more linked to the technology that leads people to realize and to interpret sensory stimuli in a wide, coherent, vivid, and interactive way. Therefore, one VR environment in combination with different devices could create different levels of the sense of "being there".
One example of VR application is the "virtual tour" that, according to Lee et al. (2013) and Osman et al. (2009), al lows users to navigate within a simulated environment that contains virtual reality elements, so offering the opportunity of looking around places far away in space and time. One of the first virtual tours was an installation in a British Museum in 1994 as presented by Boland andJohnson (1996) andPu jol (2004): the representation of Dudley Castle (England) as it had been in the year 1550. When this kind of application is available on the web, it usually simulates places through 360º pictures or videos as it is done by many universities worldwide as described by Osman et al. (2009).
Nowadays, the Covid19 outbreak and the physical dis tancing policies caused an increase of seven times on searches for "virtual tour" terms on search engines. Many websites have become a hub for these tour experiences as stated by Bloom (2020) and CatracaLivre (2020). This sce nario just makes clear, according to Tarcia (2020), the es capism needs of isolated people and the search for didactic resources for remote teaching since Klippel et al. (2019) say that virtual tours are a good alternative (in many aspects) to replace the experience of real trips.
Considering the increasing search for virtual tours, the compatibility between VR API (Application Programming Interfaces) and web browsers, and the number of available visualization devices, it is necessary to assess aspects of User Experience (UX) of virtual tours in different platforms to as sure that users are able to make a good decision based on costbenefit relations and usage context.
Most scientific papers on virtual tours do not directly ad dress the comparison of the platforms as they focus either on isolated usability evaluation, like Osman et al. (2009) and Oprean et al. (2018), or application development, like Sathe et al. (2017), Butcher and Ritsos (2017), and Ye et al. (2017). Moreover, the most significant studies on this subject per formed by Lee et al. (2013) and Klippel et al. (2019) address, respectively, the comparison of different visualization modes using tablets in a tour and the comparison of real field trips and virtual ones, but neither contrasts different devices. Fi nally, none of the related studies include a discussion of dif ferences in the evaluation results if the researchers choose a holistic approach (evaluating the experience as a whole) or a multidimensional analysis (evaluating separately the distinct experience dimensions, e.g., presence, usability, flow, etc.).
Considering the gaps in the mentioned papers, we put the following research questions regarding the tested de vices (two nonimmersive devices and two VR immersive devices): (1) Is there any significant difference in the experi ence reported by users of virtual tours among different de vices? (2) Is there any difference in performing a holistic or a multidimensional UX analysis? (3) Is there any signif icant difference in the users' performance among different devices?
So, this research aims to compare the User Experience (re garding sense of presence, cybersickness, and usability) in a virtual tour website developed in WebXR and used on four different devices: a laptop computer, a smartphone, a Google Cardboard platform, and a Samsung Gear VR HMD.
To perform this comparison: (1) we developed a virtual tour website for Federal University of Pampa (UNIPAMPA) using 360º pictures; (2) we performed user tests with 41 par ticipants using the four devices; (3) we collected users' opin ion regarding the virtual tour experience by a standardized questionnaire that had been translated and adapted; (4) we recorded performance data on participants; and (5) we ana lyzed the results through inferential statistics. The Suitability Evaluation Questionnaire (SEQ) of GilGomez et al. (2013) was chosen for the translation and adaptation process be cause of its multidimensional perspective, the small amount of items, and other features presented in Subsection 3.1.
Finally, this paper is organized as follows: the Section 2 de scribes the main studies related to our scope that were found on scientific databases; the Section 3 includes the appara tus and the methods applied in this research, the experimen tal procedures, and the description of the developed virtual tour; the Section 4 presents the detailed results of our study, and the answers to the research questions; and the Section 5 shows the key contributions of this research, the faced limi tations, and the suggestions to further work.

Related Work
Few works have been developed focused on virtual tours in the last years. We highlight the studies found in the scientific databases in this field that address different perspectives. Cho et al. (2002) performed an academic essay on the ef fects and implications of a virtual web tour in tourism mar keting. Based on theory and evidence from the scientific lit erature about tourists' experience, the authors claim that it is necessary to keep the good experience, the interaction and the sharpness at high levels to (1) lead the users to a player status instead of a spectator, (2) allow more effective infor mation search, (3) create proper and evaluative envisioning of a destination, (4) allow the users to evaluate their expecta tions regarding a real destination, and (5) cause satisfaction. Osman et al. (2009) developed and evaluated the usabil ity of a virtual tour of four Malaysian places. Two experi ments were performed: the first aimed to find usability issues and to receive feedback of 10 participants through an inter view carried out after some task had been performed, and the second aimed to measure the 5 participants' satisfaction with regard to movement speed, image quality, sounds, trip attractiveness, terminology, text descriptions, and navigabil ity. The ad hoc questionnaire developed by the authors for the second study is not available in the paper. Osman and Wahab (2011) also evaluated the feasibility of using virtual tours with children in kindergarten. 12 children experienced two different panoramas (a playground and a zoo) using a Flash program. This application was projected on the wall and the children used either mouse or keyboard to interact. The environments had animations and sounds. The authors observed kids' reactions during the interaction, and they identified that children with previous experience with computers preferred to use the mouse while inexperienced kids chose to use the keyboard more often. Furthermore, the general experience with the panoramas was wellaccepted and positive concerning the children, even though cognitive benefits had not been assessed. Lee et al. (2013) reported the development and the evalua tion with users of an Augmented Reality (AR) virtual tour in the Antarctic. The miniaturized versions of some Antarctic regions were mapped into a 90.000 squared meters space in a park, and users could visualize these environments in AR through tablets. The tour was composed of 3D virtual models to represent Antarctic elements in AR, a map that allowed top view interaction, and some pictures, videos and panoramas that could be seen without AR. The authors collected data from 50 participants through an ad hoc questionnaire on us ability and an adapted version of the GEQ (Game Experience Questionnaire). The analysis of the subjective measures con sidered the recommended scoring procedures for the GEQ in a subscale approach: competence, immersion, flow, tension annoyance, challenge, negative affect, and positive affect. Differences between the use of the application in an open en vironment (the park) and a close environment (a booth) were not noticed, but the authors observed that younger users re ported a more negative experience even so they felt more confidant in fulfilling tasks. Finally, most users reported that panoramic pictures were the favorite feature, and the open environment users spent more time touring in AR. Sundar et al. (2017) carried out a detailed analysis on im mersive journalism and how it affects our perceptions and mental processes. The authors exposed 129 participants to two stories from the New York Times (they differ on emo tional intensity) in three mediums: text read on desktop com puters, 360º video also watched on desktop computers, and VR accessed by using a Cardboard VR headset and a smart phone. They measured dispositions and outcomes through a broad range of questionnaires (some of them shortened or adapted by the authors), including the Interpersonal Reactiv ity Index, Arrival and Departure telepresence questionnaires, and the Reality Judgement and Presence Questionnaire. The story recall was also assessed. The sense of presence was ap proached in a multidimensional perspective (sense of being there, interaction, and realism). While the 360º video experi ence and the VR experience presented higher scores than the text experience for the sense of presence scales, the recalling was slightly better for participants that had read the stories. Also, the source credibility, the empathetic link, and the shar ing intention were all significantly higher among participants that experienced 360º video and VR. Oprean et al. (2018) evaluated the differences in a virtual field trip in a settlement using three device configurations: an HTC Vive HMD with joysticks on a Unity application, a Google Cardboard platform also on a Unity application, and a website developed using WebVR. The number of partici pants (Architecture students) was not reported, and the data collection approach was an informal interview. The main problems that the authors identified were related to image quality, lack of georeferencing, lack of camera settings in the VR environment, lack of sounds, user controls on the Cardboard version of the tour, and lack of user control in the videos. No comparison between the devices was conducted. Klippel et al. (2019) carried out a study that contrasted con ventional field trips with immersive virtual field trips. More over, the authors proposed a taxonomy for the area. 37 stu dents took part in the experiment in a betweensubject per spective. They visited an outcrop either physically or virtu ally by using an HTC Vive device. The users of the virtual trip were guided through 14 scenes with 360º images, and a series of ad hoc questionnaires was applied to assess tech nological satisfaction, learning experience, orientation abil ities, and sense of presence. The subjective measures were analyzed in a multidimensional perspective with scores for each factor. The results pointed out that immersive virtual trips present advantages regarding satisfaction, learning, and grades when compared to conventional field trips. Sathe et al. (2017), Butcher and Ritsos (2017), and Ye et al. (2017) developed different web VR applications: a shopping website, a prototype of an app for data visualization, and a system for visiting control system facilities that also allows observing experiments. The three systems were developed using WebVR and other technologies, and all of them run on web browsers. Even though the overall software architecture is detailed by Sathe et al. (2017) and by Ye et al. (2017), user testing is not presented by any author 2 .

Virtual Tour Case Study
We adopted the WebXR API to develop the virtual tour be cause it assures that the same virtual environment can be viewed on different devices through a web browser. The Re act 360 framework, created by the Facebook team, was also used as a productivity tool since it is a popular framework for developing immersive and semiimmersive web scenes.
We also created a scene mesh with 90 scenes to represent the UNIPAMPA in the virtual tour. The navigation through scenes was done by teleportation when the user selected a navigational element. Each scene is composed of a 360º pic ture taken with a Samsung Gear 360 camera, a scene title to aid users orientation, one or more navigation elements to reach adjacent scenes, and (occasionally) one or more in formation texts to better explain the details about the place that the user is seeing. The pictures had originally been cap tured in a spherical shape and so they were converted to a panoramic shape (the only one supported in React 360 frame work) with 5472 x 2736 pixels size and 96 dpi resolution. We compressed the scene images in JPEG files with 90% or higher quality, resulting in files with about 1MB storage size.
Four tasks regarding the virtual tour were performed by users in experiment sessions. Each task was performed in 10 minutes or less. The four tasks had different complexity lev els: easy, medium, hard, and long trip. The complexity of each task was directly related to the course size and to the difficulty in locating the right elements in each scene.
The easy task demanded to navigate through two scenes from the first scene (in the shortest path) and to count the number of dogs in the scene. The medium task required the user to navigate through at least six scenes and to count the number of students in the UNIPAMPA's library. The hard task demanded to navigate through at least 17 scenes and to identify four distinct geometric shapes located in the uni versity's main staircase. And finally, the long trip required the participant to travel through more than 18 scenes at UNI PAMPA and to read aloud the information text of an item located in the last scene. For each performed task, the partic ipant wore a different device. The full sequence for the Easy Task: how many dogs can be found at the Secondary En trance is presented in Figure 1 and described in details next: 1. At the beginning of the virtual tour, the participant sees the first scene ( Figure 1a). 2. Next, the participant selects the navigation element (1) in Figure 1a and they are teleported to the second scene ( Figure 1b). 3. Next, the participant selects the navigation element (2) in Figure 1b and they are teleported to the third scene ( Figure 1c). 4. In Figure 1c, the participant is able to count the number of dogs in the scene (three) and the task is finished.
To avoid creating a bias, we did a rotation of devices and tasks for participants by using two configuration models based on Latin Squares design as recommended by Zaiontz (2018) (Tables 1 and 2). We point out that after the 4 th partic ipant in Table 1 and after the 16 th participant in Table 2, the pattern is restarted. After the participant had accomplished a task, the experi menter handed the adapted SEQ questionnaire to them and wrote down the time to conclude that trip. If the user did not finish a task within the time limit (10 minutes), the ex perimenter would interrupt them to deliver the adapted SEQ questionnaire and to note the limit time down.
The complete flow of activities during a test session is sum marized next: 1. Greetings: the experimenters present themselves and thank the availability of the guest. 2. Experiment presentation: the experimenters describe and present the experiment and the devices. 3. Screening: the guest fills out the Screening Form (SF) and they might be prevented from carrying on with the experiment if there is any risk. 4. Terms and Profile: the guest screened that accepts to take part in the experiment fills out the Informed Con sent Form (ICF) and the Demographic Survey (DS). 5. 1st task: the participant receives instructions about the first device and the first task, and they try to accomplish the task. Next, the participant fills out the adapted SEQ while the experimenter write down the elapsed time. 6. Other tasks: the previous step is repeated to the remain ing three devices and three tasks. 7. Acknowledgments: the experimenter thanks the partic ipant and the test session is finished. The participant might also experience other VR apps on the Samsung Gear VR device.
We also performed a search in scientific databases to iden tify standardized questionnaires related to virtual experience measurement since these tools would support the gather of participants' opinions on the experience and the interaction with the virtual tour. The Suitability Evaluation Question naire (SEQ) of GilGomez et al. (2013) is noteworthy for its multidimensional perspective (user satisfaction, sense of presence, perceived success, perceived control, realism, com prehensibility of instructions, cybersickness symptoms, and general discomfort), for the small number of items, for being based on another standardized instrument, and for having a prior psychometric assessment. The questionnaire was based on the Short Feedback Questionnaire (SFQ) of Kizony et al. (2005), which in turn was based on the Witmer and Singer's Presence Questionnaire. Although both SEQ and SFQ have been created to evaluate the VR experience in the context of rehabilitation systems, the latter is much simpler than the former and it does not contain specific items to assess the user's perception of progress in rehabilitation. Also, we un derstand that suitability represents the degree of appropriate ness of a system designed for a particular domain, and it covers a subset of the UX construct according to the origi nal SEQ study. Thus, we use in this paper "suitability" as a synonym for "User Experience" regarding the UX factors as sessed through the questionnaire. The items of the original SEQ are presented in Table 3. The details on the translation, adaptation and assessment process can be found in Subsec tion 4.3.
We applied two additional questionnaires in our experi ment: a Demographic Survey (DS), and a Screening Form (SF) to assure the participants safety while and after using VR devices. The latter questionnaire was crucial for identi fying risks on using the Google Cardboard and the Samsung Gear VR, and it was based on instruction manuals of the main VR hardware that point out the user profiles more sensitive to intense sideeffects from immersive VR experiences. The screening procedure aimed to forbid the following profiles from taking part in the experiment: pregnant people, people under the effects of psychoactive medication or other sub stances, people with psychiatric or neurological issues, peo ple that feel ill, people with vision impairment, among oth ers. We point out that the guests never informed the exact condition that could keep them from taking part in the ex periment; they only informed that one or more items in the SF block their access to the experiment and thus they were immediately dismissed from the testing session after the ac knowledgments.

Evaluated Devices
The chosen immersive devices are the Google Cardboard and the Samsung Gear VR. On the other hand, the nonimmersive devices are a laptop computer and a smartphone.
In Figure 2a, we present the use of the Google Cardboard viewer. It was operated with an Asus Zenfone 3 Zoom smart phone (5.5" screen with 1080 x 1920 pixels and ∼401 ppi resolution) and a Dell WM126 wireless mouse that worked as a clicker.
In Figure 2b, we show the use of the Samsung Gear VR HMD. It was operated with a Samsung Galaxy S6 smart phone (5.1" screen with 1440 x 2560 pixels and ∼577 ppi resolution). The side touchpad was used for navigation.
In Figure 2c, we represent the use of the laptop computer Asus model K46CA with Windows 8.1 and 14" screen (1366 x 768 resolution). The same Dell wireless mouse was used to control navigation.
Finally, in Figure 2d, we show the use of the mentioned Asus Zenfone 3 Zoom smartphone. The navigation was made by touching the screen.

Ethical Considerations
All participants were informed about the details of the exper iment and about the risks regarding the side effects of using VR devices. The aforementioned SF was applied to every guest to reduce the hazard to participants.  (1) (2) (3) (4) (5) Very much Q2. How much did you sense to be in the environment of the system? Q3. How successful were you in the system? Q4. To what extent were you able to control the system? Q5. How real is the virtual environment of the system? Q6. Is the information provided by the system clear? Q7. Did you feel discomfort during your experience with the system? Q8. Did you experience dizziness or nausea during your practice with the system? Q9. Did you experience eye discomfort during your practice with the system? Q10. Did you feel confused or disoriented during your experience with the system? Q11. Do you think that this system will be helpful for your rehabilitation? Q12. Did you find the task difficult?
Very easy (1) (2) (3) (4) (5) Very difficult Q13. Did you find the devices of the system difficult to use? Q14. If you felt uncomfortable during the task, please indicate the reasons.
Open response The guests authorized to take part in the experiment filled out and signed an Informed Consent Form (ICF), and they were given copies of Confidentiality Agreement (CA) and ICF signed by the experimenters. Also, all the documents given to participants contained the phone number of the re searchers and some emergency recommendations in case of presenting VR side effects after leaving the experimental in stallation.
All the protocols and experimental procedures have been designed in consonance with the recommendations of the UNIPAMPA's Ethics Committee on Research.

Software Tools
The virtual tour was developed based on the React 360 frame work v 1.0.0 using the Sublime IDE.
The conversion from sphere shape to the panoramic shape of the 360º pictures taken with the Samsung Gear 360 camera was made through the Cyberlink ActionDirector software.
The following web browsers were adopted to experience the virtual tour in each device: Google Chrome for Windows 8.1 on the laptop, Oculus Browser on Gear VR, and Google Chrome for Android on the smartphone and the Cardboard.
We used Google Forms to create digital versions of the adapted SEQ questionnaire, the DS, and the survey of the adapted SEQ translation and backtranslation process. The time for task accomplishment was also registered on a Google Drive spreadsheet.
Adapted SEQ and time data were imported on the RStudio (version 1.4.1106) for performing psychometric analysis, de scriptive statistics, and inferential statistics. We used the R programming language (version 4.0.5) and the R packages psych (version 2.1.3), GPArotation (version 2014.11.1), and irrCAC (version 1.0).

Results
This section includes the demographic profile of the partici pants, the gathered data, the statistical methods applied, and the research question answers.

Participants
Fiftyone undergraduate students were invited to participate in this study by personal contact or by social media contact. Of these 51 guests, 41 took part in the experiment after the screening phase. Sixtyone percent of the participants (n = 25) were male.
Furthermore, the participants were distributed in under graduate programs as following: • 18 students of Software Engineering • 10 students of Civil Engineering • 7 students of Electric Engineering • 3 students of Telecommunications Engineering • 2 students of Agricultural Engineering • 1 student of Computer Science The average age of the participants was 23 years (X = 23.0, SD = 3.578, Minimum = 18, Maximum = 38).
We also verified the sensitivity of participants to motion sickness in vehicles. Fortyfour percent (n = 18) of the par ticipants reported that they had already experienced sickness while being transported by car, bus, ship, or airplane. Among these people, only 1 participant reported that it happens often. This student was once again informed about the experiment risks regarding cybersickness and asked if he wanted to carry on with the test nevertheless.

SEQ adaptation, translation and psycho metric analysis
The original SEQ developed by GilGomez et al. (2013) aims to evaluate usability, suitability, and safety aspects of VR ex periences for rehabilitation software. To use SEQ for evalu ating a virtual tour, we have adapted it. The adaptation pro cess just demanded the removal of one question of the orig inal SEQ which is exclusively related to rehabilitation sys tems: Q11. Do you think that this system will be helpful for your rehabilitation? The remaining items compose the broad and multidimensional evaluation tools presented in Subsec tion 3.1 and they have gone through a translation process in which the results can be seen in Table 4. The SEQ translation (except for Q11 from the original questionnaire) into Brazilian Portuguese was made by the au thors of this paper (one of them proficient in English). Next, the translation quality was assessed in a translation and back translation process based on recommendations of Coster and Mancini (2015). The evaluators had no prior knowledge of the SEQ and also had no access to the complete instrument during the assessment. It contributed to avoid a bias in the evaluation process.
Three independent evaluators, researchers on Human Computer Interaction (HCI) and proficient in English, as sessed the semantic equivalence between the original SEQ items and the translated items. Based on the answers of a dig ital form that had been sent to the evaluators, we achieved an agreement percentage of 89.7% and an AC 1 Gwet's (2008) agreement coefficient of .886 (p < .001) that is considered very good according to the Altman's benchmark scale de scribed by Gwet (2016). Also, there was a majority of an swers Yes to the question "Are the two items below semanti cally equivalent?" for each pair of sentences consisting of both the original SEQ item in English and the same item translated into Brazilian Portuguese.
Next, an independent professional translator performed the backtranslation of the Brazilian Portuguese SEQ back into English.
The last three independent evaluators, experienced in HCI research and proficient in English, assessed the semantic equivalence between the original SEQ items and the back translated items. Once again, we collected the responses through a digital form that we had sent to the evaluators. There was a majority of responses Yes to the question "Are the two items below semantically equivalent?" for each pair of sentences consisting of the original SEQ item and the back translated item both in English, except for Q6 that received two out of three No responses. Thus, we achieved an agree ment percentage of 59% and an AC1 Gwet's agreement coef Resposta aberta ficient of .364 3 (p = .098) that is considered fair considering the Altman's benchmark.
The comments of evaluators about Q6 were sent to another expert translator for inspection after being anonymized. The analysis included checking the verbal tense and the redun dancy since the backtranslated version is "Was the informa tion provided by the system clear enough?" After receiving the expert's feedback, we concluded that the disagreement is fair, but our Portuguese translation of Q6 is reliable and closer to the original SEQ version than the backtranslated version.
Ultimately, we performed an exploratory psychometric analysis of the adapted and translated SEQ using the answers of all participants of our experiment. So, we raised validity evidence based on the internal structure of the questionnaire.
We carried out an Exploratory Factor Analysis (EFA) pro cedure to identify the general organization of the adapted SEQ structure and its factors. This technique is based on the analysis of covariation of the observable variables according to Nunnally and Bernstein (1994), and Bandalos and Finney (2010). Firstly, we confirmed the adequacy of our sample to EFA procedures through the KaiserMeyerOlkin's test (p = .812) and Bartlett's sphericity test (p < .001). Considering the ordinal nature of our data and the expectation of correla tions between the factors, we chose the extraction method of Principal Axis Factoring with Oblimin Direct oblique rota tion and Kaiser's normalization for EFA as recommended by Costello and Osborne (2005).
All the commonalities were higher than .3 and the total ex plained variance was 53% using three factors found through the scree plot analysis of the eigenvalues Figure 3. Table 5 presents the distribution of the adapted SEQ questions in each factor (items with an * had their scores inverted, and bold numbers represent the factor which those items are sig nificantly linked into).
We can observe in Table 5 that for every question just one loading factor is either higher than .32 or lower than .32 in one factor, thus fitting the Costello and Osborne's (2005) threshold that represents about 10% of overlapped variance. However, we notice that Q6 is slightly above the minimum 3 This coefficient is different from the one in our original article because we redid the calculations with new software and included Q6 ratings even though they had been analyzed separately. Nonetheless, the results in our previous work remain valid and the conclusions unchanged. loading proposed by Costello and Osborne (2005). Once it is not strongly loaded to any factor (i.e., factor loading higher than .5 or lower than .5 in one factor) and perspicuity is also covered by tasks' difficulty (Q12), we suggest as future study the identification of the cause of such effect and the possibil ity of the removal of this question in a process of continuous improvement.
Furthermore, the Cronbach's alpha coefficient (α), which is used for estimating the reliability of the instrument through the items intercorrelation according to Nunnally and Bern stein (1994) and Hutz et al. (2015), indicated a good inter nal consistency for the adapted Brazilian Portuguese SEQ (α = .844). This measure surpasses the acceptable coeffi cient found in the original SEQ (α = .700) and indicates a more cohesive internal structure after the adaptation process and within the experiment context (virtual tours).
The original SEQ scoring procedure consists of adding up the items' scores to achieve a global score that ranges from 13 (poor suitability) to 65 (excellent suitability). The items Q7, Q8, Q9, Q10, Q12, and Q13 (Table 3) need to be reversed before adding since they are all negative items.
According to Avila et al. (2015), the methodological pro cedures to evaluate the factor structure of the adapted SEQ in light of Classical Test Theory (CTT) and the original scoring strategy allow us to classify the questionnaire as a reflective measurement model. Considering that the technique of sim ple summation is the most commonly used for this type of model, we decided to keep it in our analyses. However, the multidimensional structure of the SEQ requires additional in vestigation into the scoring procedures: should we report one total score or the factor's scores separately? Furthermore, is this imperative to the interpretation of results?
We used Haberman's approach described by Reise et al. (2013) to evaluate if the test total score (TOT X ) is a better predictor of the collection of true factor scores (SUB true ) than the individual factor scores (SUB X ). The strategy involves computing "the proportional reduction in mean squared error based on total scores (PRMSE T OT ) and com pare it with PRMSE S [the proportional reduction in mean squared error based on subscale scores estimated through the Cronbach's α for each factor]." First, we computed the Cronbach's α (PRMSE S ) for each factor: .812 (presence), .762 (cybersickness), and .761 (us ability). The standard deviations for total score and fac tor scores are 6.870 (SEQ), 3.300 (presence), 2.002 (cyber sickness), and 3.366 (usability). The factor intercorrelations are .394 (usabilitycybersickness), .513 (usabilitypresence), and .322 (cybersicknesspresence).
Following the Reise et al. (2013) algorithm, we achieved the following PRMSE T OT values: .666 (presence), .426 (cy bersickness), and .708 (usability). Then we compared these values to the original PRMSE S values: .812 (presence), .762 (cybersickness), and .761 (usability). Since the latter are larger than the former, we can argue that the factors' total scores represent a better indicator of the factors' true scores and should be reported instead of the test score. Considering this piece of evidence, we have reasonable grounds for per forming a comparison between the interpretation of results based on the total score (the original SEQ approach) or fac tor scores (the alternative approach).

Adapted SEQ analysis
Our sample contains 164 fully answered questionnaires with out any missing data 4 . The descriptive statistics for each adapted SEQ item can be found in Table 7, including sam ple means (X), standard deviations (SD), medians (Mdn), minimum and maximum values (Min. and Max.), skewness and kurtosis measures, and ShapiroWilk's normality tests' W and p 5 .

Analysis based on the total score
We computed the total score of the adapted SEQ (Table 4) of each participant by adding up the numerical value of every response. The sum was direct for items Q1, Q2, Q3, Q4, Q5, and Q6. Questions Q7, Q8, Q9, Q10, Q12, and Q13 have a negative tone and demanded being reversed (i.e., an answer with value 1 is mapped to value 5, an answer with value 2 is mapped to value 4, and so on). The total score for the adapted SEQ goes from 12 points (worst user experience) to 60 points (best user experience). The item Q14 is assessed separately from the others since it is an optional and open answer ques tion that explains the discomfort faced by the users while per forming the tasks during a virtual trip. Figure 4 shows the box splot of the adapted SEQ scores in different devices.
We performed a ShapiroWilk's test on the adapted SEQ samples, and the result (p < .01) pointed out that all samples come from nonnormal distributions.
Once nonparametric tests are more appropriate in this case of comparing nonnormal samples, we carried out a Quade test (p < .001) to check an overall difference and multiple Wilcoxon signedrank tests to analyze individual differences between scores of all devices. Table 6 presents the sample size for each pair in a withinsubject perspective (N), the Z score (Z), the twotailed probability value (p), and the correlation that represents the effect size (r).
We can notice that there is a significant difference only for the Computer Cardboard, Gear VR Cardboard, and Smart phone Cardboard pairs (p < .05). By looking at the medians in Figure 4 , we can also remark that the Google Cardboard scores are lower than the others, hence revealing an experi ence significantly worse than other devices. Furthermore, it is possible to claim a clear contrast once the effect size for the observed differences is moderate (r > .3) or big (r > .5) according to Pallant (2016).

Analysis based on subscale scores
We also computed the factor scores of the adapted SEQ of each participant according to the EFA output structure. The presence factor was computed by adding up the raw scores of items Q1, Q2, Q5, and Q6, ranging from 4 (worst sense of presence) to 20 (best sense of presence). The cybersick ness factor was computed by adding up the raw scores of Q7, Q8, and Q9, ranging from 3 (no cybersickness symptoms) to 15 (intense cybersickness symptoms). Finally, the usability factor was computed by adding up the raw scores of items Q3 and Q4 and the reversed scores of Q10, Q12, and Q13, ranging from 5 (worst usability) to 25 (best usability). We want to call attention to the way that the cybersick ness factor score is calculated. While computing the adapted SEQ total score demanded the inversion of items Q7, Q8, and Q9, calculating the cybersickness score did not. It happens because there is a relationship between cybersickness items and other factors items when calculating the total score: since higher SEQ scores represent better user experience, the re sponses "(5) Very much" should be reversed when they come from negative items like "Q7. Did you feel discomfort dur ing your experience with the system?" in order to contribute positively to the final score. On the other hand, the cyber sickness factor score does not depend on other factors when computed alone. Thus, the lower the score, the better the user experience for this particular case. Figure 5 presents the box splot of the factor scores in different devices.
ShapiroWilk's tests on each factor score allowed us to identify that all samples come from nonnormal distributions (p < .01) Therefore, we performed Quade tests (presence factor p < .001, cybersicknessfactor p < .001, and usability factor p = .013) and multiple Wilcoxon signedrank tests to analyze the differences among devices. Table 8 shows for every factor the sample size of each pair in a withinsubject perspective (N), the Z score (Z), the twotailed probability value (p), and the effect size (r).
We confirm a significant difference in all pairs containing the Cardboard device (p < .05). For all factors (presence, cybersickness, and usability), the mean scores of Google Cardboard are lower than the others and the effect size for these differences is moderate (r > .3) or big (r > .5), ex cept for the usability factor of pair ComputerCardboard and SmartphoneCardboard.
Nevertheless, the pairs Computer Gear VR and Gear VR Smartphone also presented significant differences in the pres ence factor mean scores (p < .05) with moderate effect size (r > .3). Also, the pair Computer Gear VR achieved a significant difference in cybersickness factor mean scores (p < .05) with a low effect size (r ≤ .3).

Analysis of the open response question (Q14)
As question 14 (Table 4) demanded a full written answer, we chose to copy 6 the comments from all participants in Table 9. We can notice in Table 9 that just computer use did not cause vision stress, dizziness, or headache. Of course, this result was already expected since the computer is a stable, immobile object with a big screen that relies on mouse inter action. The long loading times might be caused, indeed, by the rendering time of a scene on a big screen since all de vices used the same Internet wireless connection. Moreover, the need for changing the direction of the field of view after the users had been positioned backward (Cardboard 6th item in Table 9) is related to all devices and is caused by the same usability problem: sometimes, the user is positioned facing a different, predefined direction when they enter a new scene.
Regarding the Gear VR device, we observe that one par ticipant reported a dizziness effect. It is a common problem in immersive devices, and it is also a cybersickness symptom addressed by many authors such as Kennedy et al. (1993) andLaViola Jr. (2000). The temporary frame rate slowdown might be caused by overheating once cellphones usually deal with this problem through thermal throttling and other similar approaches that degrade performance. Besides, the blurred lens happened because of the skin's warmth around the HMD region and the cold, wet weather in the facilities.
Smartphone use resulted in an unexpected statement of discomfort. Even though we did not expect any cybersick ness symptom once the smartphone is a nonimmersive de vice, the report on headache and eye pain is intimately related to oculomotor issues as presented by Kennedy et al. (1993). We believe, in this case, that the lookaround movements that participants had performed while focusing on a small screen to accomplish every task could have caused such sickness symptoms. The second reported issue in Table 9 has been aforementioned in the paragraph about the computer device, and it is a usability noise shared by all devices.   Table 9. Answers from all participants for question Q14 of the adapted SEQ after using all devices.
Computer 1. Just after I'd clicked to access other areas, the system took a considerable amount of time to respond. The use itself didn't cause discomfort, but it did as I needed to wait for new scenes to appear. 2. Sometimes, while navigating, the next scene was loaded turned to where I'd been, so it was necessary that I rotate my view through 180ºto go on.
Gear VR 1. At some moments, the information labels, the scene title, and the navigation buttons were blurred, thus causing discomfort. 2. A little bit uncomfortable cause the slight system slowdowns, but it didn't really disturb either the experience or the task performance. 3. Dizziness.

Smartphone
1. I guess that being too much time doing the task caused me a bit of headaches and eye pain. 2. I felt uncomfortable regarding the delay in showing the images, and I also felt confused regarding being turned to one direction and, after I clicked a button and entered the next scene, getting this direction reversed (I needed to turn myself back to the path I'd been following).
Cardboard 1. Because of the visualization through the device, the images were quite blurred at some moments, and it forced me to make more effort to see, thus causing discomfort. 2. Low quality, I couldn't see a thing. 3. I felt uncomfortable regarding the visual quality of the images in which the green buttons were nearly impossible to read. I was able to accomplish the task just because I'd recalled from the previous task that it was written "secondary entrance" in that same button. 4. Because of the lowquality visualization, it made me feel uncomfortable due to forcing my vision a lot. 5. This device made me feel a little dizzy because of the low resolution. 6. Poor image. In some scenes, I entered out of order. For instance: I'd entered some scene facing forward and, in the next scene, I was facing backward. 7. Low resolution caused me discomfort. 8. The images are blurred, and the text unreadable. Finally, the Cardboard device piled up problems of the blurred lens, user's dizziness, and poor image quality. Al though we consider that the smartphone used together with the Google Cardboard has a decent resolution (Subsection 3.2), we are not sure if the main cause of discomfort is the smartphone resolution, the lens quality of the cheaper viewer, or even the rendering mechanism on the web browser.

Time Analysis
We present an overall view of the elapsed time for completing tasks in each device in Figure 6 7 . It is worth mentioning that the time limit for each task was 10 minutes.
Again, a ShapiroWilk's test applied over time samples re vealed nonnormal distributions (p < .01), but we were un able to identify significant differences through a Quade test (p = .912). We can notice that the Samsung Gear VR device was the only one in which participants accomplished all the tasks without exceeding the time limit (it does not reach the top of the box plot chart in Figure 6) but, nonetheless, the difference is not substantial.
Since the tasks have different levels of complexity and therefore unequal times for accomplishment, we performed a fairer analysis task by task. An overview of the test results can be seen in Figure 7.
We applied KruskalWallis tests (easy task p = .256, medium task p = .054, hard task p = .758, and long trip p = .105) to assess differences among devices in each task type. We adopted this approach because all samples repre sent temporal data with similar nonnormal distribution and because the comparison between devices in each task is per formed in a betweensubject perspective (one single partic ipant appears just once in one of two compared samples). Even though the tests had not revealed significant dissimilar ity, we chose to present the overall tendencies via multiple MannWhitney U tests 8 . Table 10 presents the size of each compared sample (NN), the Z score (Z), the twotailed prob ability value (p), and the effect size (r).
Next, we looked into the discomfort reported by partici pants in adapted SEQ Q14 (Table 9) to help us interpreting 7 Data with elapsed time for each participant and each device is available at https://figshare.com/s/b2fe9fd05d5373333f90. 8 We also decided to keep Table 10 to maintain consistency with the original paper.
such results since the differences in Table 10 were not sup ported by previous KruskalWallis tests.
Regarding the medium task and long trip, participants us ing the Google Cardboard device declared discomfort related to readability problems (items 2, 6, and 8 in Table 9) while participants using Samsung Gear VR and computer devices did not report any problems. In this case, we suppose the existence of rendering issues on the Android Chrome web browser for VR mode (where the screen image is split in two separate images one for each eye) resulting in lower visual quality. However, it remains unclear why participants were not equally affected in easy and hard tasks.
Unfortunately, there were no discomfort records concern ing the easy task that could help us to understand better the difference between Gear VR and the smartphone.

Discussion
(1) Is there any significant difference in the experience reported by users of virtual tours among different de vices? Yes. The Google Cardboard platform presented a sig nificantly worse User Experience than the computer, smart phone, and Samsung Gear VR. While the main measures for this conclusion have been the adapted SEQ total score (by measuring selfreported user satisfaction, sense of presence, perceived success, perceived control, realism, comprehensi bility of instructions, cybersickness symptoms, and general discomfort) and the adapted SEQ factor scores (presence, cy bersickness, and usability), the discomfort reported by partic ipants in each device corroborates the score analysis. We also observed that the key negative factor in Cardboard was the readability problems caused, most likely, by rendering issues on the web browser used and by the low budget lens of the platform. We are not sure about the impact of the screen res olution that is slightly lower in the used with the Cardboard smartphone.
We were not able to identify any substantial difference among the other devices (computer, smartphone, and Gear VR) by the adapted SEQ overall score, but the scores of the presence factor alone indicate that the sense of being there was higher when users experienced the virtual tour through the Samsung Gear VR. This result was expected since the Gear VR is the only immersive, highquality device under testing. Sundar et al. (2017) also identified higher presence scores in VR experiences using a different questionnaire de spite using a Cardboard as VR equipment. Since the exper iments of Sundar and colleagues (2017) involved watching VR content passively, the participants probably did not face many usability issues related to control like blurred naviga tion widgets and hardtoread instructions.
Furthermore, the scores of the cybersickness factor show that symptoms are more intense for Gear VR users in com parison with desktop computer users. This was also antici pated because desktop computers are more stable and thus unlikely to induce cybersickness symptoms, especially when compared to immersive VR devices.
(2) Is there any difference in performing a holistic or a multidimensional UX analysis? Yes. The multidimensional analysis based on factor scores revealed differences among devices that otherwise would go unnoticed.  Some of these differences are obvious, like the higher sense of presence for Samsung Gear VR users and the weaker cybersickness symptoms for desktop computer users when compared to immersive devices. On the other hand, there is some new insight about factor correlations. It is important to notice that even if presence and usability factors are strongly correlated (r = .513), the lower sense of presence does not seem to be enough for impacting the overall usability as we can observe in pairs Computer Gear VR and Gear VR Cardboard. This corroborates the results of Chow (2016) that achieved the same .51 correlation coefficient between "per ceived ease of use" and "presence" factors, but no strong di rect effect based on Structural Equation Modeling analysis.
Since there are two cases of a different sense of presence and indistinguishable usability (Gear VR compared to non immersive devices) and there is no case of the opposite, we hypothesize that the interaction noises and obstacles faced by users (namely usability issues) affect the sense of being there, but not the contrary. Another piece of evidence sup porting this hypothesis is the consistently lower presence scores for the Google Cardboard, although it provides an immersive VR experience. Sundar et al. (2017) got higher presence scores from Google Cardboard users in compari son with desktop computer users (what is contradictory to our findings), but their stimuli contained just 360º videos with al most no interaction and thus no room for substantial usability issues.
(3) Is there any significant difference in the users' per formance among different devices? We did not observe any significant difference in that in our case study. Even though the elapsed time analysis to accomplish tasks presented dis similarity for three cases in tasks with distinct complexity levels, the overall test showed no statistical significance and a detailed analysis revealed that the issues faced by partici pants (causing a lower performance) had not been constant. Thus, we cannot claim that the readability problems consis tently affected users of the Cardboard platform since the de creasing performance in two out of twelve comparisons (Ta ble 10) does not impact the overall performance with the de vice.
Besides, we did not identify any performance differ ence among the distinct interaction equipment for selection: mouse on the computer, trackpad on the Gear VR, touch screen on the smartphone, and clicker on the Cardboard.

Conclusion
As the seeking for web virtual tours grows and the compati bility between web Virtual Reality and visualization devices increases, it's important to check the differences in the users' experiences and performance on distinct platforms, since it leads the users to a better choice based on costbenefit and context of use.
Thus, we developed a virtual tour in UNIPAMPA using WebXR and React 360 technologies, and carried out a case study with 41 participants. Four distinct platforms (a laptop computer with a mouse, a smartphone, a Google Cardboard viewer with a mouse that was used as clicker, and a Sam sung Gear VR HMD) were alternately used by all partici pants while performing tasks with different levels of com plexity.
We conclude with this research that the adapted SEQ scores point out a significantly worse overall experience on the Google Cardboard viewer when compared to the others in our virtual tour context. Moreover, the sense of presence was higher for Samsung Gear VR users and the cybersick ness symptoms were weaker for desktop computer users as expected. Finally, differences in the elapsed time to accom plish tasks are indistinguishable among the platforms.
The key contributions of our case study are (1) the iden tification of worse user experience on Google Cardboard in virtual tours that are mainly related to visualization problems, (2) the observation of similar users' time performance among the platforms, (3) the discovery of cybersickness symptoms while using the smartphone for the virtual trip in a non immersive context, and (4) the possibility of low usability exerting a direct effect upon the sense of presence (the lower the usability, the lower the presence).
An important secondary contribution of our paper is the translation, adaptation, and psychometric analysis of a standardized questionnaire to assess UX aspects related to presence, cybersickness, and usability in virtual tours. The adapted SEQ presents enough evidence for being used as a standardized instrument in a multidimensional perspective by measuring user satisfaction, sense of presence, perceived success, perceived control, realism, comprehensibility of in structions, cybersickness symptoms, and general discomfort.
Based on the factor structure of the adapted SEQ, we also identified that computing factor scores is a more reliable strat egy for interpret results than computing a total score. The scoring procedure analysis as presented by Reise et al. (2013) is, to the best of our knowledge, innovative for User Experi ence standardized questionnaires. It is also important to men tion that this approach is just feasible for multidimensional questionnaires.