An Analysis of the Evaluation Methods being Applied to Serious Games for Autistic Children

Autism Spectrum Disorder is a neurodevelopment condition that significantly impacts social communication and interaction as well as behavior impairments, including restricted and repetitive patterns of behavior, interests, or activities. In recent years, numerous studies have proposed serious games as a way to aid in the therapy of children with ASD. Hence, it is crucial to evaluate the effectiveness of such games and obtain robust evidence of their positive influence on this type of treatment. In this study, we aim to explore the evaluation of games for autistic children by conducting a Systematic Literature Review. We analyze the methods utilized to evaluate these games, their application and combination, the quality aspects assessed, and the number and characteristics (e.g., age and special need) of the participants involved in the evaluation process. Furthermore, we present a compilation of the study findings for each evaluation method. Our findings reveal that there is no standardized methodology since different methods have been utilized and combined in various ways to evaluate serious games that support the treatment of ASD children. As contributions, this paper provides valuable insights into how serious games have been evaluated in this context and can be useful for researchers and game designers working in the field.


Introduction
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized mainly by persistent deficits in social communication and social interaction, along with repetitive and restrictive patterns of behavior, interests, or activities.According to studies by the Centers for Disease Control and Prevention of the United States, conducted on 8-year-old children in 2018, the prevalence of ASD is 1 in every 44 children [Maenner et al., 2021].In Brazil, there are no official studies that estimate the prevalence of this disorder in the country [Sukiennik et al., 2021].
In recent years, different studies have been investigating technologies to support the treatment of individuals with ASD [Cordeiro et al., 2018; Virnes et al., 2015].In this sense, different types of technologies and solutions have been explored.For example, Jouaiti and Henaff [2019] investigated robot-based solutions to motor rehabilitation for children with ASD.Khowaja et al. [2020] examined the use of augmented reality to improve various skills of children and adolescents with ASD.In their turn, Glaser and Schmidt [2021] investigated virtual reality intervention design patterns for individuals with ASD.Additionally, multiple researchers have explored the use of serious games, with different focuses, aimed at individuals with ASD (e.g., Xianmei [2017]; Hassan et al. [2021]; Kirst et al. [2022]; Parisa Ghanouni and Lucyshyn [2021]; Carreño-León et al. [2021] ; Vallefuoco et al. [2021]).A game can be defined as a "Serious Game" if its purpose is not only to entertain the player but also to develop a skill [Ritterfeld et al., 2009; Susi et al., 2007].
Serious games are a type of technological solution that has been widely investigated in order to support the treatment of ASD [Tsikinas andXinogalos, 2019; Hassan et al., 2021].Consequently, it is crucial that such games are evaluated to obtain solid evidence of their positive impact on this type of treatment.
Thus, this research aims to investigate how serious games for children with Autism Spectrum Disorder are being evaluated.To achieve this goal, we conducted a Systematic Literature Review (SLR).This work is an extended and revised version of a previous conference paper [ de Carvalho et al., 2022], that analyzed which evaluation methods were used, how these methods were applied and combined, what quality properties were evaluated, and the number and profile of the participants involved in the different evaluation types.In addition to the results of these analyses, this paper includes a characterization of the stakeholders involved in the evaluations and the presentation of a compilation of the study findings for each evaluation method.
The results presented and discussed in this work contribute to: 1) advance the knowledge about how serious games aimed at children with ASD have been evaluated, and 2) support serious game researchers and developers in understanding aspects considered in evaluating games and making choices on how to conduct evaluations in their own projects.
The article is organized as follows: Section 2 provides an ASD overview.The related works on the research are discussed in Section 3. The methodology adopted to conduct the literature review and respond the research questions are detailed in Section 4. Section 5 presents and discusses the results of our analysis, and Section 6 presents a characteriza-

Related work
Our research focuses on serious games aimed at children with ASD.In this context, different studies have investigated various aspects of these games.For example, Noor et al. [2012] conducted a systematic review and presented an overview of serious games for children with ASD, focusing on the purpose of the game, its type, and the technologies used to develop it.Zakari et al. [2014] classified serious games for autistic children with respect to the technological platform, the purpose of the game, the type of graphics (i.e., 2D or 3D), game aspects, and user interaction devices.Tsikinas et al. [2016] classified serious games for people with an intellectual disability or ASD based on adaptive behavior and intellectual functioning skills that the games aim to develop and their potential effects.Meanwhile, Xianmei [2017] presented an overview of somatic games (i.e., video games operated by body movements) aimed at autistic children, focusing on the game features, implementation of interventions, and their effectiveness.Kousar et al. [2019] presented a comparison of serious games for autistic children, focusing on the purpose of the game, type of autism, technological platform, age, type of graphics (i.e., 2D or 3D) and category.Tsikinas and Xinogalos [2019] studied the effects of computer serious games on people with an intellectual disability or ASD.Hassan et al. [2021] evaluated the design of serious games aimed at improving the social and emotional intelligence of children with ASD.Silva et al. [2021] compared the use of serious games and entertainment games in interventions for treating ASD.
Although these works addressed serious games for children with autism, only some of them [Tsikinas et al., 2016; Tsikinas and Xinogalos, 2019; Hassan et al., 2021] investigated aspects related to game evaluation.On the other hand, in the literature, different literature reviews explored aspects related to the evaluation of serious games.For example, Calderón and Ruiz [2015] conducted a systematic review to investigate the state of the art of procedures, techniques, and methods used to evaluate serious games in different domains.In addition, the authors analyzed the specific context of evaluating serious games aimed at the software project management area.Yanez- Gomez et al. [2017] conducted a systematic review on usability evaluation in serious games.Petri and Gresse von Wangenheim [2017] investigated, in their literature review, how games aimed at teaching computing are evaluated.In their turn, Marques and Monte [2022] conducted a systematic mapping of the literature to investigate how software technologies are being evaluated with users with autism.
Even though there are various studies analyzing serious games and technologies for individuals with ASD or evaluating serious games, none of them have specifically focused on the evaluation of serious games for children with ASD.The few that have addressed the subject [Tsikinas et al., 2016; Tsikinas and Xinogalos, 2019; Hassan et al., 2021] did so tangentially amidst other issues.Therefore, this study has a more specific focus and extends the contributions of these works to the evaluation of serious games for children with ASD.As a differential, we present an overview of the different methodologies employed in the studies; we map these methodologies to the category of the evaluated game; we describe how they have been combined and applied, and describe the profile and number of participants in the user evaluations and finally, the criteria being used to evaluate the games' quality.

Methodology
This study followed the guidelines for conducting a SLR indicated by Kitchenham [2004].In this section, we present the methodology conducted in our SLR.

Research questions
Since the aim of this work is to analyze the methods being applied in evaluating serious games aimed at children with ASD, the following research questions were formulated: RQ1: What evaluation methods have been used?RQ2: What methods have been used to evaluate each game category?RQ3: How have the methods been applied in evaluations?RQ4: What is the sample size and profile of the participants in the evaluations?
RQ5: What quality properties have been evaluated in the studies?

Search process
We know that the topic addressed in the SLR is multidisciplinary and can be addressed in different areas of knowledge, such as education and health.However, this work focuses on analyzing how serious games have been evaluated from the perspective of the Human-Computer Interaction (HCI) area to help in the future design and evaluation of games.For this reason, the searches were conducted in some of the main repositories and proceedings of events that store relevant works in Computer Science related to HCI and serious games: IEEE Xplore, ACM Digital Library, Entertainment Computing, SBGames (Brazilian Symposium on Games and Digital Entertainment), SBSC (Brazilian Symposium on Collaborative Systems), SBIE (Brazilian Symposium on Informatics in Education), and JIS (Journal on Interactive Systems)1 .Since the ACM and IEEE libraries allow for automated searches with the application of filters, we defined a search string to select the publications from these databases.
It is worth noting that the scope of our research was the analysis of serious games aimed at children with autism spectrum disorder, including other aspects besides their evaluation.Thus, we defined a more comprehensive search string to collect studies related to this context2 .
The search string considered was: (autism 3 AND children AND game).Subsequently, to conduct the analyses presented in this work, we filtered only those that presented a game's evaluation process from the selected studies.
The SLR was conducted following these steps: (1) Initial search; (2) Elimination by title and abstract; (3) Elimination by diagonal reading (i.e., introduction and conclusion); and (4) Complete reading and data extraction.These steps were conducted following inclusion, exclusion, and quality criteria.In each step, the researchers recorded the data of interest in control spreadsheets.
In the initial search, manual searches conducted in the proceedings of events and repositories that did not allow automatic search included all available publications in initial set of articles.In turn, automated searches executed in digital libraries that allowed the application of filters in the search resulted in the selection of works that included the terms of the search string in the article content.
The following inclusion criteria were applied: studies from January 2010 to March 2020 that presented specific serious games for children with autism or that included this population (e.g., games aimed at people with neurodevelopmental disorders).When more than one study focused on the same game, the most complete study was considered, and the others were discarded.As for the exclusion criteria, studies not written in English or Portuguese were eliminated.Robotsrelated studies were also eliminated, as they were outside the scope of our research.
For this study, for the complete reading step (step 4), the following quality criteria were considered: the work must clearly present its goal, its research questions, its methodology, its results and contributions, and present the evaluation process of the games.Articles that did not meet these criteria were excluded.
It is important to note that some precautions were taken to minimize interpretation biases during the elimination and selection of articles during the SLR.In each stage, two researchers analyzed the works, and then a consolidation analysis was carried out between the two.Additionally, if there were any doubts about whether an article should be eliminated, it was kept to be analyzed in the next stage.In the last step, in case of doubts regarding a particular article, a third researcher participated in the discussion and decision-making.

Data extraction
The following data was extracted from the selected studies: article title, reference, the skill(s) the game aims to improve, target audience, game name, sample size, profile and age range of the evaluation participants, methods used in the evaluation, and criteria used to evaluate 4 .
In the initial search, the automatic search in the IEEE and ACM libraries returned 479 articles.From the other repositories and event proceedings that did not have this mechanism, 3 Tests were carried out, which showed that the word autism appeared in articles that also used other terms to refer to TEA.For this reason, we only used this term. 4The extracted data is available at https://docs.google.com/spreadsheets/d/14u3skqiRQvdUAOoABwxFciHlkeq8ACoSjLJdPqSQTu0/edit?usp= sharing all available studies were initially selected, resulting in 3522 works.In the elimination by reading title and abstract (step 2), 4001 articles were analyzed and 239 passed to the next step.In the elimination by diagonal reading step, 162 studies passed to the final step.Finally, after the complete reading and evaluation of the quality of the articles, 70 were considered relevant for our analysis.
In three of these articles, more than one game was presented and evaluated.In addition, only one game was the subject of two articles (but as one presented the game and a preliminary evaluation, and the other focused on a more detailed evaluation of the game, both were kept).Thus, at the end of our analysis, we considered the evaluation of 75 games.

Findings
In this section, the findings of the studies are discussed in relation to the research questions that were defined.

RQ1: What evaluation methods have been used?
From our analysis, we identified eight main methods that are used in game evaluations, as presented below.It is observed that no new method has been identified; all the methods described in the studies are methods traditionally used in the Human-Computer Interaction area.
• Experiment: evaluation of game usage in a controlled environment, and conducting a statistical analysis of the results (usually involves a hypothesis test).To classify the method(s) used for each game, we registered the method(s) explicitly mentioned by the authors, or, in case the authors did not mention a method, we classified it into one of the categories above, according to the description of the method used.
From the collected studies (i.e., 70 articles that evaluated 75 games), 5 only presented the evaluation of the algorithm implemented in the game [Rouhi et al., 2019; Frutos et al., 2011; Rapela et al., 2012; Dapogny et al., 2018; Dantas et al., 2020].The remaining 70 games were evaluated using some method.Figure 1 shows the methods used in these evaluations.If the method was not clearly presented, the study was excluded from the visualization.For instance, in the study by Carvalho and da Cunha [2019], the authors present the parents' opinions about the game.However, they do not specify how this data was collected, so the article was not included in the graph depicted in Figure 1.Table 1 presents the classification of the games' evaluation according to the methods used.The most frequently used methods in the articles were: observation (35), followed by in-game evaluation (19), prepost test (18), questionnaire (15), interview (11), experiment (9), expert review (7) and focus group (2).

RQ2: What methods have been used to evaluate each game category?
We also investigated the relationship between the methods used in evaluations and the category of the game (i.e., the skill(s) the game aims to improve).To do this, we classified the selected games into categories.The categories we used were obtained from another research we conducted, which focused on categorizing serious games for children with ASD regarding which skills they aim to develop, how their activities were operationalized, and what customization options were offered to users [ de Carvalho et al., 2023].Therefore, the following categories were used: • Academic skills: skills associated with school activities (e.g., reading).• Motor skills: skills related to moving oneself, or moving and interacting with objects.• Social and socio-emotional skills: verbal and nonverbal skills for communication and interaction with others.• General: associated with two or more skills or general purposes (i.e., not focused on a specific skill).
Table 2 presents the classification of the games according to the skills they focus on.Figure 2 presents the methods used to evaluate games in each category.The y-axis shows the skills, followed by the number of evaluations conducted focusing on this category.The x-axis contains the methods identified in the analysis (described in RQ1).It is worth noting that some studies used a combination of methods to evaluate a game.As seen in Figure 2, there is no relationship between the methods used and the skill the game focuses on.For each skill, different types of methods are used.
It is worth noting that games of Social and socio-emotional skills included all methods.In other categories, various evaluation methods were used, but not all.However, the cate-gory of Social and socio-emotional skills has a much larger number of games than the other categories.In this case, 27 of the 70 games address this category, while 14 address the category of General (the 2nd most frequent).
The questionnaire was the only method used to evaluate all game categories.The methods of observation and prepost test were also widely used, with the only exception being the evaluation category.However, this category has few games (only 5 of the 70 games).

RQ3: How have the methods been applied in evaluations?
To answer this question, we present two analyses.The first presents an overview of the moment in the design process the games were evaluated and in which locations, as well as how different methods were combined (5.3.1), the second presents a description of how the evaluations with each type of method was conducted (5.3.2).

Overview of method application
The methods used in the evaluations varied according to the moment in which they were applies: some evaluated the developed game (summative evaluation), while others evaluated prototypes during the design process (formative evaluation).The evaluations also varied in relation to where they were carried out: therapeutic centers (e.g., [Loiacono et al., 2018; Crovari et al., 2019; Aruanno et al., 2018; Spitale et al., 2019]), schools (e.g., [Gomez et al., 2018; Kurniawati et al., 2019; Pistoljevic and Hulusic, 2017]), research laboratory (e.g., [Boyd et al., 2018]) or in the participants' homes (e.g., [Ringland et al., 2019; Carlier et al., 2019]).Table 3 presents the classification of the games' evaluation according to the number of methods used.Of the collected studies, 36 games were evaluated using a single evaluation method, and 34 games were evaluated using a combination of different methods, as follows: 23 games were evaluated using two methods, 10 games using three methods, and only one study combined four methods for game evaluation.
Most of the studies that used the observation method combined it with other methods (28/35).The same occurred with the in-game evaluation methods (14/19), pre-post test (13/18), questionnaire (11/15), and interview (9/11) methods.The other methods were used more as a single method in the evaluations.
The graph in Figure 3(a) shows how the methods were combined in evaluations that used two methods.Each method is represented as a vertex, and its label represents the number of games evaluated using that method.The edges represent that the two methods being connected have been used together.The width and weight of the edge represent the number of games evaluated with that pair of methods.The graph in Figure 3(b) only shows the studies that used three or four methods in the evaluation of the games.Each method is represented as a blue round vertex, and each game evaluation is represented as a yellow square vertex.Seven different combinations of methods were used in the evaluations.The most used combination was observation, in-game evaluation and pre-post test, which was used in the evaluation of four games (Space Game, Shape Game, and Bubble Game presented in [Bartoli et al., 2014], and Kirana [Sharma et al., 2018]).The second most used combination was observation, interview and questionnaire, which was used in the evaluation of two games (SpokeIt [Duval et al., 2018] and CopyMe [Harrold et al., 2014]).Each of the other combinations of methods was used in only one study.
Section 6 details how each method was combined with others.

Description of the Evaluation Process
Next, we present an overview of the different methodologies employed in the studies.Additionally, we discuss the similarities and differences of how the methods were used.
Regarding studies that used the observation method, it can be pointed out that in some of them, field notes were taken, whereas in others video recordings were made, to support an in-depth analysis.Some observations were made in the context of the user and others were made in laboratories.We (a) Games that were evaluated using a combination of two methods.
(b) Games that were evaluated using a combination of three or four methods.next describe some of these different examples of how observation was applied.
The study by Aruanno et al. [2018] presents HoloLearn, a wearable mixed reality application aimed at improving the ability of individuals with Neurodevelopmental Disorders (NDD) to perform typical household activities and increase their autonomy.In the exploratory study, 20 participants with NDD were divided into two groups: (i) severe-level participants ( 11) and (ii) moderate-level participants (9).The sessions were held at a therapeutic center, where the participants used HoloLearn with the support of their caregivers while two other professionals observed the situation and took notes.At the end of each session, the users answered a questionnaire about the device's acceptability, interaction mode usability, task complexity, virtual assistant function, friendliness, and satisfaction.In the study by Crovari et al. [2019], the evaluation was based on the audio and video recordings during the sessions.The researchers conducted an ex- 4 evaluation methods Wade et al. [2017] ploratory study to analyze how individuals diagnosed with NDD interact with SAM, an intelligent dolphin-shaped toy, and to identify requirements to improve the prototype.To stimulate autonomous exploration of SAM and active reaction to it, five individuals with NDD played and tested the game in sessions that were recorded and later analyzed.The second prototype was discussed with therapists in a workshop, and the authors observed therapeutic sessions using SAM at an institution for people with NDD.A third prototype was developed, the SAM 3D, and evaluated for the learning benefits for the users.For this purpose, a long-term empirical study (1 year) was carried out in two different therapeutic centers.
In the in-game evaluation, data is collected during the use of the game.Analysis of the studies showed that different types of data could be collected, both quantitative (e.g., performance in a game) and qualitative (e.g., user opinion surveys).For example, in the study by Sharma et al. [2018], both quantitative and qualitative data were collected to evaluate the game Balloons, which is focused on promoting social activity of joint attention, where colored balloons can be selected through gesture-based interaction.To evaluate it, data were collected regarding the total time spent by each participant in each session, the number of balloons selected with assistance, and the number of balloons selected by the children themselves.In addition, the notes of the moderator's observation were also used to validate the game.In the study [Carlier et al., 2019], the game New Horizon, which aims to help reduce stress and anxiety in children with ASD, used ingame evaluation as one of the combined methods to evaluate the game.Three children were invited to play the game at their homes for two weeks.At the beginning of each game session, participants were asked to indicate their mood on a five-point Likert scale that displayed smiling faces ranging from very happy to very angry.This information was com-bined with data collected through other methods (pre-and post-interview with parents and a standard questionnaire for parents and children) to evaluate the game's impact on the children.
Although the goal of using the pre-post test method is to measure the effect of the game, the approaches on what to measure and how varied greatly among the articles.Some studies applied a knowledge test at the pre and post-test stages.For example, in study [Sharma et al., 2018], the authors evaluated Kirana, a game that aims to teach daily living activities of grocery shopping.The researchers combined pre-post test with observation (of the participants playing the game and in a real-life context) and in-game evaluation (game log analysis).The pre-post-test evaluation consisted of mathematics tests performed by the children at the beginning of the evaluation process and again at the end.
Other studies that employed the pre-post test method compared a control group to a treatment group.For example, in study [Bartoli et al., 2014], the authors evaluated ten children with ASD regarding their initial functional profile; then, the children were randomly divided into two groups: a control group (continuation of regular treatment) and a treatment group (participation in extra sessions in which they played games); finally, the children were re-evaluated to compare the results between the two groups.Other studies compared pairs that combined children with different developmental conditions -ASD and typical development (TD).For example, in the study by Wade et al. [2017], the researchers organized twenty-four individuals, eighteen TD individuals and six diagnosed with ASD, into pairs as ASD-TD or TD-TD.After familiarizing the participants individually with the game in the pre-test, each pair played the game side by side sharing a computer.Players had the same game goal, individual characters and could help each other.Then, the pairs played different game modes in separate rooms but could communicate through Skype.Finally, in the post-test stage, the pair played the same game side by side once again as in the pre-test stage.Changes in gameplay metrics, both individually and in pairs, were analyzed, as well as changes in verbal communication.
In most of the studies that used the questionnaire method, the researchers created their questionnaire form to assess the usability and user experience of the participants (12/15).Half of them (6 articles) reported using the Likert scale.Only three studies used standardized questionnaires with different focuses, namely: Adolescent/Adult Sensory Profile (AASP) [Koirala et al., 2019], System Usability Scale (SUS) [Garcia-Garcia et al., 2019], and Social Communication Questionnaire (SCQ) and Social Responsiveness Survey, Second Edition (SRS-2) [Wade et al., 2017].For example, the study by Koirala et al. [2019] used a standard questionnaire combined with an experiment to validate the feasibility of the Sensory Assessment VR System (SAVR), a game for assessing sensory abnormalities in children with ASD.Before playing the game, all participants (six children with ASD and six typically developing children) filled out an Adolescent/Adult Sensory Profile (AASP) questionnaire, a standard sensory profile assessment tool.Then, the children interacted with the game.The researchers analyzed if the results from the SAVR system were or were not correlated with the results from the AASP questionnaire, which would indicate alignment between the measure constructed within the system and the sensory profile of the participants evaluated by a standard psychological instrument.
The interviews were conducted in various ways and involved different participants in each study, such as children, parents, therapists, teachers, and even game development students.For example, the study by Mei and Guo [2018] conducted a preliminary evaluation of an adaptive virtual environment therapy system for children with autism spectrum attention deficit.To collect feedback on the system, two ASD therapists and six game development students aged 19 to 22 were interviewed.The therapists evaluated the game's potential to be used in therapy, and the students evaluated attention detection.
The studies that applied the experiment method conducted evaluations that collected metrics as participants played the game.In some studies, one or more hypotheses were raised, and the experiment aimed to prove or reject each of them (e.g., [Chen et al., 2019; Mei et al., 2018]).Other studies did not present hypotheses but collected data, performed a statistical analysis of this data, and then discussed the results (e.g., [Li et al., 2018; Rahman et al., 2010]).To illustrate, in the study [Chen et al., 2019], the researchers conducted an experiment to evaluate the Guided Play Blocks, a game to improve symbolic play skills in children with autism spectrum disorder.The aim was to verify the following hypotheses: (H1) a child with restrictive and repetitive behaviors in a physical activity can also exhibit similar patterns in a digital replica of the activity, and (H2) interventions carried out with digital activity can impact the child's behavior in the physical world.Six children with autism spectrum disorder participated in an experiment that involved playing the game in the free and guided modes.Quantitative and qualitative data were collected for analysis.The game mode was the independent variable tested (i.e., free game mode versus guided game mode), and the dependent variables considered were: i) the percentage of representational constructions (i.e., a construction made with the blocks that resembles a real-life object); ii) the number of representational constructions; iii) the total number of symbolic categories of the constructions (i.e., a category consists of constructions with the same symbolic meaning, e.g., animals, letters); and iv) the compliance with the guidelines (i.e., captures how well a child follows the system's guidance).
The works that conducted expert reviews typically involve specialists in ASD (therapists, teachers, and psychologists) [Thordarson and Vilhjálmsson, 2019; Weilun et al., 2011; Carvalho and da Cunha, 2019; Moura et al., 2016; Sturm et al., 2016] or HCI [Sousa et al., 2012], or both [Marwecki et al., 2013].For example, to identify usability problems in the TEO [Moura et al., 2016] -a suite of interactive games aimed at helping the learning of various fundamental concepts about ASD treatment -five therapists and psychologists were invited to conduct an evaluation.The specialists tested the TEO and answered a questionnaire about the game.In the study by Sousa et al. [2012], the WorldTour game was developed to support the learning process of children with TEA.HCI specialists evaluated the game´s usability and communicability using: Heuristic Evaluation, the Semiotic Inspection Method (SIM), and the Communicability Evaluation Method (CEM).
The studies that used focus groups carried out the evaluations with specialists in TEA, such as teachers and therapists [Loiacono et al., 2018; Soysa andAl Mahmud, 2020].For example, in the study [Soysa and Al Mahmud, 2020], five special education teachers and 20 children with ASD analyzed the usability of a tangible interface-based game prototype.The children formed pairs with the teachers, who chose the activities to be performed, according to the needs of each child, to play the game.At the end of the sessions, a focus group was held with the professionals to discuss the difficulties faced during the gaming time and identify improvements in the game.

RQ4: What is the sample size and profile of the participants in the evaluations?
The studies conducted their evaluations with different stakeholders (e.g., therapists, psychologists, and teachers) and not just with the intended users, that is, individuals with ASD or with NDD.In this section, we have analyzed the participants of the evaluations and organized our results in two sub-sections -the first regarding the participants who were end-users, and the other, regarding evaluations with stakeholders.

Characterizing End-User Participants
For the evaluations that included users in their evaluation, we performed an analysis to characterize the end-user evaluations according to the number of participants involved, their profile in terms of their special needs, and their age.Articles that did not include information about the participants when presenting the evaluation were not considered.A total of 67 evaluations were described, with two different evaluations conducted for one of the games [Marchi et al., 2019]; one evaluation was not considered [Sturm et al., 2016] because it did not describe the number and age range of the participants.Table 4 presents the references of the games' evaluations according to the end-user participants' profiles in the game evaluation.Figure 4 presents a visualization that shows, for each evaluated game, the number, profile, and age range of the evaluation participants.The vast majority of the studies performed evaluations with participants representative of the target audience with ages ranging from 2 to 20 years, but concentrated in the range between 5 and 15 years.Although our SLR focused on games for children with ASD, we can see in Figure 4 that evaluations of nine games also included adults 5 .Some of the games described included children in their target user group but were not intended only for chil- 5 We consider adults any participant over twenty-one years old.
dren (e.g., [Crovari et al., 2019], [Aruanno et al., 2018], [Spitale et al., 2019], and 2/3 games -Balloons and HOPE -in [Sharma et al., 2018],).Thus, in these cases, it made sense to also include adults in their evaluation.However, four studies [Duval et al., 2018; Thordarson and Vilhjálmsson, 2019; Finkelstein et al., 2010; Sharma et al., 2018] presented games specifically aimed at children but performed the evaluations with adults.The authors of these articles [Sharma et al., 2018; Thordarson and Vilhjálmsson, 2019; Finkelstein et al., 2010] did not give any reasons for this choice.In contrast, article [Duval et al., 2018] argued that although the game SpokeIt was made for children with developmental and speech disabilities, it was difficult to access these individu-als; thus, their solution was to test with adults with this profile.
The sample size of participants ranged from 2 to 98, with an average sample size of 15, with a standard deviation of 17.The majority of evaluations were carried out with individuals with ASD, but some studies also included individuals with NDD and neurotypical individuals.The majority of evaluations with individuals with NDD included at least one person with ASD (7/12), but the remaining five works did not specify the developmental disorders of the participants.Moreover, 10 studies carried out evaluations that encompassed not only individuals with ASD, but also individuals with TD [Porayska-Pomsta et al., 2018; Koirala et al., 2019; Harrold et al., 2014; Wade et al., 2017; Zhang et al., 2018; Li et al., 2018; Golestan et al., 2019; Zhao et al., 2018; Carvalho and da Cunha, 2019; Sturm et al., 2016].
The studies by Porayska-Pomsta et al. [2018], Koirala et al. [2019], Li et al. [2018] and Sturm et al. [2016] explained that the reason for choosing both profiles was to identify and compare differences between them (e.g., behavioral differences, sensory processing patterns).On the other hand, the works by Wade et al. [2017], Zhang et al. [2018], and Zhao et al. [2018] developed and evaluated games that focused on social interaction and communication and organized the participants into ASD-TD dyads, and also TD-TD [Wade et al., 2017; Zhang et al., 2018].The reason pointed out by the researchers for not grouping the participants in ASD-ASD pairs was that the goal of the intervention included improving relationships between individuals with ASD and their neurotypical peers, since it is more common for people with ASD to interact with neurotypical individuals in their daily lives.In relation to the studies that organized the participants into TD-TD pairs, it was described that this allowed researchers to identify differences between the interactions of neurotypical and ASD individuals, and also allowed verifying the applicability of the games for individuals with and without ASD.
The remaining studies that included the two groups did not explain the reasons for this evaluation configuration [Harrold et al., 2014; Golestan et al., 2019; Carvalho and da Cunha, 2019].In two studies, the evaluations were carried out only with neurotypical individuals.Neither of them explained why the games were not tested with the respective target audience, but indicated this objective as future work [Thordarson andVilhjálmsson, 2019; Finkelstein et al., 2010].
The visualization depicted in Figure 5 indicates the total number of studies that used a given evaluation method that in-  cluded end-users and the profile of the participants involved.For instance, Figure 5 shows that 35 studies applied the observation method, and of these applications: 22 studies applied this method with participants with ASD; 9 studies conducted observation with participants with NDD, and 4 studies with participants with ASD as well as TD.It is important to remember that some studies used a combination of methods to evaluate a serious game, so they will be counted for each method applied.
We can observe that most of the methods involved participants with ASD.The questionnaire method was the only method that included applications with only TD participants.As previously reported, these two studies did not describe why the games were not tested with the game's target audience [Thordarson andVilhjálmsson, 2019; Finkelstein et al., 2010].The questionnaire was the most used method with users with ASD+TD.In these cases, the goal was to iden-tify and compare differences between profiles [Koirala et al., 2019] or evaluate games focused on children's interaction and social communication [Harrold et al., 2014; Golestan et al., 2019].Two studies did not describe the motivation to use participants with ASD+TD [Wade et al., 2017; Zhao et al., 2018].
We can also observe that the techniques that involve less intervention by the evaluator during the evaluation, for example, observation and in-game evaluation, had a more significant number of studies that explored evaluations with ASD and NDD participants.In contrast, methods such as questionnaire and interview, which demand greater interaction between the evaluator and the participants were less used with ASD and NDD participants.A possible reason for this, could be the challenges in collecting data through direct communication with the target audience.
Regarding the focus groups, their intent was to collect data on end-users experience, albeit indirectly.For instance, in the study by Soysa and Al Mahmud [2020], 20 children with ASD used the serious game POMA with the supervision of special education teachers.At the end of the sessions, a focus group was conducted with the teachers to discuss the problems faced by the children when playing the game and to identify possible improvements to mitigate such problems.Likewise, in the article by Loiacono et al. [2018], 10 children with NDD played a memory-like virtual reality game to enhance social skills, with the supervision of their therapists.At the end of the study, two focus groups were conducted with NDD specialists to understand the children and therapists' needs and identify the main characteristics and parameters of the virtual reality experience that were critical during its use.

Characterizing Stakeholder Participants
We also performed an analysis of the stakeholders involved in game evaluations from whom data was collected6 .In this case, the data was not collected directly from the end-users, but rather from other people who had an interest in the game, such as parents, therapists, psychologists, and teachers.
For example, in the studies, the stakeholders had the following actions: i) in the evaluations with the questionnaire method, the stakeholders answered the questionnaire; ii) in evaluations using the interview method, stakeholders were interviewed; iii) in the evaluations using the focus group method, the participants participated in the focus group dis-cussion; iv) in the evaluations with the pre-post test method, the stakeholders played the game together with the participants or provided information about the participants during the data collection performed before and after the participant played the game, and v) in the evaluations with the expert review method, stakeholders analyzed the game.
Figure 6 presents a visualization that shows, for each evaluation method, the number and profile of stakeholders involved in the evaluation of each game.The following stakeholders participated in the evaluations: caregiver, educational professional (it includes teacher, educator, special education teacher, and educational adviser), game development students, interact expert, lecturer in game analysis, parents, practitioner7 , professional designer of games, psychologist, and therapist (it includes occupational therapist, autism therapist, and NDD specialist).We included a psycholo-gist+therapist profile in the visualization since the study presented by Moura et al. [2016] only described that five psychologists and therapists participated in the evaluation but did not detail the exact number of participants with each of these profiles.
The number of stakeholders involved in the studies ranged from 1 to 9 individuals.These studies used 5 of the 8 identified evaluation methods: expert review, focus group, interview, questionnaire, and pre-post test.Studies with the expert review method used a greater variety of stakeholder profiles, which may have occurred due to the nature of the technique.In evaluations with experts, there is a tendency for the study to include different profiles of domain experts.The interview and questionnaire involved a more significant number of parents.These techniques require direct interaction between the evaluator and the participant.When it is impossible to apply these methods directly with the children, the parents can be the children's representatives.The challenges in including the children directly, could be due to the vulnerable nature of ASD or NDD children, or even, in some cases, their difficulties to communicate.Parents and education professionals were the stakeholder profiles that were included more often in the evaluations, probably because they are the stakeholders with the most significant interaction with the games' target audience.

RQ5: What quality properties have been evaluated in the studies?
This research question aimed to investigate which quality properties the researchers considered in their evaluations of serious games.Next, we present the seven quality properties we identified from our analysis and their description: • Communicability: evaluations that analyze the serious game's communicability (i.e., the game's ability to effectively and efficiently convey to the user the intentions and interaction principles that guided its design) [de Souza, 2005].• Engagement: evaluations that investigate the serious game's ability to involve the user.• Feasibility: evaluations that investigate whether the serious game is suitable for use in therapy, both in terms of the serious game's proposal and in terms of acceptance of use due to some physical device that the serious game requires or due to the type of interaction (e.g., immersion in virtual or augmented reality).• Impact: evaluations that investigate the effects of the serious game on users in relation to the skill the serious game aims to enhance.• Metrics effectiveness: encompasses both studies that propose metrics and evaluate them based on serious games (i.e., verify whether the proposed metrics can achieve the goal of measuring some autistic characteristic of users) and studies that analyze the impact of a specific component of the serious game.• Usability: evaluations that investigate the serious game's solution proposal (i.e., evaluation of the game's functionalities), the ease of use of the serious game, or the user's performance in the task proposed in the serious game.It is worth noting that evaluations that focused on analyzing the user's performance in a task and superficially investigated user satisfaction, for example, through a questionnaire, were classified as usability evaluations.• User experience: evaluations that investigate the user's behavior and emotions when using the serious game.This category also included evaluations that more deeply investigated user satisfaction.
To classify the quality property(ies) evaluated in each game, we registered the property explicitly mentioned by the researchers.If the researchers did not directly mention this information, we analyzed the property being considered based on the description of the evaluation's objective, the process conducted in the study, and the results obtained.Table 5 presents the classification of the games' evaluation according to the quality property considered.
Figure 7 presents the quality properties evaluated in the games.The most frequently investigated properties considered in the evaluations were: impact (29), usability ( 22), feasibility ( 14), metric effectiveness (5), user experience (5), engagement (4), and communicability (1).The evaluations that investigated the impact of the games involved both quantitative analyses that measured metrics related to the game's objective, as well as qualitative ones.About the evaluations that analyzed metrics effectiveness, most (4/5) are of the evaluation category [Koirala et al., 2019; Kołakowska et al., 2017; Li et al., 2018; Iyer et al., 2017], i.e., focusing on metrics generated by the game that evaluate autistic characteristics of children.Only one article in this category analyzed the impact of a game component, namely the impact of having a custom or non-custom virtual character on engagement [Mei et al., 2018].

Characterizing the Use of the Evaluation Methods
While in section 5, through a discussion of research questions, we present an overview of evaluation methods showing the different relationships between them.In this section, we present each evaluation method in depth.Thus, we present for each method: i) the categories of games that were evaluated; ii) the methods that were combined with it; iii) the profile of the participants and stakeholders involved in the evaluations; iv) the sample size (minimum, maximum, and average size) and age range of the participants; and v) the evaluated quality properties.It is worth noting that most  Neto et al. [2013] studies that used a combination of methods do not specify the quality properties evaluated through each method.Instead, the studies usually describe the quality properties that were investigated in the work as a whole, using the set of evaluation methods.Next, we present the evaluation methods organized from the most frequently used in the studies, to the least used.

Observation
The observation method was used to evaluate games from 7 out of the 8 identified game categories.It was only not used to evaluate games from the evaluation and measurement category.Among the 35 studies that applied the observation method, 28 used this method in combination with other evaluation methods, and 7 studies applied observation in an isolated manner.
In cases where the method was used in conjunction with other evaluation approaches, it was combined with one, two, or three other distinct methods.In 18 studies, it was used in conjunction with only one other method, with 5 studies combining it with in-game evaluation; 4 studies with pre-post test; 4 studies with questionnaire; 3 studies with interview; 1 study with focus group; and 1 study with expert review.Another 9 studies combined the application of the observation method with two methods, with 4 studies combining it with in-game evaluation and pre-post test; 2 studies combin-ing it with interview and questionnaire; 1 study combining it with pre-post test and interview; and 1 study combining it with in-game evaluation and interview.In a single study, the observation method was combined with three other methods, namely in-game evaluation, pre-post test, and questionnaire.
In studies where the observation method was used, both in isolation and in combination with other approaches, the sample size varied from 3 to 41 participants, with an average of 10.7 participants (standard deviation of 17.4).The ages of the participants ranged from 2 to 44 years.In most of these studies, the participants were individuals with ASD (22 studies), followed by NDD (9 studies) and ASD+TD (4 studies).No stakeholders were involved in the evaluations with the observation method.
With regard to the evaluated quality properties, studies that used the observation method in an isolated manner assessed the following set of quality properties: user experience (2), feasibility (2), engagement (1), usability (1), and impact (1).In turn, studies in which the observation method was combined with other evaluation approaches evaluated the same five quality properties: impact (15), usability (12), feasibility (4), engagement (2), and user experience (1).

In-game Evaluation
The In-game evaluation method was used to evaluate games in the following categories: general (7), social and socio-emotional skills (6), motor skills (2), evaluation and measurement (2), academic skills, and daily living skills (1).Most studies collected data on the participant's performance when using the game.But it's important to note that the data collected highly depends on the game.For example, the game Kirana [Sharma et al., 2018] -a game that aims to teach activities of daily living in grocery shopping -collected task times, items bought, and monetary transaction details.The vrSocial game [Boyd et al., 2018] -an immersive virtual reality game aimed at improving the social communication skills of children with ASD -collected each user's distance from the avatar, volume, and duration of talking.
Among the 19 studies that applied the In-game evaluation method, the majority (14/19) used this method in combination with other methods.In these cases, it was combined with one, two, or three other distinct methods.In 7 studies, the In-game evaluation method was used with another method, with 5 studies combining it with observation; 1 study with interview; and 1 study with pre-post test .Other studies combined the use of the In-game evaluation method with two other methods, with 4 studies combining it with observation and pre-post test; 1 study combining it with observation and interview; and 1 study combining it with pre-post test and interview.In a single study, the In-game evaluation method was combined with three other methods, namely observation, pre-post test, and questionnaire.
In studies where the In-game evaluation method was used (either alone or in combination with other methods), the average sample size was 16.6 participants, with a standard deviation of 17.9.The sample size ranged from 3 to 50 participants, with ages ranging from 4 to 44 years.In most of these studies, the participants were individuals with ASD (13 studies), followed by NDD (5 studies) and ASD+TD (1 study).No stakeholders were involved in the evaluations using the In-game evaluation method.
Regarding the evaluated quality properties, the 5 studies that used the In-game evaluation method as the only method evaluated the following set of quality properties: metrics effectiveness (2 studies), usability (2 studies), and impact (1 study).In studies where the In-game evaluation method was combined with other evaluation approaches, the following quality properties were evaluated: impact (10 studies), usability (3 studies), engagement (1 study), feasibility (1 study), and user experience (1).It is noteworthy that two studies evaluated two quality properties.

Pre-post test
The pre-post test method was used to evaluate games from 7 of the 8 identified categories of games.The only category from which it was not used to evaluate games was the evaluation and measurement category.The pre-post test method was used in 18 studies, with 13 studies using it in conjunction with other evaluation approaches and 5 studies using it in isolation.
In 6 studies, the pre-post test method was used in conjunction with only one other method: observation (4 studies); in-game evaluation (1 study); and questionnaire (1 study).Other 6 studies combined the pre-post test method with two other methods, with 4 studies combining it with observa-tion and in-game evaluation; 1 study combining it with ingame evaluation and interview; and 1 study combining it with observation and interview.One study combined the prepost test method with three methods, namely observation, ingame evaluation, and questionnaire.
In the studies that used the pre-post test method, the average sample size was 13.2 participants, with a standard deviation of 17.8.The participant sample size ranged from 3 to 28 participants.The ages of the participants ranged from 2 to 39 years old.The participants had the following profiles: ASD (11 studies), NDD (3 studies), and ASD+TD (4 studies).Two studies [Dragomir et al., 2018; Carlier et al., 2019] also involved stakeholders in the evaluations with the pre-post test method.In the study by Dragomir et al. [2018], a game was presented and evaluated to help autistic children engage in solitary pretend play.The evaluation involved the participation of 6 practitioners and 7 children with ASD.The children participated in 5 play sessions over 5 weeks.In the first and last sessions, the children played with a pre-defined set of toys alone and then with the practitioner.In the intermediate sessions, each child and a practitioner played the game.
Regarding the evaluated quality properties, studies that used the pre-post test method in isolation evaluated the following quality properties: impact (4 studies) and usability (1 study).Studies that combined the pre-post test method with other methods evaluated the following set of quality properties: impact (13 studies), user experience (2 studies), and usability (1 study).Two of these studies evaluated more than one quality property.

Questionnaire
The questionnaire method was the only method used to evaluate games in all identified categories.Among the 15 studies that applied the questionnaire method, 11 used this method in combination with other evaluation methods, and 4 studies applied the questionnaire method in isolation.
In 7 studies, it was used in combination with only one other method, with 4 studies combining it with observation; 1 study with pre-post test; 1 study with experiment; and 1 study with expert review.Another 3 studies combined the application of the questionnaire method with two methods, with 2 studies combining it with observation and interview; and 1 study combining it with observation and expert review.Only one study combined the questionnaire method with three other methods, namely observation, in-game evaluation, and pre-post test.
In the 15 studies where the questionnaire method was used, the sample size ranged from 3 to 24 participants, with an average size of 9.9 participants (standard deviation of 15.4).The participants were aged between 3 and 50 years.Regarding the profile of the participants, 6 studies involved individuals with ASD and TD, 5 studies involved only individuals with ASD, 2 studies involved individuals with NDD, and 2 studies involved only participants with TD.Some of the studies that used the questionnaire method also involved stakeholders in the evaluations.In these studies, the questionnaires were filled out only by stakeholders or were filled out by both stakeholders and end users.In three evaluations, the questionnaire was filled out only by stakeholders.
For example, in the study by Gomez et al. [2018], 9 children aged 3 to 8 and their teachers participated in the evaluation of the game "Leo con Lula " 8 .The teachers used the game in their classes, once a day for a week.At the end of the week, the teachers answered a questionnaire to collect their feelings and experiences about the game.In the study by Golestan et al. [2019], evaluations were conducted with the games CARAUTI -a speech-therapy game -and BEESAUTI: a hand-eye coordination game.The evaluations aimed to verify if the children were attracted to the games, if the parents could use the games and easily interact with their children, and if the games seemed useful as therapy for ASD.To do so, 8 children with ASD and TD (aged 3 to 7 years) and their parents were recruited to play the two games at home.After using the games, the parents answered a questionnaire about each game.The questionnaires contained questions about the child's interest, the parent's difficulty in playing, ranking game graphics, among others.
In two other studies, the questionnaires were filled out by end users and stakeholders [Aruanno et al., 2018; Garcia-Garcia et al., 2019].In the study by Aruanno et al. [2018], the evaluation involved 20 people, aged 16 to 43, with NDD, and their 3 caregivers.The evaluation was divided into sessions, in which each participant used the game for approximately 10 minutes, with the help of their caregiver.During the evaluation, an evaluator observed and made notes.At the end of each session, the participants and caregivers answered a questionnaire.The caregiver's questionnaire contained questions about device acceptability, usability of the interaction mode, task complexity, virtual assistant role.The participants' questionnaire had two questions about likability and satisfaction.In the study by Garcia-Garcia et al. [2019], EmoTEA, a serious game in the form of a mobile application, was evaluated.In this study, 3 children, aged 8 to 10 years old, were observed performing the following tasks: i) Browse the application; and ii) Playing the first level of the three games.The children were accompanied by an educator throughout the evaluation to help or calm them down if they tried to get up from the chair during the assessment.At the end of the evaluation, the three participants filled out the System Usability Scale (SUS) questionnaire.The educator also filled out the SUS questionnaire to provide feedback from their point of view.The questionnaire consisted of 10 questions that evaluated various aspects related to the game.
Regarding the evaluated quality properties, studies that used the questionnaire method alone evaluated the following set of quality properties: usability (1 study), and feasibility (3 studies).In studies where the questionnaire method was combined with other evaluation approaches, 6 out of the 7 identified quality properties were evaluated: usability (6), feasibility (3), impact (3), user experience (2), engagement (1), and metric efficacy (1).The questionnaire method was only not used in communicability evaluations (only 1 study assessed communicability). 8Original name in Spanish.

Interview
The interview method was used in 11 studies, which evaluated games in the following categories:social and socioemotional skills (7 studies); cognitive skills (2 studies); motor skills (1 study); and communication skills (1 study).
Two studies used the interview method alone, and 9 combined it with other methods.In 4 studies, it was combined with only one other method: observation (3 studies) and in-game evaluation(1 study).Another 5 studies combined the application of the interview method with two methods, namely: observation and questionnaire (2 studies); in-game evaluation and observation (1 study); in-game evaluation and pre-post test (1 study); observation and pre-post test (1 study).
Seven of the 11 studies that applied the interview method involved end-users as evaluation participants.In these studies, sample sizes ranged from 3 to 11 participants, with a mean of 7.1 participants (standard deviation of 2.7).The age range of participants ranged from 5 to 31 years.In most of these studies, participants were individuals with ASD (5 studies), followed by NDD (1 study) and ASD+TD (1 study).Four studies [Gotsis et al., 2010; Giusti et al., 2011; Boyd et al., 2018; Jain et al., 2012] interviewed both end-users (i.e., children with ASD) and stakeholders.The stakeholders involved in these studies were parents [Gotsis et al., 2010; Boyd et al., 2018; Jain et al., 2012] and therapists [Giusti et al., 2011].
For example, in the study by Jain et al. [2012], 9 children with ASD, aged between 5 and 12 years, were recruited to evaluate a serious game to teach facial expressions.In the session, children could play the game as long as they wanted.In the end, children and parents were interviewed about their experience with the game.Another 4 studies conducted interviews only with stakeholders [Ringland et al., 2019; Carlier et al., 2019; Boyd et al., 2017; Mei and Guo, 2018].Parents [Ringland et al., 2019; Carlier et al., 2019], teachers [Boyd et al., 2017], therapists and game development students [Mei and Guo, 2018] took part in these studies.
Regarding the evaluated quality properties, studies that used the interview method alone evaluated user experience (1 study) and feasibility (1 study).In turn, studies where the interview method was combined with different approaches evaluated impact (5), usability (2), feasibility (2), and engagement (2).

Experiment
The experiment method was used in 9 studies, of which 3 studies evaluated games in the social and socio-emotional skills category; 2 studies evaluated games in the general cat-egory; 2 studies evaluated games in the evaluation and measurement category; 1 study evaluated a game in the cognitive skills category; and 1 study evaluated a game in the communication skills category.
Most of these studies (8/9) used the experiment as the sole evaluation method.In these 8 studies, the following quality properties were evaluated: impact (5 studies), metrics effectiveness (2 studies), and usability (1 study).Only one study combined the experiment with another method.In this study, the experiment method was combined with the questionnaire method to evaluate the quality property of metrics effectiveness.
In the 9 studies where the experiment method was used, the number of participants ranged from 6 to 98, with an average sample size of 25.8 participants (SD = 19.7).The age range of these participants varied from 2 to 17 years old.In most of these studies, the participants were individuals with ASD (6), followed by ASD+TD (2), and NDD (1).No stakeholders were involved in the evaluations using the experiment method.

Expert Review
The expert review method was used to evaluate games in the following categories: social and socio-emotional skills (2 studies); general (3 studies); academic skills (1 study); and evaluation and measurement (1 study).Among the 7 studies that applied the method, 4 studies used it in isolation to evaluate the following quality properties: usability (3 studies), feasibility (1 study), and communicability (1 study).One of these articles evaluated two quality properties.
Moreover, 3 studies used the expert review method in conjunction with other evaluation approaches.The following combinations were made with the expert review method: 1 study combined the method with a questionnaire; 1 study combined it with observation; and 1 study combined it with both -observation and a questionnaire.These studies evaluated usability (3 studies) and feasibility (2 studies).It is noteworthy that two studies evaluated two quality properties.
The studies that used the expert review method had the participation of experts with different profiles, namely: psychologist Thordarson and Vilhjálmsson [2019] [2011], parents Carvalho and da Cunha [2019], and HCI expert Sousa et al. [2012].Two studies that used the expert review method also involved end-users.In the study by Sousa et al. [2012], three HCI experts evaluated the usability and communicability of the WorldTour game.For this, the experts used the Heuristic Evaluation method, the Semiotic Inspection Method (SIM), and the Communicability Evaluation Method (CEM).Since the CEM is applied to evaluate the designer's communication with the user, through the interface, in real-time interaction, the study counted on the participation of two potential users -a 9-year-old girl diagnosed with pervasive developmental disorder-not otherwise specified (PDD-NOS) and a 7-year-old boy diagnosed with autism.In the study by Dragomir et al. [2018], three psychologists analyzed the emot-iCan game and reported their perspectives on the design of the configuration interface aimed at the game´s administrator; and of the game's reward system.In addition, the psychologists provided feedback on how the players reacted to the game.Both players with typical development and with ASD participated in the test.The study did not inform the number and age range of the participants.It was only informed that the groups of participants with TD and ASD had a similar age range.

Focus Group
The focus group method was used in only two studies, which evaluated games in the following categories: social and socio-emotional skill and general.In the study by Loiacono et al. [2018], a focus group was conducted with therapists, and the observation method was also used to evaluate the game's usability.In the study by Soysa and Al Mahmud [2020], 20 children with ASD, aged 3 to 6 years, and 5 teachers participated in the evaluation.The children used the game, and at the end of the sessions, a focus group was conducted with the teachers to discuss usability issues with the game.

Threats to validity
In any systematic literature review, there are some threats regarding the validity of the results.Therefore, we seek to raise potential threats and apply strategies to try and mitigate their impacts.
First, we are aware that the the review's subject can be addressed in other areas.However, as our focus is to analyze how serious games have been evaluated from the perspective of the HCI area to assist in the future design and evaluation of games, we focused on analyzing publications in the area of HCI that are primarily in computing repositories.For this purpose, we have selected some of the main repositories and conference proceedings that store relevant works carried out in the field of Computer Science and related to serious games.Still, it is possible that existing relevant studies have not been considered.Furthermore, as the search for publications took place until March 2020, after this date, other studies adherent to the this research objective may have been published but were not included in this SLR.These new studies may present new findings, such as new evaluation methods and quality properties that are not included here.
Moreover, although we have followed the methodology and carried out a systematic process of article selection and analysis, the extraction process involves subjective interpretation and is based on the researchers' decisions.To minimize this bias, two researchers carried out each step of the process individually, as well as the data extraction, which was consolidated between them.In case of divergences or doubts, a third researcher was involved in the final decisionmaking.
The studies did not present all steps of their research with the same level of detail.Thus, in some steps of the analysis, inferences were made based on what was explicitly presented.For example, in the study by Sharma et al. [2018], the authors described that the evaluation of the Kirana game was developed in four phases.The pre-and post-evaluation took place in phases I and III, respectively.During phase IV, they observed the participants' behavior in a real situation.In phase II, the authors said the participants had played the game.Although the authors did not give further details about the participants' gaming session (phase II), they mentioned that the observer's notes were examined when presenting the analysis of their evaluation.Thus, in our analysis of the study, we assumed that the participants' sessions were observed as part of the evaluation.
In our analysis, we classified the method used according to when or how the data was collected.Although this analysis provides an overview of the methods used for evaluation, it does not take into account other aspects related to data collection for evaluation.For example, the evaluation methods grouped as in-game evaluation may include different data collected during the game, such as quantitative metrics (e.g., performance data) and qualitative data (e.g., the user's mood is collected through questions presented during with the game).To indicate the diversity of aspects involved in each evaluation method, we pointed them out in the discussion and presented some of the different approaches among their application.For a more detailed analysis of the evaluation methods, different aspects related to evaluation (e.g., type of data collected, moment in the design process, instrument/technology used, etc.) could be considered in the analysis.However, a challenge to do this is precisely the difference in the level of presentation of this information in the studies, because these evaluations are often presented in a short section within the article, which focuses on the development of the serious game.So, sometimes, very little (or even none) information about the evaluation methods used is presented.

Conclusions
This work investigated evaluation methods are being applied to assess serious games for children with ASD.Through a systematic literature review, it was possible to analyze the state-of-the-art of the literature regarding: 1) the methods used in the evaluations; 2) the set of methods used to evaluate each game category; 3) how the evaluation methods have been applied; 4) the number and profile of the participants involved in evaluations and 5) the quality properties have been considered in the games' evaluation.
Our findings indicate that there is no consolidated methodology specifically for evaluating serious games for children with ASD.Different existing methods have been used and combined in different ways to evaluate this type of game.Moreover, we did not identify a clear relationship between the game category and the methods used in its evaluation.However, it is important to note that the distribution of games in categories is very disparate (i.e., two categories account for about 58% of games), making it difficult to conduct a more significant analysis.Observation was the most used method and the one most combined with other methods.The reason for this could be that it allows the evaluator to collect data about the child's experience, without much interference in these sessions.On the other hand, the questionnaire was the only method used in evaluations for all game categories, perhaps because this method is low-cost and easy to combine with other methods.
It is known that TEA has different levels of severity [American Psychiatric Association, 2013].Still, most of the analyzed articles do not specify the level of severity of the games' target audience.Articles usually only describe that the serious game is aimed at children with ASD or encompasses children with ASD and define the skill the game aims to improve.Future works can investigate how to take into consideration the different levels of severity of autism in the design and use of serious games.
As presented, a large proportion of games (65/75) included representative participants of their users in the evaluation.In some situations, neurotypical individuals were also included.In these cases, the main objective was to identify differences between individuals with ASD and neurotypicals or to evaluate games aimed at stimulating interaction between children with ASD and neurotypical individuals.In addition, 19 evaluations included therapists, psychologists, teachers, or parents.These profiles represent important stakeholders, as some games are developed with the intention of being used as part of the children's treatment, and these stakeholders would be responsible for using the games with the children.In these cases, the methods that were used more frequantly were questionnaire, expert review, interview, and focus group.In three evaluations, only stakeholders (and not users) were included.
In the evaluations, the maximum number of quality properties investigated was two.However, as they are not mutually exclusive but complementary, it would make sense to take into account the evaluation of several (or even all) of the quality properties for each game.Although, including the evaluation of several properties could be too costly, perhaps focusing on distinct properties in different times in the development or implementation of the game might be feasible and interesting.For example, evaluating the game's usability during the design of the game, and once it is ready, assess its impact.One could argue that the ultimate goal is for serious games to impact children's treatment positively.However, our findings show that the results of works that evaluated the impact of games are still incipient.They considered the impact of a particular game on a specific skill (e.g., [Sharma et al., 2018; Dragomir et al., 2018; Wade et al., 2017]), so the results are very dependent on the context considered.Nonetheless, they suggest that games have the potential to improve the skills addressed that they focused on.Only one study reported an inconclusive evaluation and described that more tests would be needed to prove or disprove the investigated hypothesis (Hypothesis 1: The child indicates reduced levels of stress and/or anxiety after a gaming session) [Carlier et al., 2019].It is worth highlighting no study indicated any negative impacts of games on participants.At any rate, it is still necessary to advance in the investigation of the effectiveness of games, but this work can help as an initial step in this direction.
Our findings in this study indicate that the skill addressed in a game or its type are not enough to determine which method(s) would be best for their evaluation.The decision of the method or combination of methods, user profiles, number, and quality properties must take into account the objective, context, and available resources.Although this conclusion is not new to the field of HCI, it indicates the importance of having ways to support serious game project teams in their consideration of which evaluation methods would be appropriate or interesting for their context.Thus, this work brings relevant contributions to the evaluation of serious games for children with ASD, as it characterizes and discusses relevant aspects to be decided about these evaluations, based on the practice of what has been done in the field.Therefore it contributes to researchers or professionals who develop serious games for this context, who can draw from the discussion presented in this article and use it both to expand their knowledge about how evaluation methods have been used, and as a starting point to guide their decisions regarding the evaluations that would be most interesting in their own contexts.
Based on the results of this work, future work can delve into the specific analysis either focusing on one method or on games focusing on a specific ability or generating a more detailed characterization for the focus in question, or generating support materials for these contexts (e.g.questionnaires or impact metrics used).Another interesting direction to investigate is how ethical issues are being addressed in research related to serious games for children with ASD.Additionally, future work can extend the search to other repositories (e.g. from health field), including those with an interdisciplinary focus.Finally, another relevant future work would be to extend the analysis period beyond 2020.Updating the SLR to a period after 2020 could generate complementary and new results, such as the emergence of new approaches to evaluate serious games, especially considering that the COVID-19 pandemic may have shifted or changed the focus or means of evaluation.

•
Expert review: evaluation based on analyses by experts in ASD (e.g., therapists, teachers, and psychologists) or experts in games.• Focus group: evaluation based on a group discussion facilitated by a researcher.• In-game evaluation: evaluation based on game data/logs (e.g., performance data, eye tracker).• Interview: evaluation based on structured and semistructured interviews.• Observation: evaluation based on observations and field notes taken during game sessions or by analyzing audio and video recordings of these sessions.• Pre-post test: evaluation that collects user data before and after they play the game and then discusses and analyzes the improvements.• Questionnaire: evaluations based on questionnaires that can be developed by researchers or on standard evaluation instrument(s) (e.g., Adolescent/Adult Sensory Profile questionnaire -a standard sensory profile evaluation instrument).

Figure 1 .
Figure 1.Number of games evaluated by each method.

Figure 2 .
Figure 2. Evaluation methods used by game categories.

Figure 3 .
Figure 3. Combination of methods in the evaluations of the games.Edges whose weight was omitted correspond to weight 1.

Figure 4 .
Figure 4. Characterization of the end-user participants in the evaluations.Columns NA: the games that did not specify the age range.

Figure 5 .
Figure 5. Characterization of evaluations performed with each method and each user profile.

Figure 6 .
Figure 6.Characterization of the stakeholders involved in the evaluations.Marker ?: studies that did not specify the number of stakeholders involved in the evaluation.

Figure 7 .
Figure 7. Quality properties evaluated in the games.

Table 1 .
Classification of the games' evaluation according to the methods used.

Table 2 .
Games' classification according to skills they focus on.