How are test smells treated in the wild? A tale of two empirical studies

Developing test code may be a time­consuming process that requires much effort and cost, especially when done manually. In addition, during this process, developers and testers are likely to adopt bad design choices, which may lead to introducing the so­called test smells in the test code. As the test code with test smells size increases, these tests might become more complex, and as a consequence, much more challenging to understand and evolve them correctly. Therefore, test smells may harm the test code quality and maintenance and break the whole software testing activities. In this context, this study aims to understand whether software testing practitioners unintentionally insert test smells when they implement test code. We first carried out an expert survey to analyze the usage frequency of a set of test smells and then interviews to reach a deeper understanding of how practitioners deal with test smells. Sixty professionals participated in the survey, and fifty professionals participated in the interviews. The yielded results indicate that experienced professionals introduce test smells during their daily programming tasks, even when using their companies’ standardized practices. Additionally, tools support test development and quality improvement, but most interviewees are not aware of test smells’ concepts.


Introduction
Software projects, both commercial and opensource ones, commonly include a set of automated test suites as one cru cial support to verify software quality (Garousi and Felderer, 2016). However, creating test code may require high ef fort and cost (Wiederseiner et al., 2010; Yusifoğlu et al., 2015; Garousi and Felderer, 2016. Automated test genera tion tools, such as Randoop 1 , JWalk 2 , and Evosuite 3 , emerge as alternatives to facilitate and streamline this activity. If designed with high quality, automated testing offers bene fits over manual testing, such as repeatability, predictabil ity, and efficient test runs, requiring less effort and costs (Yusifoğlu et al., 2015; Garousi andKüçük, 2018). Therefore, tests should be concise, repeatable, robust, sufficient, nec essary, clear, efficient, specific, independent, maintainable, and traceable (Meszaros et al., 2003).
However, the development of welldesigned test code is neither straightforward nor a simple task. Developers are usu ally under time pressure and must deal with constrained bud gets, which can stimulate antipatterns in test code, leading to the occurrence of the socalled test smells. Test smells are indicators of poor implementation solutions and problems in test code design (Greiler et al., 2013). The presence of test smells in test code may lead to reduced quality and, conse quently, may not reach its expected capabilities at finding bugs while remaining understandable, maintainable, and so on (Yusifoğlu et al., 2015; Garousi andKüçük, 2018). The lit erature reports 196 test smell types classified in the following groups (Garousi and Küçük, 2018): behavior, logic, design related, issue in test steps, mock and stubrelated, association in production code, coderelated, and dependencies.
The literature presents studies aimed to identify and an alyze the effect of test smells in software projects in sev eral aspects (Greiler et al., 2013; Garousi and Felderer, 2016; Van Rompaey et al., 2006. The authors introduce test smells as nonfunctional quality attributes within the Software Test Code Engineering process in those studies. In addition, they discussed existing test smell types and their consequences in terms of test code maintenance (Garousi and Felderer, 2016). Some authors attempted to correlate metrics, and the pres ence of test smells (Greiler et al., 2013). However, few dis cussions about daily practices and programming styles that may contribute to insert test smells exist in the literature. Un derstanding the relationship between development practices and the introduction of test smell may support improving the activity of test creation.
This study extends our previous investigation (Silva Junior et al., 2020), which aimed to understand whether software testing practitioners 4 unintentionally insert test smells. We used an expert survey with sixty practitioners from Brazilian companies to analyze which and how often they adopt prac tices that might introduce test smells during test creation and execution. In this extension, we sought to understand (i) how much the practitioners know about test smells and (ii) how the practitioners deal with the test code quality regarding test smells. For identifying whether and to what extent the practi tioners know about test smells and how they deal with them, we interviewed fifty practitioners. The results from both stud ies are complementary. We found that most of the intervie wees did not know anything about the concept of test smells. They commonly used practices that introduced test smells, but they hardly removed them from the test code.
We mapped which daily programming practices would be associated with each test smell for both test creation and ex ecution. Then, we asked the practitioners if they used those practices without the need to name the test smells. We used the interviews to complement the survey and analyze the practitioners' unit test creation, maintenance, and quality ver ification activities. In addition, we investigated the practition ers' knowledge about test smells and how they treat those smells during unit test creation and maintenance.
Our study may provide insights to understand how and which practices may introduce test smells in test code. In ad dition, we presented the practitioners' point of view about activities related to unit test code and their beliefs about test smells' treatment. Thus, we investigated the following re search questions:

RQ1: Do practitioners use test case design prac tices that might lead to the introduction of test smells?
We investigated whether bad design choices may be related to test smells.

RQ2: Which practices are present in practitioners'
daily activities that lead to introducing test smells? We investigated which test smells are as sociated with the most frequent practitioners' prac tices. RQ3: Does the practitioners' experience interfere with the introduction of test smells? We investi gated whether, over time, practitioners improve the activity of test creation.

RQ4: How aware of test smells are the practitioners?
We investigated the practitioners' knowledge of test smells. RQ5: What practices have practitioners employed to treat test smells? We investigated how the practi tioners deal with test smells in their daily activities.
The remainder of this article is structured as follows: Sec tion 2 introduces the concept of test smells; Section 3 details the research method applied in this study; Section 4 presents the survey's design and results; Section 5 presents the inter view's design and results; Section 6 discusses the main find ings of this investigation; Section 7 presents the threats to va lidity; Section 8 discusses related work, and Section 9 draws concluding remarks.

Test Smells
Automated tests may generate more efficient results when compared to manually executed ones. Due to their repeata bility and nonhuman interference, automated tests might lead to time and execution effort reductions (Yusifoğlu et al., 2015; Garousi andKüçük, 2018). However, developing test code is not a trivial task, and the automated tools may not en sure the system quality because they can generate one poor design (Palomba et al., 2016; Virgínio et al., 2019. In real world practice, developers are likely to use antipatterns dur ing test creation and evolution, leading to errors in imple menting test code (Van Deursen et al., 2001; Bavota et al., 2012. These antipatterns may negatively impact test code maintenance (Van Rompaey et al., 2006).
Several studies investigated different types of test smells. Initially, Van Deursen et al. (2001) defined a catalog of 11 test smells and refactorings (to remove test smells from the test code). After that, other authors extended this catalog and analyzed the effects of the smells on the production and test code (Van Deursen et al., 2001; Meszaros et al., 2003; Van Rompaey et al., 2006; Bavota et al., 2012; Greiler et al., 2013; Bavota et al., 2015; Garousi and Felderer, 2016; Palomba et al., 2016; Peruma, 2018; Virgínio et al., 2019; Virgínio et al., 2020. For example, Garousi and Küçük (2018) identified more than 190 test smells in a literature re view of 166 studies.
In this study, we selected 14 types of test smells frequently studied and implemented in cuttingedge test smell detection tools (Van Deursen et al., 2001; Meszaros et al., 2003; Pe ruma, 2018. These are described next: •

Research Method
We carried out two empirical studies in this investigation: a survey and an interview study (Miles et al., 2014). Figure 1 shows the methodological steps employed in this study. Initially, we designed our study by defining the research questions and the suitable research methods to investigate them ( Fig. 1 Design). We used the survey research method to identify which programming practices respondents (practi tioners who participate in the survey) adopt that might insert test smells in the test code ( Fig. 1 Survey). We next applied the interview study method to identify how the interviewees (practitioners who participate in the interview) deal with test smells during the test creation and execution ( Fig. 1 Inter view). We compared results obtained from both surveys and interviews to understand the adoption of practices that might lead to introducing test smells with the practitioners' knowl edge about test smells from different perspectives ( Fig. 1 Data Comparison).
For the survey, we adopted the design of observation by casecontrol. Casecontrol is a descriptive design used to in vestigate previous situations to support understanding a cur rent phenomenon (Pfleeger and Kitchenham, 2001). It en compasses activities for the design, application, and analy sis of a survey questionnaire. We designed the questionnaire not to require specific knowledge about test smells. We corre lated each test smell to a set of programming practices, which the participants should read and analyze. Section 4 details the survey study.
To complement the findings of the survey questionnaire, we carried out a semistructured interview (Singer et al., 2008; Gubrium et al., 2012. The interview's structure aims to capture the interviewees' perception of test smells. As we needed the interviewees to know the definition of test smells for elaborating on how they deal with them, we first intro duced them to the concept of test smells. Section 5 details the interview study. The survey and interview instruments were written and applied in the Portuguese language with Brazilian practitioners. Finally, the data comparison summa rizes the survey and interview results methods to answer the research questions (Creswell and Clark, 2018). Section 6 presents the results.

Survey Study
We applied the survey research method to investigate how the respondents commonly insert test smells in the test code when designing or implementing their software projects (Melegati and Wang, 2020). Throughout this section, we pro vide readers with detailed information about the research de sign and data analysis. All material used in the survey study, including the dataset, is publicly available at (Junior et al., 2021).

Design
We structured the questionnaire so that the respondents were not required to be aware of test smells beforehand. Thus, we Some tests fail, and it is not possible to identify the fail ure cause.

For Testers Only
I have already created a test to validate some feature that will not be used in the production environment.
I run some tests to validate features that will not be used in the production environment. covered a larger number of potential practitioners. We cor related the concepts of test smells to commonly applied test creation and execution practices. Table 1 shows examples of those practices. For instance, the practices associated with Conditional Test Logic (CTL) use loops or conditions in the test code. In this case, the respondents should analyze the practices to determine whether and how often they adopt them. In CTL, the respondents should indicate how often they create tests with those structures or face them during test execution.

Questionnaire Instrument
The questionnaire comprises three blocks of questions. The first block characterizes the respondents (profile) and has thirteen questions to identify their age, gender, education degree, and software testing/programming skills.
The second block has fourteen statements and six comple mentary questions (four objective and two openended ques tions). The statements describe creation practices related to test smells. We structured those statements in a fivepoint Likert scale, where the respondents could choose one of the following answers: always, frequently, rarely, never, or not applicable. In this scale, always indicates the adoption of bad practices for test creation. For example, the "I have already created a test to validate a feature that would not be used in the production environment" statement corresponds to the For Testers Only test smell. Therefore, the answer "Always" means that the respondent usually uses that practice in her daily tasks. As a consequence, it is likely that she uninten tionally inserts that test smells in the test code. We designed the six complementary questions to understand how the prac titioners deal with the test creation activity.
The third block has fourteen statements and one additional question. Those statements describe execution practices re lated to test smells. Like the former block, we structured those statements on a fivepoint Likert scale. The respon dents could choose one of the following answers: always, fre quently, rarely, never, or not applicable, where always indi cates that the respondent comes across with test smells. We designed the complementary question to understand which problems the respondents deal with when executing the tests. The survey was available from April 3rd, 2019, to June 3rd, 2019. Appendix A includes all the questionnaire statements and questions used in this study.

Pilot Application
We ran a pilot survey with four practitioners to identify improvement opportunities. Based on the responses, we im proved the questionnaire before running the survey. It is worth mentioning that we did not include data gathered in the pilot application in the research results.

Participants
We sent invitations and one questionnaire copy (C1 C8) to practitioners from eight Brazilian companies on a conve nience sampling basis. The questionnaire's different versions served to control the number of respondents from the compa nies. Those companies have 4 to 66 practitioners who per form manual and automated tests (Table 2). In addition, we also sent the questionnaire through direct message (D1) and posted it on a Facebook group dedicated to discussing soft ware testing (G1). In total, we contacted 305 practitioners, and 60 practitioners participated in the survey (#S1 #S60).

Analysis Procedure
To answer RQ1, we analyzed the objective questions (statements) on test creation (second block) and execution (third block). To answer RQ2, we grouped the practices by frequency to identify the most commonly used ones. The practices may be associated with test smells according to their characteristics, such as external file usage, conditional structure, and programming style. To answer RQ3, we com pared the professional experience with the frequency of use of test smells. We also used the same answer format of RQ1 but only considered test creation (second block). During the test execution, respondents identify test smells instead of cre ating them.
We analyzed the three openended questions through cod ing and continuous comparison (Kitchenham et al., 2015). The objective was to understand why the respondents use practices that may insert test smells. In addition, we also in tended to understand which difficulties they encounter when creating and executing tests. Two researchers performed the coding task and validated it by consensus. We also associated some practices with the test code characteristics defined by Meszaros et al. (2003).
We employed open coding on the data collected to identify additional reasons why the respondents may use bad prac tices in their software testing activities. The obtained codes were peerreviewed and changed upon agreement with the paper authors. We used coding to complement our results on openended questions because they were optional.

Results
We received 60 answers (out of 305 potential respondents) from three Brazilian states: 40 respondents from Bahia (66.7%), 19 respondents from São Paulo (31.7%), and one re spondent from Paraná (1.6%). The respondents ranged from 22 to 41 years old, and their experience with quality assur ance ranged from 0 to 13 years (5.16 on average). Experi ence as software developers also ranged from 0 to 13 years (average 1.67). Regarding gender, 35 respondents were male (65%), 19 respondents were female (32%), and two respon dents were nonbinary (3%). Most of the respondents hold a degree in Computer Sciencerelated courses (50 respondents 83.3%), six respon dents (10%) hold a degree in other STEM (Science, Tech nology, Engineering, and Mathematics) courses, and four re spondents (6.7%) hold a degree in other areas. Most of the re spondents (54 respondents 90%) pursued higher education degrees, as follows: 40 respondents hold a bachelor's degree (66.7%), 13 respondents hold a graduate degree (21.7%), and one respondent holds a postdoc (1.6%).
In terms of domain, 39 respondents claimed they test mo bile applications (65%), and 36 respondents test web appli cations (60%). We also identified the following domains: 14 respondents work with embedded systems (23.33%), 11 re spondents work with cloud computing (18.33%), seven re spondents test information security (11.67%), four respon dents test Internet of Things systems (6.67%). They also men tioned other domains: big data, retail, artificial intelligence, cable TV, bioinformatics, commercial information, desktop system, and payment solutions one respondent each (1,67% each).

Test creation and execution practices
We asked whether the respondents search for test duplica tion and whether it was either personal or company prac tice. Twentynine respondents (48,3%) answered that it was only an individual activity. Eleven (18,3%) responded that it was only a company's practice, and three respondents (5%) claimed that it was a personal and company activity. How ever, seventeen respondents (28,3%) do not apply this activ ity. Checking tests with the same objective reduces the Test Code Duplication (TCD) test smell.
In addition, we established a relationship between the test creation and execution practices and the test smells occur rence using the data collected. Figures 2 and 3 show the us age frequency of test smells during the test creation and exe cution activities, respectively.
During test creation, the Conditional Test Logic (CTL) and General Fixture (GF) test smells were the most re ported ones. The former obtained 28 (47%) of Always and Frequently responses, and the latter, 27 (45%) in both re sponses ( Figure 2). The high rate of those responses may in dicate a common everyday use of practices related to CTL and GF. We also analyzed why developers create tests with bad practices (one openended nonmandatory question an swered by 27 respondents 45%). The main reasons were related to the company or personally employed standards, limited time, and attempt to reach better coverage and effi ciency.
We also asked whether they modified existing test sets when they came across tests containing any of the problem atic test patterns illustrated in the survey. We found that seven respondents (11,7%) always perform any test code changes, twentythree respondents (38,3%) frequently change, six teen respondents (26,6%) rarely change, seven respondents (11,7%) never edit test code, and seven respondents (11,7%) answered as not applicable. Among the reasons to modify the test, eighteen respondents reported ambiguities reduction (30%), sixteen respondents claimed execution speed im provement (26,7%), fourteen respondents stated adequacy to the company standards (23,3%), eight respondents did not understand test objective (13,3%) and four respondents stated corresponding production class evolution (6,7%).
In addition, the respondents pointed out that they used to face test structure problems. Thirtyone respondents in dicated that some tests depended on third party resources (52%), 29 respondents reported that they were hard to under stand (48%), 24 respondents claimed to contain unnecessary information (40%), 24 respondents said ambiguous informa tion (40%), 20 respondents reported to depend on external files (33%), six respondents pointed to use an external config uration file (10%). One respondent presented resources limi tation (2%).
Regarding difficulties in creating test cases (one open ended nonmandatory question answered by 23 respondents (38%)), requirement issues were the most frequent ones, re ported by twelve respondents (52%). Other problems were related to the difficulties in the test code reuse, lack of knowl edge, production code issues, code coverage, test environ ment problems, and time and resource limitation.
The test execution questions also presented a sequence of statements about ordinary situations the developers usually face, in which respondents should answer according to the frequency. The CTL (52%) and GF (47%) test smells were also the most cited during test execution (Figure 3). Those test smells obtained 31 and 28 answers of Always and Fre quently frequencies, respectively. Regarding difficulties in running test cases (one open ended nonmandatory question answered by 29 respondents 48%), ten respondents reported test environment as a prob lem related to test execution (34%), such as test environ ment unavailability, demand for thirdparty features, and lowperformance environments. The second most common problem is understanding the test purpose (28%), where eight respondents reported that tests were poorly written and with out a standard, allowing multiple interpretations. The lack of test maintenance was the third problem (24%), which in volves outdated and incomplete tests due to the system code evolution (7 respondents).

Professional Experience
Although most respondents from the survey reported they create and execute tests simultaneously, our investigation presented a different scenario as the tester gets more expe rienced. Figure 4 shows the daily activities according to the professional experience, with the following highlights: 10 re spondents (16.7%) with experience ranging from 4 to 6, and 5 respondents (8.3%) with 8 to 10 years of experience create and execute tests at the same proportion. Eight respondents (13.4%) with less than two years of experience, six respon dents (10%) ranging from 2 to 4 years of experience, and four respondents (6.7%) ranging from 6 to 8 years of expe rience only run tests or run tests with more frequency than create. Three respondents (5%) with more than 12 years of experience mostly create rather than run tests. Therefore, less experienced respondents run more than creating tests, and re spondents with more experience create more than run tests. We also analyzed whether the use of good practices to create tests increases as respondents become more experi enced. We provided the respondents with thirteen statements, with illustrative scenarios of problems with test cases. Each scenario relates to a given test smell. The respondents had to answer how often they experienced each scenario. Table  3 shows the number of respondents grouped by experience time (in years) and the number of valid responses. Figure 5 presents the frequency of test smells grouped by professional experience. When we analyzed the first expe rience range (02), 71 answers (50%) from the respondents could not identify the adoption of practices related to test smells (Not applicable). 9 (6%) answers pointed that respon  dents Always adopted some practice related to test smells, 16 (11%) answers related to Frequently, 29 (20%) answers to Rarely, and 18 (13%) answers to Never adopted practices re lated to test smells. When we extended that analysis through the next experience ranges, we could not observe any in crease in responses Never and Rarely with the professional experience, indicating that the experience might not influ ence the adoption of practices that lead to the introduction of test smells.

Interview Study
After carrying out the survey study, we interviewed Software Engineers to gather further evidence on how the practitioners deal with test smells, develop unit test code, and deal with test smells in test creation and maintenance. The interview dataset, including the interview transcriptions, interviewees profile, and coding summary, is publicly available at (Junior et al., 2021).

Design
We employed a semistructured interview approach, guided by a set of sixteen questions, as Table 4 shows.

Interview Organization
We organized the interview into three blocks: • Warmup block (#13). Questions about the professional background, such as the learning resources on software test code the interviewees commonly use, as well as the programming language they often use to implement test code, if any; • Technical block (#49). Questions about how they cre ate, maintain, and assess the quality of developed unit tests; • Test Smell block (#1015). Questions about the intervie wees' awareness of test smell and how they handle these in test case creation and maintenance.
The interviewees could also ask for more information or give additional information and suggestions to increase the interview quality (question #16). Unlike the survey, in the in terview, we employed the actual test smell term in the ques tions related to such the concept instead of considering a transitive approach through statements containing practices embedded with test smells. When the participants were not aware of the term or asked for more information on test smells, we presented the concept and two test smells samples, e.g., CTL and EpT (Virgínio et al., 2020). Those test smells were related to the most and the least frequently program ming practice used on survey results, respectively. There were no questions about challenges or problems involved in creating and maintaining test code. The interviewees an swered the questions in Table 4 according to their experi ences, concepts, and shared information during the meeting.
The interviewer and interviewees did not access any test code from interviewees to analyze the presence of test smells.
At the beginning of the interview, the practitioners an swered a professional profile' form with academic back ground and professional experience. They also provided an email to solve eventual doubts or collect more data during data analysis. We interviewed on June 3rd and June 30th. Due to the pandemic period, online meeting tools, such as Skype and Google Meet, were used upon the participants' request. We recorded the interviews with either the Skype conversa tion recording tool or the Google Meet screen capture feature. Additionally, we used an external voice recorder for every in terview.

Participants
Initially, we contacted practitioners from the survey who agreed to keep contributing to research. Unlike the survey, we opted only for test code developers whose focus was creating and maintaining unit testing, including the treat ment of test smells. Some interviewees participated in the survey study because we applied the snowballing technique (Kitchenham et al., 2015). Next, we used LinkedIn to invite other potential participants, using the "unit testing" expres sion in the profile ability search Linkedin provides users. A total of 50 practitioners accepted the invitation (#I1 #I50).

Pilot Study
We performed two pilot interviews with practitioners to measure the interview length and analyze whether it would be necessary to modify any part of the predefined instrument. As a result, there was no need to perform any changes in the instrument. The average interview length was around 30 minutes.

Analysis Procedure
The first author was the one responsible for transcrib ing the interviews. From them, we performed open coding (Corbin and Strauss, 2014) to answer the research questions. The remaining coauthors analyzed the transcriptions to un derstand how the practitioners develop tests and deal with test smells. First, we analyzed and validated the coding until we reach a consensus. In the following, two authors individ ually reviewed the proposed coding. In the end, one expert researcher reviewed the final coding.

Results
The interviewees could answer openended questions in dif ferent ways, according to their reality. Therefore, when pre senting the results, some responses got more than 100% dur ing the quantitative analysis.
The respondents' age ranged from 20 to 48 years old, most of them ranging from 25 to 34 years old (60%). Regarding their education, six respondents have completed high school (12%), 31 respondents completed an undergraduate school (62%), and 13 respondents hold a graduate degree (26%). Additionally, 48 respondents either have a degree or were studying any Computer Sciencerelated course (96%), one respondent holds a degree in Applied Business (2%), and one respondent holds a degree in Psychology (2%).
The respondents worked in companies of different sizes,  as follows: (i) 10 respondents worked in small companies (less than 50 employees 20%); (ii) 5 respondents worked in mediumsized companies (number of employees in the range from 50 to 99 employees 10%); and (iii) 35 respondents worked in largesized companies (more than 99 employees 70%). Additionally, the interviewees were responsible for different tasks within companies related to their current roles (Table 5). They created unit tests for mobile, desktop, and web platforms using different programming languages (Ta ble 6). Their experience in software development tasks var ied from 1 to 20 years, of which more than 50% were in the 1 6 years of experience range. Two out of them were not work ing with unit test creation when we interviewed them. In such cases, they should consider their previous experience.
We compared and analyzed the information for the open coding analysis and grouped them into codes using sentences, paragraphs, or the entire document. For example, when we asked them about their unit test creation process, the intervie wee #I47 answered: "When I worked only with Java [...]  We did not consider the warmup block answers (#1 3) as we used them to stimulate the interviewees to pro vide as much information as possible. We used the technical block answers (#49) to analyze how the interviewees cre ated, maintained, and verified the test quality to complement and compare the survey's supplementary questions. We used the answers for question #10 to analyze which information the interviewees presented about test smells. Therefore, we could answer RQ4. Questions #11 and #12 complemented question #10. We used answers for questions #13 and #14 to analyze the strategies for dealing with test smells and answer the RQ5. Then, we analyzed the answers given to question #16 to understand how the interviewees believe it was pos sible to avoid introducing test smells. Those questions let us understand better how they create, maintain, and verify unit test codes and how they deal with and possibly avoid test smells.

Unit test code creation and maintenance
We found that the developers usually create unit test code using TestDriven Development (TDD) (48%), Test Last De velopment (TLD) (42%), or Behavior Driven Development (BDD) (16%). Those strategies' usage was motivated accord ing to the project task or developer's knowledge about the project's programming language or architecture. For exam ple, the interviewee #I16 stated that he used TDD when he dominated the programming language; otherwise, the func tional software code was created first and then tested (TLD). The interviewee #I25 claimed that she created unit tests ac cording to the stories from the BDD scenario. When there was no scenario, she used TDD. The method adoption could also depend on if the software was new or legacy. The in terviewee #I32 pointed that TDD was used on new projects when possible, and he used a BDD variation before the soft ware code creation.
During the test code creation description, four intervie wees (8%) mentioned using Mocks to simulate components, and two interviewees (4%) used to adopt clean code prac tices. For instance, the interviewee #I22 claimed he creates easytoread and understand, fast, and independent test codes. The interviewee #I36 uses code patterns and creates less ver bose tests. Additionally, the focus of four interviewees (8%) is on test coverage. The interviewee #I12 claimed that he identifies "interesting features" to test. According to the inter viewee #I43, the test code should cover 80% of the software code. Moreover, the interviewee #I10 mentioned the SOLID principles, and the interviewee #I15 adopts the ModelView ViewModel (MVVM) project pattern (#I15) as practices dur ing the test creation.
When we asked whether there was any document that stan dardized unit test creation, nine interviewees (18%) indicated the use of templates or some other documentation. The in terviewees #I5 and #I9 mentioned a test template in their projects that the team members could adopt. The interviewee #I29 claimed his team followed the Microsoft's official doc umentation, but there was not any internal document. The interviewee #I39 mentioned using a Domain Specific Lan guage (DSL) to share project information, as follows: "On project day 0, we create and standardize an official DSL for the code. You have prerogatives, you have the test, and you have the result". In addition, some interviewees answered that there was no documented standard, but they adopted the GivenWhenThen (GWT) pattern and the ArrangeAct Assert (AAA) programming practices.
After creating unit test code, the test quality assessment was performed through code review (78%) by one or more developers inside the project team. This activity usually was supported by tools, such as Pull Panda. For example, the in terviewee #I2 claimed: "Pull Panda 5 is a tool used to ran domly assign one or more developers to perform the code review. [...]". Furthermore, two other interviewees (inter viewee #I4) and (interviewee #I16) reported that they per formed peer review (4%), and four interviewees claimed they commonly verify test code quality through pair program ming (8%). Other practices identified were: test coverage (30%), metric analysis tool (24%) (e.g., SonarQube tool), re viewing by continuous integration tool (16%), test execution (10%), application of programming practices (10%) (reuse, clean code, and libraries), running mutant test tool (6%), test validation by external Quality Assurance team (2%), and static validation (2%). Three interviewees reported that there were "no test quality assurance" activities because there were not enough tests to perform this activity or because the company does not support it.
The interviewees adopted various test maintenance types distributed by corrective (62%), adaptive (36%), preventive (4%), and perfective (4%) maintenance. Four interviewees claimed there was no test code maintenance due to: (i) there was no defined maintenance process (interviewee #I22); (ii) the participation in one new project and no maintenance task was required (interviewee #I24); absence of maintenance ac tivity because of shortage of time (interviewees #I24 and #I36); and (iv) project environment (interviewee #I45).

Test smells treatment
We asked the interviewees about their knowledge of test smells to understand whether they comprehended the study subject. Figure 6 summarizes the results. Seven interviewees (14%) demonstrated some knowledge of test smells. For ex ample, the interviewee #I2 answered: "I know a few things. I consider these as bad practices, bad choices that you make in your test code that difficult its maintenance and evolu tion.". Twentythree interviewees (46%) related test smells with code smells but claimed they have never heard of the test smells. The interviewee #I16 mentioned: "Test smell, I do not know the concept. The code smell is a problem that the static test analysis tool found in the program. Would test smell be that same analysis on top of the test code?". Finally, twenty interviewees (40%) did not know test smells and did not relate to any smells type.
We presented the definition and examples of two test smells (CTL and EpT) for the interviewees who did not know about test smells or asked for more information. Table 7 shows how they prevent test smells during test code cre ation and how they treat test smells during the test code cre ation and maintenance. For example, during the test code cre ation, the Code review practice was the most recommended (38%), followed by Tool usage (26%) and Programming practices (24%). When developing the test code, the devel oper should follow the programming practices to prevent test smells. Tools and code reviews help to check the test smells insertion in an early stage of development. Two interviewees believed there were not test smells in their repository. For example, the interviewee #I39 said: "I think we do not have this problem (test smells) in the recent project because of its difficulty level, we follow a coding standard. We educate peo ple on how we code it [...]". The interviewee #I11 also said: "As I am the only one working on the project, I coded, under stood, and never had this vision of test smells. I do not think I have any problem with that.".
Regarding maintenance, we asked how the interviewees treated test smells during the test code maintenance. The an swers were similar to the previous question (Table 7). For the test code maintenance, the Code review was also the most rec ommended practice (28%), followed by Refactoring (20%) and Tool usage (18%). As the test code was already devel oped and might have test smells, they suggested using tools to help detect test smells and refactoring techniques to re move them from the test code. The code review practice can doublecheck the test code to treat the test smells during the maintenance.
We also asked the interviewees how to prevent test smells during test code creation (Table 7). For the test smells pre vention, the Tool usage was the most recommended practice (44%), followed by Developers' skills (28%) and Code re view (20%). The developers' skills are related to develop ing tests' knowhow by following good practices, guidelines, and coding patterns. It should help the developers identify and prevent flaws in designing and implementing a test code. The tool usage can support the developers when developing a test code by identifying possible test smells. The code re view is a manual analysis of the test code to doublecheck the test code for test smells prevention.
At the end of the interviews, they could either provide or ask for further information about test smells and test code quality assurance. Therefore, the interviewee #I29 claimed: "For me, it is a quality guarantee in terms of dependence ex emption, in terms of development, cohesion, coupling, and fundamental architecture. From the moment you have unit testing or even TDD, it helps you improve the code and ar chitecture.". The interviewee #I35 demonstrated interest in our study: "I would like to know more about the study, we can talk about it later if you want, [...] I thought the term 'test smell' is complicated, at least it does not seem to be a common industry expression.".

Discussion
This section discusses the results obtained after conducting the survey and interview to answer the research questions. RQ1, RQ2, and RQ3 are related to the survey, and RQ4 and RQ5 are related to the interview.

RQ1: Do practitioners use test case design practices that might lead to the introduc tion of test smells?
We observed that at least one respondent pointed to 1 out of 14 practices related to test smells from the results. We ana lyzed those practices when creating and maintaining tests to identify which types of test smells the participants frequently insert in the test code.
Regarding test creation, we observed that every test smell presented at least three out of four possible answers (Always, Frequently, Rarely, and Never). We classified the data into two groups: the Commonlyused practices group (CPG) and the Unused practices group (UPG). CPG contains test smells that mostly present Always and Frequently as answers, and UPG that mostly present Rarely and Never as answers. We considered a test smell belonging to one group when the difference between the Always and Frequently rates and the Rarely and Never rates is greater than 10%. For example, the Empty Test, For testers only, Test Run War, Constructor Initialization, Resource Optimism, Redundant Print, Magic Number, Indirect Test test smells belong to UPG, which means practitioners rarely insert those smells on the testing activities.
On the other hand, the respondents frequently adopt prac tices related to the General Fixture test smell, the only member of CPG, indicating that they usually create tests with that smell. Still, four test smells presented a similar perti nence frequency to both groups (less than 10% of difference). For them, there was not a pattern among respondents. For in stance, the Eager Test test smell obtained 38% to CPG and 40% to UPG.
In the test execution, UPG contains the Empty Test, Eager Test, Assertion Roulette, Redundant Print, Duplicated Test, Test Run War, For testers Only, Mystery Guest, Constructor Initialization, and Resource Optimism test smells, which means that the re spondents rarely face those smells during the test execution. Otherwise, the respondents frequently find practices related to two test smells, General Fixture and Conditional Test Logic, which compose the CPG group. In addition, we did not perceive a significant difference among respon dents for two other test smells, Indirect Test and Magic Number, which presented similar pertinence frequency to both groups.
We also investigated the reasons that lead the respondents to adopt the practices presented in the survey. Thus, we an alyzed the openended questions and identified 16 different tags. The most common ones were company standard, per sonal standard, project politics, professional experience, sav ing time, and improving coverage. For example, the respon dent #S26 of the survey reported applying company stan dards when creating tests that may insert smells and com monly use bad practices "to match company development standards." In another situation, respondent #S54 reported using personal standards when said: "I group tests by mod ules to execute them sequentially without compromising ef fectiveness." This behavior suggests that participants may have misunderstood the test smells definition. When group ing tests, it is possible to insert the Assertion Roulette test smell and compromise test independence. A similar situ ation occurred with the respondents #S14, #S16, #S27, #S50, and #S59.
In general, our study identified that all test smells appeared in testing activities. They all were cited by respondents, even if rarely.
Practitioners adopt practices for test case design, which introduce test smells. Usually, those practices come from improper personal and company stan dards.

RQ2: Which practices are present in prac titioners' daily activities that lead to intro ducing test smells?
Although there are specific tools to support test automation (Fraser andArcuri, 2011; Smeets andSimons, 2011), 62% of respondents perform more manual than automated tests. Besides, 55% have no experience with software development (less than two years of experience), the lack of knowledge does not influence the adoption of bad practices in the test code.
According to the practices explored in the survey, we iden tified that the respondents usually come across: (i) the use of generic configuration data, which produces the General Fixture test smell (most frequent on the activities of test creation and execution CPG); and (ii) the use of condi tional or repetition structure, directly associated with the Conditional Test logic test smell (second most de tected on the activity of test execution CPG).
The respondents indicated they usually face several prob lems with tests, such as poorly written tests and outdated and incomplete test procedures. According to them, when the tests are associated with generic configuration data, test cases are hard to understand and may cause incorrect results. More over, the test coverage on the production code is unclear due to the conditional logic presence on the tests. Understand ing which practices are most prevalent in the professionals' activities supports improving test quality. Other identified problems are related to incompleteness, outdatedness, or lack of documentation. These may hinder traceability, evolution, and maintenance of the testing tasks.
The practices most present in the practitioner's daily life that lead to test smells insertion were conditional structure or repetition and generic configuration data.

RQ3: Does the practitioners' experience interfere with the introduction of test smells?
In the survey study, we analyzed the respondents' experi ence and its influence in adopting practices that might lead to insert test smells in their projects. As a result, we did not identify any clear causeeffect correlation. For example, the Always option indicates they always use harmful practices. When we analyzed the answers' frequencies for this option, the usage rate did not reduce over time. Instead of that, we may observe from Figure 5 that respondents with 8 to 10 years of experience achieved a higher usage rate of this fre quency. We also identified that behavior when we analyzed the other usage frequencies. However, we could not infer that inexperienced practitioners introduce more test smells than experienced ones regarding the activity of test creation. On the one hand, when testers are inexperienced program mers, they may write lowerquality tests. On the other hand, they can carry programming biases that may contain bad practices when they are more experienced. Thus, the absence of a tendency indicates a nonbehavioral change between less and more experienced practitioners.
Experienced practitioners may not produce fewer test smells than inexperienced ones.

RQ5: What practices have practitioners employed to treat test smells?
Commonly, the interviewees did not know what test smells are. After explaining the concepts to them in the interview, they could understand and explain how they deal with test smells in their daily activities. They reported adopting a set of project's activities (e.g., code review, pair programming, and technical debt) and programming practices during the test creation and maintenance processes (e.g., the clean code approach and Given, When, Then (GWT), and Arrange, Act, Assert (AAA) patterns) to either prevent or treat test smells. The interviewees tended to develop unit tests according to their skills. The professional abilities also determine the result of code review. The interviewees who did not learn about test smells or programming practices can approve a submitted package with these issues. The code review was the most reported activity to treat test smell in test creation (38%) and the most common activity performed by the inter viewees (78%) during test quality verification. In this activ ity, one or more practitioners analyze the submitted code. In this context, the reviewer's knowledge determines whether the code is good enough to merge it on the repository.
Each team adopts different strategies to perform code re views based on the number of reviewers, number of ap provals, and professional experience. Although some inter viewees reported that only the experienced members review software and test code, the review may not avoid test smells in the project repository, mainly because both experienced and inexperienced practitioners adopt practices that intro duce test smells.
When we asked about the test smell treatment during test maintenance, some interviewees reported creating a techni cal debt to refactor test smell in another moment (intervie wees #I08, #I09, #I22, #I25, and #I50). This behavior may indicate that the test smell correction is not a priority. The technical debt creation may also be the reason why test smells remain in the repository. For example, interviewee #I09 said: "There is nearly no treatment for test smells. [...]  The interviewees hardly addressed the technical debt and failing tests because they needed to prioritize other tasks as software code development. With less time for testing, test smells would be introduced in the test code during test cre ation and maintenance and keep in the repository through postponing maintenance activities.
We did not know whether practitioners have learned about test smells. Thus, we adopted the concepts of test smells in literature. The validation of those concepts was out of scope. Although we did not ask specifically whether the intervie wees considered test smell as a problem or agreed with the given test smells examples as a smell, during their answers about test smell treatment, part of them told about how they treat at least one of the given examples. For example, the interviewee #I07 said: "Despite not having worked exactly with this type of concept, Sonar itself warned us about these two problems, both when the logic was very complex, with a lot of "if," it warned us to break it in different methods, things like type. Moreover, I remember that it identified comments, commented code, and sends a warning".
Regarding the Conditional Test Logic smell example, #I37 said: "This specific code enters into a specific clean code case. This test may be doing more than it should". Accord ing to these comments, the interviewees consider test smells, including the given examples, as structures to fix.
Practitioners adopt a set of project activities and pro gramming practices to treat test smells. As they do not know well the test smells concepts, it is impossi ble to guarantee that those strategies treat test smells appropriately.
Internal validity. Although there are more than 100 test smells, this study only considered 14 test smells. However, we selected the most frequent test smells discussed in the lit erature. In addition, the test smells were presented in the sur vey as practices. To mitigate ambiguities and text compre hension, we applied a pilot with four testers from different companies. We used professional social networking to reach as many respondents as possible from Brazilian companies demographically distributed for the survey and interview ex ecution.
External validity. Our survey and interview respondents may not adequately represent the practices adopted by the practitioners in the wider software engineering industry. Al though our results may not generalize, they provide a prac tice adopted initial view by the testers. There is an agreement among the practitioners' responses, indicating that additional data might not reveal new insights.
Construct validity. The survey did not inform that the questions referred to test smells to investigate whether the practitioners nonintentionally insert test smells. We pre vented the respondents' partiality when identifying the prac tices adopted. Complementary, to investigate how the prac titioners deal with test smells. We presented the concept to the interviewees who did not know this subject. After learn ing test smells, the respondents were interested in finding so lutions for this "problem" (test smells). We collected open ended questions answers and performed one peerreviewed coding process to avoid biases. The survey and interview in struments were written in Portuguese and translated to En glish by one author but reviewed by others.
Conclusion validity. The data analysis was an exhaustive process, which depends on the researchers' interpretation of the openended questions answers. To prevent biases, we per formed the data analysis in three steps: i) two researchers ana lyzed the data on pair to discuss the identification of the code, ii) two researchers analyzed the data individually, checking if new codes could emerge, and iii) all researchers discussed and compiled the results from steps i and ii. Additionally, to increase transparency, the crude survey and interview data are available online for other researchers to validate and repli cate our study. Bavota et al. (2015) presented a case study to investigate the test smells impact on maintenance activities. In that study, developers and students analyzed testing code to compare whether their experience would make a difference in test smell identification. As a result, they found that the inten sity of the test smells' impact is different for different levels of experience; the number of impacting test smells is higher for students than industry professionals. Additionally, they found that test smells have a significantly negative impact on maintenance activities. Conversely, our survey found that the practitioners' experience does not interfere in the test smell introduction during test creation and execution activ ities. Moreover, the interview revealed that the practitioners are not aware of test smells, reinforcing that the experience is not influencing the test smells insertion in the test code. Tufano et al. (2016) proposed an interview study with 19 participants to investigate developers' perception of test smells. They performed an empirical investigation to ana lyze where test smells occur at the source code. The re sults showed that developers generally do not recognize test smells, and there are test smells since the first code commit in the repository. Similarly, our interview indicated a lack of awareness of the developers about the underlying concept of test smells. Additionally, we did not find any study investi gating how professional practices affect the test smells intro duction, and therefore we investigated it through a survey. Spadini et al. (2020) surveyed developers to evaluate the severity thresholds for detecting test smells and investigate test smells' perceived impact on test suite maintainability. The developers had to classify whether a test smell instance is valid and rate the test smell instance regarding its importance to its maintainability. The evaluation of test smells instances requires knowledge about the topic. Therefore, our survey presented practices that might lead to test smells insertion, and our interview provided information about test smells to level the respondents about the topic.

Related work
In our previous work (Silva Junior et al., 2020), we con ducted an expert survey to understand whether practitioners unintentionally insert test smells. We surveyed sixty Brazil ian practitioners regarding fourteen bad practices that might lead to the test smell insertion during the test code creation and execution. The results indicated that the practitioners' experience might not influence the test smells insertion. Usu ally, practices that lead to test smells insertion came from im proper personal and company standards. This current study complements the previous one by investigating the practition ers' knowledge about test smells and how they deal with the test code quality regarding the presence of test smells. We conducted interviews with fifty Brazilian practitioners to ask them about the test code creation and maintenance processes. As a result, the interviewees indicated a set of practices that might be useful to treat test smells. However, as they do not know about test smells concepts, those practices need further investigation for test smell treatment.

Conclusion
Test smells may decrease the test code quality and main tenance. Our study aimed to identify whether practitioners unintentionally insert test smells in the test code and how they treat them. Therefore, we applied two complementary research methods: a survey and an interview study.
We surveyed sixty respondents to investigate the uninten tional test smells insertion in the test code. They evaluated a set of practices related to the test smells insertion in the test code. The results indicated that the respondents adopt bad practices that might lead to insert the test smells. The bad practices adoption is more related to the improper company standards than the respondent's experience with test code de velopment.
To investigate how the practitioners treat test smells, we interviewed fifty respondents. They answered questions on how they prevent and treat test smells during the test code de velopment. The results indicated an overall knowledge lack on the test smells. For most of the interviewees, it was their first contact with this subject. However, after explaining one test smell to the respondents, they recognized it in their test code and identified practices that they adopted to deal with it. Among the recommended practices, we highlight the adop tion of tools, coding patterns, programming practices, code review, and training to improve the developers' skills and expertise.
After analyzing the answers to the survey and the inter view, we could identify that practitioners did not know test smells. Thus, they insert different test smells types, even the experienced ones. They have tried to treat test smells through some strategies, but as they have not learned about this sub ject, they have inserted test smells in their test code, and the strategies may not be enough to avoid that. Those studies are starting points to researches that consider practitioners as agents in the test smell treatment.
As future work, we aim to follow the Grounded Theory methodology (Corbin and Strauss, 1990) to leverage a com mon understanding of how the software industry is receptive to improving the test code quality by taking test smells into consideration. We would validate the respondents' practices to prevent and treat test smells and elaborate a checklist for test code quality development and assurance with an indepth study.
Q29. "I already changed a test by identifying one of the pre vious points." Q30. If you answered "always", "frequently" or "rarely" in the previous questions, why were the tests created with these standards? Q31. If you changed any tests according to the design stan dards above, why were they edited? Q32. What problems in the test structure have you encoun tered? Q33. What difficulties do you often encounter when creat ing test cases?

Block 3: Test Execution
Evaluate the following statements according to the fre quency found in daily activities: Q34. "A test case fails due to unavailability of access to a configuration file." Q35. "Repeat a test case because it previously failed due to competition with some other test case that was run ning at the same time." Q36. "Execute tests that could be executed performed more quickly, when modifying the contents of the configu ration file." Q37. "Run a test without understanding its purpose." Q38. "Some test fails and it is not possible to identify the cause of the failure." Q39. "Run a test that depends on an external resource that does not have a test for direct validation." Q40. "A test case fails due to unavailability of access to any external resource." Q41. "Run test with a high value for a specific parameter (eg: number of records in the database, number of files in folder) even if it makes it difficult to repeat." Q42. "Run a test to validate a feature that will not be used in the production environment." Q43. "Find duplicate test (with the same or different writ ing)." Q44. "Run test with conditional or repetitive structure." Q45. "Find empty test, with no executable instruction." Q46. "Run test with printing or display of results in a re dundant way, or unnecessary." Q47. "Run a test considering the existence of a resource, without checking the existence or availability of it." What difficulties do you usually encounter when run ning test cases?