An Approach for the Generation of MultiObjective Algorithms Applied to the Integration and Test Order Problem

Multi­Objective Evolutionary Algorithms (MOEAs) have been successfully applied to solve hard real software engineering problems. However, to choose and design a MOEA is considered a difficult task, since there are several parameters and components to be configured. These aspects directly impact the generated solutions and the perfor­ mance of MOEAs. In this sense, this paper proposes an approach for the automatic generation of MOEAs applied to the Integration and Test Order (ITO) problem. Such a problem refers to the generation of optimal sequences of units for integration testing. The approach includes a set of parameters and components of different MOEAs, and is imple­ mented with two design algorithms: Grammatical Evolution (GE) and Iterated Racing (irace). Evaluation results are presented, comparing the MOEAs generated by both design algorithms. Furthermore, the generated MOEAs are compared to two well­known MOEAs used in the literature to solve the ITO problem. Results show that the MOEAs generated with GE and irace perform similarly, and both outperform traditional MOEAs. The approach can reduce efforts spent to design and configure MOEAs, and serves as basis for implementing solutions to other software engineering problems.


Introduction
The testing activity is usually performed in different phases. In the unit testing phase, each module (unit) is individually tested. After this, in the integration phase, the units are com bined and tested in order to identify software design faults related to the interaction of units. In many cases, there are de pendency relations between the units, that is, to test a unit A, another unit B is required. When dependency cycles among units exist it is necessary to break the cycle and to construct a stub for B, if B is not available. The stubbing process may be an expensive and error prone task. Hence, to reduce stubbing costs, it is very important to determine the best sequence of units to be tested and integrated. Additionally, there are some factors to take into consideration, such as the number of re quired methods, attributes, and parameters to be emulated in the stub, besides other ones associated with the software de velopment (Assunção et al., 2014), making this task a multi objective problem that cannot be easily solved by the tester in a short time. Such problem is known in the literature as In tegration and Testing Order problem (ITO) and appears in different contexts such as, component based development, objectoriented development, aspectoriented development, and software product line engineering.
The ITO problem has been investigated in the SBSE (SearchBased Software Engineering) field (Harman et al., 2012) by using different searchbased algorithms (Wang et al., 2011). The most promising ones are the Multi Objective Evolutionary Algorithms (MOEAs) (Assunção et al., 2014; Vergilio et al., 2012a, as they can solve hard realworld problems impacted by many conflicting objec tives. Moreover, MOEAs are widely used for software engi neering problems in general. Surveys of the SBSE field (Har man et al., 2012; Colanzi et al., 2019 report that evolutionary algorithms are the preferred and most used ones. In spite of this preference and large use, the design of a MOEA is not always easy. MOEAs are distinguished by dif ferent components, which directly affect the generated solu tions. Furthermore, they have a wide range of parameters to be configured (Eiben and Smit, 2011). For example, MOEAs have as components the search operators, the replacement and archiving procedures. They also have as parameters: pop ulation size, crossover and mutation probabilities. In this work, the word design refers to the choice and implementa tion of the best components and parameters configuration. In this way, the great number of alternatives makes the MOEA design a hard task. The best combination of components and parameters may depend on the problem being solved.
Moreover, it is not always possible to know a priori what is the best choice among existing MOEAs. In the SBSE liter ature, the best MOEA for a problem is usually determined by conducting evaluation experiments, which requires ef fort and increases costs. Regarding the ITO problem, As sunção et al. (2014) conducted experiments with three dif ferent MOEAs NSGAII, SPEA and PAES and no one of them proved to be the best to solve the problem considering all instances and contexts.
To reduce such difficulties faced by software engineers, in this paper, we propose an approach for the automatic gener ation of MOEAs applied to the ITO problem. The approach includes an offline training process, performed by a design algorithm that receives as input an instance of the ITO prob lem reported in the literature (Assunção et al., 2014). The output is a MOEA that can be used by the tester on other instances. To allow the use of different design algorithms the approach encompasses a set of MOEA parameters and components to be combined. It is implemented with irace (from Iterated Racing algorithms) (LópezIbáñez et al., 2011) and Grammatical Evolution (GE) (Ryan et al., 1998). irace defines a "parameter space" in which the parameters, their types, ranges and constraints are defined. GE is a type of Ge netic Programming (GP) (Koza, 1992) that uses a grammar containing a set of rules and values to guide the evolutionary process in the generation of programs. In this sense, by us ing the grammar of GE or the parameter space of irace, it is possible to map the components and parameters to be used in the automatic design.
The approach is evaluated with seven systems and pro duces results statistically better (in terms of hypervolume) than MOEAs commonly used to solve the ITO problem in the literature. A comparison between both design algorithms shows that they present similar results. These results provide some evidence for the benefits of automating the design and configuration of algorithms in SBSE. Furthermore, our re sults can be seen as evidence for the effectiveness of GE and the competitive results between GE and irace, which are ongoing matters of discussion in the community (Whigham et al., 2017b; Ryan, 2017; Whigham et al., 2017a; O'Neill and Nicolau, 2017.
An empirical evaluation of our approach was conducted using seven testing instances of the ITO problem in order to answer the following research questions: • RQ1 -How are the results of the MOEAs generated by our approach in comparison to the MOEAs used in the literature? and, • RQ2 -How are the results of the MOEAs generated using GE in comparison to the results of the MOEAs generated using irace?
RQ1 concerns comparing MOEAs generated by our ap proach with the algorithms NSGAII and SPEA, used in the ITO problem literature. RQ2 concerns comparing the de sign algorithms GE and irace. As a result, we observe that both design algorithms have similar performance and our ap proach generates MOEAs that are better or similar than the traditional ones, considering some statistical tests and hyper volume, the main quality indicator used in the optimization field to compare MOEAs (Zitzler et al., 2003(Zitzler et al., , 2007. Some works on automatic design of Evolutionary Algo rithms (EAs) are based on GE (Lourenço et al., 2012(Lourenço et al., , 2013(Lourenço et al., , 2015 and irace (Bezerra et al., 2014(Bezerra et al., , 2015. These papers use different parameters and components, which are success fully combined to evolve better algorithms. One of these pa pers addresses the automatic design of MOEAs specialized in some benchmark instances using irace (Bezerra et al., 2014(Bezerra et al., , 2015. With respect to the use of such algorithms in SBSE, the work of Mariani et al. (2016) proposes a GEbased hyperheuristic, named GEMOITO, for generating MOEAs to solve the ITO problem. The encouraging results of all the aforementioned works motivated the work herein described. In this sense, our work has the following main contributions: • Use of a set of components and parameters for MOEAs design that includes elements not used in related work; • Introduction of an approach that can be used to solve software engineering problems formulated as permuta tion problems. The main idea is to ease the design of MOEAs, reducing efforts; and • Application and evaluation of the approach to the ITO problem. Such an evaluation reports results comparing two design algorithms: GE and irace. We have not found works comparing them.
This paper is organized as follows. Section 2 reviews sub jects related to this work: the ITO problem, automatic design of algorithms, and MOEAs. Section 3 introduces the pro posed approach, describing its main elements. Section 4 re ports the conducted empirical evaluation and the discussion of the results. Section 5 contains related work. And finally, Section 6 presents the conclusion and describes some future work.

Background
This section introduces the ITO problem and reviews back ground on automatic design of algorithms and MOEAs.

Integration and Testing Order problem
A testing strategy usually includes a set of phases with dis tinct goals. First of all, the unit testing focuses on each unit, the smallest part of a system to be tested. After this, an in tegration testing phase is performed to find problems in the interaction among the units. In this phase, the units are com bined and tested according to the testing plans. During such integration it is necessary to determine an order to integrate and test the units. Such an order impacts the sequence in which units are developed; the design and execution of test cases; the order in which integration faults are revealed; and the number of required stubs for the units that possibly are not available, but from which the unit being tested depends on.
The stubbing process can be expensive and errorprone. The cost can be impacted by many factors such as number of attributes and parameters to be emulated, number of return types and so on. When there is a cyclic dependency between two classes, such a dependency needs to be broken and a stub needs to be built. In this sense, it is important that the class associated with the smaller cost, given for instance by the number of attributes and operations, appears first in the in tegration order. Due to this, we find in the literature many strategies to solve the Integration and Testing Order Prob lem (ITO) (Wang et al., 2011; Briand et al., 2002a. The most promising works are based on MultiObjective Evolutionary Algorithms (MOEAs) (Vergilio et al., 2012a; Assunção et al., 2014. The proposed strategies are generally based on graphs that represent the dependencies among the units. A depen dency cycle in the graph, which needs to be broken, corre sponds to a required stub. The problem is to determine the best sequence of units associated with minimal stubbing cost. In this work, we use the formulation of the ITO problem proposed by Assunção et al. (2014), as well as the same benchmark. A static model representing the dependencies among the units and costs (associated with the number of pub lic classes attributes and methods to be emulated) is provided to the algorithm by using matrices. These matrices are gener ated previously and reused during the optimization process. If the tester wants to use a new program, this program static model must be generated and given as input (as described in Assunção et al. (2014)).
Since the ITO problem is related to permutations of units, which form testing orders, the chromosome is represented by a vector of integers where each vector position corresponds to a class. The size of the chromosome is equal to the number of units of each system. Thus, let each unit be represented by a number, an example of a valid solution for a problem with five units is {2, 4, 3, 1, 5}. In this example, the first unit to be tested and integrated would be the unit represented by number 2. The second, the unit represented by number 4, and so on. Each unit can appear only once.
We use two objective functions to evaluate the solu tions (Assunção et al., 2014), which measure the dependen cies between server and client units. Hence, considering that: (i) a unit is a module to be tested that can be either classes or aspects; (ii) m i and m j are two coupled modules; and (iii) the operation term represents class methods, aspect methods and/or aspect advice, the used coupling measures are defined as follows: • Number of attributes (A) = The number of attributes lo cally declared in m j when references or pointers to in stances of m j appear in the argument list of some oper ations in m i , as the type of their return value, in the list of attributes (data members) of m i , or as local param eters of operations of m i (adapted from Vergilio et al. (2012a); Briand et al. (2002a). This complexity mea sure counts the (maximum) number of public attributes that would have to be handled in the stub if the depen dency were broken. • Number of methods (O) = The number of opera tions/methods (including constructors) locally declared in m j that are invoked by operations of m i (adapted from Briand et al. (2002a)). This complexity measure counts the number of public operations that would have to be emulated if the dependency were broken.
In the literature, the ITO problem is usually defined in terms of the testing process cost (Assunção et al., 2014; As sunção et al., 2013. By reducing the number of attributes and operations to be emulated, the tester can reduce the overall cost of the integration testing activity, i.e., these metrics serve as surrogates for the cyanstubbing process cost. It is hard to estimate the actual human effort for emulating such items, but using metrics in such granularity, number of attributes and number of operation/methods, as opposed to number of stubs can provide the tester a better way of estimating the cost. Moreover, the ITO problem formulation, commonly found in the literature, does not address the fault revealing capability of the testing process, just the cost of stubbing. The input data of the MOEAs consist of matrices that are read from a text file. They are matrices associated to (i) de pendencies between units; (ii) measure A; and (iii) measure O. Then it is necessary to establish a tradeoff between these values. The dependency matrix is used to define precedence constraints and the others to calculate the fitness of each so lution, where the sum of dependencies between the classes for each measure corresponds to an objective. It is impor tant to highlight that there are classes with a great number of operators and few attributes, and viceversa; the goal is to minimize both objectives.

Automatic Design of Algorithms
The design of an algorithm is related to all decisions taken during its definition and considering a specific prob lem (Eiben and Smit, 2011). Algorithms used for solving hard optimization problems commonly have several parame ters and components to be defined by the user (LópezIbáñez et al., 2011). The chosen values have a great influence on the algorithm performance, but there is no generic set of values, since the optimal set depends on the problem being solved (Eiben and Smit, 2011). These values are then usually chosen based on common wisdom of the community (López Ibáñez et al., 2011), since finding the appropriate values for the parameters and components is one of the greatest chal lenges in the evolutionary computation field (Eiben and Smit, 2011).
In the EAs' context, examples of parameters can be the population size, crossover and mutation probabilities. Exam ples of components are the search operators, such as selec tion, crossover and mutation ones. In this sense, the auto matic design of algorithms can be very useful, since there is a great number of alternatives for these parameters and compo nents. To this end, we find in the literature different methods for the automatic design of EAs. More details about existing methods can be found in the surveys of Smit (2011, 2012).
Our approach works with irace and GE. We chose such algorithms because they are widely used in the literature of automatic configuration of metaheuristics Smit, 2011, 2012). The GE algorithm has already been used for generating EAs (Lourenço et al., 2012(Lourenço et al., , 2013(Lourenço et al., , 2015 and pre sented encouraging results. This is one of the main motiva tions for its usage. Another advantage of GE is that it allows a flexible and contextfree definition of programs to be gen erated by using a grammar. On the other hand, irace uses a very interesting mechanism based on statistical tests that has shown good results for multiobjective algorithms (López Ibáñez and Stützle, 2012; Bezerra et al., 20142012; Bezerra et al., , 2015. Both algorithms are described in the following.

irace
The Iterated Race algorithm (here called irace) works by sequentially evaluating the candidate configurations and ex cluding the statistically worse. It consists of three main steps, repeated until a stopping criterion is satisfied, according to LópezIbáñez et al. (2011): 1. sampling new configurations according to a particular distribution; 2. selecting the best configurations from the newly sam pled ones by means of racing; and 3. updating the sampling distribution in order to bias the sampling towards the best configurations.
To sample new configurations (Step1), the irace algo rithm considers two types of parameters: numerical and cat egorical, and the sampling distribution of each parameter de pends on the parameter type. Numerical parameters have a normal distribution and categorical parameters have a dis crete distribution. The update (Step 3) changes the sampling by updating the mean and standard deviation for normal dis tributions, and the probabilities for the discrete distributions. This update process guides the distribution in order to in crease the probability of selecting the parameters used in the best configurations when generating new ones (López Ibáñez et al., 2011).
In order to select the best configurations ( Step 2), the can didate ones are evaluated at each step on a single instance. After each step, the candidates that are significantly worse than at least another one are excluded. The race is repeated with the surviving configurations, and continues until a stop ping criterion is met. This criterion is generally related to a number of surviving configurations, a number of used in stances or a predefined computational budget (LópezIbáñez et al., 2011).
In this work we use the irace package 1 of the R Project 2 , the same one used in Bezerra et al. (2014Bezerra et al. ( , 2015. In order to be executed, the irace package requires three inputs: a configuration file, the set of instances to be used and the parameter space. The parameter space is where the param eters used in the automatic configuration, their types, ranges and constraints should be defined (LópezIbáñez et al., 2011). Moreover, the statistical analysis is performed using the non parametric Friedman statistical test (Derrac et al., 2011). Figure 1 presents an example of a parameter space defined in a parameter file for the irace package. This example is for the automatic configuration of an Ant Colony Optimiza tion (ACO) algorithm, and is included in the irace package along with other configuration files as a usage sample. Each parameter has a name, a switch value, a type, values and conditions (optional). The name of the parameter is an identifier for later usage in the definition of the conditions. A condition (after the "∥" delimiter) is used to define con straints among the parameters. For instance, in the example, the parameter "nnls" is only used when a local search (pa rameter "localsearch") has received a value between 1 and 3. The type identifier "c" represents the categorical parameters, whereas "r" (real) and "i" (int) represent numerical parame ters. While categorical parameters can only receive as value what is given in the "value" field, a numerical parameter can receive any value in the specified ranges in its value field. The "switch" field of a parameter is what the algorithm actu ally gives as argument to execute.
What the irace package does is to build a sequence of ar guments and values. Each argument sequence is what forms the algorithm configuration. In this sense, a script is defined in the package configuration file to receive this argument se quence and execute the algorithm using such arguments. In the end, the irace algorithm reads the result printed by the generated algorithm and updates its state.
Because a statistical test is used to compare the algorithms of each irace run, at the end of its execution, irace returns a set of algorithms statistically equal. However, in this pa per, we only use the one with the best hypervolume and ex ecute the designed algorithm several times instead. We only choose the best algorithm because the procedure of gener ating MOEAs in our approach needs an output of only one algorithm.

Grammatical Evolution
A GE algorithm can be considered a type of GP, since it is similarly used to evolve programs (Ryan et al., 1998). How ever, while a conventional GP algorithm typically uses a tree as representation for an individual and applies search opera tors to those trees (Koza, 1992), a GE algorithm uses an array of integers or bits and evolves the solutions similarly to a con ventional EA (Ryan et al., 1998). Moreover, aside from the usual parameters of an EA (e.g. population size, maximum number of fitness evaluations and others), a GE receives a grammar file, usually in BNF, to map each solution into a program. This mapping is called genotypephenotype map ping (GPM) (Barros et al., 2013).
The evolution is applied to the chromosome (genotype level), but only the program (phenotype level) can be ex ecuted and evaluated by a fitness function. Therefore, the GPM procedure is needed by the GE algorithm to transform each chromosome into an executable program. One of the advantages of GPM is that the genotype space can be freely explored, maintaining the validity of the phenotype (Barros et al., 2013). Furthermore, this allows a design of pheno type neutral crossover and mutation operators, where differ ent genotypes can be mapped to the same phenotype. Next, we present more details of how this is done, and the default structure of a GE algorithm.
The common solution representation used by the GE al gorithms is an array of integers or bits. If the array of bits is used, it is first mapped to an array of integers and, then, this array of integers is mapped to a program. It is possible to skip this step and use an array of integers directly. Never theless, the algorithm reads the grammar file, interprets the grammatical rules and then uses the integer values of the ar ray to decide which values are assigned to each rule. If the algorithm reaches the end of the integer array, but still needs more genes to map into production rules, then the wrapping process is applied. Wrapping consists in consuming genes (when needed) starting from the beginning of the array when the end of the array is reached. To illustrate the GPM process, the following BNF grammar (Figure 2) is given for evolving simple mathematical expressions: The items between ⟨ and ⟩ are nonterminal rules, ∥ rep resents the logical operator OR, ::= means that the rule can take any of the next options and the remaining items are ter minal nodes. For instance, the rule ⟨var⟩ can take either the value x or y when mapped to a program. On the other hand, ⟨expr⟩ can take the value of a single ⟨var⟩ or a composition of ⟨expr⟩ ⟨op⟩ ⟨expr⟩. The choice between each option is given by the genes of the chromosome array of each individ ual.
In this work we use an integer array of variable size as representation for the solutions. This is actually an existing strategy (Ryan et al., 1998) that might help the algorithm to eliminate useless genes or reinsert new genes into the chro mosomes as the evolution proceeds. For this end, the GE algorithm employs two distinguishable search operators: i) gene duplication operator; and ii) gene pruning/deletion op erator. The duplication operator selects a random subarray of the chromosome and copies it to the end of the chromosome. The prune operator, on the other hand, selects an index to truncate the array. These operators are usually applied with the same probability as the mutation operator and as addi tional steps in the evolution.
Summarizing, Algorithm 1 presents the pseudocode of a conventional GE algorithm. As the algorithm shows, it is very similar to an EA, aside from the evaluation of the pop ulation (GE has the GPM procedure) and the application of the duplication and prune operators.

MultiObjective Evolutionary Algorithms
Multiobjective optimization problems have more than one objective to be optimized Coello et al. (2007). In this kind of problem, usually two or more objectives are in conflict and cannot be optimized at the same time, i.e., by optimizing one objective, the value of the other is degraded. Many real world problems are multiobjective. For example, a car driver might assume two objectives when taking the best route be tween two points: time needed to complete the route and travel cost. In this context, Pareto dominance concept is em ployed (Coello et al., 2007). In a minimization problem, a solution x is said to dominate (≺) a solution y if ∀z ∈ Z : where z is an objective in the set of considered objectives Z. If these con ditions are not satisfied for both x and y, then such solutions are said to be nondominated. The set of all possible non dominated solutions in the search space of the problem be ing optimized is called the Pareto front (P F ). In most cases it is not possible to determine the (P F ) for a given problem. Hence, the algorithms try to find an approximation of this front, here called bestknown Pareto front (P F known ). Dif ferently from a monoobjective EA that yields a single so lution at the end of its execution, the result of a MOEA is a set of nondominated solutions. Thus, engineers usually have to decide which solution better fits their needs and/or prefer ences.
To evaluate the performance of a MOEA the most used quality indicator is the hypervolume (Zitzler et al., 2003). The hypervolume of a P F is the area dominated by this front with respect to a reference point. Considering a two objec tive optimization problem, each point of the Pareto Front de fines a rectangle in the search space. The hypervolume cor responds to the area formed by the sum of all rectangles.
In monoobjective EAs the engineer can easily decide which solutions are the best ones according to their fitness values in any given moment. However, MOEAs cannot do this in a straightforward way, because they have several ob jectives to evaluate and potentially a great number of non dominated solutions. Therefore, some strategies may be ap plied to help the decisions that MOEAs take during the evo lution process. Usually a fitness calculation is applied to the cyansolutions so that the comparison between them becomes possible. For instance, the SPEA2 algorithm (Zitzler et al., 2001) uses the concepts of Combined Dominance Strength (how many solutions a solution dominates and how many solutions dominate it) and Kth Nearest Neighbour (the dis tance to the kth nearest neighbour) to assign a fitness value for each solution, whereas NSGAII (Deb et al., 2002) uses the concepts of Dominance Depth (rank of the subfront) and Crowding Distance (density estimation).
Notice that the fitness evaluation in both examples uses two kinds of evaluations: i) a convergence assessment (Com bined Dominance Strength and Dominance Depth); and ii) a diversity assessment (Kth Nearest Neighbour and Crowd ing Distance). The convergence regards how close the solu tions of a front are to a reference front (usually optimal or bestknown). A good convergence in the search process can guide the algorithm on finding solutions closer to the refer ence Pareto front and can improve the overall performance of the algorithm. On the other hand, a good search diversity can prevent the algorithm from falling into a local optimum and provide a better exploration of the search space. The idea is to balance both factors during the optimization process in order to optimize the resulting output. However, this might be a difficult task and may come along with some drawbacks (e.g. computational cost).
In addition, there are other parameters and components that must be taken into account when designing a MOEA. Some of the most distinguishable parameters are: population size, archiving size, and mutation and crossover probabili ties. The archiving size regards to the size of the archive used by some MOEAs (Coello et al., 2007) as an external population to support the evolution process. Furthermore, the MOEAs components can have different implementations. For instance, the replacement strategy can be generational or ranking based. All these details contribute to increment the al gorithm complexity, and require a lot of effort from a novice and nonexpert engineer in the optimization field.
Unfortunately, these details largely influence the perfor mance of the algorithms, and their design and tuning are an optimization problem itself (Eiben and Smit, 2011). This serves as the main motivation for our approach proposed in the next section.

Proposed Approach
The proposed approach uses offline training to automatically design a MOEA specialized in the ITO problem. The training can be performed by using two different algorithms. Figure 3 shows how the proposed approach works 3 . Two inputs are used by the approach: the training instance and the set of components and parameters, which are de scribed in detail, respectively in Sections 3.2 and 3.3. The instance of the problem is provided by the user, who also chooses a design algorithm that can be either GE or irace. Another input is the set of components and parameters, which are defined in a representation compatible with the selected design algorithm (grammar or parameter space). This set is predefined, but can be modified or extended if desired. Then, the design algorithm is executed with the training instance, and at the end the best MOEA is returned. This MOEA is used by the tester to solve other ITO instances.
The design algorithms work with a population where each individual is a MOEA. The fitness is given by some indica tor calculated by using the corresponding fronts obtained by each individual to the ITO problem. In this work we use hy pervolume indicator (Zitzler et al., 2003).
It is important to emphasise that the GE and irace algo rithms work on a higher level of the search. Instead of trying to find the solution for the problem directly, these two algo rithms try to generate the best MOEA that can in turn solve the problem. Hence, these two algorithms search for MOEAs in the "MOEA space" using conventional search methods with which they were proposed (LópezIbáñez et al., 2011; Ryan et al., 1998. Next, we describe the structure of the gen erated MOEAs by detailing their main components and the representation used by each design algorithm.

MOEA
Algorithm 2 shows the structure of a standard MOEA ma nipulated and returned by the approach. The components of each step and their respective parameters are selected by the design algorithm. These steps are: initialization of the popu lation (Line 2), evaluation of the population (Lines 3 and 9), selection of parents (Line 6), crossover operator (Line 7), mu tation operator (Line 8), replacement (Line 10) and archiving of the individuals (Lines 4 and 11). We divide the fitness assignment into three independent components, one for selection, another for replacement and one for archiving. By changing these assignments separately, the design algorithms can find better MOEAs by focusing on one kind of search at each step. For instance, a selection component can focus on mating more diversified parents, whereas a replacement component can focus on making the most converged solutions to survive.

Training Instance
A training instance of the ITO problem must be given by the user, so that the design algorithms can execute the generated MOEAs. This instance must contain the matrices mentioned in Section 2. This information is later used to formulate a per mutation problem, where each unit is represented by its ID. During the problem solving, the order in which the units ap pear in the chromosome determines the order in which they are integrated and tested. Because ITO is a permutation prob lem, the MOEA components and parameters used in the pro posed approach are focused on the permutation representa tion.

Components and Parameters
We chose the parameters and components based on experi ments and tuning conducted in related work (Assunção et al., 2014; Guizzo et al., 2015; Briand et al., 2002a. For instance, in Guizzo et al. (2015) the authors used three permutation crossover operators in their online operator selection, which are also included here.
The parameters and components are categorized regarding the following MOEA steps: population initialization, selec tion, mating, replacement and archiving. Moreover, some fit ness assignment strategies are used in the selection, replace ment and archiving to guide the evolution. In this sense, the Fitness Assignment component is defined as a mechanism to identify which solution is the best one according to all ob jectives, but its three usages are completely independent and can vary according to the best outcome.
Most components and parameters are generic enough to be used for any problem and by any algorithm (e.g. crossover and selection operator, type of population initialisation), but a few components are more specific and were extracted from existing algorithms (e.g. replacement strategy, archiv ing of solutions, fitness evaluation mechanism). We have extracted and implemented the components and parame ters from the following algorithms: NSGAII (Deb et al., 2002), SPEA (Zitzler et al., 2001), SPEA2 (Zitzler et al., 2001), Multi Objective Genetic Algorithm (MOGA) (Fon seca and Fleming, 1993), Pareto Achieved Evolution Strat egy (PAES) (Knowles and Corne, 2000), and Indicator Based Evolutionary Algorithm (IBEA) with hypervolume (Zitzler et al., 2003). Hence, our approach is able (although unlikely) to generate each of those algorithms by joining their compo nents together during the evolutionary process. If the tester wants to adapt our approach, for instance by including the components of the their own algorithm, they just need to im plement the components and add them to the grammar (ex plained in more details in Section 3.4). Next, we present all these elements in detail.

Population
This element is composed by a Population Size and an Initial ization procedure. Population Size specifies the number of individuals in the population. Initialization defines how the first population is initialized. It can be done by using Random or Parallel Diversification (Talbi, 2009). The latter aims at generating diversified solutions by initializing the individu als in a way that an integer number cannot be repeated at the same position in another individual of the population.

Selection
The Selection element is related to the selection of parents to be recombined and defines the Source and Selection Opera tor components. Source specifies from where the parents are extracted. That way, the parents can be selected only from the current population, or from the archive and population combined. Selection Operator specifies the strategy used to select the parents. Random randomly selects two solutions to be recombined. KTournament performs k number of bi nary tournaments (comparisons) between random solutions and chooses the best two to be recombined. In such a case, the parameter k is also selected. Roulette Wheel gives a prob ability based on the fitness value of a parent, and performs a selection complying with the probabilities of each parent. Ranking classifies the solutions based on the fitness value and selects the best ones to be recombined.
The KTournament, Roulette Wheel and Ranking selection operators use the fitness of each solution to aid the selection. That way, they use the Fitness Assignment component.

Fitness Assignment
The Fitness Assignment element encompasses the Conver gence Strategy component to assess the quality regarding the convergence of the solutions, and the Diversity Strategy com ponent for the tiebreaking of solutions with the same conver gence value. Another possibility is the usage of only one kind of strategy, either convergence or diversity for the evaluation. If no Convergence Strategy component is selected, then the Diversity Strategy component becomes the primary metric for fitness assignment.
There are four possible components for the Convergence Strategy. Dominance Depth (NSGAII) (Deb et al., 2002) as sesses the convergence quality of the solutions using Pareto fronts. The first Pareto front has all the nondominated solu tions of the population. The second Pareto has all the non dominated solutions excluding the ones in the first Pareto front. Such process is performed until there are no more solu tions. The Dominance Depth of a solution is the Pareto front number in which it is present, thus the lower the Dominance Depth value, the better. Dominance Strength (SPEA) (Zitzler et al., 2001) computes the number of solutions that a solu tion dominates. If a solution dominates many others (greater strength), then it may indicate that this solution dominates a great area of the objective space. Raw Fitness (SPEA2) (Zit zler et al., 2001) assesses the convergence quality by com puting the sum of the strength values of all the solutions that dominate a solution. Thus, the lower this value, the less likely a solution is to be on an easily dominated area of the objec tive space. Dominance Rank (MOGA) (Fonseca and Flem ing, 1993) computes the number of solutions that dominates a solution, thus the lower this value, the better.
Regarding the Diversity Strategy, there are four possibil ities. Crowding Distance (NSGAII) (Deb et al., 2002) is based on the distance between the neighbours solutions in the objective space. A low crowding distance value means that the solution is in a crowded area of the objective space, and possibly brings low diversity to the search (e.g. if used as a parent for recombination). Kth Nearest Neighbour (SPEA2) (Zitzler et al., 2001) assesses the distance from a solution to its k th nearest neighbour solution in the objective space. The parameter k is defined as in (Zitzler et al., 2001) where N is the population size and N is the archive size. Similar to Crowding Distance, it is necessary to maximize the value the Kth Nearest Neighbour diversity strategy. Adaptive Grid (PAES) (Knowles and Corne, 2000) divides the objective space into grids to trace the crowding degree of different regions. It is possible to diversify the nondominated solutions and help to remove excessive non dominated solutions located in the crowded grids. The adap tive grid value of a solution is the number of solutions in its grid, thus the lower the value, the more isolated the solution is. Hypervolume Contribution (Zitzler et al., 2003) is based on the hypervolume quality indicator. Briefly, the hypervol ume contribution of a solution measures how much a solution strictly contributes to the hypervolume of the front in which it is contained. Therefore, the greater the hypervolume con tribution of a solution, the bigger the space dominated only by this solution.

Mating
Mating Strategy (Talbi, 2009) defines the way and how many offsprings are created in each generation. Generational One Child and Generational Two Children generate N children at each generation. The former generates one in each recombi nation and the latter generates two. Steady State generates only one offspring at each generation. Even though Steady State is usually defined as a replacement strategy (Eiben and Smith, 2003), here it is defined as a reproduction strategy and the replacement component is responsible to insert the gener ated solution in the population according to the replacement strategy.
Mating Operators are composed by the Crossover Oper ator, Mutation Operator (Eiben and Smith, 2003) and their probabilities. Since in this paper we are addressing a permu tation problem, all the crossover and mutation operators used here are for permutation representations.
The Crossover Operator creates one or more offsprings by combining the genes of two parents. In this approach, no crossover or one of four crossover operators can be selected. Single Point Crossover selects a random point and cuts the parents in this point. One half of each parent is merged into different children. Two Points Crossover has a similar strat egy, but two random points are chosen to cut the parents. The subarray inside each cutting point of a parent is copied to a solution, and the remaining genes are selected from the other parent. Partially Mapped Crossover (PMX) also cuts the par ents in two points, but it uses a more complex cyclic proce dure to select the genes of the subarrays. Cycle Crossover works by dividing the elements into cycles. A cycle is a sub set of elements where each element always occurs paired with another element of the same cycle when two parents are aligned. The offsprings are created by selecting alternate cy cles from each parent.
A Mutation Operator applies some kind of transformation in the individual. In this approach, cyanthe alternatives to that component are no mutation or one of the four following mu tation operators: Swap Mutation randomly selects two genes and swaps their values. Insert Mutation randomly selects a random value and moves it to a random position in the chro mosome. Scramble Mutation randomly selects two genes and shuffles the values between them. Inversion Mutation ran domly selects two genes and inverts the order of the values appearing between such genes.

Replacement
Replacement represents the strategy to define which individ uals survive and compose the next generation. The Genera tional strategy defines that the parents are always replaced by the offsprings in the next generation. This replacement takes into account the idea of elitism. Briefly, the elitism forces the survival of a predetermined number of best parents for the next generation. The Elitism Size is also selected by the algorithm. If the elitism is indeed selected, then the Fitness Assignment component is used to determine the best parents for survival. The Ranking replacement creates a ranking of the solutions, based on the fitness values that are measured by the Fitness Assignment component. That way, the best so lutions are selected to survive, regardless of being parents or children.

Archiving
The Archiving element is related to the archiving proce dure used to store the solutions. This is employed by some algorithms such as SPEA2 (Zitzler et al., 2001) and PAES (Knowles and Corne, 2000). Sometimes the archive is called external population, but the purpose is the same: to aid the search process with another source of solutions for repro duction, or simply to store the best solutions found so far. At each generation, the archive is updated with the newly gener ated solutions. In this work the Ranking component is always used to rank the solutions using the Fitness Assignment com ponent, and the best solutions are kept in the archive. It is able to store dominated and nondominated solutions. Moreover, the Archive Size parameter is used to define how many solu tions are stored in the archive. In this work, a set of popula tion size percentages is used. If such value is zero, no archive is used by the MOEA.

Design Algorithms
For the generation, execution, evaluation and selection of MOEAs the tester can choose between GE and irace. These algorithms have the same purpose: to generate MOEAs based on several trials. However, they require different representa tions for the elements and components presented in previous section and specialized for the ITO (permutation) problems. The GE algorithm generates each MOEA using the following grammar. Based on this grammar, the symbol λ means empty com ponent. If an alternative for a component is not wanted by the tester, then it can be removed from the grammar and it will not be used by the GE algorithm. Similarly, if the tester al ready has a MOEA and only wants to configure some parame ters, then it is necessary to set the components of the MOEA in the grammar without other alternatives. For instance, to apply the NSGAII algorithm and automatically configure it, the tester must set the replacement rule with only the "Rank ing" alternative, the fitness assignment component with only "Dominance Depth" and "Crowding Distance", and any other element that must not change.
Similarly, the parameter space of irace is defined with all the parameters and components, and can be changed if desired. The parameter space used by our approach is struc tured as follows.

Empirical Evaluation
As mentioned before, the evaluation of our experimental evaluation was guided by two research questions. RQ1 com pares the MOEAs generated using our approach with the tra ditional MOEAs used in the ITO literature. RQ2 compares both design algorithms GE and irace.
To answer the questions, we use eight real world systems, also explored in related work (Assunção et al., 2014; Guizzo et al., 2015. One was used for training and seven for testing. They are implemented in Java and AspectJ. Table 1 shows the number of units, dependencies and LOC of each system. The number of dependencies directly impacts the number of existing solutions for the problem. AJHSQLDB (first row) is used only for the training. The empirical evaluation is performed in two phases. In the training phase, the proposed approach is executed in the training instance AJHSQLDB to automatically generate a set of MOEAs. Then, in the testing phase, the generated MOEAs are executed in all testing instances of the problem and are evaluated.
To answer RQ1, we execute two MOEAs, successfully used in related work to solve the ITO problem (Assunção et al., 2014): i) NSGAII (Deb et al., 2002); and ii) SPEA2 (Zitzler et al., 2001). In order to perform a fair com parison, we use the proposed approach with GE to tune these algorithms. This is another advantage of this approach: it can also be used to tune algorithms and not only to design new ones. For this tuning, we adapted a grammar for each algo rithm, fixing some key components and parameters of the al gorithms, and only letting the other ones vary. We fixed the replacement as Ranking with Dominance Depth and Crowd ing Distance for the NSGAII grammar, whereas for the SPEA2 grammar we fixed this component as Generational with no elitism. If we let the GE tune all elements of NSGA II and SPEA2, then they could lose their main features and become totally different algorithm implementations. All the components were implemented with jMetal (Durillo and Ne bro, 2011). The approach was executed once for each algo rithm and for the same amount of evaluations (10,000). In the end, the best configuration was selected and then used in the testing phase for 60,000 fitness evaluations, as well as the generated MOEAs. Table 2 shows the configuration of each algorithm chosen by the approach.

Training Phase
The system AJHSQLDB was chosen as the training instance of our approach, because it is the biggest instance. In prelim inary experiments, we observed that using an easily solvable instance usually results in a weak training. Furthermore, we use only one instance of the problem for the training because otherwise it would increase the cost of such a phase. We ac knowledge that the use of several instances would increase the training quality, but this is a subject for a future work only focused on this tradeoff. We execute each design algorithm (GE and irace) 10 times. That way, each run returns one MOEA, for a total of 20 MOEAs. For the GE configuration, we define the param eters based on the literature (Lourenço et al., 2013). Table 3 presents such configuration.
The irace algorithm requires as parameter only a training budget (number of MOEA evaluations), for which we use the same amount given to GE: 10,000. In addition, each gener ated MOEA is executed by the algorithms for 2,000 fitness evaluations. We use few fitness evaluations for the training due to the great computational time needed for this task. Sum marizing, each design algorithm is executed 10 times with 10,000 evaluations in each independent run, and each gener ated MOEA receives a budget of 2,000 fitness evaluations We analyse the parameters values and the components of the MOEAs returned in all runs. Based on the 20 obtained MOEAs, we compute the frequencies that the values appear. Table 4 shows, for each parameter and component and for each design algorithm, the values that appear more often in the MOEAs.
As seen in Table 4, some components and parameters are clearly dominant, since they appear in the design of the best MOEAs very often, regardless of which design algorithm is being used. For example, Steady State mating, Ranking se lection operator and Dominance Strength for the selection procedure are used for all the 20 MOEAs. Other compo nents and parameters such as Parallel Diversified initializa tion, 100% mutation probability, Archive and Population se lection source and Ranking archiving are used for almost ev ery obtained MOEA.
Even though the design algorithms obtain similar MOEAs, some differences can be noted by analysing these frequen cies. For instance, using GE, the best MOEAs always use Inversion Mutation, whereas using irace the best MOEAs always use Swap Mutation. In addition, irace presents greater crossover probabilities and a convergence strategy for the archiving procedure, while the GE algorithm gener ates MOEAs with the lowest crossover probability 50% of the time and it does not use a convergence strategy for archiv ing more than half of the time.

Testing Phase
In order to answer the research questions, the 20 MOEAs generated by our approach in the training phase, and the tra ditional algorithms NSGAII and SPEA2 are executed using all the seven testing instances of the problem.
For each instance and MOEA, 30 independent runs are ex ecuted for 60,000 fitness evaluations. In the comparison, we use the hypervolume indicator (Zitzler et al., 2003) to assess the quality of each front obtained after each execution. We do not know the real Pareto fronts of the problems, thus this indicator can be used because it does not require such front. Furthermore, if the hypervolume value of a front A is greater than the value of a front B, then A is not worse than B. These characteristics make hypervolume suitable for the context of this experimentation.
In order to compare the algorithms, we calculate the hy pervolume value of the 30 runs of each algorithm in each Table 4. Frequency of most used parameters values and components. Some frequencies are not multiples of 10 because in some executions the corresponding components/parameters were not applicable, since they depended on other components that were not selected. instance, and used the KruskalWallis statistical test at 95% of confidence (Derrac et al., 2011) to address statistical dif ferences on the hypervolume values. Moreover, for each in stance, we calculate the rank of the algorithms based on their mean hypervolume. In this ranking process, algorithms with out statistical difference are considered tied. In the end, we calculate the mean hypervolume and mean rank for each algo rithm across all instances. We also compute the Friedman sta tistical test at 95% of confidence over the seven mean hyper volume values of each algorithm to determine if there is any overall statistical difference between their results. Finally, we compute the VarghaDelaney'sÂ 12 effect size (Vargha and Delaney, 2000).
We use multiple group pvalue analysis (KruskalWallis and Friedman) based on multiple groups (algorithms) due to the number of algorithms used in the experimentation. Using MannWhitney U instead would result in a large amount of pair pvalues. While KruskalWallis is used to compute dif ferences in a given experimental subject, Friedman is used on the means of the algorithms to give an overall statistical anal ysis across multiple systems. Furthermore, for the posthoc test we use the suggestion of Siegel and Castellan (1988). It is also the default posthoc technique used for such statistical tests in R.
For the sake of succinctness, we select and report only some of the generated algorithms. In the next paragraphs we present the results for the best, median and worst algo rithms of each method according to their mean rank in terms of hypervolume. The best, median, and worst algorithms ob tained by the GE algorithm are named GE_Best, GE_Median, and GE_Worst respectively. Similarly, the best, median, and worst algorithms obtained by irace are I_Best, I_Median, and I_Worst respectively. Table 5 presents the mean hypervolume for each instance and algorithm, and its mean rank in parentheses 4 . The last row presents the overall mean hypervolume and mean rank. The last column presents the pvalue obtained using the KruskalWallis test, except in the last row where the Fried man test result is shown. The best values (greatest for hyper volume and lowest for ranks) and the statistically equivalent values to the best ones are highlighted in bold.
In general, the results obtained by the best and median gen erated algorithms are better than the ones obtained by the conventional ones. GE_Best and I_Best are able to obtain the best or equivalent to the best results according to the Kruskal Wallis test in almost every instance of the problem. Even the worst generated algorithms (GE_Worst and I_Worst) per form better than NSGAII and obtain competitive results to SPEA2 in overall.
I_Best performs better in general, which results in the best mean rank (3.14). GE_Best obtains the second best re  HealthWatcher 1.00 (4.00) 1.00 (4.00) 0 sults with a mean rank difference of only 0.07 to I_Best and with no statistical difference according to the Friedman and KruskalWallis tests. The only system (TollSystem) in which GE_Best is statistically worse than the other algorithms is in fact the smallest one. The best known solution for this sys tem is always found by GE_Median. Furthermore, the me dian and worst algorithms obtained by the GE algorithm per form better than the median and worst algorithms of irace respectively. The Friedman test presents differences only between GE_Best and NSGAII, GE_Median and NSGAII, and GE_Best and I_Worst. Even though I_Best obtains the best results (equal hypervolume to GE_Best and GE_Median, but better rank), the Friedman test result shows no statistical dif ferences between I_Best and any other algorithm. Perhaps with more systems in the experiment, the statistical power would be higher and we could see more differences between the algorithms. With only seven systems, the more conser vative statistical assessment of the nonparametric Friedman test takes place, preventing us from assuming and report other differences.
Figures 46 show the fronts obtained by the algorithms. We present only the fronts of the three more complex systems based on the number of dependencies presented on Table 1. The other instances are omitted because the algorithms found similar solutions for them. Based on the presented fronts, for the AJHotDraw in stance, the solutions found by NSGAII and SPEA2 are dom inated by the ones found using the other algorithms. More over, the GE algorithms obtain better solutions. For the My Batis instance, the best solutions are obtained by I_Best and GE_Best. For the JHotDraw instance, I_Best found two nondominated solutions, and one of them is also found by GE_Best and GE_Worst. In general, with exception of JHot In order to provide another way to analyse the differences between the algorithms, we present the VarghaDelaney'ŝ A 12 effect size (Vargha andDelaney, 2000; Arcuri andBriand, 2014). TheÂ 12 test measures the probability of an effect magnitude regarding the difference between a group X and another group Y , where anÂ 12 value of 1.0 means that X always outperforms Y , 0.0 means that Y always out performs X, and 0.5 means that X and Y are equally good. According to Vargha and Delaney (2000), theÂ 12 values can be read as follows: This statistical test is computed over the 30 hypervolume values of each algorithm and compared to the other ones in each instance. Because this is a binary comparison, there would be 196 effect size values to report: 28 values for each of the seven instances. In this sense, for a cleaner view of such data, we summarize these results in Figures 7 and 8, and in Table 6. The figures present respectively the box plot for theÂ 12 effect size values obtained in the comparisons be tween I_Best and the other algorithms, and between GE_Best and the other algorithms in all instances. We chose these two algorithms because they were the best obtained algorithms by each generation algorithm. Any value above 0.5 is a bet terÂ 12 in favour of the main algorithm. I_Best performs similarly to GE_Best and, as expected, both perform better than NSGAII and SPEA2. Furthermore, the differences between the best, median and worst algo rithms become clearer. For example, in Figure 7 we can see that GE_Best obtained closer results to I_Best, then GE_Median and then GE_Worst. This "ladder" is also vis ible in Figure 8 regarding I_Best, I_Median and I_Worst. Table 6 presents the number of nonnegligible favourable effect size values minus the number of nonnegligible un favourable effect size values for each algorithm in a given row when compared to another algorithm in the respec tive column. For example, GE_Best (first row) obtained 5 favourable, 1 negligible, 1 unfavourable effect size values when compared to GE_W (fourth column), hence the differ ence is 4 (5 favourable minus 1 unfavourable). For each cell, the maximum value is 7 and the minimum is 7, representing, respectively, that the row algorithm was significantly better or worse than the column algorithm in all systems. The last column shows the sum of the differences. We could observe some ties in the number of effect size counts between some of the algorithms. GE_Best tied with GE_Median and I_Best, whereas I_Best only tied with GE_Best and not with GE_Median. Furthermore, GE_Median also obtained the best total sum of count dif ferences, even though its hypervolume average is the same and the ranking is lower than GE_Best and I_Best. This in dicates that these algorithms performed very similarly in the testing set used in this work, which can also be observed in the proximity between their Pareto fronts shown in Figures 4 6. NSGAII shows the worst results in the effect size com parisons, whereas SPEA2 is more competitive to the other algorithms by overcoming the worst generated algorithms.
Comparing the generated algorithms and the traditional ones, we observe that all generated MOEAs are able to out perform NSGAII and obtain competitive results to SPEA2. Additionally, some algorithms performed significantly bet ter than the traditional ones, specially I_Best, GE_Best and GE_Median. In this sense, we can answer RQ1 by asserting that the proposed approach is capable of generating MOEAs that are better than conventional ones in the literature.
Regarding RQ2, it is not possible to conclude which de sign algorithm is better for this problem. irace obtains the best overall MOEA, but GE generates a better set of al gorithms considering the median and worst generated algo rithms. In other words, both best and median algorithms of GE are able to statistically outperform NSGAII and some times SPEA2, which occurs less often for the MOEAs gen erated by irace. However, irace generated the best algo rithm considering hypervolume values, rankings, and statis tical differences across all systems.

Further Discussion
An advantage of GE is that its grammar provides a flexi ble way of design algorithms. By using such a grammar, the tester can create complex structures by changing the arrange ment of the grammar rules. This enables not only the gener ation of MOEAs with a common template, but also MOEAs using different procedures (e.g. multiple crossover and muta tion). On the other hand, GE needs more configuration (num ber of parameters) than irace.
Considering the structure of the generated MOEAs, proba bly a tester would not think about designing such uncommon algorithms, since they are quite different from wellknown MOEAs presented in the literature. Table 7, presents the com ponents and parameters used by the best and worst algo rithms generated by GE and IRACE. For instance, GE_Best uses a mutation operator with 100% of probability and no crossover operator. Of course, these characteristics are de pendent on the training instance we used, but they are contra dictory to the "common wisdom" of low mutation and great crossover probabilities. Different training scenarios could provide a more humanlike designed algorithm, but yet the manual design of MOEAs may not be so powerful when com pared to automated mechanisms such as the approach herein proposed. In fact, as stated by Eiben and Smit (2011), the de sign and configuration of an EA is an optimization problem itself, hence optimization algorithms may be successfully ap plied in this context.
Beyond the promising results of the proposed approach, it also requires reduced effort to design and configure a multi objective algorithm. In a conventional scenario, the tester might avoid the boring tasks of configuring or tuning the MOEA, and may simply use a common MOEA with arbitrary parameters and components, which can yield not so good re sults. The proposed approach can solve this problem by au tomatically generating MOEAs that present the best results.
The downside of the approach is the great amount of com putational resources used in the training phase. In this paper we executed the design algorithms for 10,000 evaluations, and assigned 2,000 training evaluations for each MOEA, re sulting in 20 million total fitness evaluations each training run. Each training run took 113 hours on average to execute due to the great number of fitness evaluations during the train ing. A more cheaper strategy could be adopted by reducing from 10,000 to 5,000 fitness evaluations, and from 2,000 to 1,000 training budget for each MOEA, however, we would still be using 5 million fitness evaluations for the training (around 28 hours for training time). This elevated cost is not something exclusive of our approach, but rather something that is inevitably inherent to offline design algorithms (Bez erra et al., 2014; Guizzo et al., 2015; Burke et al., 2012.
However, one should bear in mind that the 20 million fit ness evaluations were also assigned to the tuning of NSGA II and SPEA2. Even when we assign the same amount of resources, automatically tuning an existing algorithm results in suboptimal results. If we consider the manual effort of de signing a MOEA or tuning an existing one, then this effort must be added to the computational resources spent in the tun ing process. In such case, the tester would have to perform experiments and compare the algorithms manually, which is already done automatically by our approach. All in all, letting our approach design and configure a new algorithm seems like the best option, despite its cost.
Although a great number of MOEAs evaluations were used for the training, we discovered that the best algorithm is found, on average for the 30 training executions, around the 7,120 th evaluation out of 10,000. In some executions the best algorithm was found as early as the first 3,000 evaluations, but sometimes it was not found until the last 100 evaluations. This data will be important for future work, where we are go ing to investigate a better stopping criterion for the training in order to reduce the cost of this phase.

Threats to Validity
The experimentation test was only performed using seven systems. In order to minimise this external threat, we ex tracted systems of different sizes and characteristics from related work (Assunção et al., 2014; Mariani et al., 2016; Guizzo et al., 2015. It is worth mentioning that the systems are implemented with two different paradigms: object and as pect oriented. All of these details help mitigate the threat and improve the generalisation of our results.
We only used AJHSQLDB as the training instance on both GE and irace. Even though in an earlier and preliminary experimentation we observed similar results using MyBatis as training instance, this may affect the final output. We ex pect a degradation on the results when using smaller training instances, thus this is something we intend to test in future work.
The results are susceptible to the components and param eters used in the training phase. In this sense, a greater num ber of components and parameters would result in different MOEAs and, probably, better algorithms. In order to min imise this threat and improve generalisation, we used a com prehensive set of components from multiple different algo rithms in the literature.
Another possible threat is that due to the execution costs, in the training phase we execute each design algorithm (GE and irace) 10 times. This number was chosen based on other works in the literature of automatic design algorithms (??). We think it is enough to show the behaviour of our approach and offer a preliminary evaluation.
During the testing phase we used only the hypervolume indicator due to its compatibility to our scenario and because it is considered the only Paretodominance compliant indica tor (Zitzler et al., 2007). We tried to mitigate this threat by also presenting the Pareto fronts found during the evaluation and making our experimental results available.

Related Work
This section reviews studies addressing the ITO problem and after related work on automatic design of algorithms.

Studies addressing the ITO problem
In the literature the ITO problem has been addressed by many studies and different contexts. The work of Hewett and Ki jsanayothin (2009) is focused on the integration order of com ponents that can be a subsystem, module, or objectoriented class. The authors use a graphbased algorithm to solve the problem and reduce the number of required stubs. The heuris tics were applied in test dependency graphs and the results showed an improvement of the results when compared to other approaches. The authors also stated that their algorithm was faster to be executed than these approaches. Other works such as Abdurazik and Offutt (2009) and the ones reviewed by Briand et al. (2003) also use graphbased algorithms and other similar heuristics to solve this problem in the Object Oriented (OO) context. The approach of Jiang et al. (2019) in troduces a strategy to consider the control coupling complex ity estimated by using the concept of transitive relationship  Ré et al. (2007) proposed an extension to encompass AspectOriented (AO) programs. Even though the mentioned works can solve the integra tion and testing order problem, sometimes they may only find suboptimal solutions (Assunção et al., 2014; Briand et al., 2002b. To overcome this situation, some works such as Assunção et al. (2013) ; Briand et al. (2002b) ; Vergilio et al. (2012b) use searchbased approaches. In Briand et al. (2002b) the author proposed a way to optimally solve this problem by using GAs. However, the authors used an aggre gation of coupling measures as fitness function, in contrast to Assunção et al. (2013) and Vergilio et al. (2012b), where multiobjective approaches were used and good results were obtained. Cabral et al. (2010) also used a multiobjective searchbased algorithm, more specifically Ant Colony Opti mization (ACO), and obtained better results than an approach with aggregation of objectives.
We can observe that MOEAs obtained the best results to solve the ITO problem, and this means that it a suitable prob lem for applying our MOEA design approach. From all exist ing works addressing the ITO problem, the work of Assunção et al. (2014) introduced an approach that can be applied in dif ferent contexts of the problem, such as AO and OO programs, because of this we used such a formulation in our work.

Works on automatic design of algorithms
In the literature we find works on automatic design and con figuration of metaheuristics that use different techniques (Bezerra et al., 2014, 2015; LópezIbáñez and Stützle, 2012; Smit et al., 2010; Dréo, 2009). Some of them are focused only on the automatic configuration of parameters of mono objective algorithms (Smit et al., 2010; Dréo, 2009. Related work are the ones that use GE and irace.
We can find in the literature some works that use GE for the automatic design and configuration of heuris tics (Lourenço et al., 2012, 2015; Burke et al., 2012; Mascia et al., 2014; Marshall et al., 2014b,a; Hutter et al., 2007. They focus on simple heuristics (Marshall et al., 2014b,a) and metaheuristics, such as local search algo rithms (Burke et al., 2012; Mascia et al., 2014; Hutter et al., 2007 and EAs (Lourenço et al., 2012(Lourenço et al., , 2013(Lourenço et al., , 2015. We would like to mention the works of Lourenço et al. (2012Lourenço et al. ( , 2013Lourenço et al. ( , 2015 that propose a GE approach to automatically design and configure monoobjectives EAs. Their grammar encompasses the main components and parameters of EAs, such as population size, crossover and mutation operators. The EAs returned by the approach obtained better results when compared to traditional ones on the Royal Roads prob lem. However, we did not find works applying GE in the context of MOEAs, or even for other types of multiobjective algorithms.
Regarding the use of irace, relate work (Bezerra et al., 2014, 2015; LópezIbáñez and Stützle, 2012 also consider EAs and a set of components and parameters. The studies of Bezerra et al. (2014Bezerra et al. ( , 2015 are the only ones addressing the context of MOEAs. They use many MOEAs components and parameters, such as replacement, archiving and fitness assignment. As result, the automatically generated MOEAs could outperform traditional ones in instances of combinato rial and continuous problems. Our work is mainly inspired by Bezerra et al. (2014Bezerra et al. ( , 2015 and Lourenço et al. (2012), because the idea of using GE and irace for the automatic design of EAs seems very promis ing. Furthermore, in the automatic design of MOEAs, there are many parameters and components that can be explored. However, our approach differs from them by using a differ ent set of components and parameters, such as initialization procedures, source of parents selection, replacements proce dures and the possibility of elitism. The set herein proposed is broader and more suitable for the MOEAs context.
Another difference is the application of automatic design of MOEAs in the SBSE area. None of the works mentioned above generate algorithms specialized for solving software engineering problems, such as ITO.
Regarding the ITO problem, a great number of approaches found in the literature use searchbased algorithms (Wang et al., 2011; Briand et al., 2002a. The most promising are based on MOEAs (Assunção et al., 2014; Vergilio et al., 2012a; Guizzo et al., 2015. However, to implement and con figure a MOEA solution can be very difficult for the tester, who is not usually an expert on the optimization field. To ease such a task, the work of Mariani et al. (2016) introduces a hyperheuristic based on GE to derive MOEAs for solv ing the ITO problem. Such a hyperheuristic was also used to select products for the Sofware Product Line (SPL) test ing (Jakubovski Filho et al., 2017). These works found good results and serve as motivation to propose our approach, that now encompasses a set of elements and components that al low the use of GE and irace and comparison of both design algorithms. As far as we know, there is no work in the litera ture that applies and compares both.

Concluding Remarks
This paper presented an approach for the automatic design of MOEAs applied to the ITO problem. For this purpose, the main components and parameters of MOEAs were identified and a set of alternatives were assigned for them. They were formally mapped in a grammar (GE) and in a parameter space (irace) to be used by the design algorithms.
An empirical evaluation was conducted in two phases. The first was the training phase, where the GE and irace al gorithms were executed using one instance of ITO, gener ating ten MOEAs each. In the testing phase, the twenty ob tained MOEAs were executed using all seven instances of the problem. Furthermore, the traditional MOEAs NSGAII and SPEA2 were also executed using such instances. Their parameters were tuned using the proposed approach.
The MOEAs generated by the approach with both design algorithms were compared using the hypervolume quality in dicator and three statistical tests: VarghaDelaney'sÂ 12 Ef fect Size, KruskalWallis and Friedman. They obtained sim ilar results and also some similarities regarding the most se lected components. In addition, the generated MOEAs are able to outperform the traditional ones with statistical differ ence.
In short, the introduced approach can successfully gener ate MOEAs to better solve the ITO problem, improving the performance of the traditional algorithms and reducing ef fort, freeing the tester from tasks like: choice among existing MOEAs, implementation and configuration.
We intend to apply the approach for other ITO instances, testing contexts and problems. Some mentioned limitations should be addressed, for example, to use other training in stances and quality indicators in the fitness evaluation.
In addition to this, the approach serves as a basis for future works that can extend the proposed set of elements to cover other desired characteristics of the MOEAs and to be used in other domains, contributing to the SBSE practitioners to implement solutions for other SE problems. Ideally, the gen erated MOEAs could be generalized to other SE problems without the need of retraining.