A Datacentric Model Transformation Approach using Model2GraphFrame Transformations

Data­centric (Dc) approaches are being used for data processing in several application domains, such as dis­ tributed systems, natural language processing, and others. There are different data processing frameworks that ease the task of parallel and distributed data processing. However, there are few research approaches studying on how to execute model manipulation operations, as model transformations models on such frameworks. In addition, it is of­ ten necessary to provide extraction of XMI­based formats into possibly distributed models. In this paper, we present aModel2GraphFrame operation to extract a model in a modeling technical space into the Apache Spark framework and its GraphFrame supported format. It generates GraphFrame from the input models, which can be used for partitioning and processing model operations. We used two model partitioning strategies: based on sub­graphs, and clustering. The approach allows to perform model analysis applying operations on the generated graphs, as well as Model Transformations (MT). The proof of concept results such as model2GraphFrame, GraphFrame partitioning, GraphFrame connectivity, and GraphFrame model transformations indicate that our Model Extraction can be used in various application domains, since it enables the specification of analytical expressions on graphs. Furthermore, its model graph elements are used in model transformations on a scalable platform.


Introduction
Model Transformations (MTs) are key artifacts for exist ing MDE (ModelDriven Engineering) approaches, since they implement operations between models (Brambilla et al., 2012). Nevertheless, the transformation of models via paral lel and/or distributed processing is still a challenging ques tion in MDE platforms. There are recent initiatives that aim to improve existing solutions by adapting the computa tion models, for instance, using MapReduce (Dean and Ghe mawat, 2008) to integrate model transformation approaches within the dataintensive computing models. Works such as Burgueno et al. (2016), Pagán et al. (2015), Benelallam et al. (2015) and Tisi et al. (2013) aim at providing solu tions for this new scenario using frameworks such as Linda and MapReduce. Even when adopting these frameworks, the model processing is not a straightforward task, since the mod els are semistructured, which can have selfcontained or intercontained elements, different of flat data structures on linear space usage, such as logs, text files, and others.
The need for performing complex processing on large vol umes of data has led to the reevaluation of the utilization of different kinds of data structures (Raman, 2015). Very Large Models (VLMs) are composed of millions of elements. VLMs are present in specific domains such as the automo tive industry, civil engineering, Software Product Lines, and modernization of legacy systems . Fur thermore, new applications are emerging involving domains, such as Internet of Things (IoT), open data repositories, so cial networks, among others, demanding intensive and scal able computing for manipulating their artifacts (Ahlgren et al., 2016).
There is a wide range of approaches of model transforma tions (Kahani et al., 2018), such as QVT (OMG, 2016), ATL, ETL (Kolovos et al., 2008), VIATRA (Varró et al., 2016), among others. However, most of these approaches adopt as strategy the local and sequential execution for the transforma tion of models, conditioning the processing of models with large amounts of elements (VLMs) to the capacity of the ex ecution environment.
Given the nature of models and metamodels, they can have elements that are densely interconnected. This hardens the processing of transformation rules, mainly when execut ing a pattern matching step (Jouault et al., 2008). Moreover, distributed Model Transformation (MT) requires strategies for partitioning and distributing the model elements on dis tinct nodes, while at the same time, ensuring the consistency among their elements (Benelallam et al., 2018).
A large part of modelbased tools uses a graphoriented data model. These tools have been designed to help users in specifying and executing modelgraph manipulation op erations efficiently in a variety of domains (Xin et al., 2013; Szárnyas et al., 2014; Junghanns et al., 2016; Shkapsky et al., 2016; Li et al., 2017; Benelallam et al., 2018; Tomaszek et al., 2018; Azzi et al., 2018. The extraction of large semi structured data under a graph perspective can be useful in choosing a strategy to design distributed/parallel MTs, graph data processing, model partitioning, and to analyze model interconnectivity, as well as to offer graphstructured infor mation to different contexts. Even though, the graph pro cessing in the MT context requires more research, involving implicit parallelism, parallel/distributed environments, lazy evaluation, and other mechanisms for model processing. For these reasons, in this paper, we present an evalu ation study on the application of a Datacentric (Dc) ap proach for model extraction and MT in the Spark framework, based on GraphFrames (Apache, 2019). Therefore, we con sider that the mechanisms, such as implicit parallelism, lazy evaluation, model partitioning, and scalable framework, can compose an approach for MT.
First, we inject the input model into a DataFrame, which is a format supported by Apache Spark. Second, we im plement in Scala a model extraction with graph generation from the DataFrame and its schema. It translates the in put models into GraphFrame from a DataFrame, through a Model2GraphFrame transformation, which allows us to pro cess them. We evaluate how to query the graph elements us ing its native query language, and also, how to specify dif ferent kinds of operations over GraphFrames. We focus on the partition of graphs from GraphFrames into subgraphs, as well as the clustering of its vertices, which are used in Model Transformations. We provide the following contributions: • We produce an automated mechanism for data trans lations between the MDE technical space and the DataFrame and GraphFrame formats, which allows the execution of different operations (including MT) over the models from the GraphFrame; • We use two partitioning strategies of models on Graph Frame (semiautomated), one based on the Motif al gorithm and another on clustering using the Infomap framework. The model partitioning result is used on MT, aiming to improve the execution performance; • To validate our approach, We implemented a proof of concept, in which we compared the partitioning strate gies in MT executions on top of the Spark, a scalable framework.
This paper is organized into 6 sections. In Section 2, we introduce the context for this work with the DataFrame and GraphFrames APIs and their data formats, as well as Model Transformations using Graphs; In Section 3, we present the specifications of our approach, including extracting, trans lating, partitioning, and model transformations; In Section 4, we describe the proof of concepts for validating our ap proach; In Section 5, we present related work; In Section 6, we conclude with future work.

Context
In this section, we present DataFrame, a distributed col lection of data organized into named columns, and Graph Frames, a graph processing library based on DataFrames, both for Apache Spark. We also introduce: the MT, the key artifact for existing MDE approaches; Model Extractor (ME) for extracting model elements from different technical spaces; and Graph, a data structure composed of vertices and edges, which may be used in MT.

Data Structures on GraphFrame
Apache Spark (Apache, 2019) is a generalpurpose data pro cessing engine providing a set of APIs that allow the im plementation of several types of computations, such as in teractive queries, data and stream processing, and graph pro cessing. The DataFrame Spark API uses distributed Datasets. A Dataset is a stronglytyped data structure organized in collections. The Dataset API allows the definition of a dis tributed collection of structured data from JVM objects, and its manipulation using functional transformations such as map, flatMap, filter, and others.
Structurally, a DataFrame is a twodimensional labeled data structure with columns of potentially different types. Each row in a DataFrame is a single record, which is rep resented by Spark as an object of type Row. Each DataFrame contains data grouped into named columns, and keeps track of its own schema. Summarizing, a DataFrame is similar to a table in a relational database, but with a difference, their columns allow the manipulation of multivalued attributes. A DataFrame can be transformed into new DataFrames using various relational operators available in its API and expres sions based on SQLlike functions. DataFrames and Datasets are (distributed) tablelike collections with well defined rows and columns. Each column must have the same number of rows and each column has type information that must be consistent for every row in the collection. DataFrames and Datasets represent immutable and lazily evaluated plans that specify what operations to apply to data residing at a loca tion to generate some output (Chambers and Zaharia, 2018). Figure 1 shows an example of a DataFrame. It is formed by three rows and five columns, and contains data extracted from model Families (Rows with March, Sailor, and Camargo families. A Row can have Columns with dif ferent types, such as String, Integer, Date, Boolean, and Array. Another possible way to describe elements and their rela tionships is the creation of graphs, due to their high expres siveness. Spark provides the GraphX and GraphFrames APIs to process data in graph formats. In the GraphFrames API, the GraphFrame class is used for instantiating graphs. In Fig  ure 2, we present a simple illustrative example of a Family model, using the March family elements into a GraphFrame instance. It can be created from vertex (nameVerticesDF) and edge (roleEdgesDF) DataFrames. A vertex DataFrame has to contain a special column named "id", which specifies a unique ID for each vertex in the graph. An edge DataFrame should contains two special columns: "src" (as the source vertex ID of the edge) and "dst" (as the destination ver tex ID of the edge) (Chambers andZaharia, 2018; Apache, 2019).
The GraphFrame model supports userdefined attributes within each vertex and edge. The GraphFrames API provides the same operations of the DataFrame API, such as map, select, filter, join, and others. It has a set of builtin graph algorithms, such as breadthfirst search (BFS), label propagation, PageRank, and others. The GraphFrames and DataFrame APIs are based on the concept of a Resilient Dis tributed Dataset (RDD), which is an immutable collection of records partitioned across a number of computers or nodes. To provide fault tolerance, each RDD is logged to construct  (Tang et al., 2019)). When a data partition of a RDD is lost due to the node failure, the RDD can recompute that partition with the full information on how it was generated from other RDD partitions (Apache, 2019).

Model Transformations using Graphs
A directed graph may be represented by (G(V, E)), where V represents a set of vertices and E the set of edges of the graph G. A subgraph S of a graph G is a graph whose ver tices V (S) are a subset of the set of vertices V (G), where V (S) ⊆ V (G), and the set of edges E(S) is a subset of the edges E(G), that is, E(S) ⊆ E(G). Extensions of this basic representation have been proposed to define the graph as a data model (Junghanns et al., 2016; Barquero et al., 2018.
Graphs are useful for modeling computational problems. They can be adopted to model relationships among objects. A graph can be used, such as a representation format for models, enabling abstract features of a model. In model transforma tion processes, graphs can be used to translate instances from one modeling language to another, since the structures of that language can be represented by a type of graph. The Triple Graph Grammars approach (Schürr, 1995) is a way to specify translators of data structures and to check their consistency. In addition to model transformation, there is a variety of basedgraph algorithms used for processing graph models in different domains, such as complex network structures, net work analysis, business intelligence, and others (Junghanns et al., 2016; Löwe, 2018.
Graph transformation has been widely used for express ing model transformations, since graphs are well suited to de scribe the underlying structures of models and metamodels.
Operations are implemented as model transformations solv ing different tasks. A transformation is a set of rules that describe how a model in the source language can be trans formed into a model in the target language (Rutle et al., 2012). The extraction is a process that transcribes model/meta model elements from the native source platform to the tar get platform (Jia and Jones, 2015). This is necessary mainly when the input model comes from a different technical space (e.g., input model is in the XMI format and the transforma tion platform works on data collections).

A Datacentric Approach for MT
In a previous work (Camargo and Fabro, 2019), we presented a study on applying a datacentric language called Bloom (Al varo et al., 2011) to develop model transformations. There are three major differences from the previous study to this paper: a) We define a specific format based on RDF (W3C, 2014), and we used it in the injection/extraction operations for translating source model in new modeling domain; b) We implement the RDF models in data collections and specify transformation rules, mapping the source and target meta models and models elements as Ruby classes; and c) We choose the Bloom language, a Datacentric declarative lan guage, since it is based in collections (unordered sets of facts) and provides implicitparallelism. On the other hand, the use of the Datacentric approach, and parallel model transforma tions are the main similarities between these works.
The proposed approach in this work is built on top of the Apache Spark framework, using Dc aspects such as high level programming, parallel/distributed environments, and considering that a model element is a set of data. It allows the extraction of models and metamodels in different for mats and transforming them to a directedgraph, which is as signed to a GraphFrame. The transformation output is the in put to process graph operations and model transformations. In order to improve the performance of transformation exe cutions, we use two different strategies for partitioning mod els from GraphFrame. Figure 3 shows an overview of our ap proach. There are arrows between Spark components, mainly in Spark Context. It is the responsible for managing all exe cutions on the Spark framework. The arrows among the ap proach modules (2, 3, and 4) represent the interaction be tween them and their outputs, forming a workflow. All the steps of the workflow are automated, except for the Opera tion on Graph to the partitioning of models (semiautomated). We describe these steps in the next sections.
The Driver Node controls the execution of a Spark Ap plication and maintains all states of a Spark cluster. It ex changes messages with the Cluster Manager in order to ob tain physical resources and launch executors (Worker Nodes). The Executor is the process that performs the tasks assigned by the Spark driver. The Executors have the responsibility to receive the tasks (Task) assigned by the driver, run them, and report back their state and results. The interaction between the Work Nodes and Spark Context is supported by a Cluster Manager, which is responsible for maintaining a cluster of machines (nodes) that will run one or more Spark Applica tions (Chambers andZaharia, 2018; Apache, 2019). In our approach, the modules 2 and 3 are executed on the Driver Node. The Injector module is responsible for extracting the input model to the DataFrame, which is transformed into a GraphFrame by the Model Translator module. The Model Transformation (module 4) is executed on Worker Node(s).
For the Module 3, we create a metamodel to instantiate the result of the translation of the input model to a graph model. It is necessary for assuring the conformance and con sistency of translation output. Such metamodel is based on the GraphDB metamodel proposed by (Daniel et al., 2016), which focuses on NoSQL graph databases. Figure 4, de picts our Graph Metamodel, where GraphElement repre sents all elements of a graph. Their subtypes, Graph Vertex and Graph Edge, express the vertices and edges, respectively. A GraphVertex has an Id attribute, meaning that each ver tex is unique. Also, there are type and value attributes to represent the model element properties, forming a triple. In contrast, the GraphEdge type has a string attribute key for identifying the elements from src and dst links, which are represented by src (source) and dst (destination) associa tions between GraphVertex and GraphEdge Classes.
We use the Graph Metamodel as a schema to instan tiate model elements and their relationships by means of the GraphVertex and GraphEdge classes. Their properties, such as attributes and associations indicate the model ele ment structures. GraphVertex and GraphEdge classes are in stantiated into a GraphFrame, and from the GraphFrame it is possible to specify operations and queries to manipulate them. An instance of the Graph Metamodel is shown in Sub Figures 5a and 5b.  A set of operations over graph elements of GraphFrame can be executed, such as the Motif algorithm to split graph in subgraphs, graph degree to compute the valency of a ver tex in a graph, queries, and others 1 . In addition to such exe cutions, the Model2GraphFrame (M2G) output is also used as input by the Model Transformation module, which trans forms the input model elements in a directedgraph format to the target model.
In the next sections, we present the steps to extract and transform models, as well as two alternatives for model par titioning.

Extracting model elements into a DataFrame
The initial step consists of the extraction of the input model elements into a DataFrame model. It starts when the user submits (1 in Figure 3) the input model with its name, and 1 The valency of a vertex of a graph is the number of edges that are incident to the vertex location (path) (Figure 3) to the Driver Node. The Injector Module (2 in Figure 3) assigns the input model in formats such as XMI or JSON to a variable (modelPath) which is read for loading the input model. Next, the input model is parsed (DataFrame API) and its elements are assigned to a DataFrame (modelDF). All DataFrame has a schema for de scribing the data structures, such as the input model. Thus, a schema is formed according to the input data structures. List ing 2 shows an example of a DataFrame schema. We choose to use the DataFrame in this step due to their schema. It pre serves the input data structures, easing the translation of the input models to the GraphFrame through the reuse of these structures. Furthermore, it is not necessary to implement a parser for loading the input model to DataFrame.
We use the Family model excerpt from the ATL Zoo (Eclipse, 2019) to illustrate the extraction into the DataFrame and we then describe how model elements are represented in a DataFrame. In Spark, the operations on data are made by means of Transformations and Actions. A Trans formation is formed by a set of instructions to manipulate data and an Action is specified to trigger the computation on data. When it is called, it notifies the Spark Engine to com pute a result from a series of transformations (Chambers and Zaharia, 2018). Listing 3 illustrates the extraction result from the model Family (excerpt) in XMI format (Listing 1) to a DataFrame, where its structure is supported by DataFrame Schema shown in Listing 2.

Listing 2: Family Schema Excerpt
root |--Family: array (nullable = true) | |--element: struct (containsNull = true) | | |--lastName: string (nullable = true) | | |--daughters: struct (nullable = true) | | | |--firstName: string (nullable = true) ...  Figure 3, the model elements are structured in a set of columns with an unspecified number of rows, since a schema defines the column names and types of a DataFrame. The rows are unspecified because the reading of the model elements is a lazilyevaluated operation (lazy evaluation (Michael l., 2016)). The schema does not require the rows to be identified explicitly.

According to
Although a DataFrame Schema can be specified manually, we opt for the Schema generated by the parser by the read op eration of the input model (Extraction step). In this schema, the structures of input model elements are preserved in a tree format by the translation process. Listing 2 has a trans lation example, where the DataFrame Schema is structured by element root and their rows are represented by Family element. The multivalued elements are represented by ar rays (array) and their elements are represented by structs that may have one or more elements, including null values (containsNull). These elements represent the leaves (i.g., lastName) and have a type (i.g., string). All elements represented on DataFrame Schema have the (nullable) at tribute assigned as true by default. This is for fitting the Spark framework for handling the Dataframe columns, with the nullable attribute true or false. Their columns are logical constructions that represent a value computed by means of programmatic expressions. Thus, to have a real value for a column, we need to have a row, and consequently to have a DataFrame. Therefore, since the input model was translated to a DataFrame, it can be transformed according to the transformation domains of the user.

Translating the input DataFrame to GraphFrame
In a second step, the Model Translator Module (3 in Fig  ure 3) translates the input model, which was assigned to a DataFrame, into a GraphFrame. We use the model elements in the DataFrame as input to the Model Translator. In addi tion to elements, the schema associated with the DataFrame that describes the model element structures is essential for our Model Translator, since we use it for reproducing these element structures in a graph, assigning them to the Graph Frame. We create an algorithm for translating a DataFrame to a GraphFrame, conforming to the metamodel of Figure 4. Algorithm 1 is responsible for such translation. As input, the Algorithm receives a DataFrame, which is processed by combining its content and Schema. Algorithm 1 contains the functions model2GraphFrame and model2GraphSchema. The source code of the functions is available on 2 . Since the modelDF DataFrame contains all model elements, it is assigned as a parameter to the model2GraphFrame function. It is responsible for starting the transformation process called. For simplicity's sake, we omit the specification of the model2GraphSchema function in Algorithm 1 (line 2), the model2GraphSchema function with the model elements and the DataFrame Schema as parameters. It performs the processing of model elements and their structures together with the respective schema columns of DataFrame in a recursive way, assigning its result into the verticesDF and edgesDF DataFrames. (lines 3 and 4). We use the wildcard parameters (_1 and _2) and the toDF function with its parameters, and the respective DataFrame columns ("id","value"). Thus, the first elements are separated to the verticesDF DataFrame and the remaining elements are to the edgesDF DataFrame. Both DataFrames shape the vertices and edges and are assigned into the GraphFrame (GF, line 7) by model2GraphFrame function.
2 https://github.com/lzcamargo/extracSpk We use some Family model elements (Listing 2) as input to present a translation example (an Algorithm 1 execution). To access the vertex and edge contents, we execute the com mands: GF.vertices.show() and GF.edges.show(). Its outputs are represented in Figures 5a and 5b. The values of Family model elements from the DataFrame are instantiated into graph vertices. The model element names are assigned to graph edges as keys. The links (src and dst) among ver tices and edges establish the relationship of the model ele ments. In Figure 5 we use circles and rectangles for illus trating the model element structures and their relationships. For example, the vertices and edges marked in red demon strate the structure of the lastName Sailor element, and the blue ones denote the firstName David element. The relationship between these two elements is marked on edge (Figure 5b), where the src column value is noted in red, and the value of the dst column is noted in blue. The join of these structures (the match between id, src, and dst columns) al lows to identify that David is a son (sons), and belongs to Sailor Family. Thus, the model elements are structured into GraphFrames so that they can be queried and processed for different purposes.
In the first two steps, we obtain the extraction of the input model to the modelDF DataFrame and its translation to the GraphFrame GF. We consider the result of these operations as the transformation of the input model to a graph, in particu lar the Model2GraphFrame transformation. In the next steps, we use the GraphFrame contents for Model Partitioning and Model Transformations.

Model Partitioning
In this step, we present two strategies that we use for parti tioning models from GraphFrame: one based on the model keyelement names with the Motif Algorithm, and another using clustering. First we present their implementation. In the next section, we present a proof of concept on using these strategies. We choose the first strategy because it allows us to use the transformation rule names with an algorithm im plemented on the GraphFrames API itself, in this case the Motif algorithm. Regarding clustering, we choose it to link the model elements on clusters by means of the related ver tices (src to dst) in edges contained on the GraphFrame. We use the clusters as parameters for the Spark framework par titions in the processing of the Model Transformations. In a graph, a motif can be defined as a pattern of interconnections of edges that occurs in a graph (Milo et al., 2002). We are in terested in finding patterns in a graph for a given purpose, forming subgraphs as such partitions from this graph. Thus, we consider the following definition, where a Graph G ′ is a subgraph of graph In our context, consider a scenario with the following transformation rule names: Package2Schema, Class2Table, Att2Col, and Family2Person. From each rule name, we use its prefix (i.e., Package, Class, Att, and Family) as a pa rameter (keyelement) in graph partitioning using Motif al gorithm, particularly for the key column of the edges. This means that these prefixes are interest points in the graph.
In a GraphFrame, the Motif Finding is implemented in a DomainSpecific Language (DSL) for expressing struc tural queries. For example, graph.find("(a)-[e]->(b); (b)-[e2]->(a)") will search for pairs of vertices a,b con nected by edges in both directions. It will return a DataFrame of all the structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. The returned columns will be the vertices a, b, and edges e, e2 (Apache, 2019).
We specify the subgraphs extraction combining Motif Finding and a filter. This means that depending on the in put model it is necessary to adjust of Motif algorithm pa rameters and/or filter, characterizing the model partition ing semiautomated. Listing 4 shows the implementation in Spark Scala for the Class elements through the tag "classes", which were mapped to column key of the edgesDF DataFrame. Graph motifs are patterns that occur re peatedly in the graphs and represent the relationships among the vertices. In a GraphFrame, Motif Finding uses a declar ative DSL for expressing structural queries for finding pat terns among edges and vertices by means of the find() func tion. Therefore, we choose it for easing the subgraph extrac tions. We believe that its characteristics can generate consis tent subgraphs from key model elements (prefix name rules). Line 3 of Listing 4 is the specification of a query for search ing for pairs of vertices between (a,b), (b,c), and (c,d), which are respectively connected by edges e, ea, and eb. We also use a filter for delimiting the vertex pairs, starting from an edge, whose key property element is equal to the tag "classes". This means that the execution of this ex pression will return as motifsDF all the structures (vertices and edges) related to the filtered property (classes) on the graph, which are arranged in a, e, b, ea, c, eb, and d columns. We select the edges contained in motifsDF and as sign them to the subE immutable variable (line 5). We use it as edges for composing the subG subgraph, whose vertices are the same as in the GF graph. We apply the dropIsolated Vertices() function to exclude the isolated vertices (i.e., ver tices with degree zero, if there are any.) for ensuring that the links among vertices and edges in subG subgraph. In this case, Listing 4 allows us to get all the Class elements and their associated elements from the GraphFrame that repre sent a Class model, producing a subgraph.
Listings 11  Now we present the utilization of clustering as a strategy, by implementing it using the Infomap from the MapEqua tion framework (Bohlin et al., 2014). There are other alter natives for such implementation, such as the utilization of the kmeans algorithm (MacQueen, 1967), one of the most commonly used clustering algorithms. We could also adapt the Apache Spark MLlib, machine learning (ML) library. It provides various operations based in ML, including cluster ing. Infomap is a fast stochastic and recursive search algo rithm with a heuristic method Louvain (Blondel et al., 2008) based on the optimization of modularity. When it is exe cuted with vertices and edges of a graph, the neighbor nodes are joined into modules, which are subsequently joined into supermodules and so on, clustering tightly interconnected nodes into modules. Infomap has been used in community partition problems (Aslak et al., 2018; Edler et al., 2017, for detecting communities in large networks, and to help in the analysis of complex systems. In addition, Infomap oper ates on graphstructures in the Pajeck format (file.net) 3 , which can be easily extracted from the GraphFrame as input to Infomap. For example, Listing 5 shows a excerpt of the File.net extracted from Class0 model, and Listing 6 shows the .clu output file, the clustering result, where the nodes are gathered in the respective clusters (node and cluster columns). Column flow contains cluster indices for each node, but they are discarded when the .clu file is injected into DataFrame by a loading operation and used in clustering model elements. However, the clustering from GraphFrame using the Infomap framework is a semiautomated operation, since we do not implement integration between our approach and the Infomap framework (Operations on Graph, Figure 3) Listing 5: Class0 File.net *Vertices 50031 0 0 1 1 2 2 ... *Arcs 50030 1 2 4 5 4 6 ..
Later, we present the use of Infomap and the model parti tioning in Section 4.

MT using GraphFrame
In the last step, we specify a set of operations and transfor mation rules to transform the source model in GraphFrame into a target model. They are executed as parallel tasks on Worker Nodes of the Spark framework, through the Model Transformation module (4 in Figure 3). The source code of the operations and transformation rules are available on 4 . Listing 7 shows the Family2Person rule written in Scala as a singleton object (object Family2Person). We sepa rate the male and female elements in the maleEdgesDF and femaleEdgesDF DataFrames. They contain the target val ues (dstm, dstf, dst) that link each last name with its first names. We use the select, join, and filter func tions to select the last and first names from of maleEdgesDF. For each join operation, we use the filter function (lines 4, 6, 12, and 14) to ensure the accurate selection of model el ements, since they are formed by relationships among edges and vertices ("dstm" === "id"). In lines 7 and 15, we use the select and concat functions to assign the last name (lastName) and the respective first names (value) as the full name (fullName column) to the maleFullNamesDF DataFrame. .select($"dstm", $"dst"). join(GF.vertices) 4 .filter($"dstm" === $"id") 5 .select($"value".alias("lastName"), $"dst") 6 .join(GF.vertices ). filter($"dst"===$"id") 7 .select(concat($"lastName", lit(" "), $"value") 8 as "fullName") 9 10 val femaleFullNamesDF = femaleEdgesDF 11 .select($"dstf", $"dst"). join(GF.vertices)

14
.join(GF.vertices ). filter($"dst"===$"id") 15 .select(concat($"lastName", lit(" "), $"value") 16 as "fullName") 17 } For the femaleFullNamesDF DataFrame (lines 9 to 14), we use the same idea applied to the maleFullNamesDF Dataframe. These DataFrames are merged (union function) in the personDF DataFrame, each one with a new column Gender (withColumn("Gender")) to ensure the gender dis tinction among persons. 4 https://github.com/lzcamargo/transformSpk Next, we specify an operation, using coalesce(1) method to instantiate the transformation output in a single partition (1). This means that output tasks will be reduced in a single partition (distinct output) as the final result of the transformation. The example in Listing 8 is obtained with the write function, and the tags (root and row) of the databricks:spark-xml library, indicating that the format was assigned as xml. We separate these commands (write op erations in the target model) from the loading rules for better code legibility. Since the target model was stored in a repos itory, it enables to load the output in xml/xmi format and in stantiate it back in GraphFrame. Listing 8 shows a portion of the persons.xml file content. It represents the Family2Person transformation result, using the Family model presented in Listing 1 as the source model. In this section we described our approach. In the next sec tion, we perform the proof of concepts in order to validate its feasibility.

Implementation
We implemented a Proof of Concept (PoC) (Kendig, 2016) using GraphFrames to demonstrate the feasibility of our ap proach and to show its usefulness under following aspects: the processing of Model2GraphFrame outputs, the partition ing of graphs contained in the GraphFrame, connectivity among model elements in a set of GraphFrames, and the ex ecution of model transformation using the GraphFrames.
We run the PoC in a single machine with the following software stack: Ubuntu 18.04; Spark 2.4; and Scala 2.3. It is hosted by an Intel Core i54210U 1600 CPU with 8096 MB of RAM; and the processor has two cores. As input, we use the both Class and Family models in XMI format. There are four models with the following specifications: • Class0, class model with no attributes or methods, only Package and Class elements. This kind of model is used in Domain Modeling, useful to understand the ideas and concepts of the domain (Larman, 2004); • Class3, class model with Package and Class elements, each Class contains from 1 to 3 methods and attributes; • Class6, as the previous item, but each class contains from 1 to 6 methods and attributes; • Family model with 0 to 3 sons and daughters. Its el ements are selfcontained in LastName elements and their attributes.
We get the Class models from, each one with 10000 classes 5 . They were created to be used as a benchmark for the Class2Relational transformation case studies in parallel transformations using Lintra (Burgueno et al., 2015) 6 . These models have references among their elements established by attributes. For instance, the Class0 model has 10 Pack age elements and each Package has 1000 Class elements. The Family model has 10000 LastName elements, which we created for this proof of concept. In this case, we consider these elements as selfcontained (Class0 and Family). How ever, there are models (Class3 and Class6) that besides self contained elements, also contain interconnected elements, where Class elements are referenced by one or more Class elements, which are contained in other Packages. Attributes such as super, and type establish such references. The models used on PoC have a different density (Class 0, Family, and Class6) and interconnectivity (Class3 and Class6) among their elements. This means that we will validate our approach in relation to these model aspects.
To measure the execution times in seconds, we use the System.currentTimeMillis() function from the Scala language, in a dedicated machine with no UI interactions. The input model elements once extracted to a GraphFrame, they must be available. Each model element in the Graph Frame vertices has to be linked to its properties through GraphFrame edges.
We have defined three research questions to validate the PoC implementation and its main aspects.

Q1: How to check if the Model2GraphFrame output is available for processing?
To address this question, we use the directedgraph prop erty (DGP) to check the total of Edges and Vertices in a directedgraph G, , where the V (G) total minus 1 is equal to E(G) total. When this property is true to a directedgraph it is considered as a simple directed graph (Hochbaum, 2008). A directedgraph is no longer simple if there are multiple edges or loops. Hence, the V(G) total is less than to the E(G) total ( ). In addition, we execute a set of queries on the GraphFrame to validate the contents of vertices and edges, whose input models contain 100 classes and 100 families. This means that we take a set of model elements contained into GraphFrame and we compare it with its input model elements.
Although the M2G outputs are directedgraphs into GraphFrame, we need to know whether it is achievable to use them in model transformations. To address this issue, we define question Q2.

Q2: Is it possible to perform MT using GraphFrame?
We address this question in order to use GraphFrame in Model Transformations. Our goal is to verify how the source models into GraphFrames can be transformed to target models. We specify operations and rules using methods and functions in Scala for manipulating vertices and edges in GraphFrame (e.g., Listing 7). They are similar to transformation specifications in ATL ATLAS Transfor mation Language (Jouault et al., 2008), where Helpers and Transformation Rules are the constructs used to specify the transformation functionality.
Finally, the last question is about performance of MT exe cutions using clusters. Q3: Does executing model transformations using model par titioning improve performance? We address this question in order to verify whether the execu tions of model transformations using model partitioning im prove performance, since we adopted two partitioning strate gies for this approach: partitioning of input model into Graph Frame in subgraphs, and generating of clusters from Graph Frame vertices. In the following Sections, we present the proof of concepts, results and the answers for the above ques tions, as well as further discussions.

Processing Model2GraphFrame Outputs
To check the GraphFrame outputs with respect to the input models, we obtain the total of vertices and edges and we use the DGP to check their amount. Columns V(G) and E(G) of Table 2 show the total of vertices and edges from the input models (Model column). The amount of vertices V(G) 1 is equal to the amount of edges E(G) for the Class0 and Family models, demonstrating that they are simple directedgraphs. However, the total of vertices and edges from the Class3 and Class6 models indicate that they are not simple directedgraphs (V(G) < E(G)). In addition, we execute queries as shown below, and their results are compared to input model elements to validate the M2G consistency. It returns the values of class properties such as name, isAbstract, and visibility from the GraphFrame vertices. It does not return Attributes and Methods, because the keyelement (key) is assigned the "classes" value. gf.edges.where($"key"==="classes") .select($"dst".as("dstv")).join(gf.edges) .filter($"dstv"===$"src").select($"dst") .join(gf.vertices).filter($"dst"===$"id").show() Listings 9 and 10 show excerpts of Class0 model elements and the query output. They represent an example of our vali dation. In this case, the relation among classes and their prop erties are established by the GraphFrame edges (gf.edges src and dst), whereas the value of each property is assigned to the GraphFrame vertices (gf.vertices).
Listing 9: Class0 Model <classes name="Class14" isAbstract="true" visibility="public"> <classes name="Class15" isAbstract="true" ...  Table 2 (the first four columns) and the query outputs (ex ample in Listing 10) show that the M2G results from the input models seem correct. This comparison complements the quantitative checking through the DPG. For example, when using the DPG for the Class models, we identify that the Class0 model is a simple directedgraph, since it only has Package and Class elements. Furthermore, there is a single relationship between Package and Class elements that link them. In this case, 1000 Classes in each Package, since the Class0 model has 10 Package elements. On the other hand, the Class3 and Class6 models have Package, Class, Attribute, and Method elements, and relationships among them include Datatype elements. This means that these models, when transformed to graphs have a total of edges larger than the total of vertices. Therefore, we answer the question Q1 by validating the total of vertices and edges for each input model, as well as the respective contents. We follow with the proof of concepts executing the Motif Algorithm for all input models (Class and Families) using the strategy shown in Listing 4. Next, we measuring the connectivity of models assigned to the GraphFrames.

Measuring GraphFrame Connectivity
The idea of measuring the GraphFrame connectivity is a strat egy to reveal how complex the input models that we use are, with respect to the graph elements (vertices and edges). It can help the choice of the strategy to be adopted for the partition ing and for operations over models with GraphFrame. Fur thermore, a set of functions can be executed from the Graph Frame, as for example the outDegrees and inDegrees functions.
We execute the outDegrees and inDegrees func tions for all vertices of the GraphFrame models (e.g., outDeg = gf.outDegrees). These functions deter mine the amount of outwarddirected (outDegrees) and inwarddirected (inDegrees) graph edges from GraphFrame vertices. Once the degree is calculated for all vertices of the graph, they are grouped and summed (outDeg.groupBy("vertices").sum()). Table 1 shows the execution results. The amount of Degrees for each GraphFrame Model is in descending order (only the first four or six amounts are shown) in the OutDegrees and InDegrees Columns, and the total of vertices of each calculated degree is in the Total Vertices column. It is worth mentioning that no sink vertex was found (vertex with outdegree equal to 0) in the GraphFrame models, but one source vertex was found (vertex with indegree equal to 0), in this case the vertex with id equal to 0, that represents the root vertex.
In the InDegrees column (Table 1), we can see that the Class0 and Family models in GraphFrame Models column have Degrees equal to 1 for most of its vertices (except for the vertex 0). The degree calculated of outgoing edges for the vertices (OutDegrees column) for these models show their characteristics. For instance, a degree equal to 1000 and a total of vertices equal to 10, mean that there are 10 vertices with 1000 outgoing edges. In particular, they represent the directedlinks between the Package and Class elements of the Class0, Class3, and Class6 models, since there are 10 Packages and 10000 Classes into each Class Model. There fore, the results obtained from the Class0 and Family mod els indicate that they are simple directedgraphs and weakly connected. On the other hand, the results in OutDegrees and InDegress columns show that the Class3 and Class6 mod els are directedgraphs and strongly connected, since there are directedlinks among Package, Class, and Attribute ele ments. In addition, the ingoing edge from GraphFrame ver tex elements, such as Datatype and Type are also represented in the InDegrees Columns for these models. The result from the inDegrees and outDegrees functions is useful to eval uate how complex models are.
In the next section, we present the results of our two parti tioning strategies over the GraphFrames. Furthermore, in the following sections, we discuss the influence of these strate gies in model transformation executions, as well as we de scribe the distribution of model elements over the executor processes (Worker nodes) on Spark framework in local mode executions.

Partitioning M2G Outputs
Our approach provides model partitioning with two differ ent strategies (Section 3.3): the Motif Find algorithm avail able in GraphFrames API, and Infomap framework. From the GraphFrame the Motif algorithm finds patterns among edges and vertices for producing subgraphs. Using the vertices and edges from GraphFrame, the Infomap framework generates clusters of vertices from format files (.net). We use them for partitioning the operations on the GraphFrame in the process of model transformations.
In Section 3.3, we showed a specification of how the Mo tif algorithm may be used in graph partitioning, where a subgraph (G') is formed from the graph edges that con tain a key element extracted from the rule name (object Package2Schema). For example, for the Key element "Pack age" (kelement column at Table 2) there are 30 ver tices (V(G') column) and 20 edges (E(G') column) of the subgraphs from the input (Class0,Class3, and Class6) model transformation (M2G) outputs. In addition to Pack age names, their nearest neighbor elements are partitioned together in the respective edges (in this case, Class ele ments). Listings 11 and 12 show edge and vertex samples of a subgraph (SG), where vertex 17 contains the value Pck0. Listing 11 has this property, and the nearest neigh bor linked in two edges (16,17; 16,18). In this manner, for each Package element, the partition contains three ver tices and two edges. All Class models have 10 Package elements; this justifies the amount of 30 vertices and 20 edges in the subgraphs for the keyelement Package. Ev ery Class element in the Class0 model has 10000 Class el ements, composed of name, isAbstract, and visibility attributes. For each attribute an edge is created, whose source (src) vertex is a StructType. Thus, each Class el ement has four vertices and three edges. Keeping the quan tifiable amount of model elements in mind, we execute the Motif algorithm for the key elements such as Pack age, Classes, and Attributes (methods were not partitioned).  As seen in the previous section, the Class3 and Class6 models are directedgraphs and strongly con nected. Their elements such Package, Class, and Attribute elements have links to each other through type and super attributes. There are links among Attribute and DataType elements such as float, string, integer, and others. For Class elements there are edges containing attributes, such as (19,22,name), (19,23,super), (19,24,visibility), ...(99,101,name).
These attributes are also connecting structures between the class el ements, when the types of attributes are classes. For instance, to establish links among Class elements using attributes, edges such as the {(23,101,lnk) are formed to link the super attribute of a Class element to the name attribute of another Class element. These edges are joined with their vertices, forming the subgraph Classes. The Attributes subgraph is formed in same way, and its links to other elements (i.g.,DataType) are established via type attributes. Consequently, the amount of edges (E(G')) is larger than the vertices (V(G')) for the subgraphs (partitions) of Classes and Attributes.
The Class element has 6 attributes, which are assigned to vertices. For the type of structure (StrucType) of a Class element, more than one vertex is assigned. These are linked to the source (src) vertex of the class properties (18,19,"0").
Thus, for each Class element in a Class subgraph there are 7 vertices, explaining the 70000 vertices. Regarding the Fe male and Male subgraphs, they were partitioned from the Family model transformation (M2G) output. The lastName element and its structure are duplicated into these subgraphs. Thus, the total of vertices (V(G')) and edges (E(G')) is more than the total of vertices (V(G)) and edges (E(G)) of the Family M2G output. Table 2 (the last four columns) shows the partitioning results for each Model translated into Graph Frame. Each model partition (V(G') and E(G')) subgraph) is related to a key element (Kelement). We extract the prefix from transformation rule names, such as Package, Classes, and Attribute, and we execute the Motif algorithm for each of them, except the key element Attribute for the Model Class 0. This partitioning strategy is dependent on the key attribute of GraphFrame edges. It requires that all links among model elements are instantiated into edges. Otherwise the partition ing will not be correct. Now considering the graph clustering strategy, we ex tracted from the GraphFrame the vertices and edges in the Pajeck format (.net), as required by the Infomap frame work 7 . Once the .net file is available for processing, we ex ecute it with a call to Infomap. The runtime arguments are -z, -N 10, --directed, --clu, meaning respectively: start with vertex equal to zero; iterate ten times over the vertices and edges; the input is a directed graph; and the output will be a file containing clusters of vertices. The Infomap framework execution output is a text file (.clu) containing a list of pairs formed by vertices and clusters ((11,1),(12,1),(13,1),(21,1), (22,1)). This list is formed ac cording to the incidence of each vertex in the edges. All links shaped from a vertex are grouped in a single cluster, and thus, a vertex belongs only to a single cluster. We execute the Infomap framework for the four input mod els using .net files. These files contain vertices and edges extracted from GraphFrames. Table 3 displays the execution results of the Infomap framework using the .clu files. In the Infomap output (Clusters(G) column), we can note that the amount of clusters generated from the Class0 (10012) and Family (160480) models is much larger than Class3 (7) and Class6 (13) models. This is related to the density of each model. The higher the number of interconnections among model elements (edges), the lower the number of clusters generated, because each vertex in an interconnection is as sociated to the same cluster. As we saw in the previous sec tion, the Class0 and Family GraphFrame models are sim ple directedgraphs and weakly connected. This corroborates the cluster granularity in both cases. In contrast, the Class 3 and Class6 models, when translated to GraphFrames, are directedgraphs as well, but the graphs are not simple. They are in fact strongly connected. As a result, the number of clus ters generated for these models is much smaller when com pared to the Class0 and Family models.
Once Infomap has generated the .clu Cluster file for each model, we load them into their respective DataFrames. As an example, for loading the Class0 clusters: val clusterPath = "/Infomap/output/class-0.clu" val clusterInput = spark.read.option("header","true") .option("delimiter", " ").txt(clusterPath) .select($"node", $"cluster") The graphs are used in the execution of model transforma tions to direct the model element partitions on the parallel Spark execution (section 4.4). We can also use them on dis tributed Spark execution (cluster of machines in future exe cutions). For example, the number of nodes provided by the user or obtained from machine clusters can be used as the denominator in a division of the amount of clusters from the partitions. On other hand, the partitioning and distribution of data done by the Spark partitioner can be improved in terms of data dependency among the environment nodes. Since the clustering and the subgraphs tend to have the linked model elements closely. The partitioning and distribution of data in Spark is done at runtime using the RDD (Resilient Dis tributed Datasets) API.
In Table 3, we report again the amount of vertices and edges of models (Table 2) to show the relation of the num ber of clusters for each model, considering their total vertices and edges.  Class0  40031  40030  10012  Class3  318789  350006  7  Class6  740365  880776  13  Family  160481  160480  3354 The results from the Motif algorithm executions showed that Motif Find can be used as a partitioning strategy for graphs represented in GraphFrames. Regarding Infomap, it uses the GraphFrame output as input for processing the clus tering, and the result is injected back to Spark through a DataFrame. This process requires an integration between In fomap and Spark frameworks. There is a drawback in the way that we adopt both strategies of partitioning since we do not consider data balancing. Even though we know that it is difficult to treat the densely connected models, we be lieve that we may explore this challenge in a future work. In the next section we execute the model transformations using GraphFrames.

Executing Model Transformations using GraphFrame
We execute the Class to Relational (C2R) and the Family to Person (F2P) transformations using Class0, Class3, Class 6, and Family as source GraphFrame models. Once they are transformed to GraphFrames, their elements are used as input in filtering operations and transformation rules (as shown in Listing 7), which we submit to the Spark framework for exe cution. For each GraphFrame containing the source models, we execute the transformations considering: • No Partition (NP) in these executions, without any parti tioning strategy. We execute the model transformations for the whole model in the GraphFrame; • Motif, running the model transformations using the sub graphs from the Motif partitioning strategy (Table 2); • Infomap, executing the model transformations using the clusters of vertices from the Infomap framework.
An operation that we used in F2P transformation has the following specification: val lastNameFamilyDF = edgesDF.filter("key = 'lastName'").select($"src", $"dst", $"key"). It selects the edges (src and dst) for each family last name (lastName) and assigns them to the lastNameFamilyDF DataFrame. Table 4 shows the model transformation results, which include the times in seconds. These times were computed as the average of 7 transformation executions for every input GraphFrame model, having discarded the first 3 executions. They were considered as warmup phases for the virtual machine of the Spark framework. Before performing model transformations using the Clus tering strategy, we validate the cluster partitions using the same procedure that we apply to the Motif partitioning (Sec tion 4.3). We identify that the clustering of GraphFrame is consistent only when Infomap is executed without the level parameter (visualizing the cluster output in modules), partic ularly for the Class3 and Class6 models. Otherwise, ver tices of the same model elements were in different clusters due to the cluster modularity aspect of the Infomap algorithm. We consider consistency as the main requirement of the par titioning results, but at the same time recognize that the parti tioning obtained through Infomap is conservative when eval uating the total of vertex clusters in Table 3. The vertices of the Class3 and Class6 GraphFrame models were clustered to 7 and 13 clusters respectively. It means the densely inter connected models require a more efficient partitioning strat egy with regards to the balancing and consistency of model elements.
We run model transformations on the Spark framework in local mode using four nodes, which were used for executing the parallel tasks in memory. A task is the smallest unit of schedulable work in a Spark program. A stage is a set of tasks that can be run together (Apache, 2019). In this manner, the requests for data manipulation operations (name of transformation) and actions (requests for output, for instance) are coordinated by the Spark framework. Thus, we establish a traceability among source and target elements, assigning the links among model elements from the GraphFrame to a DataFrame (write the target model output). For instance, the following expression selects all links (vertices of last and first names) of sons elements.
val sonsNamesLinksDF = maleFamilyDF .where($"key"==="sons") .select($"dstm",$"dst".alias("dstt")).join(edgesDF) .filter($"dstt"===$"src").select($"dstm",$"dst") These links are inserted into the sonsNamesLinksDF DataFrame, and we can use it to obtain the complete names of sons on a parallel Spark framework execution. For each source model (Table 4) in the GraphFrame, we submit all the operation expressions and transformation rules to the Spark context needed by the transformation of the model in question. According to Table 4, the partitioning model strategy with the Motif algorithm penalized the performance of subgraphs transformation executions (Motif), due to the memory con sumption and the possible negative effect on mechanisms of the Spark framework that minimize the data exchange among the executors (data shuffles). The execution results with no partitioning strategy (NP) have shown better perfor mance when compared with the subgraphs executions (it does not interfere on Spark partitioner). In the cluster par tition executions (Infomap), we interfered in the Spark parti tioner, submitting the partitions from cluster model elements to the Spark framework to improve the performance. This strategy showed the best performance when compared to the other executions (NP and Motif). The results on using these strategies are present in this section. Regarding execution times of model transformations, we observe that the models which we consider weakly connected, have the lowest execution times (Class0 and Family columns) when compared to the execution times of the Class3 and Class6 models. These models have their transformations times increased as the amount of interconnected elements grow. In the transformations using Infomap, we use the clusters to repartition in runtime the default Spark partitioning. For each expression, we add the clusterInput DataFrame (node and cluster columns) and invoke the repartition function with the cluster column as parameter, in order to interfere in Spark partitioner. This is necessary because the input model clusters were generated before by Infomap framework. The expression below shows an example of interference in the partitioning of the Spark through the repartition() function.
val lastNameFamily = clusterInput .select($"node",$"cluster") .join(edgesDF).where($"node" === $"dst" && $"key"==="lastName") .select($"src", $"dst", $"key",$"cluster") .repartition($"cluster") We create the partitions of the model elements from the clus ters. This means that for the strongly connected models the number of partitions in runtime is smaller. Consequently, the amount of shuffling (operation in Spark to distribute data across multiple partitions) also diminishes. However, the execution times for the Motif is higher than in the other executions for all input models used in this PoC. This is due to memory consumption used to process the Motif partitioning and model transformations, since all the steps of our approach were executed in memory. In addition, when the action is invoked by the program all the operations in lazy evaluations are triggered.
A spark applications consist of a driver process (Driver node) and executor processes (Worker nodes). The driver runs, analyzes, and distributes work across the executors. A partition is a logical chunk of a large distributed data set (Apache, 2019). In our case, when the models are sub mitted to execution on the Spark framework, they are parti tioned automatically when there is no interference via code (repartition() or coalesce()). For example, when a class element requires one or more model elements that are in other nodes, these elements are shuffled and distributed between nodes, and processed. These results are available to the Worker nodes, which can then be used in a subsequent operation.
Concerning the influence of partitioning strategies in model transformations, we note that the generation of sub graphs with Motif execution penalizes the performance of model transformation executions. Since they are running to gether in local mode, the subgraphs were in the same JVM (Java Virtual Machine) as the transformation code. We be lieve that the fact of model elements required by opera tions and transformation rules being together in subgraphs (GraphFrames) may diminish the amount of shuffles during the model transformation executions. However, this strategy can be better explored and be made more efficient on dis tributed executions with priority to data locality (data and the code stored together on the same Worker Node). As for the clustering strategy, we see that the use of clusters as a param eter for repartitioning (Spark manages data using partitions) of the source models for model transformation processing is favorable to the performance of model transformations, since they help parallelize data processing by minimizing shuffles between executors (Worker nodes). This is a consequence of each cluster and its vertices being distributed as a single partition in runtime. However, this strategy is susceptible to data skew when the data is unbalanced. In distributed execu tion, where each cluster of model elements is in a partition and localization, the model transformation processing can be minimized, with less network traffic overhead for sending data between executors (Worker Nodes). In both strategies there are open issues, such as data balancing (Le et al., 2014), data skew processing (Gao et al., 2017), and data locality (Jin et al., 2011) that need be contemplated in our approach.
Although there are open questions, we answer the question Q2 admitting that it is feasible to use GraphFrame for model transformations. According to the executing times of model transformations in Table 4, the model partitioning based on clustering performed better when compared with the other times. On the other hand, the graph partitioning using the Motif Find algorithm presented the worst performance in this PoC. That means that we answer the question Q3. In the next section, we present further discussion about this work.

Discussion
DataFrame and GraphFrames are flexible, structured, and based in collections. These aspects allow us to extract meta models and models from different formats, such as XMI and JSON. Moreover, their characteristics may ease the data modeling for distinct transformation scenarios (local and dis tributed/parallel). Syntactically and semantically, the func tional constructs of the DataFrame and GraphFrames APIs are relatively simple, though proper usage of some constructs in Resilient Distributed Dataset (RDD) may require more skill in functional programming.
In addition to the APIs, DataFrame has a schema that al lows interacting with its data structures and ease data op erations. The capability to process different data formats, such as JSON and XMI can be considered a differential of our approach to those that accept only the XMI format. An other essential aspect is the transformation from a techni cal modeling space to the GraphFrame space, easing differ ent operations over model graphs through the GraphFrames API. However, in our proof of concepts, we observe that the Model2GraphFrame transformations consume a consid erable amount of memory, as the input model is loaded and held in memory during the recursive processing. Even when using a platform such as Spark, this problem with memory usage needs to be addressed, in particular when using recur sive processing, or by avoiding it altogether.
Regarding the GraphFrames API, it has a set of func tions and builtin algorithms that can be used by different languages. Its GraphFrame data representation (vertex and edge DataFrames) allowed the manipulation of model ele ments while preserving their references and the connectivity of models. The links between model elements are assigned to GraphFrame edges, and from them, it is possible to identify and process the elements and their links, such as when us ing Motif algorithm to generate subgraphs, or via functions that measure the connectivity of GraphFrames, among other operations over GraphFrames. On the other hand, the model elements that are in GraphFrame vertices and edges can be joined in a single DataFrame using functions such as join, union, and merge. When there is a need for a single out put from parallel/distributed processing, a reduce operation can be executed using functions such as repartition(1) or coalesce(1).
The model partitioning strategies used in fully connected models need to be better investigated and integrated to model partitioning in the Spark framework, mainly the clustering. Issues of balancing the partitioning outputs, data skew, and data locality need to be treated under the distributed/parallel model transformations. Our approach can contribute to scal able MDE, since the Spark framework provides mechanisms for such context. Nevertheless, we did not yet explore dis tributed processing in our approach. Furthermore, more ex periments involving a diversified set of transformation sce narios are necessary.
The results show that the parallel model transformations is feasible in our approach, but is necessary to explore other aspects such as the learning cost to use it, the semantic to specify the transformation rules, and to know how difficult it is to use our approach regarding the ATLbased approaches.

Related Work
Data extraction and operations on directedgraphs are used in most application domains, such as MT, Reverse Engineer ing, Software Evolution, and others. We report some works that highlight the Dc approach in MT, parallel/distributed MT, some extraction processes, as well as works that process graphs on the Spark framework.
MDE approaches have already been reinterpreted under different views, for instance, Batory and Azanza (2017) do a reinterpretation under the context of relational databases. To ease the understanding of MDE approaches, they employ a Dc approach and a declarative language to model trans formations. They map metamodels to relational tables and use the Prolog language to write declarative constraints in m2m transformation. This approach employs a Dc approach to declarative styles. However, it was elaborated for helping to explain the MDE concepts under relational perspectives, whereas we seek to offer an approach for model transforma tion in a Dc approach.
In Wischenbart et al. (2012) an approach is proposed to derive social network schemas from social network data. They derive schema information expressed in JSON data. For this, schema extraction strategies are provided for integration tools that build on different technical spaces. Moreover, they propose to apply techniques from MDE to transform schemas into instance data. The extraction strategy used in this work is similar to the one adopted in our ME, in which we use the DataFrame schema from the output Injector module for pre serving the consistency of model elements during their trans lation into GraphFrame. However, they only use JSON as input to the extraction and apply it on three social networks. Our approach using DataFrame eases the extraction of data from more than one format, such as XMI and JSON, and translates it in a graph model for model transformations or other GraphFrame operations.
Triple Graph Grammars (TGG) is considered a stan dard framework for model transformation based on graphs (Tomaszek et al., 2018). Its expressiveness and its mathematical basis are relevant aspects in graph transfor mation, since a single set of triple rules is sufficient to generate the operational rules for the forward and backward model transformations (Hermann et al., 2014). Tools such as eMoflon, MoTE, TGG Interpreter, and EMorF are TGG based (Kahani et al., 2018; Edgar et al., 2014. However, the optimization is still a challenge for applications based on TGG, which may be a tradeoff between expressiveness and scalability (Anjorin et al., 2016). Our approach uses directedgraphs as a means of representing the model ele ments and easing the operations over them. In addition, the aspects of the platform that we use can be a differential for development of parallel/distributed model transformation. Bollati et al. (2013) introduced the MeTAGeM, a method ological and technical framework for the development of model transformations, which bundles a set of Domain Spe cific Languages (DSL) for modeling model transformations with a set of metamodel transformations in order to bridge these languages in (semi) automated model transformations development. Amongst the aspects of the MeTAGeM, we re port two to our work: the concern with interoperability of dif ferent languages in the transformation process; and the Plat form Dependent Transformation (PDT) model, that allows the use of Injectors/Extractors for modeltotext transforma tions. However, the injector/extractor are based on Textual Concrete Syntax (TCS), which provides a DSL for the spec ification of the correspondence between the metamodel of a given DSL and its textual representation. This means that, for each DSL it is necessary to recover its TCS correspon dent, in case it already includes the DSL. Otherwise is needed to develop a TCS. Vara and Marcos (2012) developed a tex tual editor and model extractor for Oracle OR models using the TCS language to support textual editing of models and the extraction of models from legacy code, for validating a systematic study and a technical solution for MDE develop ment of information systems. Our approach extracts model elements in XMI/JSON format and transforms them to the directedgraph format using the GraphFrames API from the Spark framework. The extraction output is used in model par titioning, as well as in model transformations. In addition, it may be used for general purposes in graphoriented applica tions.
Distributed/parallel graph processing has been applied as a way to optimize the graph operations. Imre and Mezei (2012) introduced an algorithm to do graph transformations in a parallel way using threads on a GPU (Graphics Processing Unit). The transformation is executed on this algorithm in two phases: matching and modifier. Although, the algorithm can take the advantage of multicore processors, the modi fier phase executes the modifications sequentially on a sin gle thread. In our approach, a graph can be processed on a distributed and/or parallel way, since we utilize the parallel implicit operations on a generalpurpose cluster computing framework. When some operation and/or transformation is required by a program on the Spark framework, the model is automatically split in partitions (this step can be changed by the developer) and processed on nodes by a set of tasks in memory. Furthermore, the graph operations can be specified SQLlike declarative style and/or functionallike. Benelallam et al.(Benelallam et al., 2015 present the ATLMapReduce as a distributed MT engine. They em bed the ATL on the MapReduce framework for obtaining an implicit distribution of ATL rules, achieving distributed exe cution. From static analysis of transformation rules, they pro posed a model partitioning for balancing and preserving the dependency among model elements by means of a greedy distribution algorithm. The strategy is relevant for apply ing an algorithm for balancing the partitioning. The ATL MapReduce solution is dependent on the MapReduce frame work and an implicit distribution of models, as well as the transformation executions on two phases (map and reduce). Our partitioning strategies are based on directedgraph and search split model into in subgraphs and clusters of vertices. Our approach uses the GraphFrame as a bridge between input models and model transformations. It depends on the Spark framework.
NeoEMF is a scalable model persistence framework based on a modular architecture enabling model storage into mul tiple data stores. This framework is proposed by Daniel et al. (2017), it provides modeltodatabase mappings for persistence solutions and enables to store models in graphs, keyvalue, and column databases. This framework provides an API compatible with the Eclipse Modeling Framework (EMF) API, meaning that the NeoEMF accepts only models from EMF. Furthermore, the NeoEMF focuses on scalable model persistence, whereas our approach aims at scalable transformation of models, supporting input models in XMI or JSON formats. Junghanns et al. (2016) propose the Extended Property Graph Model (EPGM), a graph data model that supports flat and graph collections with heterogeneous vertices and edges. They implemented a set of analytic operators using a DSL on top of Apache Flink 8 to graph processing of sin gle graph representations (i.g., in collection), and to provide generalpurpose operators (i.g., select, count,...) on graph. The graph representation in EPGM contains three object types (GraphHead, Vertex, and Edge), whereas ours has two object types: GraphNode and GraphEdge, which are sub types of GraphElement ( Figure 4) and are contained in the GraphFrames API. In EPGM, the input data format for pro cessing is not informed. Our extractor processes different data formats (XMI and JSON). To process the graphs, we use the operators from of the GraphFrames API itself (spe cific for graphs) in a functionlike style assembling a lazy evaluated pipeline of transformed data. The vertex and edge instances are the input for model transformations. Szárnyas et al. (2014) handle MDE scalability issues, proposing an architecture (the IncQueryD) for a distributed and incremental model query framework by adapting incre mental graph pattern matching techniques to a distributed cloud based infrastructure. This architecture evaluates graph patterns over EMF models using Rete algorithm in a dis tributed environment. It focuses in distributed data store and distributed query evaluation network for model transforma tions. The Rete algorithm uses tuples to represent the ver tices, edges and subgraphs in the graph. This graph represen tation is similar with our approach, which uses Dataframes for representing vertices and edges into a Graphframe (graph instance). Furthermore, the IncQueryD uses incremental queries based on joins to specify rules transformation, similar to the specifications (transformation rules) of our approach.
The works (Szárnyas et al., 2014; Junghanns et al., 2016; Benelallam et al., 2018 address the scalability with model partitioning and or graph pattern techniques in MT. Our ap proach includes these aspects on Spark, a scalable frame work. Furthermore, the lazyevaluate in monotonic opera tions, the implicit parallelism, transformation rules in declar ative specifications (SQLlike functions), data collections, and parallel/distributed environment establish the technical space of our approach.

Conclusion
We applied a Dc approach for model transformations through the GraphFrames API, including model extraction. We eval uate the API, together with an implemented extraction pro cedure, to assess if they are a valid alternative for directed graph operations including model transformations. We devel oped a Model Extraction from technical modeling spaces to the Apache framework on its DataFrame and GraphFrame formats. From GraphFrame, we developed two partitioning strategies, one based on the Motif algorithm and another based on clustering using the Infomap framework. Both may be used for partitioning models, but their outputs are not bal anced.
We also developed a set of operations and transformation rules on the Scala language and validated them with a proof of concept using the Spark framework on four nodes, in lo cal mode. The results obtained indicate that the extraction of large semistructured data under a directedgraph perspective can be useful in choosing a strategy to design model transfor mations in a scalable platform, such as the Spark framework. In addition, the model GraphFrame may be used for model partitioning, graphdata processing, and to analyze model interconnectivity, as well as to offer graphstructured infor mation to different contexts. However, there is a need for fur ther studies to apply more sophisticated strategies in model partitioning and for improving the integration with the Spark framework.
As future work, we plan run transformation rules using GraphFrame on distributed environments such as cloud com puting, aiming for a benchmark with Very Large Models on top scalable frameworks, to evaluate the scalability and model partition strategies, whilst prioritizing load balancing, minimizing data skew, and improving data locality of sub models. The benchmark can be based on works such as Varro et al. (2005); Szárnyas et al. (2018. Furthermore, it is also worth assessing whether our approach is practical, or too dif ficult for a typical developer.