Applying Analogy to Schema Generation

To support the generation of database schemas of information systems, a five-step design process is proposed that explores the notions of generic and blended spaces and favours the reuse of predefined schemas. The use of generic and blended spaces is essential to achieve the passage from the source space into the target space in such a way that differences and conflicts can be detected and, whenever possible, conciliated. The convenience of working with multiple source schemas to cover distinct aspects of a target schema, as well the possibility of creating schemas at the generic and blended spaces, are also considered.


INTRODUCTION
Designers of information systems soon learn that reusing their previous experience, and also that of other designers, is a rewarding strategy.
Motivated by this remark, we have been working [2,3] on methods and tools to, starting from so me predefined database schema regarded as a source schema, abstract a pattern that captures its structure, which is then repeatedly used to generate one or more target schemas. What makes this strategy viable is the intuitive perception of an analogy between source and target, expressed by saying that the latter is like the former.
Additionally, the source schema should be a typical example among those that are analogously structured, and the terminology of its underlying domain should be familiar even to the less experienced designers. If these requirements are satisfied, it will be possible to instantiate the positions occupied by variables in the pattern, by prompting the designer to indicate wh ich names in the target schema being generated correspond to each name in the examp le source schema.
In the present paper, we expand our earlier method and introduce a five-step process that takes four spaces into consideration -the source, target, generic and blended spaces, as proposed in [9] for widely different areas. We adopt the familiar Entity-Relat ionship (ER) model [5] and use the weak entity concept to illustrate the process.
The diagram in figure 1 represents the four spaces and shows how they are articulated in view of the process, whereby, starting from the source, the target is gradually constructed.
Informally, the generic space originates from the source by import ing, in a generalized format, the elements for which corresponding elements in the target will eventually be characterized. In practice, both the source and the target will contain other noncorresponding elements, since analogy is rarely bijective. Viewing the diagram as a lattice [17], the generic constitutes the meet of the source and the source target generic blend Figure 1: The four-space approach target spaces and denotes the elements that correspond to each other in these two spaces. By contrast, the blended space reflects the join of source and target and inherits all their elements, corresponding or not. Again informally, the blend is the space wherein one can detect whatever is incomparable or conflict ing when putting together source and target, often calling fo r some creative form of adaptation to be remedied or conciliated [9,21]. Goguen [10] fo rmalized b lending in category theory.
The text is organized as follows. Section 2 details the five-step process we propose and is the thrust of the paper. Sections 3 and 4 briefly discuss, respectively, the advantages of bringing in a mu ltip licity of source schemas for designing distinct aspects of a target schema, and the possibility of also creating schemas directly fro m elements at the generic or blended spaces. Section 5 contains the conclusions.

EXAMPLE
We adopt a simple examp le to illustrate the proposed schema generation process. We start with a schema fragment, specifying employees and their dependents, wh ich is probably the most frequently mentioned illustration of the weak entity concept in ER modelling. As a fragment, it only needs the elements relevant to characterize weak entities.
We express schemas with the help of clauses such as those below that introduce two entity classes, employee and dependent: The identifying attribute of employee is empno, whereas dependent, being a weak entity, relies on the identifying relat ionship isdepof, co mbined with the discriminating attribute depno. The identifying relationship is 1 to n, being total with respect to dependent and partial with respect to employee; these properties are indicated by associating pairs of minimu m and maximu m values for the participation of instances of each entity in relationship instances: at least 0 and at most n dependents can be related with exactly one employee. The relationship isdepof has attribute family_tie, with values such as spouse or child. Note that the fragment does not include, as unessential to the characterization of weak entities, certain basic properties of employee, such as those referring to the employ ment aspect itself.
This schema will be used as the source schema, wherefro m target schemas based on the weak entity concept can be derived, through five consecutive steps, to be described in the sequel.
As will be noticed, the process takes into due consideration some domain-independent consistency rules inherent in the ER model, such as the following, among others: 1. all entity classes must have identifying properties; 2. relationships can only be defined between defined entity classes; 3. the deletion of an entity instance imp lies the deletion of all its properties; 4. if a relationship R is total with respect to one of its participating entity classes E, an instance of R cannot be deleted if it is the only one involving a given existing instance of E.

STEP 1 -GENERATING THE PATTERN
Fro m the source schema Emp_ Dep, the Weak Entity pattern is obtained ( Figure 2) by consistently substituting variables for the names of entities, relationships and attributes.
Besides clauses built from those of the source schema, the pattern contains mappings, associating the variables introduced with the corresponding source schema names. Consistent substitution imp lies that, to give one examp le, variab le A refers to entity employee wherever it occurs in the clauses of the pattern.

STEP 2 -GENERATING THE TARGET SCHEMA
Suppose the designer wants to specify a Bk_ Ed schema, about book editions, and realizes that this too involves the weak entity concept: the editions of a book are co mparab le to the dependents of an employee, in that to identify an instance of edition, the indication of the book in question is needed, besides the edition numberedno -as discriminating attribute. The generation ( Figure 3) is basically done by specializing the clauses of the pattern (belonging to the generic space), but the diagram also refers to the originating source space, to stress that the names in the pattern mappings were ext racted fro m it.
Specializing the clauses of the pattern is done by replacing each pattern variable by an appropriate name belonging to the underlying domain of Bk_ Ed. Relying on the assumption of a widespread intuitive understanding of the analogy between the two domains, the designer is prompted to supply the target schema names through queries of the form: -What corresponds to <name in the source schema>?
In our examp le, this would instantiate the pattern mappings as follows: We note that the designer may, with limitations, deny one or more correspondences by replying nil. So it may happen, at this stage, that nothing corresponding to the attribute family_tie co mes to mind:

family_tie → nil
This is indeed the only element in this case that can be absent. Having informed book as corresponding to entity employee, the designer should be aware that the indication of what corresponds to empno is mandatory, since no entity can lack an identifier (cf. rule 1, stated for the ER model at the end of section 2.1). Likewise, if nothing corresponds to dependent, the indication of isedof as corresponding to isdepof would be an error, because a binary relationship requires the presence of two participating entities (cf. rule 2). The absence of isedof, on the other hand, would defeat the purpose of the entire process -the weak entity concept makes no sense without an identifying relationship.
After inspecting the resulting target schema, the designer's knowledge of the target domain must be used to check its clauses, with a special attention to: a. additions to the target schema, that have no correspondence in the source schema; b. modifications to be done in the generated clauses in the target schema.

SCHEMAS
The blended space is pictured as a confluence of the source and the target spaces, taking into consideration the correspondences registered in the In the database schema-generation process, elements are obtained by joining each entity and relationship of the source schema with its counterpart in the target schema. To begin with, all info rmation about each entity and relationship, contained in the various clauses of the two schemas, is collected in separate frames, structured as lists of property:value pairs.
Each property of an entity E is represented either by an attribute name, or by a binary relationship name tagged with 1 or 2 to indicate, respectively, whether E is the first or the second participant in the relationship. Since in the present examp le no restrictions are being imposed on the values, all value positions are filled with an underscore, a usual convention for an anonymous variable.
The properties of a relationship R are similarly represented. They include the identify ing attributes of the two participating entities, the min imu m and maximu m occurrences for the first and for the second participant, and other relationship attributes if any. We shall introduce here a join operation on frames, specifying that, when applied to entity or relationship frames F 1 and F 2 , a frame J results, whose propertyvalue pairs comp rise: a. pairs p 1 :v 1 from F 1 , for each property p 1 not corresponding to any property in F 2 ; b. pairs p 2 :v 2 from F 2 , for each property p 2 not corresponding to any property in F 1 ; c. pairs p 1 -p 2 :v 1-2 , for each two corresponding properties p 1 and p 2 in F 1 and F 2 , respectively.
Value v 1-2 in item c is obtained by, in turn, joining the two values v 1 and v 2 , according to the follo wing criterion: if the values are identical constants, or at least one of them is a variable, v 1-2 is the result of their unification [13]; otherwise the result is a term formed by the two values prefixed by an asterisk to indicate that they are in conflict.

family_tie:_]
A disclaimer is in order here. We have considered only one simple type of conflict. If the designer is allo wed to perform arbit rary modifications to the target schema initially obtained by instantiating the pattern variables (cf. step 2), other types of conflict may occur, calling for the specification of appropriate criteria to handle them. As noted in [9], blending is, in general, a particularly co mp lex task, requiring a great deal of creat ivity fro m the part of the designer, who may have to devise ad hoc ways to achieve consistency. Moreover, conflicts detected through blending may affect the design of application-oriented operations on the generated schemas (a topic briefly addressed in section 2.7).

SCHEMAS
The resulting blended space can be reinjected into the derived target space, and even into the originating source space, if the designer admits the possibility of source target generic blend Figure 5: Revising the target (and source) schemas also reconsidering it ( Figure 5).
In our example, a convenient way to call the designer's attention to what was not used from the source schema is to display together, in frame format, the entire list of current properties of each entity and relationship in the target schema, expanded as the result of blending. Such frames are direct ly obtained fro m the blend frames by reducing the paired names assigned to corresponding properties to their original names in the target space, while, naturally, keep ing the names of the source space properties until now disregarded: Surely, the designer may or may not judge appropriate to reconsider what was initially left out, in this case the relationship attribute family_tie. Would there be different "ties" between edition and book? Ironically, the remark that "so-and-so is a revised edition of his father" is not uncommon, a playful but expressive metaphoric connection between the domain of hu man beings, underlying employee, and the domain of books, wh ich would bring to mind that an edition may be classified as revised, corrected, expanded, abridged, and also simply as regular, wh ich are some of the possible values for a new ed_type attribute for the isedof relationship.
The reconsideration of a source schema, such as Emp_Dep, for expansion is more rarely desirable, especially if one wishes to keep it as a fragment containing only the features necessary to characterize weak entit ies. But in the event that the designer wants to examine the possibility, the blend frames can be alternatively renamed as fo llo ws: What can be the "subject" of an employee? The subject of a book can be some fict ional genre, but it can also be a professional field, such as engineering, or accounting, wh ich may suggest a new attribute profession for the employee entity, with possible values including engineer and accountant, among others.
A further reduction of Emp_Dep to suppress the family_tie attribute is more likely to happen. This would become advisable if the attribute is systematically disregarded, even at this revision step, in a long series of target schemas generations. Reconsidering a source schema, and consequently the pattern abstracted fro m it (as covered in step 5) is a case of double-loop learning [1]: the continuing use of a model providing clues for its correction and refinement.

STEP 5 -REVISING THE PATTERN
Since the generic space is often intended as a help to generate a plurality of target spaces, conflicts located at the blended space, as well as changes made at the source space from suggestions motivated by observing the blend, may entail the reconsideration of the generic space ( Figure 6).
In our examp le, the blend mirrors the fact that an identifying relationship must be total with respect to the weak entity, but no such requirement is imposed with respect to the entity on which it relies for identification. So the conflict registered in the property:value pair min-1:*(0,1) of the frame resulting fro m the jo in of F isdepof with F isedof should motivate the insertion of a hotspot [19] in the Weak Entity pattern, i.e., a place where the specification becomes flexible. The adopted notation, using a question mark as prefix, will signal that the designer should be queried about the min-1 property of the relationship denoted by variable E, and that the value supplied must be chosen as 0 or 1.
Moreover, if at step 4 a new attribute such as profession is added to the source target, or if the family_tie relationship attribute is removed from it, the pattern must be modified accordingly, so that it will continue to reflect the Emp_Dep schema.

TOWARDS THE DESIGN OF OPERATIONS
In [6] we added, both to schemas and patterns, clauses defining operations in terms of their preand post-conditions [8].
Without going into details, we now give one example of the repercussion of conflicts detected at the blending stage on the design of operations. Suppose that an operation named end_coverage has been defined over the source schema, allo wing to remove a child C of an employee E fro m the list of dependents of E, if the birth_year of C (an additional attribute of dependent) precedes a currently determined limit. Note that indicating the deletion of the literal dependent([E,C]) should cause the deletion of all properties of the entity instance C, in view of ER rule 3. On the other hand, note that the repeated execution of end_coverage is allo wed, leg itimately, to leave an employee with no Also suppose that, during step 2 of the interactive process, the designer reacted favourably when prompted to introduce an operation corresponding to end_coverage, with the purpose to analogously discard editions whose year of publication, ed_year (again a new attribute, corresponding to birth_year), came before a currently designated year. In the context o f library management, this is a well documented practice, known as weeding library collections [20].
A conservative librarian would very likely demand that systematic d iscarding be restricted to regular editions, a requirement that can be easily expressed if attribute ed_type has been supplied as a counterpart to family_tie, as considered earlier.
However, straightforward renaming and the replacement of child by regular is not sufficient here to avoid a conflict of the generated weed operation with specific characteristics of the target schema registered when blending, namely, the totality property of isedof with respect to book, co mbined here with ER ru le 4. One solution to the conflict is illustrated in the version of weed shown below, which can be repeatedly applied to discard any number of non-special editions, provided that the book itself remains -by keeping its newest edition -to adopt a usual criterion. Further refined versions may specify different values of ed_ylimit for different subjects, in view of constantly updated studies to determine the period of obsolescence for publications belonging to each so-called Dewey class [14].

COVERING DIFFERENT ASPECTS THROUGH M ULTIPLE SOURCE SCHEMAS
Patterns to model the same concept can be obtained from different source schemas. We chose the Emp_Dep example to construct the Weak Entity pattern, but other examp les could be selected, fro m which a family of versions of the pattern would be obtained and made available to designers. Originating fro m source schemas featuring different sets of names, the mapping section of each version would differ fro m that of the others.
More importantly, not all clauses might be identical, which reflects permissible structural variations, according to which the versions could be classified. A designer would then have a chance to choose the version appearing more congenial to the case on hand. For instance, a schema Prod_Comp, treating the components o f products as another example of Weak Entity, would co me equipped with operations such as repair and replace as alternative ways to handle a component found to be defective. Thanks to the availability of such operations, Prod_Comp would seem a better source than Emp_Dep for generating a schema Bk_Vol dealing with volumes of books, inevitably susceptible to damage in the everyday functioning of a lib rary environ ment.
Repeating the pattern generation process with a second version is another advantage of keeping several examp les around, since this provides a means to check the result. Assume, for instance, that a version of Weak Entity is available wherein the identifying relationship is total with respect to both participating entities. If the designer of Bk_Ed had not noted at step 2 (see section 2.3) the need to correct the specification of isedof, b lending it with the schema generated from this second version of the pattern would reveal the conflict.
But the application of mo re than one source must also be considered along a separate line of reasoning. Early studies on analogy and metaphor [15] already argued in favour of the use of multip le sources to provide a fuller characterizat ion of a target possessing many properties, which might however be grouped into a manageable number of mean ingful clusters. Morgan [18] used a set of eight metaphors to exp lore the concept of organization fro m the viewpoints of different competing theories.
We worked with Emp_Dep as source schema to characterize a structural feature of the Bk_Ed schema, namely the reliance on an identifying relationship to designate instances of weak entities. Many other sources can be brought in to suggest other types of properties and operations; integrity constraints, expressed e.g. in first-order logic notation, could also be added. Here we previously treated books as library items, but clearly they can also be seen as products, merchandises, objects of intellectual p roperty, etc.
On the other hand, the name of the source schema used to derive a certain set of properties of a concept serves to designate a distinct aspect of the concept. Following the orientation prescribed in [11], when performing a problem-solving algorithm o f exponential or high polynomial co mplexity, one can establish that only the properties of the involved entities that have been derived from the one (or the few) designated source(s) will be considered, thereby reducing the computational effort.

CATEGORIZATIONS FROM THE GENERIC AND THE BLENDED SPACES
Whereas the patterns at the generic space are preserved to help in the future creation of any number of target schemas, the frames composed at the blended space are only used in connection with a specific source-target pair, and can in principle be discarded after the generation process terminates.
Yet both the generic and the blended spaces, whose role is no more than auxiliary in the derivation of targets from sources, can give rise to new fullfledged conceptual spaces, through a process sometimes called categorization [9]. Th is is more easily accomplished when generic and blend represent the confluence of spaces associated with the same underlying domain.
Entit ies employee and student provide an example of this situation, since both have human beings as underlying domain. As a convenience, their corresponding properties can be identically named, so that they can more appropriately be called common properties, to be factored out to characterize a person entity -in a sense, a materialization of the generic space. Both the common and the exclusive properties of employee and student are, in turn, inherited by the trainee entity, which materializes the blended space. In [3] we represented these four entity classes as nodes of the lattice induced by is-a links, and showed that, their properties being so specified, the meet and the join of the frames of employee and student yield, respectively, the frames of person and trainee.
When different underlying domains are involved, categorization can still be envisaged. The resulting blend is then populated with hybrid entities, which may either appear realistic or fantastic, depending on the context. Conflat ing persons, objects or events is a powerful literary practice, and, surprisingly, offers sometimes intuitive clues to solve problems, as in the Buddhist monk riddle expounded in [11]. A blend conflating persons and books, for instance, might make sense in a cartoon universe, as a Digital Storytelling application aiming to teach children how to use the facilities of a library. Apart fro m Information Systems, on which the present paper concentrates, and Digital Storytelling, other Co mputer Science areas such as Software Engineering have drawn significantly fro m the notions of analogy [4] and blending [12].

CONCLUDING REMARKS
We were able to run experiments employing the current version of the five-step process, with the help of an interactive logic programming tool. Also, although simple, the weak entity examp le helped us gain a better understanding of design by analogy and blending.
Much work remains to be done, especially to extend the process as described in section 2, in order to cope with an ampler variety of conflicts, and to develop semi-automatic algorith ms or heuristics to recommend adequate strategies for handling the different situations that may arise in practice.
The topics broadly sketched in sections 3 and 4 should also be included as objectives for future research, aiming at their integration in a mo re comprehensive treatment of the schema generation problem.