XML Schemas and UML Class Diagrams

Draft May 18, 2000

XML and UML are both important technologies and their optimal integration requires some analysis. We discuss our point of view on integration of XML schemas and UML class diagrams. Our view is that either

XML schemas should be the starting point of a design and that UML class diagrams should be derived from the XML schemas using a very systematic transformation outlined below or
UML class diagrams are written in a certain style so that it is straightforward to generate the XML schema. The style is needed because UML class diagrams are less expressive than XML schemas and the missing expressiveness in a UML class diagram is brought in through the style.

It is unfortunate that the XML schema notations cannot express what is common in UML class diagrams: names of parts (instance variables or data members). It is unfortunate that the UML class diagram notation cannot express what is common in XML schemas: the ordering of the parts. This mismatch could have been easily avoided so that there would be one common notation to express XML schemas as well as UML class diagrams.
How such a common notation could look like is explored in this paper.

This note is organized as follows. First we compare XML schemas and UML class diagrams and then we sketch translation algorithms in both directions.

XML offers several notations (called schema notations) for defining grammars for markup languages. The markup languages defined by those schemata describe objects textually. An XML schema defines both a set of documents conforming to the schema and a set of objects (for example in the DOM).
UML class diagrams, on the other hand, only define a set of objects. The don't define a language for describing the objects except through the constructor calls notation and other notation of the programming language that will be used in the project.

	defines object set	defines document set	plays grammar role
XML schema	partially***	yes	yes
UML class diagram	yes	no	no

*** XML does not support naming of the parts with other names than the type names. If all part types per class are unique, then the schema defines an objects set where all parts can be uniquely identified by name.

Because both XML schemas and UML class diagrams are defining sets of objects, the natural question arises: how should an organization manage its XML schema repository and its UML class diagram repository? From the above discussion it becomes apparent that there is a danger of significant overlap between XML schemas and UML class diagrams that are needed for the same application.

Because XML schemas are more expressive than UML class diagrams, it would be natural to start the modeling with XML schemas. There are however currently some deficiencies in the XML schema notations that don't make it an ideal object modeling notation. But there are workarounds available and we expect those deficiencies to disappear over time (written in May 2000). It should be noted that the XML schema notation should also be used to describe "functional" objects, like visitor objects, that are not directly related to business concepts. The XML schema should be written with the intent that it will be used to implement the functionality of the application and not just to describe the structure of the business data.

XML schemas have the essential capabilities to model class structures. The essential capabilities are:

modeling classes with a fixed number of parts
modeling classes with a variable number of parts (collection classes)
defining abstract classes with a fixed number of parts

To consider the advantages and disadvantages of XML schemas as an object modelling notation, consider prefix expressions. A prefix expression (Exp) is either Simple or Compound. A Simple expression is just a Number. A Compound expression consists of an Operator and two arguments that are themselves expressions (Exp). An operator (Op) is either an addition operator (Add) or a multiplication operator (Mul). In the following we use a tool called XML Authority from Extensibility Inc. to create schemas graphically and to print them in a schema notation known as DTD. The above description of prefix expressions is expressed by the following schema:

<!ELEMENT Compound (Op , Exp , Exp )>
<!ELEMENT Simple (Number )>
<!ELEMENT Exp (Simple | Compound )>
<!ELEMENT Op (Add | Mul )>
<!ELEMENT Number (#PCDATA )>
<!ELEMENT Add (#PCDATA )>
<!ELEMENT Mul (#PCDATA )>

The following is an example of a document that describes the prefix expression * 3 5 (3*5 in ordinary notation).

<Compound>
<Op>
<Mul>*</Mul>
</Op>
<Exp>
<Simple>
<Number>3</Number>
</Simple>
</Exp>
<Exp>
<Simple>
<Number>5</Number>
</Simple>
</Exp>
</Compound>

The diagram shown above is very close to a UML class diagram for representing prefix expressions. The nodes are classes and the edges show relationships between classes. The rectilinear connections show directed associations from left to right. The connections from Op to Add and Mul and from Exp to Simple and Compound are inheritance edges. There is one detail missing (besides the missing edge from Compound to Exp): the association ends are missing in the schema. We would like to say that a Compound expression has two subexpressions called argument1 and argument2 and both being of type Exp. But unfortunately, we cannot express this in the schema while it can easily be expressed in a UML class diagram. The workaround would be to introduce two extra elements, called Argument1 and Argument2 und to define them to contain an Exp. This introduces two extra nodes and two extra edges in the schema which is not so nice. We call this problem the PartNaming problem of XML schemas.

Besides the PartNaming problem that creates systematic differences between XML schemas and UML class diagrams there is the ObjectLinking problem that also creates differences. Consider the following XML schema that describes a network of partners using a graph structure with labels on edges (LinkInfo) and nodes (PartnerInfo).

The above schema defines documents that define Partner structures referring to partners using the PartnerId in the PartnerLinkInfo objects. In a UML class diagram, we would like to represent the Partner structures as a linked structure which means that PartnerId in PartnerLinkInfo should be replaced by Partner.

XML schemas have the essential capabilities to model class structures. They can be translated to UML class diagrams by using the following systematic process:

Eliminate superfluous elements introduced because of the PartNaming problem.
Introduce object linking where it is needed for efficiency reasons. The linked objects can be automatically created from the parsed XML documents by using a suitable tool.
Identify the abstract classes in the XML schema and mark them in the UML class diagram.

From UML class diagrams to XML schemas

It is useful to translate UML class diagrams to XML schemas provided the UML class diagrams have been written with the purpose in mind that they will play a grammar role. Diagrams written with such an intent can be easily translated. What are the restrictions that a UML class diagram must satisfy so that it is easily translated into an XML schema.

All nodes in the class diagram must be inductive (see definition in chapter 12 of the AP book ). This rule disallows classes that can have only circular objects. Such classes would not define any XML documents and would therefore be "useless".
Classes are partitioned into abstract classes and concrete classes. Abstract classes have at least one subclass.
The class diagram should follow the Abstract Super Class rule (ASR). That means that all superclasses should be abstract.
The parts of each class need to be ordered.
There is policy for adding the parts of a superclass in a systematic way to the parts of the subclasses.

The translation algorithm roughly proceeds as follows: Flatten the UML class diagram, i.e., push all parts of abstract classes down to concrete classes. Replace undirected associations by two directed associations.Follow the ordering of parts and the ordering policy. Concrete classes are translated into a schema element A of the form (B1, B2, B3, ...). Abstract classes are translated into a schema element of the form A (B1 | B2 | B3 | ...). An optional part B is represented as B?. A repeated part B is represented as B+ or B*.

Summary:

Given the current state of the art of XML and UML technology, it seems useful to develop an integration of XML schemas and UML class diagrams. The combined notation should start either with an XML Schema Notation or the XMI notation (or similar notation) for class diagrams and extend it with the missing information. If we start with XML schemas, we need to add part names. If we start with UML class diagrams we need to add ordering information and we need to follow a certain style.

Because it is easier to add parts to an XML schema notation than to add more information to a UML class diagram, we prefer to take XML schemas as the starting point of a design. But to start with UML class diagrams also makes sense.

Demeter Home Page

Professor Karl J. Lieberherr

College of Computer Science, Northeastern University

Avenue of the Arts

Cullinane Hall, Boston, MA 02115-9959

lieberherr@ccs.neu.edu

Phone: (617) 373 2077 / Fax: (617) 373 5121