Maintaining XML documents

XML documents are hard to maintain under changing schemas (DTDs or XML schemas) because each document encodes many of the details of the DTD into each document. There is a strong need to have tools that help with the maintenance of XML documents. The purpose of this project is to provide a tool that automatically updates documents after a manual correspondence between schemas has been established.

The proposed approach can benefit from earlier work on correspondences between models. See [pottinger:vldb03] for work on schema correspondences.

[pottinger:vldb03] is about merging schemas based on given correspondences. The work proposed here is also about merging schemas, including instances of schemas. This is new based on the quote from [pottinger:vldb03]: " Note that these works, like ours, consider merging only the models, not the instances of the models".

Idea: Consider the schema evolution from schema 1 (now called L1) to schema 2 (now called L2) in the purchase order example.

Can you come up with LL(1) grammars LL-L1 and LL-L2 such that L1 and LL-L1 are object-equivalent and L2 and LL-L2 are object-equivalent and such that LL-L1 defines a sublanguage of LL-L2? If you can find such grammars, the XML documents can be maintained automatically. Notice that if you have 1000 documents for L1 that you need to move to L2, it is easier to write two grammars LL-L1 and LL-L2 than editing 1000 documents. Provide a tool that helps to find LL-L1 and LL-L2.

For the definition of object-equivalence, including an adaptive implementation, see: the AP book. The original paper was Bergstein's OOPSLA 91 paper.

Under which circumstances do LL-L1 and LL-L2 exist given L1 and L2? Informally, they seem to exist if L2 is an "enhancement" of L1. If L1 and L2 are too different then LL-L1 and LL-L2 are likely not to exist. The choice of LL(1) grammars is intentional. See the AP book for the advantages of LL(1) grammars in this context. In prinicple we could also use another grammar family provided there is a bijection between objects and sentences and provided one sentence describes a "large" family of objects following the slogan: concrete syntax is more abstract than abstract syntax.

Design an XML document maintenance service that reads in an old schema, a new schema, a set of documents for the old schema and that returns a set of updated documents for the new schema. The user needs to interact with the service to construct LL-L1 and LL-L2.

What are opportunities of the project? For many schema evolutions happening in practice it should be easy to use the approach described.

What are possible risks of the project? Testing whether L(G1) is a sublanguage of L(G2) is in general undecidable. But the G1 and G2 occurring in practice are "closely" related. Embedding an XML schema into an LL(1) grammar that is object-equivalent requiresa bit of research. The concept of object-equivalence will need to be extended because XML schemas are more complex than class dictionaries. On the other hand, class dictionaries model the essential ingredients of XML schemas.

For further information (the document is a little dated), see regression testing with XML documents.

A simple formulation of the problem is: Given two class dictionaries cg1 and cg2, do there exist class dictionaries cg3 and cg4 such that cg1 is object-equivalent to cg3 and cg2 is object-equivalent to cg4 and L(cg3) is a sublanguage of L(cg4). All class dictionaries are assumed to be LL(1) and not left-recursive. L(cg) is the set of objects Objects(cg) in printed form.

References:
@INPROCEEDINGS{pottinger:vldb03,
AUTHOR = "Rachel Pottinger and Philip Bernstein",
TITLE = "Merging Models Based on Given Correspondences",
BOOKTITLE = "29th VLDB Conference",
YEAR = "2003",
ADDRESS = "Berlin, Germany",
PAGES = "",
EDITOR = "",
PUBLISHER = ""
}