8.7 XML Schema Languages 231
• A talk might be contributed, in which case the title and the authors of the
contribution are listed, or invited, in which case the title and the speaker
are listed.
• Between sessions there might be breaks.
These requirements describe a tree language over the alphabet consisting of the
tags like conference, track, etc. used in the document. In these considerations
the data is usually ignored and only the structure of the document is taken into
account. Such a description of a tree language is also called a schema. A
document is said to be valid w.r.t. a schema if it (seen as a tree) belongs to the
tree language defined by the schema.
There are several languages, called XML schema languages, that allow to
describe such requirements on XML documents, and in the following we discuss
some of them. This should not be considered as a reference for the syntax
and semantics of these schema languages but we rather fo cus on the theoretical
aspects and the connections to automata theory. In particular, we are not
considering anything connected to the actual data represented in the documents,
as e.g. data types, because we are, as already mentioned, only interested in the
structure of the documents.
In particular, we are interested in the expressive power of these languages.
Furthermore, we consider the decision problems from Section 8.5 because they
correspond to natural questions on XML documents.
• The membership problem corresponds to check whether a given document
is valid w.r.t. to a fixed schema. For example, checking whether an HTML
document conforms to the specification is an instance of such a problem.
• The uniform membership problem corresponds to validation of a document
against some given schema. The schema might for example be specified
in the header of the document.
• The emptiness problem corresponds to the question whether for a schema
there is any tree valid w.r.t. this schema. This can one the one hand be
used for sanity checks on schema descriptions but also as subproblem for
solving other questions like the inclusion problem.
• The inclusion problem is the question whether all documents that are valid
w.r.t. one schema are also valid w.r.t. another schema. This is, e.g., of
interest when merging archives of documents that are valid w.r.t. different
schemas.
The practical formalisms that we present in the following all have a syntax
that is closer to tree grammars than to hedge automata (compare Chapter 2
for ranked tree grammars). But it is convenient to use hedge automata as a
reference model for the expressiveness, and to use the results from Section 8.5
to obtain algorithms and complexity results for the different formalisms in a
uniform framework. Furthermore, the knowledge on hedge automata (and word
automata) has certainly influenced the development of the formalisms presented
here.
TATA — November 18, 2008 —