XML formats

This page is mainly intended for developers, as ordinary users can create resources through a graphical user interface. The XML formats, satisfying a collection of DTDs, are discussed using examples. To see the DTDs, download the software and look in the doc directory.

See also the PDF output of this example.

The corpus file

A file corpus.xml could contain:

<?xml version="1.0" encoding="UTF-8"?>
<corpus type="Ancient Egyptian" name="My corpus">
<text location="texts/Peasant.xml"/>
<tree kind="texts">
 <leaf label="Eloquent Peasant"
      name="Eloquent Peasant" location="texts/Peasant.xml"
      post=""/>
 <leaf label="Peasant"
      name="Eloquent Peasant" location="texts/Peasant.xml"
      post=""/>
</tree>
<tree kind="books">
 <internal label="De Buck, Readingbook">
  <leaf label="pp. 88-99"
       name="Eloquent Peasant" location="texts/Peasant.xml"
       post=""/>
 </internal>
 <internal label="Sethe, Ägyptische Lesestücke">
  <leaf label="2"
       name="Eloquent Peasant" location="texts/Peasant.xml"
       post=""/>
 </internal>
</tree>
</corpus>

The third line mentions the (currently only) file in the corpus. The rest is an automatically generated tree, which can be used to navigate through the corpus. This tree should never be manually edited.

A text

A file for a text (say Peasant.xml in directory texts) could be:

<?xml version="1.0" encoding="UTF-8"?>
<text description="hiero; eng; translit">
<primary language="eng" name="Eloquent Peasant"/>
<primary language="nld" name="De welsprekende boer"/>
<secondary language="eng" name="Peasant"/>
<collection kind="books" collect="De Buck, Readingbook"
    section="pp. 88-99"/>
<collection kind="books" collect="Sethe, Ägyptische Lesestücke"
    section="2"/>
<resource location="../resources/PeasantHi.xml"/>
<resource location="../resources/PeasantTr.xml"/>
<precedence
  location="../align/Peasant.xml"
  resource1="../resources/PeasantHi.xml"
  resource2="../resources/PeasantTr.xml"/>
</text>

The description in the second line is an automatically generated description of the available resources (here hieroglyphic, English translation, and transliteration). Then follow primary names in English and Dutch, and a secondary name.

Then a number of collections may group the text together with other texts.

Two textual resources are available for this text. They have names ending on Hi for (hieroglyphic) and Tr for (translation; the file in fact also contains transliteration). I would further recommend using Al (alphabetic) for transliteration-only files and Lx (lexicon) for lexical annotation. Behind Tr one could append a (3-letter) language code, for translations in languages other than English.

Note that the path names of the textual resources are relative to the file of the text. Avoidance of absolute path names makes it easier to exchange data with others.

Lastly, there is an alignment file linking positions in different textual resources, explained below.

Again, normally one would not manually edit any of the above, except in the case of alignment files that are imported from elsewhere.

Textual resource

A textual resource is for example the following (PeasantHi.xml in directory resources):

<?xml version="1.0" encoding="UTF-8"?>
<egyptian
  creator="Mark-Jan Nederhof"
  name="Parkinson"
  labelname="Pa"
  created="2009-08-17"
  modified="2009-08-17"
  version="R"
  scheme=""
  language="">
<header>
<p>
Some hieroglyphs to illustrate the file formats.
</p>
</header>
<bibliography>
<li>
R.B. Parkinson. <i>The Tale of the Eloquent Peasant</i>. Griffith
Institute, Ashmolean Museum, Oxford, 1991.
</li>
</bibliography>
<tier name="hieroglyphic" mode="shown"/>
<tier name="transliteration" mode="ignored"/>
<tier name="translation" mode="ignored"/>
<tier name="lexical" mode="ignored"/>
<segment>
<texthi>
<coord id="1.1"/><note symbol="5">A footnote about the hare
<hi>wn</hi>.</note><pos symbol="3" id="0"/>z:A1*Z1-p*W-wn:n
</texthi>
</segment>
</egyptian>

At the beginning a number of properties of the resource are listed. The header contains a description and the bibliography may list some references. It is further indicated which of the tiers are shown or are ignored; other possible values are omitted (corresponding lines omitted from the interlinear output) and erased (the interlinear output contains whitespace for the lines to be filled in, as a student exercise). One can put tiers with hieroglyphic, transliteration, translation and lexical annotation in one resource, but this is not encouraged. In the above example, only hieroglyphic is included.

The segment starts with a coordinate, which is a line number in the manuscript.

The position <pos symbol="3" id="0"/> attaches a position with label "0" to hieroglyph number 3 of the following hieroglyphic. (Numbering is from 0.) This is used for alignment, discussed further below.

A footnote marks hieroglyph number 5.

Another resource with transliteration and translation is the following (PeasantTr.xml in directory resources):

<?xml version="1.0" encoding="UTF-8"?>
<egyptian
  creator="Mark-Jan Nederhof"
  name="Nederhof"
  labelname="Ne"
  created="2009-08-17"
  modified="2009-08-17"
  version="R"
  scheme=""
  language="eng">
<header>
<p>
This is my translation.
</p>
</header>
<bibliography>
<li>
R. Hannig. <i>Grosses Handwörterbuch Ägyptisch-Deutsch: die
Sprache der Pharaonen (2800-950 v.Chr.)</i>. Verlag Philipp von
Zabern, 1995.
</li>
</bibliography>
<tier name="hieroglyphic" mode="ignored"/>
<tier name="transliteration" mode="shown"/>
<tier name="translation" mode="shown"/>
<tier name="lexical" mode="ignored"/>
<segment>
<textal>
<coord id="1.1"/>s <pos id="0"/>pw wn
<pos id="1"/>^xw.n-^jnpw<note>Cf. p. 356 of Allen (2000).</note>
</textal>
<texttr>
<coord id="1.1"/> There was a man called <pos id="2"/>Khunanup
</texttr>
<prec id1="1" id2="2"/>
<prec id1="2" id2="1"/>
</segment>
</egyptian>

Here only one segment is given. In general the resource includes multiple segments. It is a matter of taste how to divide a text into segments. It is recommended to make segments short enough to fit on a page of reasonable width, when possible shorter. (Hieroglyphic is normally divided into segments according to the line numbers, with one segment for each line.)

In the example there is a footnote following ^xw.n-^jnpw. There are also labelled positions. These are linked by the lines at the end of the segment, which say that position "1" should occur to the left of position "2" and position "2" should occur to the left of position "1", which effectively means the positions should be aligned one below the other.

Lexical annotation

In a resource, one may also include lexical annotations. An example of an entry in its most general form is:

<lx 
    texthi="R8-O6" 
    textal="Hwt-nTr" 
    texttr="Heiligtum"
    textfo="honorific transposition"
    cite="Dictionary of Ancient Egypt"
    href="http://somesite.org/~someuser/someprogram#Hwt"
    keyhi="O6" 
    keyal="Hwt"
    keytr="enclosure" 
    keyfo="noun, sing." 
    dicthi="R8-O6-X1:O1"
    dictal="Hwt ntr" 
    dicttr="tempel"
    dictfo="dir. genitive"/>

The attributes starting with text refer to an occurrence of a word or phrase in the text at hand. One may indicate the hieroglyphic, the transliteration, the translation, some indication of the orthographic or syntactic form of the occurrence, any combination of these, or none at all.

The value of cite refers to a dictionary. This attribute is typically omitted if the only objective of lexical annotation is to develop ones own word list. There may also be a hyperlink belonging to the dictionary, or specifically to an entry within the dictionary.

The key is the word under which the phrase may be found in a dictionary. Typically, keyal would be either equal to textal or to a substring of it (except for example in the case of verbs, where keyal could contain a weak consonant that is not found in textal). As key one may include any combination of hieroglyphic, transliteration or translation. One may also specify aspects of the form of the key, such as gender and verb class.

Further, by means of the attributes starting with dict, one may provide the phrase as it is found in the dictionary, or as one wants to have it found in ones own word list.

All of the attributes are optional. An example of annotation with omitted attributes is:

<lx
    textal="abt=f" 
    cite="Faulkner" 
    keyal="iab"
    dicttr="join s'one"/>

Alignment file

An alignment file may be the following (e.g. Peasant.xml in directory align):

<?xml version="1.0" encoding="UTF-8"?>
<precedence
  created="2009-08-17"
  modified="">
<prec1 id1="0" id2="0"/>
<prec2 id1="0" id2="0"/>
<prec2 id1="1" type1="after" id2="2"/>
</precedence>

This indicates that position "0" from the first resource should occur to the left of a position "0" from the second resource and vice versa, which requires alignment one below the other. Further, the location just after position "1" from the second resource should occur to the left of position "2" from the first resource.

Etc elements

The <etc/> tag indicates that some material was omitted in a tier. Automatic alignment between tiers can make use of this.

Mapping between schemes

For the Eloquent Peasant, there are two numbering schemes for the line numbers. A special resource can be created to map between different such numbering schemes, to obtain alignment e.g. between coordinate "1" in the old scheme and coordinate "1.1" in the new scheme. This is specialized use and not discussed further here. If anyone is interested, I would advise them to look at the Eloquent Peasant as it is encoded in the St Andrews corpus, which offers a self-explanatory example.