AELalign

A proposed standard for representing and exchanging electronic resources on Egyptian texts

In this document I propose an XML format for representing resources on Egyptian texts. These resources include hieroglyphic representations, transliterations, translations, and lexical information.

I first discuss the motivation behind the format, then I discuss the XML format itself, given a formal description of the syntax in terms of a DTD, and an informal description of the meaning. Then I describe an implementation that translates different resources in the proposed format to an aligned HTML representation. Lastly, I discuss examples of files in the XML format, and the form resulting from applying the implementation.

If the section on the XML format is too tedious to read, just jump ahead and look at the section with the example.


Motivation

I commonly encounter the following situation: I wish to study a text, say The Eloquent Peasant. Several resources are available to me. We have the book by Parkinson with the transcription into hieroglyphic of the 4 known hieratic manuscripts, we have several books containing translations (e.g. Lichtheim, Simpson and Parkinson), and scores of grammar books that contain isolated examples taken from this text.

For studying a text on which few publications exist, one may open, say, two books containing resources on that text, and use the index fingers of the left and right hands to study the books in parallel, matching e.g. the hieroglyphic form of the text to the translation of the text. However, this strategy is little effective for The Eloquent Peasant, and one is forced to study one or two resources at a time instead of studying all available resources in parallel.

The purpose of these pages is to investigate electronic means of alleviating this problem. We propose a format for representing resources on Egyptian texts that allows automatical alignment on the screen. The aligned representation offers a way to study several resources in parallel.

The present implementation is a strong case against systems that are WYSIWYG (`What You See Is What You Get'). Although for hieroglyphic, WYSIWYG editors are arguably more convenient to work with than those that are not WYSIWYG, other kinds of textual data, such as translations and transliterations, can be typed in just as easily by using ones favourite editor. After such resources have been put in electronic form, their combination can easily be achieved by simple implementations such as AELalign, described below. This represents a very flexible and efficient way of dealing with textual data, since:


The format

A resource is an XML file that satisfies the DTD AELalign.0.1.dtd. Below we will informally discuss this format.

A resource typically has the form:

<?xml version="1.0"?> 
<!DOCTYPE resource SYSTEM "AELalign.0.1.dtd">
<resource>
<created>
Created by me today.
</created>
<header name="Name of the resource" 
        url="http://www.my_site/my_file.html">
...header text...
</header>
<body>

<texthi>
<coord version="B1" pos="239"/>
D54-w-i-n-r:f-sxt-t:y-A1
<coord pos="240"/>
....
</texthi>

<textal>
...
<coord version="B1" pos="10"/>
sxtj pn Dd=f
<coord pos="11"/>
...
</textal>

<texttr>
...
<coord version = "B1" pos="10"/>
this peasant. He said
<coord version="B1" pos="11"/>
...
<coord version="R" pos="29"/>
and he loaded his donkeys
...
</texttr>

....

</body>
</resource>
The very first line of a resource is just there to tell us that the document is XML; the exact form of this line deserves no further attention here. The line of the resource starting with <!DOCTYPE contains the version number of the above-mentioned DTD. We add this to keep track of changes in the DTD that are bound to take place in the future.

The first piece of real information in a resource is text mentioning the person who created the file, the date of initial creation, and possibly the dates of subsequent changes. The name of the resource, which is part of the header, is typically a short string that represents for example the last name of the author of the resource. Optionally, one may also include the URL of the site where the XML file is located.

Below that, in the header text itself, one can add a description of the resource. It is recommended that at least the following pieces of information are included:

In the header a limited number of HTML tags are allowed, viz. <ul>, <li>, <i>, and <p>, and the corresponding closing tags. Note that in XML every opening tag should be matched by its closing tag.

In the body of the file, we find several (to be precise, zero or more) blocks of text between pairs of tags such as <texthi> and </texthi>, <textal> and </textal>, <texttr> and </texttr>, and (not indicated in the example above) <textlx> and </textlx>. Here, hi stands for `hieroglyphic', al stands for `alphabetic', i.e. transliteration using the (extended) Roman alphabet, tr stands for `translation', and lx stands for `lexical'. For ease of reference, we will continue to refer to such a piece of text between any of these four pairs of tags as a block.

Although for expositional reasons the example file above contains blocks of several distinct types, it is often better that different types be physically separated into different files, which enhances reuse. A notable exception is when the different types of text were extracted together from the same source; consider in particular examples from grammar books, where hieroglyphic, transliteration and translation occur close together. Apart from this exception, a resource typically contains one single block, and thereby only one type of text. (Actually, for representing concordances a resource may contain no block at all; see further below.)

We will first discuss the four types of text that may occur in a block, and then we will explain the specification of textual coordinates.

Hieroglyphic (hi)

A line of hieroglyphic text has the form where $hitext$ is itself a line of hieroglyphic text, and $vgroup$ is a vertical group.

A vertical group has the form

where $vgroup$ is itself a vertical group, and $hgroup$ is a horizontal group. Informally, the operator : means ``arrange the arguments vertically on top of each other''.

A horizontal group has the form

where $hgroup$ is itself a horizontal group, and $bgroup$ is a basic group. Informally, the operator * means ``arrange the arguments horizontally next to each other''.

A basic group has the form

where $vgroup$ is a vertical group, and $glyph$ is a glyph.

A glyph has one of the following forms:

where $upper_letter$ is a letter from A to Z, $number$ is a non-empty sequence of digits not starting with 0 (in other words, the usual notation of an integer larger than 0), $opt_lower_letter$ is either the empty string or a letter from a to z, and $letters$ is a non-empty sequence of upper-case or lower-case letters.

The glyphs of the first and second kinds above correspond to the so-called Gardiner codes. The glyphs of the third kind are mnemonics for the more conventional Gardiner codes; use of these mnemonics other than in the user interfaces of hieroglyphic editors is discouraged, since there is no stable and well-publicized standard for the mapping to Gardiner codes; furthermore, mnemonics are redundant for the purposes of exchanging data. (I readily admit however that my recommendation of avoiding mnemonic in internal representations may be difficult to follow if one uses a hieroglyphic editor that does not automatically map between mnemonics and Gardiner codes.)

Examples of glyphs are: A20, W24a, Aa27f, xpr and mAa; the last two are mnemonics for L1 and Aa11, respectively. A glyph is also a basic group; another example of a basic group is (X1:Z4). Glyphs and basic groups are also horizontal groups; other examples of horizontal groups are Q3*X1 and Q3*(X1:Z4). All of the above are also vertical groups; another example of a vertical group is Q3*(X1:Z4):N1.

No elegant solution has yet been found for denoting cartouches. The traditional ASCII notation using < and > (see Manuel de Codage, or inofficial mirror) is incompatible with XML.

White-space may be inserted into the encoding of hieroglyphic text, anywhere except inside of the denotation of glyphs. When it is used, white-space has the purpose of delimiting ``words'' (words in hieroglyphic are then taken to roughly correspond to characters between white space in the transliteration). Visualization tools for hieroglyphic texts should ignore white-space in the ASCII encoding.

The use of exclamation marks (as in the end-of-line marker ``!'' and in the end-of-page marker ``!!'') plays no role here, and should be avoided for the purposes of AELalign.

Several other possibilities of ASCII encoding of hieroglyphic, such as the use of different colours, fine-tuning of positioning of signs and scaling, are not supported at the present time.

In the near future, we will remove the Manuel de Codage representation of hieroglyphic entirely and replace it with the superior RES representation.

Inside of a block with hieroglyphic text, one may insert normal text, delimited by <text> and </text>, and footnotes delimited by <no> and </no>, as for example in

... Q3*(X1:Z4):N1 <text>(sic!)</text> C4<no>As confirmed by Gardiner, this sign <hi>C4</hi> reads <al>DHwtj</al>, i.e. <tr>Thoth</tr>.</no> T8a ...
The former of these two possibilities is intended for very short pieces of text, typically single words. The footnotes however have fewer restrictions and may contain nested pieces of text delimited by <hi> and </hi>, <al> and </al>, and <tr> and </tr>, for hieroglyphic text, transliterations and translations, respectively, as demonstrated in the example above.

Transliteration (al)

We assume an ASCII encoding of the transliteration alphabet. Apart from the constraints imposed by this encoding, the user is completely free to use any other characters he wishes, such as ``.'', ``='', ``-'' or ``?''; no special meaning will be assigned to these. The user is however advised to adhere to some set of common conventions.

Also a block with transliterations may contain normal text, delimited by <text> and </text>, and footnotes delimited by <no> and </no>.

Translations (tr)

We now come to the most straightforward of the types of text in a block. A translation may be written in any language that is expressed within the limits of the ASCII code, i.e. the expected use will be in English, German, French, and possibly other western European languages. For special characters, the normal conventions for HTML will be used; e.g. &auml; stands for: ä, and &szlig; stands for: ß.

Translations may contain pieces of transliteration delimited by <al> and </al>. This is especially useful when no exact translation of a word is available, as in:

He sees a <al>narw</al>-bird.

Translations may contain footnotes, but no text delimited by <text> and </text>, which would make little sense.

Lexical information (lx)

There are roughly two motivations for the existence of this type of information in a resource, consisting of what we will call lexical entries: Software for the second application is not provided here, but such software can be developed with minimal effort. In the example below of the use of AELalign for The Eloquent Peasant, we apply lexical entries to provide information found in dictionaries.

A lexical entry in its most general form looks like:

<lx
texthi="R8-O6"
textal="Hwt-nTr"
texttr="Heiligtum"
textform="honorific transposition"
cite="Dictionary of Ancient Egypt"
keyhi="O6"
keyal="Hwt"
keytr="enclosure"
keyform="noun, sing."
dicthi="R8-O6-X1:O1"
dictal="Hwt ntr"
dicttr="tempel"
dictform="dir. genitive"/>
The attributes text.. refer to a phrase in the text at hand. One may indicate the hieroglyphic, the transliteration, the translation, some indication of the orthographic or syntactic form of the phrase, any combination of these, or none at all.

The value of cite refers to the dictionary. This attribute is typically omitted if the only objective of the lexical entries is to develop ones own word list.

The key is the word under which the phrase may be found in the dictionary. Typically, keyal would be either equal to textal or a substring of it (except for example in the case of verbs, where keyal could contain a weak consonant that is not found in textal). As key one may include any combination of hieroglyphic, transliteration translation, as the values of key..; one may also specify aspects of the form of the key, such as gender and verb class. Further, by means of the values of dict.., one may provide the phrase as it is found in the dictionary, or as one wants to have it found in ones own word list.

All of the attributes are optional. Typically, one will include only a few attributes. An example from The Eloquent Peasant:

....
<coord version="B1" pos="228"/>

<lx
textal="abt=f"
cite="Faulkner"
keyal="iab"
dicttr="join s'one"/>

Coordinates and alignment

Alignment is admittedly the most complicated aspect of AELalign, but also the most important. Alignment is the process of fitting together the related parts from different resources for the same text. Alignment is essential for the software that is to allow us to study various resources simultaneously instead of one after the other, as outlined in the Motivation section above.

There are two kinds of tag that are necessary for alignment. The general forms are:

For The Eloquent Peasant, $version$ may for example be B1 or R, which refer to different manuscripts of the same text, or it may be B1 Old which is a different version name for the same manuscript as B1, but refers to an older numbering scheme for that version; $position$ indicates a position within a version of the text, and may for example be 27 or 7.6.

The most obvious use of coord tags is within blocks. Such a tag then specifies that the following text belongs to a certain version, and occurs just after the specified position. This information is valid until we encounter the next coordinate tag in the block, which may specify a different version and a different position.

If one omits a value for the version in a coordinate tag, the version from a most closely preceding coordinate within the same block is assumed. E.g.

<texttr>
<coord version="B1" pos="26"/>
...some text...
<coord pos="27"/>
is equivalent to:
<texttr>
<coord version="B1" pos="26"/>
...some text...
<coord version="B1" pos="27"/>

If the position is omitted, this is equivalent to stating pos="", i.e. the position is taken to be the empty string. E.g.

<coord version="B1"/>
is equivalent to:
<coord version="B1" pos=""/>
For an align tag, all missing attributes are taken to specify the empty string. E.g.
<align pos="27"/>
is equivalent to:
<align version="" pos="27"/>
Before we can explain the meaning of coordinate and alignment tags more closely, we first have to introduce the concept of stream. We define a stream to be the text of one single type (hieroglyphic, transliteration, translation, or lexical information) for one single version name that occurs in one single resource. This means that one stream cannot contain information from different resources (= different files), nor can it contain information for different version names, nor can it contain different types of text.

Software that processes the XML files will group together all text that belongs to the same stream. For a given stream, the text is arranged in the same order as it occurs in the file. For example, if a file contains:

<texttr>
<coord version="B1" pos="26"/>
...first bit of text from version B1...
<coord version="R" pos="7.6"/>
...other text not belonging to the same version...
</texttr>

<textal>
<coord version="B1" pos="26"/>
...yet another stream since text is of different type...
</textal>

<texttr>
<coord version="B1" pos="27"/>
...same stream as first text line above...
</texttr>
then the last text line of the example will be arranged in the proper stream directly after the first text line of the example, since both lines belong to the same version and occur in blocks of the same type ``tr''.

The function of coordinates is twofold: first, they are to be included in the output streams by whatever software turns the XML files into a more readable format. This means they are not only capable of specifying the streams to which text belongs, they are themselves text elements in the streams. Secondly, they are there to tell the software to arrange all streams for the same version (but for different resources and/or different text types) such that identical positions are aligned. In the above example, the translation following position 26 of version B1 is to be aligned with the transliteration following position 26 of the same version.

However, coordinate tags cannot contribute to the alignment of different versions. For this purpose, the alignment tags are used. Alignment tags are also elements in streams, but are typically not visible by themselves in the output of the software. What is visible however is that text following an alignment tag in a stream is aligned with the specified position and version as they occur in other streams. Consider for example a file containing:

<coord version="B1" pos="26"/>
the peasant said to this wife of his:
<align version="C" pos="6.7"/>
see, I'm going to Egypt to fetch
<coord version="B1" pos="27"/>
food for the children
...

<coord version="C" pos="6.6"/>
thereupon his wife said to him:
<coord version="C" pos="6.7"/>
you have to fetch food for the children
Here, the alignment tag in the former of the two streams states that the beginning of ``see, I'm going'' in version B1 is to be aligned with position 6.7 of version C.

In the above example, alignment tags occur in blocks of translations. They may however also occur in other types of text. The best place for alignment tags is undoubtedly in hieroglyphic, because it is there that alignment can be specified most accurately. Note that if the alignment of the hieroglyphic of two versions is specified in this way, one does not also need to add alignment tags to the translation or transliteration for these two versions, since enough information is already present for software to align all streams belonging to the two versions. However, the more alignment tags are added, the more fine-grained the alignment may be.

The coordinate and alignment tags were until now assumed to occur within blocks, but they may also occur outside of blocks. If it occurs outside of blocks, a coordinate tag specifies that at most four blocks that follow with different types (before the next coordinate tag outside of a block is encountered) are to begin with that coordinate tag. If two or more blocks that follow are of the same type, then the coordinate tag is only inserted at the beginning of the first; the text in the second and following blocks of that type is however written to the same stream. For example,

<coord version="B1 Old" pos="83"/>

<textal>
wnn.k Hr rdit di.tw n.f aqw
</textal>

<texttr>
thou shalt cause provisions
</texttr>

<texttr>
to be given to him.
</texttr>
is equivalent to:
<textal>
<coord version="B1 Old" pos="83"/>
wnn.k Hr rdit di.tw n.f aqw
</textal>

<texttr>
<coord version="B1 Old" pos="83"/>
thou shalt cause provisions
to be given to him.
</texttr>
This representation makes sense in particular if the resource contains short examples from grammar books, comprising two or more types of text (hieroglyphic, transliteration, or translation) for each example. Note: one should generally avoid using coordinate tags both within and outside of blocks in one and the same file, since this may lead to confusion.

Another use of coordinate tags outside of blocks is in conjunction with alignment tags outside of blocks. If an alignment tag is used in this way, it means that the specified version and position is to be aligned with the version and position of the most closely preceding coordinate tag. This is useful for specifying concordances. An example of concordances for The Eloquent Peasant is the following:

...
<coord version="R Old" pos="145"/> <align version="R" pos="20.7"/>
<coord version="R Old" pos="146"/> <align version="R" pos="21.1"/>
<coord version="R Old" pos="160"/> <align version="R" pos="23.1"/>
<coord version="R Old" pos="161"/> <align version="R" pos="23.2"/>
...
This states that any occurrence of for example position 20.7 for version name R is to be aligned with position 145 for version name R Old. It is not advisable to combine this special use of the tags with the more common use internal in blocks, in one and the same file, since this may lead to confusion.

Although a position can be an arbitrary string, there is one symbol that has a special meaning: if a position starts with @, then a coordinate tag in which it occurs should not be output by the software. This can be used to force alignment using `dummy' positions, positions that do not correspond to a natural boundary in the manuscript such as a column break, but rather represent for example segmentation into syntactic units.

In most cases, use of explicit dummy positions should be avoided, since they may lead to resources that are paired to other resources by means of position names that are specific to one author, and that are thereby difficult to reuse by others. However, dummy positions can also be written implicitly, to align different types of text within one resource. This is done by writing a coordinate tag outside of a block with the special position @anon, which stands for ``anonymous''. The result of such a tag is that a unique dummy position is generated, and the resulting coordinate tag is inserted at the beginning of each of at most four following blocks of different types. If a second block (or third block, etc.) of the same type is seen, a new unique dummy position is generated. Thus:

<coord version="B1" pos="@anon"/>

<textal>
...some transliteration <coord pos="121"/> and some more ...
</textal>
<texttr>
...some translation <coord pos="121"/> and some more ...
</texttr>

<textal>
...more transliteration <coord pos="122"/>
</textal>
<texttr>
...more translation <coord pos="122"/>
</texttr>
is roughly equivalent to:
<textal>
<coord version="B1" pos="@fresh_dummy1"/>
...some transliteration <coord pos="121"/> and some more ...
</textal>
<texttr>
<coord version="B1" pos="@fresh_dummy1"/>
...some translation <coord pos="121"/> and some more ...
</texttr>

<textal>
<coord version="B1" pos="@fresh_dummy2"/>
...more transliteration <coord pos="122"/>
</textal>
<texttr>
<coord version="B1" pos="@fresh_dummy2"/>
...more translation <coord pos="122"/>
</texttr>
possibly with another string instead of fresh_dummy, which in any case should remain invisible to the user.

In general, at most one coordinate tag with position @anon is needed in a resource, but in some cases, for example when for certain parts of the text only transliterations are available and for some others only translations, it may be required to manually generate new unique dummy positions by means of several coordinate tags of this form. Thus:

<coord version="B1" pos="@anon"/>
<textal>
...transliteration <coord pos="121"/> ...
</textal>

<coord pos="@anon"/>
<texttr>
...translation for another part <coord pos="122"/> ...
</texttr>
is roughly equivalent to:
<textal>
<coord version="B1" pos="@fresh_dummy1"/>
...transliteration <coord pos="121"/> ...
</textal>

<texttr>
<coord version="B1" pos="@fresh_dummy2"/>
...translation for another part <coord pos="122"/> ...
</texttr>
Had we instead omitted the second coordinate tag with position @anon, the beginnings of the two blocks would be erroneously aligned by means of the same dummy position @fresh_dummy1.

Concluding remarks concerning alignment:


Implementation

The main reason why I discuss a concrete implementation is to demonstrate that the ideas underlying the XML format above can indeed be realised. However, I cannot stress enough that the implementation exists by virtue of the format, not the other way around. In particular, if our implementation does not behave according to the specification above, then it is the implementation that is incorrect, not the format. Furthermore, I encourage others to make alternative implementations that can be used on other platforms.

The current implementation was developed for Unix, and is programmed in Perl. It can be called from the Unix command line as:

AELalign $config$
or
perl AELalign $config$
Here AELalign is a Perl script. For a call of the first type, the file AELalign should be made executable, and the path for Perl in the first line of the program should be adapted to concur with that on the local machine. Internally in the code, the program mdc2html from Geoffrey Watson is called to translate hieroglyphic codes to their pictorial representation. This needs to be installed along with AELalign.

The configuration file $config$ is a file that provides information about which resources are to be used and how. (As before, the $ delimit a variable name that is used for the purposes of this document. However, these symbols $ are not themselves to be typed in.)

A configuration file $config$, called e.g. peasant.conf, may look like this:

name = The Eloquent Peasant
directory = ../Texts/HTML
file = page
header = peas-header.html
resource = peas-concordances.xml
resource = Bauer-Graefe.xml
stop = "B2" "60" hi
start = "B2" "65" hi
stop = "B2" "100" al
stop = "B1" "10"
resource = Peas-Gardiner.xml
stop = "R Old" "59"
resource = Peas-Allen.xml
stop = "Bt" "25"
stop = "B2" "65" tr
resource = peas-hi.xml
resource = peas-lx.xml
break = "R"
break = "B1"
The order of the lines in a configuration file is irrelevant, except for the following: We now discuss the respective types of line.

name = $document_name$

Here, $document_name$ is the name of the Egyptian document to which the resources refer. This name is used as title of the HTML files, and is printed at the top and bottom of each page. It makes no sense to have more than one line of this form in a given configuration file. If there is no line of this form, the default value of name is unnamed.

directory = $out_path$

Here, $out_path$ specifies the path of the directory where the created HTML files are to be stored. One should take care that the directory indeed exists before invoking the script, since the script does not create directories itself. It makes no sense to have more than one line of this form. If there is no line of this form, the default value of directory is ``.'' (current directory).

file = $html_name$

Here, $html_name$ specifies the prefix of the HTML files that are to be created. A file that is created will typically be named $html_name$.html or $html_name$$number$.html, for some number $number$. It makes no sense to have more than one line of this form. If there is no line of this form, the default value of file is unnamed.

header = $header_file$

Here, $header_file$ is a file that contains HTML text, and that is located in the same directory as the configuration file itself. The contents of the $header_file$ is inserted in the root page of the HTML output, called $html_name$.html. It makes no sense to have more than one line of this form. If there is no line of this form, no header will be added.

resource = $resource_file$

Here, $resource_file$ is a resource on an Egyptian text, satisfying the XML format specified above. The configuration file may contain one or more lines of this form. A $resource_file$ should be located in the same directory as the configuration file itself.

stop = "$version$" "$pos$" $type$

stop = "$version$" "$pos$"

start = "$version$" "$pos$" $type$

start = "$version$" "$pos$"

Here, $version$ and $pos$ are a version and a position specified in a tag of the form <coord version="$version$" pos="$pos$"> in a resource file specified immediately above a sequence of lines of the above forms, and the <coord ...> should occur within a pair <texthi> and </texthi> if $type$ is hi, or within a pair <textal> and </textal> if $type$ is al, etc. A line of the second or fourth forms above is shorthand for 4 lines of the first or third forms, where $type$ is respectively hi, al, tr, lx.

The meaning of stop lines is that upon finding a coordinate of the specified form in a resource, all input for the specified version is ignored, until the end of the resource or until a coordinate is found that was specified in a start line, and then normal processing is resumed. For a given resource and version, it makes no sense to specify a start line without having specified a stop line for a preceding position.

break = "$version$"

break = "@anon"

The software generally tries to choose line breaks such that lines start with coordinates. If one wants the software to deviate from the standard strategy, and to try to choose line breaks in conjunction with coordinates from one or more particular versions, then one or more lines of the first form above can be used. If one wants the software to attempt to choose line breaks in conjunction with anonymous dummy positions, the second line can be used, possibly together with lines of the first form.

Further notes on the implementation:


Example

Consider the configuration file peas.conf, which refers to the header file peas-header.html and to the following resources: By calling:
AELalign peas.conf
these resources are compiled into a number of HTML files, the root of which is page0.html. It contains the header file, an index of versions and positions, and information from the headers of the resources.


Work in progress

After two years of experience with AELalign, we are planning a next version. The differences with version 0.1 will include the following:


Acknowledgements

Chris Macksey and Mark Wilson contributed essential ideas, which were incorporated into the present XML format. Many thanks go to Geoffrey Watson for adapting his program mdc2html to simplify integration into AELalign, and to Jenny Carrington for providing hieroglyphic text that I used for testing the software. Deidre Lonergan helped me to get the XML format right.