I first discuss the motivation behind the format, then I discuss the XML format itself, given a formal description of the syntax in terms of a DTD, and an informal description of the meaning. Then I describe an implementation that translates different resources in the proposed format to an aligned HTML representation. Lastly, I discuss examples of files in the XML format, and the form resulting from applying the implementation.
If the section on the XML format is too tedious to read, just jump ahead and look at the section with the example.
For studying a text on which few publications exist, one may open, say, two books containing resources on that text, and use the index fingers of the left and right hands to study the books in parallel, matching e.g. the hieroglyphic form of the text to the translation of the text. However, this strategy is little effective for The Eloquent Peasant, and one is forced to study one or two resources at a time instead of studying all available resources in parallel.
The purpose of these pages is to investigate electronic means of alleviating this problem. We propose a format for representing resources on Egyptian texts that allows automatical alignment on the screen. The aligned representation offers a way to study several resources in parallel.
The present implementation is a strong case against systems that are WYSIWYG (`What You See Is What You Get'). Although for hieroglyphic, WYSIWYG editors are arguably more convenient to work with than those that are not WYSIWYG, other kinds of textual data, such as translations and transliterations, can be typed in just as easily by using ones favourite editor. After such resources have been put in electronic form, their combination can easily be achieved by simple implementations such as AELalign, described below. This represents a very flexible and efficient way of dealing with textual data, since:
A resource typically has the form:
<?xml version="1.0"?> <!DOCTYPE resource SYSTEM "AELalign.0.1.dtd"> <resource> <created> Created by me today. </created> <header name="Name of the resource" url="http://www.my_site/my_file.html"> ...header text... </header> <body> <texthi> <coord version="B1" pos="239"/> D54-w-i-n-r:f-sxt-t:y-A1 <coord pos="240"/> .... </texthi> <textal> ... <coord version="B1" pos="10"/> sxtj pn Dd=f <coord pos="11"/> ... </textal> <texttr> ... <coord version = "B1" pos="10"/> this peasant. He said <coord version="B1" pos="11"/> ... <coord version="R" pos="29"/> and he loaded his donkeys ... </texttr> .... </body> </resource>The very first line of a resource is just there to tell us that the document is XML; the exact form of this line deserves no further attention here. The line of the resource starting with <!DOCTYPE contains the version number of the above-mentioned DTD. We add this to keep track of changes in the DTD that are bound to take place in the future.
The first piece of real information in a resource is text mentioning the person who created the file, the date of initial creation, and possibly the dates of subsequent changes. The name of the resource, which is part of the header, is typically a short string that represents for example the last name of the author of the resource. Optionally, one may also include the URL of the site where the XML file is located.
Below that, in the header text itself, one can add a description of the resource. It is recommended that at least the following pieces of information are included:
In the header a limited number of HTML tags are allowed, viz. <ul>, <li>, <i>, and <p>, and the corresponding closing tags. Note that in XML every opening tag should be matched by its closing tag.
In the body of the file, we find several (to be precise, zero or more) blocks of text between pairs of tags such as <texthi> and </texthi>, <textal> and </textal>, <texttr> and </texttr>, and (not indicated in the example above) <textlx> and </textlx>. Here, hi stands for `hieroglyphic', al stands for `alphabetic', i.e. transliteration using the (extended) Roman alphabet, tr stands for `translation', and lx stands for `lexical'. For ease of reference, we will continue to refer to such a piece of text between any of these four pairs of tags as a block.
Although for expositional reasons the example file above contains blocks of several distinct types, it is often better that different types be physically separated into different files, which enhances reuse. A notable exception is when the different types of text were extracted together from the same source; consider in particular examples from grammar books, where hieroglyphic, transliteration and translation occur close together. Apart from this exception, a resource typically contains one single block, and thereby only one type of text. (Actually, for representing concordances a resource may contain no block at all; see further below.)
We will first discuss the four types of text that may occur in a block, and then we will explain the specification of textual coordinates.
A vertical group has the form
A horizontal group has the form
A basic group has the form
A glyph has one of the following forms:
The glyphs of the first and second kinds above correspond to the so-called Gardiner codes. The glyphs of the third kind are mnemonics for the more conventional Gardiner codes; use of these mnemonics other than in the user interfaces of hieroglyphic editors is discouraged, since there is no stable and well-publicized standard for the mapping to Gardiner codes; furthermore, mnemonics are redundant for the purposes of exchanging data. (I readily admit however that my recommendation of avoiding mnemonic in internal representations may be difficult to follow if one uses a hieroglyphic editor that does not automatically map between mnemonics and Gardiner codes.)
Examples of glyphs are: A20, W24a, Aa27f, xpr and mAa; the last two are mnemonics for L1 and Aa11, respectively. A glyph is also a basic group; another example of a basic group is (X1:Z4). Glyphs and basic groups are also horizontal groups; other examples of horizontal groups are Q3*X1 and Q3*(X1:Z4). All of the above are also vertical groups; another example of a vertical group is Q3*(X1:Z4):N1.
No elegant solution has yet been found for denoting cartouches. The traditional ASCII notation using < and > (see Manuel de Codage, or inofficial mirror) is incompatible with XML.
White-space may be inserted into the encoding of hieroglyphic text, anywhere except inside of the denotation of glyphs. When it is used, white-space has the purpose of delimiting ``words'' (words in hieroglyphic are then taken to roughly correspond to characters between white space in the transliteration). Visualization tools for hieroglyphic texts should ignore white-space in the ASCII encoding.
The use of exclamation marks (as in the end-of-line marker ``!'' and in the end-of-page marker ``!!'') plays no role here, and should be avoided for the purposes of AELalign.
Several other possibilities of ASCII encoding of hieroglyphic, such as the use of different colours, fine-tuning of positioning of signs and scaling, are not supported at the present time.
In the near future, we will remove the Manuel de Codage representation of hieroglyphic entirely and replace it with the superior RES representation.
Inside of a block with hieroglyphic text, one may
insert normal text, delimited by <text>
and </text>, and footnotes delimited by
<no> and </no>,
as for example in
... Q3*(X1:Z4):N1 <text>(sic!)</text> C4<no>As
confirmed by Gardiner, this sign <hi>C4</hi>
reads <al>DHwtj</al>,
i.e. <tr>Thoth</tr>.</no> T8a ...
The former of these two possibilities is intended for very short pieces of
text, typically single words. The footnotes however have fewer
restrictions and may contain nested pieces of text delimited
by <hi> and </hi>,
<al> and </al>, and
<tr> and </tr>, for
hieroglyphic text, transliterations and translations,
respectively, as demonstrated in the example above.
Also a block with transliterations may contain normal text, delimited by <text> and </text>, and footnotes delimited by <no> and </no>.
Translations may contain pieces of transliteration delimited by <al> and </al>. This is especially useful when no exact translation of a word is available, as in:
He sees a <al>narw</al>-bird.
Translations may contain footnotes, but no text delimited by <text> and </text>, which would make little sense.
A lexical entry in its most general form looks like:
<lxThe attributes text.. refer to a phrase in the text at hand. One may indicate the hieroglyphic, the transliteration, the translation, some indication of the orthographic or syntactic form of the phrase, any combination of these, or none at all.
texthi="R8-O6"
textal="Hwt-nTr"
texttr="Heiligtum"
textform="honorific transposition"
cite="Dictionary of Ancient Egypt"
keyhi="O6"
keyal="Hwt"
keytr="enclosure"
keyform="noun, sing."
dicthi="R8-O6-X1:O1"
dictal="Hwt ntr"
dicttr="tempel"
dictform="dir. genitive"/>
The value of cite refers to the dictionary. This attribute is typically omitted if the only objective of the lexical entries is to develop ones own word list.
The key is the word under which the phrase may be found in the dictionary. Typically, keyal would be either equal to textal or a substring of it (except for example in the case of verbs, where keyal could contain a weak consonant that is not found in textal). As key one may include any combination of hieroglyphic, transliteration translation, as the values of key..; one may also specify aspects of the form of the key, such as gender and verb class. Further, by means of the values of dict.., one may provide the phrase as it is found in the dictionary, or as one wants to have it found in ones own word list.
All of the attributes are optional. Typically, one will include only a few attributes. An example from The Eloquent Peasant:
....
<coord version="B1" pos="228"/>
<lx
textal="abt=f"
cite="Faulkner"
keyal="iab"
dicttr="join s'one"/>
There are two kinds of tag that are necessary for alignment. The general forms are:
The most obvious use of coord tags is within blocks. Such a tag then specifies that the following text belongs to a certain version, and occurs just after the specified position. This information is valid until we encounter the next coordinate tag in the block, which may specify a different version and a different position.
If one omits a value for the version in a coordinate tag, the version from a most closely preceding coordinate within the same block is assumed. E.g.
<texttr>is equivalent to:
<coord version="B1" pos="26"/>
...some text...
<coord pos="27"/>
<texttr>
<coord version="B1" pos="26"/>
...some text...
<coord version="B1" pos="27"/>
If the position is omitted, this is equivalent to stating pos="", i.e. the position is taken to be the empty string. E.g.
<coord version="B1"/>is equivalent to:
<coord version="B1" pos=""/>For an align tag, all missing attributes are taken to specify the empty string. E.g.
<align pos="27"/>is equivalent to:
<align version="" pos="27"/>Before we can explain the meaning of coordinate and alignment tags more closely, we first have to introduce the concept of stream. We define a stream to be the text of one single type (hieroglyphic, transliteration, translation, or lexical information) for one single version name that occurs in one single resource. This means that one stream cannot contain information from different resources (= different files), nor can it contain information for different version names, nor can it contain different types of text.
Software that processes the XML files will group together all text that belongs to the same stream. For a given stream, the text is arranged in the same order as it occurs in the file. For example, if a file contains:
<texttr>then the last text line of the example will be arranged in the proper stream directly after the first text line of the example, since both lines belong to the same version and occur in blocks of the same type ``tr''.
<coord version="B1" pos="26"/>
...first bit of text from version B1...
<coord version="R" pos="7.6"/>
...other text not belonging to the same version...
</texttr>
<textal>
<coord version="B1" pos="26"/>
...yet another stream since text is of different type...
</textal>
<texttr>
<coord version="B1" pos="27"/>
...same stream as first text line above...
</texttr>
The function of coordinates is twofold: first, they are to be included in the output streams by whatever software turns the XML files into a more readable format. This means they are not only capable of specifying the streams to which text belongs, they are themselves text elements in the streams. Secondly, they are there to tell the software to arrange all streams for the same version (but for different resources and/or different text types) such that identical positions are aligned. In the above example, the translation following position 26 of version B1 is to be aligned with the transliteration following position 26 of the same version.
However, coordinate tags cannot contribute to the alignment of different versions. For this purpose, the alignment tags are used. Alignment tags are also elements in streams, but are typically not visible by themselves in the output of the software. What is visible however is that text following an alignment tag in a stream is aligned with the specified position and version as they occur in other streams. Consider for example a file containing:
<coord version="B1" pos="26"/>Here, the alignment tag in the former of the two streams states that the beginning of ``see, I'm going'' in version B1 is to be aligned with position 6.7 of version C.
the peasant said to this wife of his:
<align version="C" pos="6.7"/>
see, I'm going to Egypt to fetch
<coord version="B1" pos="27"/>
food for the children
...
<coord version="C" pos="6.6"/>
thereupon his wife said to him:
<coord version="C" pos="6.7"/>
you have to fetch food for the children
In the above example, alignment tags occur in blocks of translations. They may however also occur in other types of text. The best place for alignment tags is undoubtedly in hieroglyphic, because it is there that alignment can be specified most accurately. Note that if the alignment of the hieroglyphic of two versions is specified in this way, one does not also need to add alignment tags to the translation or transliteration for these two versions, since enough information is already present for software to align all streams belonging to the two versions. However, the more alignment tags are added, the more fine-grained the alignment may be.
The coordinate and alignment tags were until now assumed to occur within blocks, but they may also occur outside of blocks. If it occurs outside of blocks, a coordinate tag specifies that at most four blocks that follow with different types (before the next coordinate tag outside of a block is encountered) are to begin with that coordinate tag. If two or more blocks that follow are of the same type, then the coordinate tag is only inserted at the beginning of the first; the text in the second and following blocks of that type is however written to the same stream. For example,
<coord version="B1 Old" pos="83"/>is equivalent to:
<textal>
wnn.k Hr rdit di.tw n.f aqw
</textal>
<texttr>
thou shalt cause provisions
</texttr>
<texttr>
to be given to him.
</texttr>
<textal>This representation makes sense in particular if the resource contains short examples from grammar books, comprising two or more types of text (hieroglyphic, transliteration, or translation) for each example. Note: one should generally avoid using coordinate tags both within and outside of blocks in one and the same file, since this may lead to confusion.
<coord version="B1 Old" pos="83"/>
wnn.k Hr rdit di.tw n.f aqw
</textal>
<texttr>
<coord version="B1 Old" pos="83"/>
thou shalt cause provisions
to be given to him.
</texttr>
Another use of coordinate tags outside of blocks is in conjunction with alignment tags outside of blocks. If an alignment tag is used in this way, it means that the specified version and position is to be aligned with the version and position of the most closely preceding coordinate tag. This is useful for specifying concordances. An example of concordances for The Eloquent Peasant is the following:
...This states that any occurrence of for example position 20.7 for version name R is to be aligned with position 145 for version name R Old. It is not advisable to combine this special use of the tags with the more common use internal in blocks, in one and the same file, since this may lead to confusion.
<coord version="R Old" pos="145"/> <align version="R" pos="20.7"/>
<coord version="R Old" pos="146"/> <align version="R" pos="21.1"/>
<coord version="R Old" pos="160"/> <align version="R" pos="23.1"/>
<coord version="R Old" pos="161"/> <align version="R" pos="23.2"/>
...
Although a position can be an arbitrary string, there is one symbol that has a special meaning: if a position starts with @, then a coordinate tag in which it occurs should not be output by the software. This can be used to force alignment using `dummy' positions, positions that do not correspond to a natural boundary in the manuscript such as a column break, but rather represent for example segmentation into syntactic units.
In most cases, use of explicit dummy positions should be avoided, since they may lead to resources that are paired to other resources by means of position names that are specific to one author, and that are thereby difficult to reuse by others. However, dummy positions can also be written implicitly, to align different types of text within one resource. This is done by writing a coordinate tag outside of a block with the special position @anon, which stands for ``anonymous''. The result of such a tag is that a unique dummy position is generated, and the resulting coordinate tag is inserted at the beginning of each of at most four following blocks of different types. If a second block (or third block, etc.) of the same type is seen, a new unique dummy position is generated. Thus:
<coord version="B1" pos="@anon"/>is roughly equivalent to:
<textal>
...some transliteration <coord pos="121"/> and some more ...
</textal>
<texttr>
...some translation <coord pos="121"/> and some more ...
</texttr>
<textal>
...more transliteration <coord pos="122"/>
</textal>
<texttr>
...more translation <coord pos="122"/>
</texttr>
<textal>possibly with another string instead of fresh_dummy, which in any case should remain invisible to the user.
<coord version="B1" pos="@fresh_dummy1"/>
...some transliteration <coord pos="121"/> and some more ...
</textal>
<texttr>
<coord version="B1" pos="@fresh_dummy1"/>
...some translation <coord pos="121"/> and some more ...
</texttr>
<textal>
<coord version="B1" pos="@fresh_dummy2"/>
...more transliteration <coord pos="122"/>
</textal>
<texttr>
<coord version="B1" pos="@fresh_dummy2"/>
...more translation <coord pos="122"/>
</texttr>
In general, at most one coordinate tag with position @anon is needed in a resource, but in some cases, for example when for certain parts of the text only transliterations are available and for some others only translations, it may be required to manually generate new unique dummy positions by means of several coordinate tags of this form. Thus:
<coord version="B1" pos="@anon"/>is roughly equivalent to:
<textal>
...transliteration <coord pos="121"/> ...
</textal>
<coord pos="@anon"/>
<texttr>
...translation for another part <coord pos="122"/> ...
</texttr>
<textal>Had we instead omitted the second coordinate tag with position @anon, the beginnings of the two blocks would be erroneously aligned by means of the same dummy position @fresh_dummy1.
<coord version="B1" pos="@fresh_dummy1"/>
...transliteration <coord pos="121"/> ...
</textal>
<texttr>
<coord version="B1" pos="@fresh_dummy2"/>
...translation for another part <coord pos="122"/> ...
</texttr>
Concluding remarks concerning alignment:
arr<coord version="B1" pos="67"/>ytis distinct from
arr <coord version="B1" pos="67"/> ytOnly the former has the intended meaning in this case, viz. that there is one single word arryt that is broken up in the manuscript by the transition between two columns. Software for AELalign may in this case however print yt before the coordinate, i.e. append it after arr, to improve the readability.
The current implementation was developed for Unix, and is programmed in Perl. It can be called from the Unix command line as:
AELalign $config$or
perl AELalign $config$Here AELalign is a Perl script. For a call of the first type, the file AELalign should be made executable, and the path for Perl in the first line of the program should be adapted to concur with that on the local machine. Internally in the code, the program mdc2html from Geoffrey Watson is called to translate hieroglyphic codes to their pictorial representation. This needs to be installed along with AELalign.
The configuration file $config$ is a file that provides information about which resources are to be used and how. (As before, the $ delimit a variable name that is used for the purposes of this document. However, these symbols $ are not themselves to be typed in.)
A configuration file $config$, called e.g. peasant.conf, may look like this:
name = The Eloquent PeasantThe order of the lines in a configuration file is irrelevant, except for the following:
directory = ../Texts/HTML
file = page
header = peas-header.html
resource = peas-concordances.xml
resource = Bauer-Graefe.xml
stop = "B2" "60" hi
start = "B2" "65" hi
stop = "B2" "100" al
stop = "B1" "10"
resource = Peas-Gardiner.xml
stop = "R Old" "59"
resource = Peas-Allen.xml
stop = "Bt" "25"
stop = "B2" "65" tr
resource = peas-hi.xml
resource = peas-lx.xml
break = "R"
break = "B1"
The meaning of stop lines is that upon finding a coordinate of the specified form in a resource, all input for the specified version is ignored, until the end of the resource or until a coordinate is found that was specified in a start line, and then normal processing is resumed. For a given resource and version, it makes no sense to specify a start line without having specified a stop line for a preceding position.
AELalign peas.confthese resources are compiled into a number of HTML files, the root of which is page0.html. It contains the header file, an index of versions and positions, and information from the headers of the resources.