Conversion of Manuel de Codage into RES

The Manuel de Codage (MdC) is too flawed to be used if it can be avoided, but it would be a waste not to benefit from existing hieroglyphic texts that are encoded in MdC. The implementation of AELalign is to include an automatic converter from a MdC file to a file containing snippets of RES, mixed with AELalign tags. Further post-editing is needed to turn this into a complete and valid file, in say the AELalign format.

Introduction

Automatic conversion from MdC to RES is marred by a number of problems. The first is that the last printed version of the Manuel de Codage from 1988 (referred to as MdC-88) is so vague:

Jan Buurman, Nicolas Grimal, Michael Hainsworth, Jochen Hallof, and Dirk van der Plas. Inventaire des signes hiéroglyphiqures en vue de leur saisie informatique. Institut de France, Paris, 1988.

Later on-line documentation by Hans van den Berg from 1997 (referred to as MdC-97) contains partly contradictory information, and is just as vague.

A second problem is that no one is using the Manuel de Codage. All tools being used today use some dialect of the 'standard', consisting of some interpretation of the constructions in the scanty official documents, plus various additional constructions that attempt to make up for the inadequacies in the power of MdC, with moderate success so far.

The grammar below tries to capture a large number of MdC dialects, for the purpose of conversion to RES. Given this objective, the grammar is necessarily overgenerating, i.e. it may accept files that would not be accepted by any MdC tool.

The sources of information I've relied on include:

The MdC parser in Java of Serge Rosmorduc's JSesh.
A description of MdC by Serge Rosmorduc.

Notation

Literal strings will be enclosed in double quotes.

Expressions of the form ( ... )*, ( ... )+, [ ... ] have the same meaning as in the RES grammar. The pipe symbol "|", as in A | B, separates alternatives.

An expression of the form character_except_X means any character except X. We use for example character_except_plus, where plus stands for the symbol "+".

The MdC syntax is so chaotic that an unambiguous grammar could be rather ugly. In the case of ambiguities, we assume a tokenizer with the usual longest-match strategy.

Grammar

MdC file


mdc_file ->
	( sep | top_group )*

sep ->
	"-" | whitespace | "_" | "=" | toggle

A MdC file consists of a sequence of top_groups, interspersed with hyphens, whitespace, grammar symbols (single and double underscore and equal-sign), or toggles. There may be spurious such symbols at the beginning and end of the file. The notation with underscores seems to be from Winglyph.

Outside of top_groups, there is no further hierarchical structure in the file, whereas the top_groups may include comments, transliterations and translations. One implication is that it would be hard to collect material for the prelude of an AELalign document. The conversion therefore tries to collect valid snippets of RES, interspersed with other material that must be reworked during post-editing.

groups


top_group ->
	quadrat |
	box |
	"^" |
	"?" |
	"??" |
	"|" ( character_except_minus )* |
	"!" [ "=" integer "%" ] |
	"!!" |
	"+s" |
	"+t" subtext |
	"+" ( "+" | "l" | "i" | "c" | "g" | "r" | "h" | "b" ) subtext |
	"{l" integer "," integer "}" |
	"{L" integer "," integer "}" |
	"?" integer 

subtext ->
	( character_except_plus )*

The top_groups occur at the top level of MdC files.

The pattern "^" can be translated into RES as, for example, "<" * "-" * "-" * ">".

The pattern "?" can be translated into RES as, for example, "-" * "-", and the pattern "??" as, for example, "-" * "-" * "-" * "-".

The character "|" denotes that the following string, up to the next "-", is a position in the manuscript. It can best be translated using a coordinate-tag from AELalign.

The pattern "!" stands for end of line. The addition "=" integer "%" is from MdC-97, and influences the distance between the present line and the following. The pattern "!!" stands for end of page. Both patterns should be ignored in the translation to RES.

The pattern "+s" indicates that the following is hieroglyphic. The pattern "+t" indicates transliteration, up to the next "+" (alternatively, up to the next pattern of the form "+s", "+t", etc.); according to the JSesh manual, "+" inside subtext can be escaped as "\+". Transliterations can be enclosed in corresponding XML tags from AELalign.

Patterns "++", "+l", etc., signal other types of text, which are best translated verbatim, awaiting post-editing. The pattern "+b" is in MdC-97, but not in MdC-88. For robustness, one might accept all patterns consisting of "+" followed by a lower-case letter.

The non-MdC patterns starting with "{l" or "{L", mentioned in the JSesh documentation, denote horizontal lines, and are ignored here.

The non-MdC pattern "?" integer, mentioned in the JSesh documentation, denotes that the following should occur some distance from the left margin. This is further ignored here.

quadrats


quadrat ->
	horizontal_group ( ( sep )* ":" ( sep )* horizontal_group )* [ ( sep )* shading ]

horizontal_group ->
	basic_group ( ( sep )* "*" ( sep )* basic_group )* 

basic_group ->
	hieroglyph |
	stack |
	ligature |
	philology |
	subgroup 

subgroup ->
	"(" ( sep | simple_group )* ")"

simple_group ->
	quadrat |
	box

The MdC documents are unclear as to whether toggles should be allowed within quadrats. We allow them between basic_groups and operators ":" and "*" however, as part of seps.

A simple_group is more restricted than a top_group. Note however that hyphens are allowed in simple_groups, and they may best be translated to colons within RES groups.

Undocumented are the operators "^^^" and "&&&" in the JSesh parser. I have a hunch these might mean something similar to ":[fit]" and "*[fit]".

stacks


stack ->
	hieroglyph ( sep )* ( "#" | "##" ) ( sep )* hieroglyph

Two hieroglyphs (which may also be shades) can be stacked. The corresponding RES expression is with stack.

The notation with '#' is from MdC-88. JSesh prefers use of '##' for stacking to avoid confusion with the use of '#' for shading. Note that since "1" is a hieroglyph (see glyph_name below), the MdC expression "A1#1" is ambiguous. We assume the tokenizer would interpret this as an occurrence of shading.

The JSesh parser accepts the undocumented operators "**" and "^^". In the current implementation, these can be used instead of "#" and "##".

ligatures


ligature ->
	glyph ( "&" glyph )+ |
	glyph [ ligature_pos ] ( "&&" glyph [ ligature_pos ] )+

ligature_pos ->
	"{{" integer "," integer "," integer "}}"

The construction with "&" is the most harmful feature introduced in MdC dialects after 1988. A ligature of this form is a combination of signs with relative positioning defined within the tool being used. Often the result of say A&B is similar to a RES expression of the form insert[te](A,B), but it may also be insert[ts](A,B), insert(A,B) or stack(A,B). The only viable strategy is therefore to collect common ligatures, together with their intended meanings. The EGPZ 1.0 Sign List contains no fewer than 400 such ligatures (which is a wasteful and ludicrous pollution of the character set). In time, we hope to implement automatic conversion from all of these ligatures to a more sustainable form, starting with:

EGPZ ligature RES

A14&Z2 insert[bs,sep=0.5](A14,Z1*[sep=1.5]Z1*[sep=1.5]Z1)

A17&Z2 insert[be](A17*[sep=0.0]empty[width=0.2,height=0.0],Z1*Z1*Z1)

A24&Z2d insert[te,sep=0.5](A24*[sep=0.0]empty[width=0.3],Z1*[sep=0.5]Z1*[sep=0.5]Z1)

A51&X1 insert[b,sep=0.2](A51,X1)

EGPZ ligature	RES
A14&Z2	insert[bs,sep=0.5](A14,Z1[sep=1.5]Z1[sep=1.5]Z1)
A17&Z2	insert[be](A17[sep=0.0]empty[width=0.2,height=0.0],Z1Z1*Z1)
A24&Z2d	insert[te,sep=0.5](A24[sep=0.0]empty[width=0.3],Z1[sep=0.5]Z1*[sep=0.5]Z1)
A51&X1	insert[b,sep=0.2](A51,X1)

The construction with "&&" allows manual specification of the positions (horizontal and vertical distance between top-left of the group and top-left of the sign) and scaling of the sign. Distances are in 1/1000 of the unit size.

This construction is quite alien to the philosophy of RES, as change of the font will be detrimental to the appearance. Nevertheless, it could in principle be translated automatically into RES. For example, stp{{0,0,100}}&&n{{0,800,100}}&&ra{{700,0,70}} corresponds roughly to stack[x=0.8,y=0.2](stack[y=0.9](stack[y=0.4](empty,stp),n),ra[scale=0.7]). Needless to say, there are better solutions, without the use of stack, but these are difficult to realise automatically, and the current implementation ignores the pattern between "{{" and "}}" altogether.

The JSesh parser also accepts the undocumented pattern "[" ( character_except_square_bracket_close )* "]". Such patterns will be ignored, just as the pattern with "{{" and "}}".

hieroglyphs


hieroglyph ->
	glyph |
	shade |
	blank


glyph ->
	glyph_name ( glyph_modifier )*

glyph_name ->
	[ "@" ] category non_zero_natural_number [ upper_letter ] |
	mnemonic |
	"o" | "O" |
	"1" | "2" | "3" | "4" | "5" | 
	"20" | "30" | "40" | "50" |
	"200" | "300" | "400" | "500"

For the definitions of category, etc., see the grammar grammar of RES. Note however that MdC glyph_names have an affix that is an upper-case letter, which needs to be converted to a lower-case letter in RES.

In JSesh, the prefix "@" signals a provisional Gardiner code. This has no equivalent in RES. The best translation may be empty^"C", where C is the provisional code.

Further, most modern MdC dialects seem to use the MdC97 mnemonics, which differ from the MdC88 mnemonics. The additional mnemonics need to be translated to corresponding Gardiner codes in RES, avoiding the extended library (which should be considered as outdated in the light of the recent Unicode proposal):

MdC-97 mnemonic MdC meaning RES if different

M Aa15

R D153 D26

nDs G37

1000 M12

nn M22A M22*M22

qnbt O38A O38[mirror]

nTrw R8A R8*[sep=0,fix]R8*[sep=0,fix]R8

K S56 S7

wa T21

MdC-97 mnemonic	MdC meaning	RES if different
M	Aa15
R	D153	D26
nDs	G37
1000	M12
nn	M22A	M22*M22
qnbt	O38A	O38[mirror]
nTrw	R8A	R8[sep=0,fix]R8[sep=0,fix]R8
K	S56	S7
wa	T21

The verse points "o" and "O" can be translated to "o"[red] and "o"[black], respectively.

The numerals need to be translated into appropriate combinations of signs:

MdC-97 mnemonic RES

1 Z1

2 Z1*Z1

3 Z1*Z1*Z1

4 Z1*Z1*Z1*Z1

5 .*[sep=0]Z1*Z1*[sep=0].:Z1*Z1*Z1

20 10*10

30 10*10*10

40 10*10*10*10

50 .*[sep=0]10*10*[sep=0].:10*10*10

200 100*100

300 100*100*100

400 100*100*100*100

500 .*[sep=0]100*100*[sep=0].:100*100*100

MdC-97 mnemonic	RES
1	Z1
2	Z1*Z1
3	Z1Z1Z1
4	Z1Z1Z1*Z1
5	.[sep=0]Z1Z1[sep=0].:Z1Z1*Z1
20	10*10
30	101010
40	101010*10
50	.[sep=0]1010[sep=0].:1010*10
200	100*100
300	100100100
400	100100100*100
500	.[sep=0]100100[sep=0].:100100*100


glyph_modifier ->
	"\" | 
	"\r1" | "\r2" | "\r3" |
	"\t1" | "\t2" | "\t3" |
	"\" integer |
	"\R" [ "-" ] integer |
	"\red" |
	"\i" |
	"\l"

Modifier "\" is translated into RES as [mirror]. The modifiers "\r1", "\r2", "\r3", "\t1", "\t2", "\t3", which were introduced with MdC-97, are translated into RES as [rotate=270], [rotate=180], [rotate=90], [mirror,rotate=90], [mirror,rotate=180], [mirror,rotate=270].

The modifier "\" integer is from MdC-97, and denotes scaling in percentages. E.g. "A1\50" is translated into A1[scale=0.5].

The modifier "\R" [ "-" ] integer is not in MdC-88 or MdC-97. By the JSesh documentation, the appropriate translation into RES seems to be [rotate=X], where X is the integer or 360 minus that integer if "-" is used.

The non-MdC modifier "\red" is from JSesh, and its meaning is clearly [red].

The non-MdC modifier "\i" is from JSesh, and its meaning is [gray], or perhaps [silver] .

The non-MdC modifier "\l" is from JSesh. A similar meaning would be described quite differently in RES, and translation is not straightforward. For example, "xt\l:x\80*t\80*D54\80" in MdC could be translated into xt[scale=2.0]:[size=inf]x*t*D54 in RES.

The JSesh manual recommends that other patterns of the form "\" followed by alphanumeric characters are reserved for future extensions. Consequently all such patterns are ignored in the translation to RES.


shade ->
	"//" | "h/" | "v/" | "/"

In RES, "//", "h/", "v/", "/" may be translated as empty[shade], empty[t], empty[s] and empty[ts] if they occur on the top level. However, in general these expressions are difficult to translate automatically into RES, as the context determines the size of the shaded area. For example "h/-h/:n:n" in MdC might correspond to empty[t]-empty[shade]:n:n in RES.

Even more problematic is the combination of "#" and a shade. Straightforward translation of e.g. "A1#v/" to RES would be stack(A1,empty[s]), but the RES philosophy would prefer A1[s], avoiding stack where possible.


blank ->
	( "." | ".." ) [ blank_modifier ]

blank_modifier ->
	"\" integer

The symbol "." can be translated into empty[width=0.5,height=0.5] and ".." can be translated into empty. The addition of a blank modifier is from MdC-97, and allows scaling of the size of a blank, where integer is in percentages. For example, ".\70" is to be translated to empty[width=0.7,height=0.7].

boxes


box ->
	"<" box_start_tag simple_groups box_end_tag ">" [ ( sep )* shading ]

box_start_tag ->
	[ "S" | "F" | "H" ] [ "b" | "m" | "e" ] |
	[ "s" | "f" | "h" ] [ "0" | "1" | "2" | "3" ]

box_end_tag ->
	[ "s" | "f" | "h" ] [ "0" | "1" | "2" | "3" ]

The following combinations of tags have a corresponding type of box in RES.

box_start_tag box_end_tag documentation RES box_type

MdC-88 cartouche

1 1 MdC-97 oval

2 1 MdC-97 cartouche[mirror]

S MdC-88 serekh

s2 s1 JSesh serekh[mirror]

F MdC-88 inb

H MdC-88 Hwtcloseunder

h1 h2 MdC-97 Hwtcloseunder

h1 h3 MdC-97 Hwtcloseover

h2 h1 MdC-97 Hwtopenunder

h3 h1 MdC-97 Hwtopenover

box_start_tag	box_end_tag	documentation	RES box_type
		MdC-88	cartouche
1	1	MdC-97	oval
2	1	MdC-97	cartouche[mirror]
S		MdC-88	serekh
s2	s1	JSesh	serekh[mirror]
F		MdC-88	inb
H		MdC-88	Hwtcloseunder
h1	h2	MdC-97	Hwtcloseunder
h1	h3	MdC-97	Hwtcloseover
h2	h1	MdC-97	Hwtopenunder
h3	h1	MdC-97	Hwtopenover

shading


shading ->
	"#" [ "1" ] [ "2" ] [ "3" ] [ "4" ]

This construction was introduced with MdC-97, which is extremely vague as to the context in which it may occur. The prevailing interpretation seems to be that it can only occur at the end of a quadrat, and not within, and shading is relative to the entire quadrat. Here it is also allowed after a box.

If only the "#" occurs, the quadrat is to be shaded in its entirety. If the "1" is present, this means shading of the top left quarter, "2" stands for top right quarter, "3" stands for bottom left quarter, and "4" stands for bottom right quarter. Combinations of the four numbers indicate corresponding combinations of shading of the quarters.

It is difficult to translate shading automatically into RES, as in the case of shade. Post-editing is needed in most cases.

A tokenizer might have difficulty distinguishing between "#" as shading and as part of a stack, which may lead to parsing errors. It is recommended to change the shading pattern "#" by hand into "#1234".

toggles


toggle ->
	"$" | "$r" | "$b" |
	"-#" | "-#b" | "-#e"

The toggle "$" is from MdC-88, and switches to the red mode if the current mode of printing is black, or vice versa. This is obviously silly and MdC-97 offers "$r" and "$b" to switch to the red and black modes, respectively, regardless of the current mode. As RES has nothing similar to "$", one may translate this to say '![blue]' so that the need for post-editing is clearly signalled.

Something similar holds for "#", which signals a toggle between hieroglyphs (and the spaces between them) being, or not being, shaded. Again, MdC-97 offers a slight improvement with "#b" and "#e", which switch on and off, respectively, the shading, regardless of the current mode.

The reason why we assume "#", "#b" and "#e" to be preceded by "-" is to reduce the confusion with (too many) other uses of "#". Regrettably, a tokenizer may have difficulty with hieroglyphic starting with e.g. "#-A1", with the intended meaning of a toggle of shading followed by quadrats.

philology


philology ->
	"[&" | "&]" |
	"[{" | "}]" |
	"[[" | "]]" |
	"[\" | "\]" |
	"['" | "']"

These markers can be translated into angled brackets, braces, square brackets, the patterns "["*[fix]"|" and "|"*[fix]"]", and single quotes, respectively.

auxiliaries


integer ->
	( digit )+

digit ->
	"0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Depending on the context, an integer can indicate scaling (in percentages) or degrees of rotation.


whitespace ->
        " " | 
	<HORIZONTAL_TAB> | 
	<NEWLINE> | 
	<CARRIAGE_RETURN> | 
	<FORM_FEED>

All whitespace is treated equal.