Hebrew Cantillation Marks And Their Encoding

by Helmut Richter


Table of Contents


V. Unicode Problems


1. Error pertaining to the characters U+0598 and U+05AE

In UnicodeData 3.0.1, the two entries read:

0598;HEBREW ACCENT ZARQA;Mn;230;NSM;;;;;N;;*;;; (230=above)
05AE;HEBREW ACCENT ZINOR;Mn;228;NSM;;;;;N;;;;;  (228=above left)

This has obviously been changed since Unicode 1.0, but into the wrong direction, making the entries more consistently wrong now.

All sources other than code tables, that is, various grammars of Biblical Hebrew and Breuer's book on cantillation marks (Mordekhai Broyer (Breuer): Taamey hammiqra be-21 sfarim uvesifrey eme"t; Jerusalem, TShM"B (=1981)) which I took as ultimate referee, agree on the following:

The table of cantillation marks on page 867 of volume 18 of the Hebrew Encyclopedia contains only the accents of the 21 books and therefore no Tsinorit. Hence, it is not a source to resolve the confusion of Tsinorit with Zarqa=Tsinor. That "Tsinori" instead of "Tsinor" is given as synonym of Zarqa adds a little to that confusion but is not contradicting the above summary.

In contrast to these findings, Unicode (here following Israeli national standard SI 1311-2) makes a distinction between ZARQA and ZINOR (sic!) where ZINOR seems to play the rôle of Tsinorit, as the much more similar names suggest. This interpretation, to wit that ZINOR should have been TSINORIT, is also supported by the order of the accents: first all distinctive accents in decreasing strength, then the conjunctive accents; in each class first the accents for the 21 books (or for all books), then for the 3 books. From this order, one sees that U+0598 was intended to be a distinctive accent of medium strength in the 21 books - exactly what Zarqa is. One can safely conclude that ZARQA indeed means Zarqa=Tsinor and that ZINOR means Tsinorit. However, the glyph chart shows the two characters swapped, and the combining classes (whose impact on normalisation is minimal in this particular case) are in accordance with the glyph chart and not with the above interpretation of the character names. After Unicode 1.0, but before Unicode 3.0, a remark "=zinorit" was added to the character ZARQA, in conformance with the glyphs. Opinions are divided whether this remark is sufficient to safely refute the initial interpretation that ZARQA means Zarqa and ZINOR means Tsinorit. In any case, both interpretations have been taken for granted by several people in the recent discussion, so that it is not unfair to assert that there is a significant ambiguity left.

Criteria for the encoding

Here are the criteria according to which problems are identified and possible solutions are evaluated in this article:

  1. Unicode (and also SI 1311-2) has the strategy that graphically equal, but semantically different characters are to be treated as the same character and not as distinct characters (Unicode 3.0, p.17), with a number of exceptions that do not apply here. This strategy is also followed with other cantillation marks: Tipeha=Tarha, Merkha=Yored, Meteg=Siluq, etc. Hence, Zarqa and Tsinor have to be treated as the same character. Whether Tsinorit is to be treated as distinct from Zarqa=Tsinor depends on whether one considers its different position relative to the base letter as a feature of the character or as a detail of the rendering process. As there are now two code points assigned, and there are glyph variants that are definitely wrong for Tsinorit but not for Zarqa=Tsinor, it would be a step backwards to unify all three.

  2. Given a sequence of characters, the standard must specify unambiguously how to encode it.

    However, there may be some unavoidable ambiguity when the standard specifies characters as distinct although they may have similar appearance in some renderings. Examples of such a situation could be:

    In such cases where absolute uniqueness of encoding cannot be achieved, it is at least required that the standard be unambiguous once the users of the standard have set up their policy how to treat the pertinent characters. In other words: ambiguities how to apply the standard in a given situation may be unavoidable, but there should not be ambiguities what the standard says.

  3. Even though glyphs are not normative for characters, the use of glyphs in the standard must not violate the character identity (Unicode 3.0, p.40, item D2). This principle is currently violated: if one of the two glyphs now used for ZARQA and ZINOR denotes a Zarqa and the other does not, then it is unambiguously not the glyph of ZARQA.

  4. Names of Unicode characters must no longer be changed (policy on the Unicode WWW site, also in the standard document?).

  5. Changes of properties of Unicode characters are to be kept to a minimum (Unicode 3.0, p.73).

  6. It is desirable but not mandatory that the set of cantillation marks in Unicode follow Israeli standard SI 1311-2. If not, the remark on p.187 of Unicode 3.0 has to be modified.

  7. It is desirable but not mandatory that the cantillation marks appear in the code in an order which appears to be logical, given the semantics of the marks.

The problem and possible solutions

When we start with criterion 1 above, we find that two different Unicode character names, ZARQA and ZINOR, denote the same character and a unique name of a character to be encoded, TSINORIT, is missing as the name of a Unicode character. Because of criterion 4, this problem cannot be fixed. Similarly, criterion 3 cannot be completely fulfilled as it is impossible to give two character names a different identity and different glyphs when in reality they are names of the same character. A solution will consist in providing enough comment for the user that, despite the unavoidable inaccuracy of character names, criterion 2 is fulfilled, and to define the character identities so that criterion 3 is not too grossly violated.

If one takes the statement seriously that Unicode defines characters, not glyphs, in particular as expressed as principle D2, then one has to change the glyph to match the character name instead of leaving both the glyph and the character name as they are: If the glyph picture of "LATIN CAPITAL LETTER A" shows a "B", then the picture is wrong, and not "LATIN CAPITAL LETTER A" a somewhat outlandish name for a "B". In this spirit, this page contained originally a request to change one name (ZINOR->TSINORIT), and then, in order to comply with the policy not to change names, a request that the glyphs be restored to the correct order although the names cannot be straightened. The fact that the shown glyphs are already being implemented in fonts (so that they are de facto treated as normative) and that at least one name cannot be made entirely correct could be used as an argument to keep the two characters swapped as they are now in the glyph chart.

Now, there are four ways to proceed:

Initially, solution 1 was suggested here as only solution. Now, as an exact match between character names and character identity cannot be achieved anyway, solution 2 might be a fair compromise to strike the balance between the stability of the standard and the plausibility of the definitions contained therein. I leave it to the various standards bodies to find the solution they consider most consistent with the standards' policies. My personal preference is still with solution 1 (or even the more consistent solution 4), but much more important is that the standard become as soon as possible unambiguous also in the context of the users (people who encode text and process encoded text) and not only in the context of the font designers.

Solution 1:

Solution 2:

Solution 3:

Solution 4:


2. Order of characters between Holam and Vav

In the case that a vowel is represented in vocalised text by both a vowel point and a consonant (a mater lectionis), Unicode 3.0 fails to define the order of these two characters. In nearly all cases, this order is evident from the typographical appearance which is the same as if one of the consonants had no vowel point. In the case of the combination of Holam and Vav, however, there is a need to define the intended sequence. Whatever the desired sequence of Unicode characters, it has an influence on the definition of character VAV WITH HOLAM (U+FB4B). Example:

Is the word "shalom" to be spelt as
  SHIN + SHIN DOT + QAMATS
  LAMED + HOLAM
  VAV
  FINAL MEM
or as
  SHIN + SHIN DOT + QAMATS
  LAMED
  VAV + HOLAM
  FINAL MEM
and is
  SHIN WITH SHIN DOT + QAMATS
  LAMED
  VAV WITH HOLAM
  FINAL MEM
equivalent?

A good place to insert such additional clarification into the Unicode standard is the paragraph near the end of p.186 which begins with "Vowels". It could be enhanced with the following explanation appended. In its wording, the same strategy was followed as with other scripts explained in chapters 7 to 11 of the standard: the principles of the script are explained in a bit more detail than is needed for readers that are already acquainted with the script:

These vowel points are used in liturgical texts including the Bible, in poems, in dictionaries, and whenever the exact vocalisation must be uniquely specified. In most other texts, they are omitted. Independently of the presence of vowel points, vowels are frequently represented by the letters U+05D0 HEBREW LETTER ALEF, U+05D5 HEBREW LETTER VAV, U+05D9 HEBREW LETTER YOD, and, restricted to the end of a word, U+05D4 HEBREW LETTER HE. When vowel points are present, they do not only denote vowels that are not represented by one of these letters, but they also determine which occurrences of these letters serve as substitutes for vowels, and if so, for which vowels. A vowel may thus be represented by both a letter and a vowel point. In case of the vowel shuruq, i.e. the vowel /u/ in an open or a word-final syllable, the vowel point U+05BC HEBREW POINT DAGESH OR MAPIQ is applied to the vav which acts as a substitute for the vowel. In all other cases the vowel point is applied to the preceding consonant, and the letter representing the vowel remains without vowel point.

If this interpretation is not wanted, the text has to be modified to read:

These vowel points [...] both a letter and a vowel point. If the letter is vav, the vowel point (either U+05B9 HEBREW POINT HOLAM or U+05BC HEBREW POINT DAGESH OR MAPIQ) is applied to the vav. In all other cases the vowel point is applied to the preceding consonant, and the letter representing the vowel remains without vowel point.

The first of these alternatives requires that the definition of VAV WITH HOLAM (U+FB4B) be changed to HOLAM + VAV; otherwise the character is of no use.

On the other hand, the second of the alternatives has the consequence that HOLAM may be applied to the last character of a word so that, theoretically, it interferes typographically with an accent that is positioned left above the word like a Zarqa (whatever the Unicode name of that accent will become). Practically, it does not interfere, as such a Holam is written on top of the Vav and thus right of the accent, as also the combining class defines.


3. Unclear glyphs

The pictures in the glyphs in the glyph chart are unclear for the following accents. Here is what they should look like:


© Helmut Richter      published here 2000-12-06; last update 2001-04-09      http://www.lrz.de/~hr/teamim/unicode.html