Han Unification

Han Unification (2002-04-06)

Han Ideogram (Kanji in Japanese, Hanji in Chinese, and Hanja in Korean) has a unique property of variants or itaiji (異体字) in Japanese. Variant is a concept which is between character and glyph.

Character codes encodes character, not glyph. Thus, all character codes (as far as I know) give different codepoints for different characters and don't do so for different glyphs. However, how about variants?

Most of character codes, including Unicode and JIS, takes a principle of "one common codepoint for variants". However, there may be some distinction on the following points among character code standards.

Rule of unification (housetsu or 包摂 in Japanese). What is characters and what is variants? In other words, where is the borderline between character and variants? (Unified characters are called variants.)
Implicit representing variant. Among various variants, which one should be a representative? Though any glyphs which are within the range of variats are compliant to the standard, font products with such glyphs might not be useful for real daily life for some reasons. Not all variants are well-known by common people. Which variants to be mainly used and how to use proper variants for various purpose strongly depends on the (local) calture. Strictly speaking, this is not a problem of standard itself, but a problem of implementation. However, a standard can be designed to avoid (or produce) implementation difficulties.
Exceptional variants. There may be exceptions of "one common codepoint for variants" rule if there are variants whose distinctions are important. Note that the importance of distinction strongly depends on the (local) culture.

For exceptions problem, Unicode doesn't have additional problems against CJK national standards, because of "round-trip conversion compatibility" (or "source separation") principle of Unicode. This principle is that variants which are separated and have different codepoints in a CJK local standard will also be separated and have different codepoints in Unicode. Thus, all exceptional variants in CJK local standards (with separate codepoints) also have separate codepoints in Unicode. For example, the following Unicode characters come from separated variants in JIS X 0208: U+8FBA (辺), U+9089 (邉), and U+908A (邊).

However, there is a new Japanese local standard JIS X 0213 which is developed after initial version of Unicode and, strictly speaking, unification rule of JIS X 0213 and Unicode is partly incompatible for less than 200 ideographs. Another example is higher (>= 3) planes of Taiwan CNS 11643 standard. Note that round-trip conversion compatibility is guaranteed for JIS X 0213 by use of Unicode's CJK COMPATIBILITY IDEOGRAPHS. I don't know about CNS 11643.

Implicit representing variants problem will need the biggest real implementation costs. Though local standards such as JIS unifies many variants, developer of fonts feel little difficulty choosing one variant to be adopted for one character. This is because there is one variant which is mainly used for each character. However, because it strongly depends on culture which variant is mainly used or (when multiple variants are used) how to choose proper variant in various situation, this variant rule can bring a real implementation problem when the same unification rule is applied to an international character code standard.

Problematic condision can occur when one certain variant of a character is used in one country and different variant is used in other country. In some cases, variants which Chinese and Korean use are not familiar for Japanese people (for example, U+6D77), or even unreadable (for example, U+76F4).

Variants of a character can have very similar shapes or can have very different shapes. Since there are some apparently different Ideographs which have very similar shapes (like 土 and 士), visual similarity cannot help readers who read an unfamiliar variant.

Thus, it is impossible to design a common world version font of CJK Unified Ideograph of Unicode. All what you can do is to develop "Japanese version of Unicode font", "Chinese version", or "Korean version". "World version" of computer system has to have Japanese, Chinese and Korean versions of fonts.

Then, such systems have to have some algorithm to determine which version of fonts to be used. The following procedure is a good solution, I think.

if the document specifies language (by Unicode language tag, SGML language tag, or any other means), the information should be used.
in case 1 is not true, then, if the system knows user's preferable language (for example, localized version of operating systems or LANG variable), the information should be used.
in case 1 and 2 are not true, then, Japanese font would be a good fallback because Japanese people tend not to know (or feel uneasy on) various variants which Chinese and Korean people mainly use. I may be wrong in this point, because I don't know about average knowledge of average people in China and Korea.

There are (so far) no method for Unicode (and also for JIS) text file to specify variant. However, it seems to be planed to add a mechanism to specify variant. It is variation selectors in Unicode 3.2.

FE00 VARIATION SELECTOR-1
FE01 VARIATION SELECTOR-2
FE02 VARIATION SELECTOR-3
FE03 VARIATION SELECTOR-4
FE04 VARIATION SELECTOR-5
FE05 VARIATION SELECTOR-6
FE06 VARIATION SELECTOR-7
FE07 VARIATION SELECTOR-8
FE08 VARIATION SELECTOR-9
FE09 VARIATION SELECTOR-10
FE0A VARIATION SELECTOR-11
FE0B VARIATION SELECTOR-12
FE0C VARIATION SELECTOR-13
FE0D VARIATION SELECTOR-14
FE0E VARIATION SELECTOR-15
FE0F VARIATION SELECTOR-16

There are standardized list of variants. It reads:

At this time no Han variants exist. When they do, a table will be inserted here.

Thus, we can expect that variation selectors will be able to be used to specify Han variants in future version of Unicode.

Also, recently, many variants are added as "COMPATIBILITY IDEOGRAPHS" to Unicode. They are added to achieve "round-trip conversion" compatibility with JIS X 0213, CNS 11643, and so on which have additional variants which have separated codepoints.

Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>