Japanese page

Problems on Interoperativity between Unicode and CJK Local Encodings

This page describes problems related to convertion between Unicode and national CJK encodings, mainly with non-letter symbols.

Note about encoding and coded character set: JIS X 0201, JIS X 0208, JIS X 0212 are Coded Character Sets (CCS). EUC-JP, Shift_JIS, ISO-2022-JP are Encoding or Character Encoding Schemes (CES). Encoding is a real code to be used for text file. An encoding consists of one or more coded character sets. Usually, 8 bit encodings such as ISO-8859-1 consist of one coded character set. On the other hand, multibyte encodings usually consists of multiple coded character sets, like EUC-JP consists of ISO/IEC 646 IRV (aka US-ASCII), JIS X 0208, JIS X 0201 Kana, and JIS X 0212.

Note for Japanese coded character sets:

JIS X 0208 is a national standard which includes Hiragana, Katakana, Latin Alphabets, Numerics, Greeks, Cyrillics, 1st level Kanji, 2nd level Kanji, and non-letter symbols. This is the most important character set for Japanese. This is a 94x94 character set which complies ISO 2022.
JIS X 0201 consists of two parts of JIS X 0201 Roman and JIS X 0201 Kana.
JIS X 0201 Roman is a Japanese version of ISO 646, whose 0x5c is YEN SIGN instead of REVERSE SOLIDUS and 0x7e is OVERLINE instead of TILDE.
JIS X 0201 Kana includes Katakana and a few punctuations. This is a 94 character set which complies with ISO 2022.
JIS X 0212 is a 94x94 ISO 2022-compliant character set and includes additional Kanji which are intended to be used with JIS X 0208. JIS X 0208 and JIS X 0212 are source character sets of Unicode CJK unified ideograph since Unicode version 1.0.
JIS X 0213 is a new Japanese standard which was released in 2000. It is a superset of JIS X 0208. This consist of two 94x94 planes and is compliant to ISO 2022. Additions to JIS X 0208 are: non-letter symbols, Hiragana, Katakana, Latin alphabets with accents, 3rd level Kanji, 4th level Kanji, and so on.

To use Unicode for daily life, there are three major problems for Japanese users. One is Han Unification. The second is mapping problem. The third is width problem. There are other problems such as EUC-JP round-trip conversion and yen problem. Solution of these problems will be required for convenient usage of Unicode for Japanese people.

Han Unification (2002-04-06)
Conversion tables differ between vendors (2002-04-04)
Width problems (2002-03-31)
EUC-JP round-trip compatibility (2001-04-30)
ASCII and JIS X 0201 Roman (2001-04-30)

The followings are old documents.

2001-11-01: Japanese translation is prepared.
2001-09-06: added comments on the release of Unicode 3.1.1.
2001-05-07: a typo JIS X 0201 Kana -> JIS X 0201 Roman in explanation on nkf and qkc.

Comments on the release of Unicode 3.1.1 (2001-09-06)
Conversion tables differ between venders (2001-04-30)
Width problems (2001-04-30)
JIS X 0213 (2001-04-30)

Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>