Japanese page

Problems on Interoperativity between Unicode and CJK Local Encodings

This page describes problems related to convertion between Unicode and national CJK encodings, mainly with non-letter symbols.

Note about encoding and coded character set: JIS X 0201, JIS X 0208, JIS X 0212 are Coded Character Sets (CCS). EUC-JP, Shift_JIS, ISO-2022-JP are Encoding or Character Encoding Schemes (CES). Encoding is a real code to be used for text file. An encoding consists of one or more coded character sets. Usually, 8 bit encodings such as ISO-8859-1 consist of one coded character set. On the other hand, multibyte encodings usually consists of multiple coded character sets, like EUC-JP consists of ISO/IEC 646 IRV (aka US-ASCII), JIS X 0208, JIS X 0201 Kana, and JIS X 0212.
Note for Japanese coded character sets:

To use Unicode for daily life, there are three major problems for Japanese users. One is Han Unification. The second is mapping problem. The third is width problem. There are other problems such as EUC-JP round-trip conversion and yen problem. Solution of these problems will be required for convenient usage of Unicode for Japanese people.

The followings are old documents.


Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>