Japanese page return

ASCII and JIS X 0201 Roman (2001-04-30)

When converting EUC-JP and Shift_JIS, handling of 0x5c and 0x7e can be a problem. Since both encodings have long history and Japanese people have lot of experience how to handle them, I now introduce it.

Solution is very simple. Just regard YEN SIGN and REVERSE SOLIDUS as a different glyphs of the same character. Then, distinction between ASCII and JIS X 0201 Roman can be neglected.

Thus, when a Japanese person (almost Japanese people don't know about encoding; a certain amount of people [Windows and Macintosh users] know the word "Shift_JIS" as the only usable encoding) says "Shift_JIS", almost always it means "CP932".

Please don't blame such Japanese people who don't aware of distinction between Shift_JIS and CP932. The difference between Shift_JIS and CP932 was only that CP932 has extension characters. It is the introduction of Unicode and conversion to/from it that brought a confusing incompatibility of non-letter symbols between Shift_JIS and CP932.

The following is the reason why I wrote that when a Japanese person says "Shift_JIS", almost always it means "CP932". For example, DOS/Windows programmers write YEN SIGN + "n" to mean new line (in Shift_JIS, strictly speaking, CP932). DOS/Windows use YEN SIGN (0x5c) for directory name separator. This is why Microsoft cannot convert 0x5c in CP932 into characters other than U+005C.

Not only Windows users but also UNIX users regarded 0x5c in Shift_JIS as an ambiguous character of YEN SIGN and REVERSE SOLIDUS. For example, popular Japanese encode converters such as nkf and qkc don't care about distinction between ASCII (0x21-0x7e in EUC-JP) and JIS X 0201 Roman (0x21-0x7e in Shift_JIS). When I often use TeraTerm, a telnet/ssh client for Windows, and read YEN SIGN, I read it as a REVERSE SOLIDUS according to the context. (When a Japanese person is a writer, it means YEN SIGN in most cases. When a non-Japanese person is a writer, it always means REVERSE SOLIDUS).

Thus, I don't complain if 0x5c in Shift_JIS is mapped into U+005C. Rather, distinction of them (i.e., being strict to official standards) might confuse many Japanese people.


Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>