Conversion tables differ between venders

Conversion tables differ between venders (2001-04-30)

There are many CES (Character Encoding Schemes) which use a common CCS (Coded Character Set). For example, CES such as EUC-JP, Shift_JIS, and CP932 include JIS X 0208 as CCS.

For these CES, character from the same CCS should be mapped into same UCS character. However, this is not realized for dozens of characters.

The following table is a table of characters with witch same character in JIS X 0208 and so on are mapped into different code points by using various conversion tables.

---------------------------------------------------------------------------------------------
ORIGINAL                      Converted** to U+????/EastAsianWidth
CCS     Shift_JIS* EUC-JP*    0208    SJIS    CP932   APPLE   0221A   0221B   JAVAA   JAVAB
---------------------------------------------------------------------------------------------
[ASCII]
0x5C    ----       0x5C       ----    ----    ----    ----    ----    005C/Na ----    005C/Na
0x7E    ----       0x7E       ----    ----    ----    ----    ----    007E/Na ----    007E/Na
[JISX0201 Roman]
0x5C    0x5C       ----       ----    00A5/Na 005C/Na 00A5/Na 00A5/Na ----    005C/Na 00A5/Na
0x7E    0x7E       ----       ----    203E/N  007E/Na 007E/Na 203E/N  ----    007E/Na 203E/N
[JISX0208]
0x2131  0x81 0x50  0xA1 0xB1  FFE3/F  FFE3/F  FFE3/F  FFE3/F  FFE3/F  203E/N  FFE3/F  FFE3/F
0x213D  0x81 0x5C  0xA1 0xBD  2015/A  2015/A  2015/A  2014/A  2014/A  2014/A  2015/A  2015/A
0x2140  0x81 0x5F  0xA1 0xC0  005C/Na 005C/Na FF3C/F  FF3C/F  005C/Na FF3C/F  FF3C/F  FF3C/F
0x2141  0x81 0x60  0xA1 0xC1  301C/W  301C/W  FF5E/F  301C/W  301C/W  301C/W  301C/W  301C/W
0x2142  0x81 0x61  0xA1 0xC2  2016/A  2016/A  2225/A  2016/A  2016/A  2016/A  2016/A  2016/A
0x215D  0x81 0x7C  0xA1 0xDD  2212/N  2212/N  FF0D/F  2212/N  2212/N  2212/N  2212/N  2212/N
0x216F  0x81 0x8F  0xA1 0xEF  FFE5/F  FFE5/F  FFE5/F  FFE5/F  FFE5/F  00A5/Na FFE5/F  FFE5/F
0x2171  0x81 0x91  0xA1 0xF1  00A2/Na 00A2/Na FFE0/F  00A2/Na 00A2/Na 00A2/Na 00A2/Na 00A2/Na
0x2172  0x81 0x92  0xA1 0xF2  00A3/Na 00A3/Na FFE1/F  00A3/Na 00A3/Na 00A3/Na 00A3/Na 00A3/Na
0x224C  0x81 0xCA  0xA2 0xCC  00AC/Na 00AC/Na FFE2/F  00AC/Na 00AC/Na 00AC/Na 00AC/Na 00AC/Na
[JISX0212]
0x2217  ----       0x8F,A2,97 ----    ----    ----    ----    007E/Na FF5E/F  ----    ----
---------------------------------------------------------------------------------------------

Note 1 This table mentions Japanese encodings only.

Note 2 This table doesn't contain vendors' extended characters (invalid characters in formal EUC_JP and Shift_JIS).

Note * Converted from ASCII, JISX0201 Roman, and JISX0208 algorithmically. The algorithm for EUC-JP is described in http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT. The algorithm to convert from JIS X 0208 to Shift_JIS is:

out1 = (((in1 - 1) >> 1) + (in1 <= 0x5e) ? 0x71 : 0xb1);
out2 = in2 + ((in1 & 1) ? ((in2 < 0x60) ? 0x1f : 0x20) : 0x7e);

where in1 and in2 are the 1st and 2nd bytes of JIS X 0208 respectively and out1 and out2 are the 1st and 2nd bytes of Shift_JIS. Shift_JIS value is used for original code for conversion of "SJIS", "CP932", "Win98", and "Apple", because all of them (other than Shift_JIS itself) are supersets of Shift_JIS.

Note **

0208 = http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT (Version 0.9, 1994-03-08)
SJIS = http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT (Version 0.9, 1994-03-08)
CP932 = http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT (Version 2.01, 1998-04-15)
APPLE = http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT (1999-09-22)
0221A = JIS X 0221 annex 3 (JIS X 0201), from http://www.ingrid.org/java/i18n/unicode.html (downloaded 2001-04-13). JIS X 0221 is a Japanese national standard corresponding to ISO 10646.
0221B = JIS X 0221 annex 3 (ISO/IEC 646-IRV), from http://www.ingrid.org/java/i18n/unicode.html.
JAVAA = Java (SJIS & EUCJIS), from http://www.ingrid.org/java/i18n/unicode.html.
JAVAB = Java (JIS), from http://www.ingrid.org/java/i18n/unicode.html.

Thus, same characters in Japanese encodings is mapped into different Unicode characters, according to the conversion table. Especially, CP932 (which has relatively more differences) is called Shift_JIS in Microsoft OSes and very widely used. This will introduce vast problems in future when Unicode will be more popular in Japan.

Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>