Conversion tables differ between venders

Conversion tables differ between venders (2002-04-04)

Introduction
Softwares Used for the Research
Results
Discussion
Additional Comments

Introduction

Character codes which are used for text files are called encoding. For example, EUC-JP, Shift_JIS, ISO-2022-JP, ISO-8859-1, and UTF-8 are encodings. An encoding consist of one or multiple coded character set(s). Coded character set is often abbreviated to CCS. For example, an encoding of EUC-JP consists of coded character sets of ISO 646 IRV (aka US-ASCII), JIS X 0208, JIS X 0201 Kana, and JIS X 0212. An encoding of ISO-8859-1 consists of one coded character set of ISO-8859-1.

There are several local encodings which are widely used in Japan. All of these encodings use JIS X 0208 as their components. In other words, characters which belong to JIS X 0208 are the same character in definition even if they appear in different encodings. Thus, these characters can be safely communicated between different encodings. For example, both of Shift_JIS 0x82 0xA0 and EUC-JP 0xA4 0xA2 correspond to JIS X 0208 0x2422 and they are strictly same. This is Hiragana "A" and normally mapped into U+3042 (あ).

However, since Unicode was developed, this peaceful situation is threatened. The strictly same characters (which has the same codepoint in the same JIS X 0208 coded character set) are mapped into different codepoints in Unicode depends on the encodings.

For example, CP932 (aka Windows-31J) is an encoding used for Microsoft Windows. This is an extension of Shift_JIS with additional Microsoft's private characters. Macintosh also uses a different encoding of Shift_JIS with additional Apple's private characters. In other words, Shift_JIS is a subset encoding of these encodings. Though it is a matter of course that there are no compatiblity for these additional characters, even JIS X 0208 part of these encodings lose compatibility if we think about mapping to Unicode. For example, Shift_JIS 0x81 0x60 (0x2141 in JIS X 0208) is mapped into U+301C (WAVE DASH, 〜) by Macintosh or GNU libc, while it is mapped into U+FF5E (FULLWIDTH TILDE, ～) by Microsoft Windows. Please note Shift_JIS is a subset of Apple's and Microsoft's encodings. This means that Shift_JIS text can be regarded as CP932 text.

A strictly same character from JIS X 0208 happens to mapped into different Unicode codepoints only by thinking the character is Shift_JIS or CP932, or only by changing the software to be used for the mapping. The inter-encoding compatibility of JIS X 0208 characters was broken by these mapping tables.

In Unicode Consortium's point of view, these encodings such as Shift_JIS and CP932 are totally different. Based on this interpretation, there are no problems on differences in mapping tables. However, it means that CP932, Macintosh's encoding, and so on are not based on JIS X 0208. Thus, it is not guaranteed that JIS X 0208 part of these encodings are common. If the compatibility is not guaranteed, it is a huge inconvenience.

The difference of mapping tables means that Unicode fonts have to have glyphs for all possible Unicode codepoints for these mapping tables, even if the fonts are intended to be used only in Japan. This is the reason why mlterm has a command line option to use CP932 as Unicode/JIS mapping table. The difference of mapping tables (and more, Unicode Consortium's standpoint that it is not responsible for mapping tables) makes the width problem even more difficult and complex. And more, identity of files which are mapped into Unicode is not guaranteed and this will bring difficulty for MD5 hash, diff command, and so on.

However, in case of conversion from Unicode to Japanese encodings, many-to-one mapping can solve the problem. (However, this is true only if different characters in different Japanese encodings are mapped into the same Unicode codepoint.)

Softwares Used for the Research

On discussion about mapping tables between Japanese encodings and Unicode, we need mapping tables. Now, I researched the following mapping tables which are freely available via Internet. However, I didn't check these mapping tables are exactly same to mapping tables which are impremented to widely used systems.

ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
This mapping table is probably made by mapping Unicode character names and character names written in JIS X 0208 standard. I wrote that the file is obsoleted while it was released from Unicode Consortium in the previous research, it was not true; Unicode Consortium has never released any mapping tables. Same for SHIFTJIS.TXT .
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT
This mapping table is between Shift_JIS and Unicode. Shift_JIS encoding consists of JIS X 0201 Roman and JIS X 0208. JIS X 0208 part of this table seems to be exactly same to the above JIS0208.TXT .
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
This mapping table seems to be used for Japanese version of Microsoft Windows. This means that this mapping table is the most popular table in Japan. Originally, CP932 must be Shift_JIS with Microsoft's private additional characters. However, JIS X 0208 part of this table is different from the above SHIFTJIS.TXT or other tables.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
This mapping table seems to be used for Japanese version of Apple Macintosh. Originally, this mapping table must be Shift_JIS with Apple's private additional characters. However, JIS X 0208 part of this table is different from the above SHIFTJIS.TXT or other tables.
http://www.jca.apc.org/~earthian/aozora/0213/jisx0213code.zip
This file is introduced and downloaded on JISX0213 InfoCenter page. I researched the file of jisx0213code.txt which is obtained by unzipping the archive. JIS X 0213:2000 is a superset of JIS X 0208:1997. This table is probably made by mapping Unicode character names and character names written in JIS X 0213 standard. I researched JIS X 0208 part of this table.
http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
This file is downloaded from IRG Reports page. I researched IBM1394toUCS4-GLY.txt file which is obtained by unzipping the archive. IBM1394 is an encoding of Shift_JISX0213 with IBM's private additional characters. Shift_JISX0213 is an encoding described in the Annex 1 of JIS X 0213 standard (informative) and is a superset of Shift_JIS. I researched JIS X 0208 part of this table.
http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
This archive is same to the above archive. I researched an another file IBM1394toUCS4-IRV.txt which is obtained by unzipping the archive. This table interprets 0x21-0x7e part of IBM1394 as ISO 646 IRV, not JIS X 0201 Roman.
locales 2.2.5-4 package from Deban GNU/Linux
I researched /usr/share/i18n/charmaps/EUC-JP.gz file in the package. This is a packaged version of GNU C Library. This file determines the behavior of iconv(3) of GNU libc; i.e., this file is a de-facto standard in Linux.
locales 2.2.5-4 package from Deban GNU/Linux
I researched /usr/share/i18n/charmaps/SHIFTJIS.gz file in the package.

All of them use JIS X 0208 as their components.

JIS0208.TXT and jisx0213code.txt show raw JIS X 0208 codepoint.
EUC-JP.gz shows JIS X 0208 codepoint in EUC-JP expression. To obtain raw JIS X 0208 codepoint, subtract 0x80 from upper and lower bytes.
Other files show JIS X 0208 codepoint in Shift_JIS expression. The following is conversion from JIS X 0208 to Shift_JIS:
```
s1 = ((j1 - 1) >> 1) + ((j1 <= 0x5e) ? 0x71 : 0xb1);
s2 = j2 + ((j1 & 1) ? ((j2 < 0x60) ? 0x1f : 0x20) : 0x7e);
```
and the following is conversion from Shift_JI to JIS X 0208:
```
j1 = (s1 << 1) - (s1 <= 0x9f ? 0xe0 : 0x160) - (s2 < 0x9f ? 1 : 0);
j2 = s2 - 0x1f - (s2 >= 0x7f ? 1 : 0) - (s2 >= 0x9f ? 0x5e : 0);
```
where s1, s2, j1, and j2 are 1st and 2nd bytes of Shift_JIS and 1st (upper) and 2nd (lower) bytes of JIS X 0208, respectively.

Java conversion table was checked in the previous research but it in omitted in this time, because I couldn't obtain the mapping table.

Here is the script I used for this research.

#!/usr/bin/perl

$JIS = 1;
$EUC = 2;
$SJIS = 3;

sub readmapfile ($$$$$) {
    my($file, $name, $legacy, $ucs, $type, $line, @column, $u, $l);
    ($file, $name, $legacy, $ucs, $type) = @_;

    print "reading $file ...\n";

    open(FILE, $file) || die "Cannot open $file.\n";
    while ($line = <FILE>) {
	if ($line =~ /^\#/) {next;}
	if ($line =~ m+^\/+) {next;}
	$line =~ s/^\s+//;
	@column = split(/\s+/, $line);
	if ($name eq "JISX0213") {
	    if ($line =~ /^2-/) {next;}
	    $u = $column[$ucs];  $u =~ s/u-/0x/;  $u = int($u);
	    $l = $column[$legacy];  $l =~ s/j-/0x/;  $l = int($l);
	} elsif ($name =~ /IBM1394/) {
	    $u = int("0x" . $column[$ucs]);
	    $l = int("0x" . $column[$legacy]);
	} elsif ($name =~ /GLIBC/) {
	    if ($line !~ /^<U/) {next;}
	    $u = $column[$ucs];  $u =~ s/<U/0x/;  $u = int($u);
	    $l = $column[$legacy];  $l =~ s+/x++g;  $l = int("0x" . $l);
	} else {
	    $u = int($column[$ucs]);
	    $l = int($column[$legacy]);
	}
	if ($type == $JIS) {
	    # do nothing
	} elsif ($type == $EUC) {
	    if ($l > 0x100) {$l -= 0x8080;}
	} else {
	    if ($l > 0x100) {
		$h = $l >> 8 & 0xff;
		$l = $l & 0xff;
		$out1 = ($h << 1) - ($h <= 0x9f ? 0xe0 : 0x160) 
		    - ($l < 0x9f ? 1 : 0);
		$out2 = $l - 0x1f - ($l >= 0x7f ? 1 : 0) 
		    - ($l >= 0x9f ? 0x5e : 0);
		$l = ($out1 << 8) + $out2;
	    }
	}
	${"map" . $name}[$l] = $u;
    }
}

# ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS0208.TXT
&readmapfile("JIS0208.TXT", "JISX0208", 1, 2, $JIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/SHIFTJIS.TXT
&readmapfile("SHIFTJIS.TXT", "SHIFTJIS", 0, 1, $SJIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
&readmapfile("CP932.TXT", "CP932", 0, 1, $SJIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
&readmapfile("JAPANESE.TXT", "APPLE", 0, 1, $SJIS);

# http://www.jca.apc.org/~earthian/aozora/0213/jisx0213code.zip
&readmapfile("jisx0213code.txt", "JISX0213", 1, 4, $JIS);

# http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
&readmapfile("IBM1394toUCS4-GLY.txt", "IBM1394", 0, 1, $SJIS);

# http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
&readmapfile("IBM1394toUCS4-IRV.txt", "IBM1394I", 0, 1, $SJIS);

# GNU libc 2.2.5 from Debian libc6 2.2.5-4 package (2002-04-03)
&readmapfile("EUC-JP", "GLIBCEUC", 1, 0, $EUC);

# GNU libc 2.2.5 from Debian libc6 2.2.5-4 package (2002-04-03)
&readmapfile("SHIFT_JIS", "GLIBCSJIS", 1, 0, $SJIS);

print "JIS      0208   SJIS   CP932  APPLE  0213   IBMGLY IBMIRV G-EUC  G-SJIS\n";

for ($r = 0x0; $r <= 0x28; $r++) {
    if ($r > 0 && $r < 0x21) {next;}
    for ($c = 0x21; $c < 0x7f; $c++) {
	$l = ($r << 8) + $c;
	if ($l != 0x5c && $l != 0x7e && $l < 0x100) {next;}
	if ($l > 0x100 && $mapJISX0208[$l] == 0) {next;}
	if ($mapJISX0208[$l] != $mapSHIFTJIS[$l] ||
	    $mapJISX0208[$l] != $mapCP932[$l] ||
	    $mapJISX0208[$l] != $mapAPPLE[$l] ||
	    $mapJISX0208[$l] != $mapJISX0213[$l] ||
	    $mapJISX0208[$l] != $mapIBM1394[$l]) {
	    printf "0x%04X   ",$l;
	    foreach $i ("JISX0208", "SHIFTJIS", "CP932", "APPLE", "JISX0213",
			"IBM1394", "IBM1394I", "GLIBCEUC", "GLIBCSJIS") {
		if (${"map" . $i}[$l] == 0) {print "------ ";}
		else {printf "U+%04X ",${"map" . $i}[$l];}
	    }
	    print "\n";
	}
    }
}

Results

The following table is the result.

JIS      0208   SJIS   CP932  APPLE  0213   IBMGLY IBMIRV G-EUC  G-SJIS
-----------------------------------------------------------------------
0x005C   ------ U+00A5 U+005C U+00A5 ------ U+00A5 U+005C U+005C U+00A5 
0x007E   ------ U+203E U+007E U+007E ------ U+203E U+007E U+007E U+203E 
0x2131   U+FFE3 U+FFE3 U+FFE3 U+FFE3 U+203E U+FFE3 U+FFE3 U+FFE3 U+FFE3 
0x213D   U+2015 U+2015 U+2015 U+2014 U+2014 U+2014 U+2014 U+2015 U+2015 
0x2140   U+005C U+005C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C 
0x2141   U+301C U+301C U+FF5E U+301C U+301C U+301C U+301C U+301C U+301C 
0x2142   U+2016 U+2016 U+2225 U+2016 U+2016 U+2016 U+2016 U+2016 U+2016 
0x215D   U+2212 U+2212 U+FF0D U+2212 U+2212 U+2212 U+2212 U+2212 U+2212 
0x216F   U+FFE5 U+FFE5 U+FFE5 U+FFE5 U+00A5 U+FFE5 U+FFE5 U+FFE5 U+FFE5 
0x2171   U+00A2 U+00A2 U+FFE0 U+00A2 U+00A2 U+FFE0 U+FFE0 U+00A2 U+00A2 
0x2172   U+00A3 U+00A3 U+FFE1 U+00A3 U+00A3 U+FFE1 U+FFE1 U+00A3 U+00A3 
0x224C   U+00AC U+00AC U+FFE2 U+00AC U+00AC U+FFE2 U+FFE2 U+00AC U+00AC

Note for 0x005C and 0x007E in JIS. Both of them are not from JIS. They should be interpreted as ISO 646 IRV for G-EUC and IBMIRV, while it should be interpreted as JIS X 0201 Roman for other mapping tables.

There are ten JIS X 0208 codepoints which are mapped into different Unicode codepoints.

0x2131
jisx0213code.txt maps it into U+203E (OVERLINE, ‾), while others map it into U+FFE3 (FULLWIDTH MACRON, ￣).
0x213D
APPLE JAPANESE.TXT, jisx0213code.txt, and two IBM1394 tables map it into U+2014 (EM DASH, —), while others map it into U+2015 (HORIZONTAL BAR, ―).
0x2140
JIS0208.TXT and SHIFTJIS.TXT map it into U+005C (REVERSE SOLIDUS, \), while others map it into U+FF3C (FULLWIDTH REVERSE SOLIDUS, ＼). When JIS X 0208 is used as a component coded character set of EUC-JP encoding, U+005C is not appropriate because it should be used for ISO 646 IRV 0x5C. On the other hand, when JIS X 0208 is used as a component coded character set of Shift_JIS encoding, U+005C can be used here (it doesn't mean U+005C must be used. Rather, I think it should not be used in order to keep compatibility between Shift_JIS and EUC-JP).
0x2141
CP932 maps it into U+FF5E (FULLWIDTH TILDE, ～), while others map it into U+301C (WAVE DASH, 〜).
0x2142
CP932 maps it into U+2225 (PARALLEL TO, ∥), while others map it into U+2016 (DOUBLE VERTICAL LINE, ‖).
0x215D
CP932 maps it into U+FF0D (FULLWIDTH HYPHEN-MINUS, －), while others map it into U+2212 (MINUS SIGN, −).
0x216F
jisx0213code.txt maps it into U+00A5 (YEN SIGN, ¥), while others map it into U+FFE5 (FULLWIDTH YEN SIGN, ￥). When JIS X 0208 is used as a component coded character set of Shift_JIS encoding, U+00A5 is not appropriate because it should be used for JIS X 0201 Roman. On the other hand, when JIS X 0208 is used as a component coded character set of EUC-JP encoding, U+00A5 can be used here (it doesn't mean U+00A5 must be used. Rather, I think it should not be used in order to keep compatibility between Shift_JIS and EUC-JP).
0x2171
CP932 and two IBM 1394 tables map it into U+FFE0 (FULLWIDTH CENT SIGN, ￠), while others map it into U+00A2 (CENT SIGN, ¢).
0x2172
CP932 and two IBM 1394 tables map it into U+FFE1 (FULLWIDTH POUND SIGN, ￡), while others map it into U+00A3 (POUND SIGN, £).
0x224C
CP932 and two IBM 1394 tables map it into U+FFE2 (FULLWIDTH NOT SIGN, ￢), while other map it into U+00AC (NOT SIGN, ¬).

Discussion

I think it is much more important to unite mapping tables into any one, than discussion on which mapping tables are better or worse. However, it is disappointing that each vendor has to keep compatibility to its own previous products and I imagine it cannot adopt modification of its mapping table. It is too late to unite these mapping tables when several years has passed since these mapping tables are released in the early age of Unicode. I feel like to blame related people who worked in the early age of Unicode for this problem but I don't because it doesn't bring any solution nor profits for Japanese people.

At least, it is needed to prevent appearance of additional incompatible mapping tables. To achieve this, I think it is needed for Unicode Consortium to recognize one (or several) orthodox reference mapping table(s) and release the table(s) via the Internet.

There are other reasons why we need one (or several) orthodox reference mapping table(s):

When a person or a company (who don't have their own mapping table) want to write a new software, a reference mapping table is needed.
Especially, in the Open Source world, it is a real problem how to define a mapping table.
When people want to exchange JIS X 0208 data between systems from different vendors, a common mapping table may be needed.
To discuss width problem theoretically and not by impression, reference mapping table(s) is needed.
Even when using a system with non-orthodox mapping table, all that user will have to do will be to think about difference against the orthodox mapping table. It is much simplified from the current situation where the user has to think about all differences between all mapping tables.
And, at last, as I wrote above, recognition of mapping tables will help discouraging someone will develop a new different mapping table and make the situation even worse.

However, Unicode Consortium's viewpoint is that it doesn't take responsibility on mapping tables. Indeed, Unicode Consortium is mainly a union of major companies and these companies are guilty for this confusing situation. Release of an orthodox mapping table will mean that other mapping tables are wrong, which will defeat pride of these companies. And more, there seems to be some opinions that mapping tables should be maintained by other encoding standards, not by Unicode Consortium.

I think it is OK that Unicode Consortium cannot take ultimate responsibility for mapping tables. Japanese standard body may take the ultimate responsibility. However, JIS can be responsible only for JIS mapping table and not for vendor's mapping tables. Only Unicode Consortium is located at the standpoint which can unite various standard bodies and various companies.

Thus, the followings are my proposal.

Unicode Consortium to recognize one or several mapping table(s) between Unicode and encodings which use JIS X 0208.
Unicode Consortium to show standard bodies or companies which maintain the encodings and mapping tables.
Unicode Consortium to show the mapping tables or the ways how to get the mapping tables. I hope Unicode Consortium to give these standard bodies and companies a web space to release mapping tables.
Unicode users to use one of mapping tables which Unicode Consortium recognizes.

Additional Comments

I sent a mail to a Unicode member and I received a mail that this problem will be proposed in the Unicode Technical Committee (2002-04-05).

I researched only Japanese encodings. I imagine there are the same problem for encodings in China, Taiwan, and Koera. Especially, BIG5 which is widely used in Taiwan and other Chinese communities other than mainland China, is a de-facto standard. Thus I heard that BIG5's situation is much worse.

Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>