Japanese page return

Conversion tables differ between venders (2002-04-04)


Introduction

Character codes which are used for text files are called encoding. For example, EUC-JP, Shift_JIS, ISO-2022-JP, ISO-8859-1, and UTF-8 are encodings. An encoding consist of one or multiple coded character set(s). Coded character set is often abbreviated to CCS. For example, an encoding of EUC-JP consists of coded character sets of ISO 646 IRV (aka US-ASCII), JIS X 0208, JIS X 0201 Kana, and JIS X 0212. An encoding of ISO-8859-1 consists of one coded character set of ISO-8859-1.

There are several local encodings which are widely used in Japan. All of these encodings use JIS X 0208 as their components. In other words, characters which belong to JIS X 0208 are the same character in definition even if they appear in different encodings. Thus, these characters can be safely communicated between different encodings. For example, both of Shift_JIS 0x82 0xA0 and EUC-JP 0xA4 0xA2 correspond to JIS X 0208 0x2422 and they are strictly same. This is Hiragana "A" and normally mapped into U+3042 (あ).

However, since Unicode was developed, this peaceful situation is threatened. The strictly same characters (which has the same codepoint in the same JIS X 0208 coded character set) are mapped into different codepoints in Unicode depends on the encodings.

For example, CP932 (aka Windows-31J) is an encoding used for Microsoft Windows. This is an extension of Shift_JIS with additional Microsoft's private characters. Macintosh also uses a different encoding of Shift_JIS with additional Apple's private characters. In other words, Shift_JIS is a subset encoding of these encodings. Though it is a matter of course that there are no compatiblity for these additional characters, even JIS X 0208 part of these encodings lose compatibility if we think about mapping to Unicode. For example, Shift_JIS 0x81 0x60 (0x2141 in JIS X 0208) is mapped into U+301C (WAVE DASH, 〜) by Macintosh or GNU libc, while it is mapped into U+FF5E (FULLWIDTH TILDE, ~) by Microsoft Windows. Please note Shift_JIS is a subset of Apple's and Microsoft's encodings. This means that Shift_JIS text can be regarded as CP932 text.

A strictly same character from JIS X 0208 happens to mapped into different Unicode codepoints only by thinking the character is Shift_JIS or CP932, or only by changing the software to be used for the mapping. The inter-encoding compatibility of JIS X 0208 characters was broken by these mapping tables.

In Unicode Consortium's point of view, these encodings such as Shift_JIS and CP932 are totally different. Based on this interpretation, there are no problems on differences in mapping tables. However, it means that CP932, Macintosh's encoding, and so on are not based on JIS X 0208. Thus, it is not guaranteed that JIS X 0208 part of these encodings are common. If the compatibility is not guaranteed, it is a huge inconvenience.

The difference of mapping tables means that Unicode fonts have to have glyphs for all possible Unicode codepoints for these mapping tables, even if the fonts are intended to be used only in Japan. This is the reason why mlterm has a command line option to use CP932 as Unicode/JIS mapping table. The difference of mapping tables (and more, Unicode Consortium's standpoint that it is not responsible for mapping tables) makes the width problem even more difficult and complex. And more, identity of files which are mapped into Unicode is not guaranteed and this will bring difficulty for MD5 hash, diff command, and so on.

However, in case of conversion from Unicode to Japanese encodings, many-to-one mapping can solve the problem. (However, this is true only if different characters in different Japanese encodings are mapped into the same Unicode codepoint.)


Softwares Used for the Research

On discussion about mapping tables between Japanese encodings and Unicode, we need mapping tables. Now, I researched the following mapping tables which are freely available via Internet. However, I didn't check these mapping tables are exactly same to mapping tables which are impremented to widely used systems.

All of them use JIS X 0208 as their components.

Java conversion table was checked in the previous research but it in omitted in this time, because I couldn't obtain the mapping table.

Here is the script I used for this research.

#!/usr/bin/perl

$JIS = 1;
$EUC = 2;
$SJIS = 3;

sub readmapfile ($$$$$) {
    my($file, $name, $legacy, $ucs, $type, $line, @column, $u, $l);
    ($file, $name, $legacy, $ucs, $type) = @_;

    print "reading $file ...\n";

    open(FILE, $file) || die "Cannot open $file.\n";
    while ($line = <FILE>) {
	if ($line =~ /^\#/) {next;}
	if ($line =~ m+^\/+) {next;}
	$line =~ s/^\s+//;
	@column = split(/\s+/, $line);
	if ($name eq "JISX0213") {
	    if ($line =~ /^2-/) {next;}
	    $u = $column[$ucs];  $u =~ s/u-/0x/;  $u = int($u);
	    $l = $column[$legacy];  $l =~ s/j-/0x/;  $l = int($l);
	} elsif ($name =~ /IBM1394/) {
	    $u = int("0x" . $column[$ucs]);
	    $l = int("0x" . $column[$legacy]);
	} elsif ($name =~ /GLIBC/) {
	    if ($line !~ /^<U/) {next;}
	    $u = $column[$ucs];  $u =~ s/<U/0x/;  $u = int($u);
	    $l = $column[$legacy];  $l =~ s+/x++g;  $l = int("0x" . $l);
	} else {
	    $u = int($column[$ucs]);
	    $l = int($column[$legacy]);
	}
	if ($type == $JIS) {
	    # do nothing
	} elsif ($type == $EUC) {
	    if ($l > 0x100) {$l -= 0x8080;}
	} else {
	    if ($l > 0x100) {
		$h = $l >> 8 & 0xff;
		$l = $l & 0xff;
		$out1 = ($h << 1) - ($h <= 0x9f ? 0xe0 : 0x160) 
		    - ($l < 0x9f ? 1 : 0);
		$out2 = $l - 0x1f - ($l >= 0x7f ? 1 : 0) 
		    - ($l >= 0x9f ? 0x5e : 0);
		$l = ($out1 << 8) + $out2;
	    }
	}
	${"map" . $name}[$l] = $u;
    }
}

# ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS0208.TXT
&readmapfile("JIS0208.TXT", "JISX0208", 1, 2, $JIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/SHIFTJIS.TXT
&readmapfile("SHIFTJIS.TXT", "SHIFTJIS", 0, 1, $SJIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
&readmapfile("CP932.TXT", "CP932", 0, 1, $SJIS);

# ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
&readmapfile("JAPANESE.TXT", "APPLE", 0, 1, $SJIS);

# http://www.jca.apc.org/~earthian/aozora/0213/jisx0213code.zip
&readmapfile("jisx0213code.txt", "JISX0213", 1, 4, $JIS);

# http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
&readmapfile("IBM1394toUCS4-GLY.txt", "IBM1394", 0, 1, $SJIS);

# http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip
&readmapfile("IBM1394toUCS4-IRV.txt", "IBM1394I", 0, 1, $SJIS);

# GNU libc 2.2.5 from Debian libc6 2.2.5-4 package (2002-04-03)
&readmapfile("EUC-JP", "GLIBCEUC", 1, 0, $EUC);

# GNU libc 2.2.5 from Debian libc6 2.2.5-4 package (2002-04-03)
&readmapfile("SHIFT_JIS", "GLIBCSJIS", 1, 0, $SJIS);

print "JIS      0208   SJIS   CP932  APPLE  0213   IBMGLY IBMIRV G-EUC  G-SJIS\n";

for ($r = 0x0; $r <= 0x28; $r++) {
    if ($r > 0 && $r < 0x21) {next;}
    for ($c = 0x21; $c < 0x7f; $c++) {
	$l = ($r << 8) + $c;
	if ($l != 0x5c && $l != 0x7e && $l < 0x100) {next;}
	if ($l > 0x100 && $mapJISX0208[$l] == 0) {next;}
	if ($mapJISX0208[$l] != $mapSHIFTJIS[$l] ||
	    $mapJISX0208[$l] != $mapCP932[$l] ||
	    $mapJISX0208[$l] != $mapAPPLE[$l] ||
	    $mapJISX0208[$l] != $mapJISX0213[$l] ||
	    $mapJISX0208[$l] != $mapIBM1394[$l]) {
	    printf "0x%04X   ",$l;
	    foreach $i ("JISX0208", "SHIFTJIS", "CP932", "APPLE", "JISX0213",
			"IBM1394", "IBM1394I", "GLIBCEUC", "GLIBCSJIS") {
		if (${"map" . $i}[$l] == 0) {print "------ ";}
		else {printf "U+%04X ",${"map" . $i}[$l];}
	    }
	    print "\n";
	}
    }
}


Results

The following table is the result.

JIS      0208   SJIS   CP932  APPLE  0213   IBMGLY IBMIRV G-EUC  G-SJIS
-----------------------------------------------------------------------
0x005C   ------ U+00A5 U+005C U+00A5 ------ U+00A5 U+005C U+005C U+00A5 
0x007E   ------ U+203E U+007E U+007E ------ U+203E U+007E U+007E U+203E 
0x2131   U+FFE3 U+FFE3 U+FFE3 U+FFE3 U+203E U+FFE3 U+FFE3 U+FFE3 U+FFE3 
0x213D   U+2015 U+2015 U+2015 U+2014 U+2014 U+2014 U+2014 U+2015 U+2015 
0x2140   U+005C U+005C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C U+FF3C 
0x2141   U+301C U+301C U+FF5E U+301C U+301C U+301C U+301C U+301C U+301C 
0x2142   U+2016 U+2016 U+2225 U+2016 U+2016 U+2016 U+2016 U+2016 U+2016 
0x215D   U+2212 U+2212 U+FF0D U+2212 U+2212 U+2212 U+2212 U+2212 U+2212 
0x216F   U+FFE5 U+FFE5 U+FFE5 U+FFE5 U+00A5 U+FFE5 U+FFE5 U+FFE5 U+FFE5 
0x2171   U+00A2 U+00A2 U+FFE0 U+00A2 U+00A2 U+FFE0 U+FFE0 U+00A2 U+00A2 
0x2172   U+00A3 U+00A3 U+FFE1 U+00A3 U+00A3 U+FFE1 U+FFE1 U+00A3 U+00A3 
0x224C   U+00AC U+00AC U+FFE2 U+00AC U+00AC U+FFE2 U+FFE2 U+00AC U+00AC 
Note for 0x005C and 0x007E in JIS. Both of them are not from JIS. They should be interpreted as ISO 646 IRV for G-EUC and IBMIRV, while it should be interpreted as JIS X 0201 Roman for other mapping tables.

There are ten JIS X 0208 codepoints which are mapped into different Unicode codepoints.


Discussion

I think it is much more important to unite mapping tables into any one, than discussion on which mapping tables are better or worse. However, it is disappointing that each vendor has to keep compatibility to its own previous products and I imagine it cannot adopt modification of its mapping table. It is too late to unite these mapping tables when several years has passed since these mapping tables are released in the early age of Unicode. I feel like to blame related people who worked in the early age of Unicode for this problem but I don't because it doesn't bring any solution nor profits for Japanese people.

At least, it is needed to prevent appearance of additional incompatible mapping tables. To achieve this, I think it is needed for Unicode Consortium to recognize one (or several) orthodox reference mapping table(s) and release the table(s) via the Internet.

There are other reasons why we need one (or several) orthodox reference mapping table(s):

However, Unicode Consortium's viewpoint is that it doesn't take responsibility on mapping tables. Indeed, Unicode Consortium is mainly a union of major companies and these companies are guilty for this confusing situation. Release of an orthodox mapping table will mean that other mapping tables are wrong, which will defeat pride of these companies. And more, there seems to be some opinions that mapping tables should be maintained by other encoding standards, not by Unicode Consortium.

I think it is OK that Unicode Consortium cannot take ultimate responsibility for mapping tables. Japanese standard body may take the ultimate responsibility. However, JIS can be responsible only for JIS mapping table and not for vendor's mapping tables. Only Unicode Consortium is located at the standpoint which can unite various standard bodies and various companies.

Thus, the followings are my proposal.


Additional Comments

I sent a mail to a Unicode member and I received a mail that this problem will be proposed in the Unicode Technical Committee (2002-04-05).

I researched only Japanese encodings. I imagine there are the same problem for encodings in China, Taiwan, and Koera. Especially, BIG5 which is widely used in Taiwan and other Chinese communities other than mainland China, is a de-facto standard. Thus I heard that BIG5's situation is much worse.


Tomohiro KUBOTA <debian at tmail dot plala dot or dot jp>