kanji collation.

For a while, I've been irritating everyone I know with an interest in either Japanese or information processing by observing that the iPhone's sorting algorithm for kanji is basically nonexistent. As far as I can tell, it sorts according to unicode code point order (probably normalized, but I haven't gone to the trouble of checking.) This is deterministic, but not, like, logical or intuitive.

For example, I have some music with titles of the form "first composition," "second composition," and so on. Naturally these bog-standard ordinary numbers sort in this order: 一、七、三、九、二、五、八、六、四. (The Unicode locale explorer does the same thing.)

I consider this unreasonable. They're not even the fancy anti-forgery numbers. Now, I only have the loosest idea of how kanji ordering in the absence of pronunciation is supposed to work, but my understanding is that classical dictionaries assign each kanji a "primary radical" and then sort on stroke count and/or other radicals.

The JIS collation standard, JIS X 4061, doesn't appear to address this problem. (Standard disclaimer about reading Japanese documents applies.) However, there is an overlay mechanism in the Unicode spec that's designed to deal with this. (See UTS #10, although I don't recommend actually reading it.) I propose that someone figure out the "classical" collation for at least the Joyo kanji and the most common others, and order them as an overlay. This cannot be difficult, given that my base assumption about dictionary order is correct.

And special-case the numbers if necessary. This is ridiculous.

words from chris, 2013-11-23 14:16:34, los angeles

