Converting from Traditional to Simplified Chinese

Posted by janiner on 2009-06-25 11:16

I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate the result of a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.

The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn, cp936... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5/cp950 first and it doesn't help.

My test code looks like this:

set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body

I have also tried

set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body

But it didn't change anything.

Any pointers? I'm using Tcl 8.5.7.

jeffh
ActiveState Staff
Thu, 2009-06-25 13:22

I see that you are using AOLServer (or NaviServer) from the ns_write. It's not clear what you want to do that would require UTF-8S, which is not a standardized encoding.

I'm not completely familiar with AOLServer encoding handling, so this might best be asked on that mailing list.

Data in Tcl is generally utf-8, with the String Tcl_Obj type being UCS-2. As such it does not handle non-BMP characters, if that is what you are trying to deal with.

I suspect that you are double-converting regular utf-8 at some point, or even utf-8s. The key is what format of data ns_httpget and ns_write are expecting. Is it binary data, or ? In any case, you are not looking for simple transcoding, which is the 2nd example above. That is trivial in Tcl. You require a translation of Traditional to Simplified Chinese, and it is merely getting the encodings write in all the data passing. Is it possible to call translationtools.com directly from Tcl? That would eliminate one possible (hairy) level of errors.

janiner | Thu, 2009-06-25 13:59

I chose UTF-8S only because it is the encoding that worked with the Mandarin Tools encoder. From what I understand it was created by Oracle and isn't well supported in other places, but it was what worked at the time. I'm certainly not tied to it if something else will work better.

Actually I have discussed this on the AOLserver list and that's where the suggestion to use 'encoding convertto' came from. :) Sorry I forgot to mention AOLserver; that is the only environment in which I use Tcl and I tend to forget that what we do is not entirely standard.

I don't know if I'm trying to use non-BMP characters or not, sorry! This unicode/translation stuff is all fairly new to me. The most I've had to do in the past is deal with smart quotes issue.

The translation tool is written in Java so it can only be used as a servlet (what I'm doing now) or as a command line tool that I would have to exec out to. My hope was to get Tcl to do this for me as it seemed likely to be faster.

I will continue investigating but any suggestions would be welcome!

janiner | Sat, 2009-06-27 15:46

And the only response I've received so far says that using "encoding convertto" won't do what I want it to do anyway. Which is frustrating, considering that's where I was advised to do this in the first place (albeit by a different person).

So which is it - would this work, if I got the encodings right, or is it a dead end?

jeffh
ActiveState Staff
Mon, 2009-06-29 08:42

The person on the AOLServer list was correct, but you don't seem to be aware of the technical issues involved. To convert from traditional to simplified is NOT simply a matter of encoding, it is a (simplified) case of language translation. There are different unicode character points for the same word in each form. You can easily create a string map for this translation. See also the simple docs in working code for perl at:

http://search.cpan.org/~divec/Lingua-ZH-HanConvert-0.12/HanConvert.pm

janiner | Tue, 2009-06-30 10:45

I was actually aware that it's a translation, but several people had suggested that changing the encoding would do this for me. I guess not (which was my assumption in the first place, but they had me second-guessing myself). I'll check out the link, thanks.