After a long, complex and detailed debug session on UTF8 and TinyMCE, here are my findings.
First, UTF8 in TWiki topic contents is pretty much fundamentally broken, as far as I can see. When a topic is read from the disc, the stream is set binmode so the string representing the topic contents is a byte string. If you open the file :utf8, and the content is not correctly unicode encoded, you will get an error. This effectively ignores the utf-ness of the string.
Because the strings are processed as byte strings, there is a considerable risk during topic rendering that the second (or third, or fourth) byte of a unicode character will match a 7 character treated specially in TWiki; for example, a < sign.
When this byte string is presented to the browser, it just receives a string of bytes. From that it is able to reassemble unicode characters. However when the browser sends that same string back to TWiki in a REST parameter value, it arrives in TWiki marked as a utf8 encoded string. The characters that were previously represented using two 7-bit characters in the byte string are now represented using a single 16 bit character. If you then try to print the string to STDOUT (it is still the same string) you will get a warning about the string containing wide-byte characters, which causes an error in the TinyMCEPlugin.
It's not clear to me why the string is accepted by perl as a byte string on form submit, but when the value is taken in JS and passed back using XMLHttpRequest it is seen as a UTF-8 string. The headers on the two requests are identical, and the data in both cases comes from the same textarea.
My familiarity with utf8 is not good; I have been avoiding this area like the plague. So I am looking for advice on how best ot handle this. It is awfully tempting just to
in the REST handler. To work around it I have set STDOUT to utf8 mode in the REST handler, but I don't think that's a deep solution.
Richard, any words of wisdom for me?
- 13 Sep 2007
Turns out this is due to use of the wrong encoding function when sending form parameters using XMLHttpRequest. As such it's specific to TinyMCE
Glad this is fixed. Generally, TWiki server code should not be doing any opening of files with the
layer until we do full UnicodeSupport (which we really need to do though...). A couple of definitions are useful, though Perl has crap terminology here:
- In "Perl Unicode mode", a Perl Unicode character is a single unit whose internal representation happens to be 1 to 4 UTF-8 bytes, not a 16-bit value (and Unicode is 21 bits these days), but that doesn't matter - the key thing is that when you step through the string, the 2 to 4 UTF-8 bytes in the character are skipped over as one unit. This is not the default, must be enabled by reading with the
:utf8 layer and various other techniques. Sometimes a package such as CGI.pm can turn this on though, which is a bug for TWiki purposes.
- In "Perl normal mode", UTF-8 bytes are just like any other bytes, processed at the byte level - they just happen to conform to the UTF-8 encoding (which by the way is ASCII safe - bytes 2 through 4 are all 8-bit high, unlike some legacy multi-byte encodings - for list of non-ASCII-safe encodings see TWiki:Codev.JapaneseAndChineseSupport)
No TWiki sites should be using Unicode even as UTF-8 bytes, except
for some Japanese/Chinese sites that only need English WikiWords to link automatically - see install guide linked from I18N
here for more. Once we have full UnicodeSupport everything will 'just work', but getting to this point could take a while.
would really just be covering up for TWiki getting into Perl Unicode mode, rather than fixing the root cause.
- 14 Sep 2007