• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item4617: Printing UTF-8 chars fries REST handlers (and my brain)

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Extension TinyMCEPlugin Urgent Closed   n/a  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

After a long, complex and detailed debug session on UTF8 and TinyMCE, here are my findings.

First, UTF8 in TWiki topic contents is pretty much fundamentally broken, as far as I can see. When a topic is read from the disc, the stream is set binmode so the string representing the topic contents is a byte string. If you open the file :utf8, and the content is not correctly unicode encoded, you will get an error. This effectively ignores the utf-ness of the string.

Because the strings are processed as byte strings, there is a considerable risk during topic rendering that the second (or third, or fourth) byte of a unicode character will match a 7 character treated specially in TWiki; for example, a < sign.

When this byte string is presented to the browser, it just receives a string of bytes. From that it is able to reassemble unicode characters. However when the browser sends that same string back to TWiki in a REST parameter value, it arrives in TWiki marked as a utf8 encoded string. The characters that were previously represented using two 7-bit characters in the byte string are now represented using a single 16 bit character. If you then try to print the string to STDOUT (it is still the same string) you will get a warning about the string containing wide-byte characters, which causes an error in the TinyMCEPlugin.

It's not clear to me why the string is accepted by perl as a byte string on form submit, but when the value is taken in JS and passed back using XMLHttpRequest it is seen as a UTF-8 string. The headers on the two requests are identical, and the data in both cases comes from the same textarea.

My familiarity with utf8 is not good; I have been avoiding this area like the plague. So I am looking for advice on how best ot handle this. It is awfully tempting just to use bytes in the REST handler. To work around it I have set STDOUT to utf8 mode in the REST handler, but I don't think that's a deep solution.

Richard, any words of wisdom for me?

-- TWiki:Main/CrawfordCurrie - 13 Sep 2007

Turns out this is due to use of the wrong encoding function when sending form parameters using XMLHttpRequest. As such it's specific to TinyMCE

CC

Glad this is fixed. Generally, TWiki server code should not be doing any opening of files with the :utf8 layer until we do full UnicodeSupport (which we really need to do though...). A couple of definitions are useful, though Perl has crap terminology here:

  • In "Perl Unicode mode", a Perl Unicode character is a single unit whose internal representation happens to be 1 to 4 UTF-8 bytes, not a 16-bit value (and Unicode is 21 bits these days), but that doesn't matter - the key thing is that when you step through the string, the 2 to 4 UTF-8 bytes in the character are skipped over as one unit. This is not the default, must be enabled by reading with the :utf8 layer and various other techniques. Sometimes a package such as CGI.pm can turn this on though, which is a bug for TWiki purposes.
  • In "Perl normal mode", UTF-8 bytes are just like any other bytes, processed at the byte level - they just happen to conform to the UTF-8 encoding (which by the way is ASCII safe - bytes 2 through 4 are all 8-bit high, unlike some legacy multi-byte encodings - for list of non-ASCII-safe encodings see TWiki:Codev.JapaneseAndChineseSupport)

No TWiki sites should be using Unicode even as UTF-8 bytes, except for some Japanese/Chinese sites that only need English WikiWords to link automatically - see install guide linked from I18N here for more. Once we have full UnicodeSupport everything will 'just work', but getting to this point could take a while.

use bytes would really just be covering up for TWiki getting into Perl Unicode mode, rather than fixing the root cause.

-- TWiki:Main.RichardDonkin - 14 Sep 2007

ItemTemplate
Summary Printing UTF-8 chars fries REST handlers (and my brain)
ReportedBy TWiki:Main.CrawfordCurrie
Codebase

SVN Range TWiki-4.2.0, Sat, 08 Sep 2007, build 14780
AppliesTo Extension
Component TinyMCEPlugin
Priority Urgent
CurrentState Closed
WaitingFor

Checkins

TargetRelease n/a
ReleasedIn

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2007-09-14 - RichardDonkin
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback