• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item4622: Saving a topic with malformed UTF-8 pasted into TinyMCE causes TWiki to suicide

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Engine I18N Urgent Closed   n/a  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

This isn't a bug with TinyMCE AFAICT, it's a bug in TWiki.

When you paste UTF8 characters into the TinyMCE editor and then save the topic, TWiki dies. The reason is the UTF-8 encoding is carried through to the write of the actual file in RcsFile, which blows up because the stream is not open for UTF-8. I have been unable to work out how a standard form submission can carry UTF8 through to that stage, especially considering that the same content is passed through from a textarea without triggering this problem.

Setting Normal priority, as no-one seems much interested in UTF-8. Not interested to fix any of the bugs, that is.

Testcase is Item4583 (currently disabled for WYSIYG editing)

-- TWiki:Main.CrawfordCurrie - 13 Sep 2007

This is a serious bug because it deleted the whole topic when it happened to me on Item4583. We need to fix it regardless of where it's located otherwise we can't use TMCE.

As I have said many times, until we get the time/effort to do TWiki:Codev.UnicodeSupport, which is a significant piece of work, particularly optimising performance back to normal if possible, and dealing with Perl Unicode issues, any use of Perl Unicode characters in TWiki is a bug and must be eliminated. Since TMCE is introducing the Unicode characters it's TMCE that should avoid this regression.

If someone pastes UTF-8 into TMCE, it's essential that TMCE (or more likely the backend plugins) converts this to the site character set. This is not hard, there's already a TWiki.pm routine to do exactly this.

The other part of this is to find out why Perl Unicode mode is being turned on (there are sites running in UTF-8 today for Japanese/Chinese etc - breaks WikiWords but doesn't turn on these characters.

This is a release blocker for 4.2 if TMCE is included. You can't have pasting I18N text causing a topic to be deleted requiring an adminstrator to do a delRev.

-- RichardDonkin - 14 Sep 2007

Update: it only happens on Mozilla. For once, IE does something better than Moz!

-- TWiki:Main.CrawfordCurrie - 14 Sep 2007

OK, I think I found all the cases in which it can break on FF and IE. There may be more on Safari or Opera.

No, I didn't. I'm going to have to give up on this, I just don't understand what is going on, and am more likely to break than fix. If anyone else wants to try, then take a copy of the raw contents of Item4835 and paste it (in the plaintext editor) into a new topic. Then edit that new topic in TMCE and save it. Watch the fireworks.

As noted in Item4636 the "solution" I tried is not a solution, because it breaks high-bit characters, used in most western european languages.

CC

I think I understand all the nuances now.

There were two sources of error; the XMLHttpRequest data transfer, and the entity conversion in the translator. Here's what was happening:

Steffen had somehow created a string that contained an entity, &65533;, and a number of unicode characters, and he had placed this string in a verbatim block (in Item4583) to demonstrate an error.

TMCE was receiving this string correctly, and then passing it back to the REST handler for conversion to HTML. This involved passing an application/x-form-urlencoded POST back to the server. To do this, I had URI-encoded the string, which is the recommended practice. I had also set a site-charset transfer encoding of on the transfer.

Perl was picking up this string, but not recognising the encoding so was double-encoding it again as UTF-8. This resulted in a corrupt UTF8 string containing wide bytes.

When this string was converted to HTML and posted back to the client, it was collapsing with a "wide byte in print" error, due to the socket not having a UTF8 layer.

All this was problem 1, and was solved by double-encoding the string in the client to protect wide characters in the transfer. This makes the transfer independent of the encoding (though much larger).

Once this is solved and we are able to edit the actual strings in the topic, we hit problem 2. When TinyMCE saves it posts HTML back to the server, which then converts that HTML to TML. It parses the HTML, and then reconstructs the TML from the parse tree.

When you have a verbatim block, any HTML inside the verbatim is entity-encoded. During expansion of a verbatim block it decodes entities in the strings embedded in sub-structures under the verbatim. Normally this process is repeated at each level in the parse tree; but in a verbatim, if you do this then the first call correctly decodes the string, but then any subsequent call will incorrectly decode any entity strings embedded in the verbatim.

In this case this double-decode resulted in a malformed UTF8 character being embedded in the string, which caused TWiki to fall over when the topic files containing the string was saved.

The solution to this is to ensure that verbatim blocks are only entity-decoded once.

Phew!

CC

I was expecting news of a recent China-Syndrome in the east-northern Europe this morning, but it seems things are stable? smile

Preliminary testing suggests more predictable results between ie and firefox now, will update where behaviour has changed.

Thanks for looking into this - I am very happy that we now have two UTF-8 experts on the team (?) smile

-- TWiki:Main.SteffenPoulsen - 16 Sep 2007

ItemTemplate
Summary Saving a topic with malformed UTF-8 pasted into TinyMCE causes TWiki to suicide
ReportedBy TWiki:Main.CrawfordCurrie
Codebase

SVN Range TWiki-4.2.0, Sat, 08 Sep 2007, build 14780
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:14866 TWikirev:14878
TargetRelease n/a
ReleasedIn

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r11 - 2007-09-16 - SteffenPoulsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback