This isn't a bug with
TinyMCE AFAICT, it's a bug in TWiki.
When you paste UTF8 characters into the
TinyMCE editor and then save the topic, TWiki dies. The reason is the UTF-8 encoding is carried through to the write of the actual file in
RcsFile, which blows up because the stream is not open for UTF-8. I have been unable to work out how a standard form submission can carry UTF8 through to that stage, especially considering that the same content is passed through from a textarea without triggering this problem.
Setting Normal priority, as no-one seems much interested in UTF-8. Not interested to fix any of the bugs, that is.
Testcase is
Item4583 (currently disabled for WYSIYG editing)
--
TWiki:Main.CrawfordCurrie
- 13 Sep 2007
This is a serious bug because
it deleted the whole topic when it happened to me on
Item4583. We need to fix it regardless of where it's located otherwise we can't use TMCE.
As I have said many times, until we get the time/effort to do
TWiki:Codev.UnicodeSupport
, which is a significant piece of work, particularly optimising performance back to normal if possible, and dealing with Perl Unicode issues,
any use of Perl Unicode characters in TWiki is a bug and must be eliminated. Since TMCE is introducing the Unicode characters it's TMCE that should avoid this regression.
If someone pastes UTF-8 into TMCE, it's essential that TMCE (or more likely the backend plugins) converts this to the site character set. This is not hard, there's already a TWiki.pm routine to do exactly this.
The other part of this is to find out why Perl Unicode mode is being turned on (there are sites running in UTF-8 today for Japanese/Chinese etc - breaks
WikiWords but doesn't turn on these characters.
This is a release blocker for 4.2 if TMCE is included. You can't have
pasting I18N text causing a topic to be deleted requiring an adminstrator to do a delRev.
--
RichardDonkin - 14 Sep 2007
Update: it only happens on Mozilla. For once, IE does something better than Moz!
--
TWiki:Main.CrawfordCurrie
- 14 Sep 2007
OK, I think I found all the cases in which it can break on FF and IE. There may be more on Safari or Opera.
No, I didn't. I'm going to have to give up on this, I just don't understand what is going on, and am more likely to break than fix. If anyone else wants to try, then take a copy of the raw contents of
Item4835 and paste it (in the plaintext editor) into a new topic. Then edit that new topic in TMCE and save it. Watch the fireworks.
As noted in
Item4636 the "solution" I tried is not a solution, because it breaks high-bit characters, used in most western european languages.
CC
I think I understand all the nuances now.
There were two sources of error; the XMLHttpRequest data transfer, and the entity conversion in the translator. Here's what was happening:
Steffen had somehow created a string that contained an entity, &65533;, and a number of unicode characters, and he had placed this string in a verbatim block (in
Item4583) to demonstrate an error.
TMCE was receiving this string correctly, and then passing it back to the REST handler for conversion to HTML. This involved passing an
application/x-form-urlencoded
POST back to the server. To do this, I had URI-encoded the string, which is the recommended practice. I had also set a site-charset transfer encoding of on the transfer.
Perl was picking up this string, but not recognising the encoding so was double-encoding it again as UTF-8. This resulted in a corrupt UTF8 string containing wide bytes.
When this string was converted to HTML and posted back to the client, it was collapsing with a "wide byte in print" error, due to the socket not having a UTF8 layer.
All this was problem 1, and was solved by double-encoding the string in the client to protect wide characters in the transfer. This makes the transfer independent of the encoding (though much larger).
Once this is solved and we are able to edit the actual strings in the topic, we hit problem 2. When TinyMCE saves it posts HTML back to the server, which then converts that HTML to TML. It parses the HTML, and then reconstructs the TML from the parse tree.
When you have a verbatim block, any HTML inside the verbatim is entity-encoded. During expansion of a verbatim block it decodes entities in the strings embedded in sub-structures under the verbatim. Normally this process is repeated at each level in the parse tree; but in a verbatim, if you do this then the first call correctly decodes the string, but then any subsequent call will incorrectly decode any entity strings embedded in the verbatim.
In this case this double-decode resulted in a malformed UTF8 character being embedded in the string, which caused TWiki to fall over when the topic files containing the string was saved.
The solution to this is to ensure that verbatim blocks are only entity-decoded once.
Phew!
CC
I was expecting news of a recent
China-Syndrome
in the east-northern Europe this morning, but it seems things are stable?
Preliminary testing suggests more predictable results between ie and firefox now, will update where behaviour has changed.
Thanks for looking into this - I am very happy that we now have two UTF-8 experts on the team (?)
--
TWiki:Main.SteffenPoulsen
- 16 Sep 2007