I have just found out that TinyMCEPlugin
has started working with I18N
webs and topics - great!
But there is a problem: If pages contain images, apparently their URLs are changed into a non-working encodings upon saves.
If an image up front looks like this:
<img src="%ATTACHURLPATH%/%e6%f8%e5.png" alt="æøå.png" width='392' height='129' />
- After an edit/save cycle it looks like this:
<img width="392" alt="æøå.png" src="%PUBURL%/Ã�bleGrÃ%u017Ed/PÃ¥nKake/Ã%u0160Ã%u017EÃ¥.png" height="129" />
Which is a broken URL.
This also various from browser to browser, the above case is with Firefox 2/Windows. IE7 just loose the image complete and leaves an empty
tag. Actually as long as the image is not touched in the editor a FF2 edit cycle will leave the image alone.
The above is an example where the image itself has international chars, but the problem appears as soon as the web or the topic name has international chars in them. Hereafter, each time Tiny touches the topic it continues to "double-encode" the chars, leaving more and more of a mess after itself.
Perhaps to avoid dataloss TinyMCEPlugin
could be given a userconfigurable option where if set it would disable itself in case it finds non-usascii chars in the topic path (and images in the topic)?
- 08 Sep 2007
It looks like there is some UTF-8 encoding going on here, but it's a bit confusing re the entity-encoded 65533 bit. Generally, it's crucial to set up TWiki to always use a single charset in both locale and
- the latest installation with I18N
guide is correct.
does support Unicode, so maybe it is assuming any 8-bit-high characters are really UTF-8? My JS I18N
is not very strong, just read about it a while back.
Since images are attachments, it may be that any images attached before fix to Item3652
was finalised are also causing problems. The original Image path looks OK as a URL-encoded string (3 x ISO-8559-1 characters).
So... Is TMCE doing some URL-decoding somewhere and then re-encoding to UTF-8?
It should be possible to get TMCE to do the right thing here. Disabling TMCE if it finds I18N
characters is a bit hard on all those I18N
users, and could be hard to do - any %-encoded characters in any URL (even external to TWiki) might disable TWiki here.
- 12 Sep 2007
Extract from IRC chat with CDot to understand what may be going on:
[15:01] <RichardDonkin> the fact that it varies between IE and others is interesting - only difference should be in client JS, as the URL is pre-encoded
[15:02] <RichardDonkin> ... so browser's preference on how to encode URLs makes no difference
[15:02] <CDot> on the TML->HTML side, the content is first encoded into the textarea. That content is then XMLHttpRequested back to the server, where WysiwygPlugin/TML2HTML.pm converts it to HTML
[15:02] <CDot> this is then sent back to the browser where it is converted to DOM.
[15:03] <CDot> Then during an edit, a popup window accepts URLs for images
[15:03] <CDot> it does some filename manipulation (TinyMCEPlugin/pub/...../twiki.js)
[15:03] <CDot> but only when a user enters a name
[15:04] <CDot> finally, the edited HTML is post-processed by WysiwygPlugin/HTML2TML.pm - that is the most likely source of the problem
[15:04] <CDot> there isa URL-rewriting function in WysiwygPlugin/HTML2TML.pm called postConvertURL that would be my main suspect
[15:04] <CDot> but, need to nail down where in the process it's going wrong first
- 12 Sep 2007
Having looked at the
file in latest SVN, I think one possible problem is that this assumes that the URL's web and topic name match \w+ at line 97. For a URL that points to an I18N
web or topic name, it will be URL-encoded. The else branch may help but it may not. Something like matching
may work, in place of the
. Of course,
is not really correct either, as it includes underscores - better to use JS equivalent of the TWiki
pattern, and so on.
The same applies to attachment names, after Item3652
Although TWiki.Codev.InternationalisationGuidelines is somewhat server and Perl oriented, a lot of the same principles still apply - using
is still a red flag. TWiki.Codev.EncodeURLsWithUTF8 ensures that all attachment URLs are URL-encoded by server side in the site charset, so the TMCE processing chain needs to take account of this at all points. This only applies to attachment URLs, since they are served directly by the web server from
- all other URLs are served by TWiki scripts, which means that the inbound URL is UTF-8 encoded (if the browser prefers) or site charset encoded, and then converted by the TWiki.pm convert-UTF8-to-site-charset routine.
- 12 Sep 2007
I coded an exclusion
expression - that filters chars known to be illegal in URLs. Even if I could access the TWiki regular expressions, I want something more relaxed than that.
Can someone please test on a UTF8-enabled TWiki?
- 12 Sep 2007
OK, I see Kenneth tested on a non-UTF-8 enabled TWiki, and found a problem.
I had to revert the change described above, as it broke the wikiword recognition code.
My comment was lost due to Item4622 but I was pointing out that the filter on line 97 of twiki.js is not an exclusion expression, so my suggestion looks relevant, although I don't understand the whole processing chain here. Were you talking about same line of code?
Also, there's no point testing on a UTF-8 Wiki as TMCE should convert the UTF-8 to a site charset. See my comments on Item4622, full Unicode support is a major project not a bug fix.
- 14 Sep 2007
While getting Tiny ready for I18N
is a very attractive thought, I am not sure it is realistic.
I was just hoping that adding an option to tiny to have it choose a fallback instead of data corruption in I18N
surroundings were a doable thing (hopefully necessary only while we get tiny to work correctly for those special chars).
- 14 Sep 2007
After a nightmare 3-day debug and fix, i believe this is all working correctly now.