• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item4583: I18N: Image URLs are altered / lost

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Extension TinyMCEPlugin Normal Closed   n/a  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

I have just found out that TinyMCEPlugin has started working with I18N webs and topics - great!

But there is a problem: If pages contain images, apparently their URLs are changed into a non-working encodings upon saves.

If an image up front looks like this:

<img src="%ATTACHURLPATH%/%e6%f8%e5.png" alt="æøå.png" width='392' height='129' />

- After an edit/save cycle it looks like this:

<img width="392" alt="æøå.png" src="%PUBURL%/Ã&#65533;bleGrÃ%u017Ed/PÃ¥nKake/Ã%u0160Ã%u017EÃ¥.png" height="129" />

Which is a broken URL.

This also various from browser to browser, the above case is with Firefox 2/Windows. IE7 just loose the image complete and leaves an empty

 <img />
tag. Actually as long as the image is not touched in the editor a FF2 edit cycle will leave the image alone.

The above is an example where the image itself has international chars, but the problem appears as soon as the web or the topic name has international chars in them. Hereafter, each time Tiny touches the topic it continues to "double-encode" the chars, leaving more and more of a mess after itself.

Perhaps to avoid dataloss TinyMCEPlugin could be given a userconfigurable option where if set it would disable itself in case it finds non-usascii chars in the topic path (and images in the topic)?

-- TWiki:Main/SteffenPoulsen - 08 Sep 2007

It looks like there is some UTF-8 encoding going on here, but it's a bit confusing re the entity-encoded 65533 bit. Generally, it's crucial to set up TWiki to always use a single charset in both locale and {Site}{CharSet} - the latest installation with I18N guide is correct.

Worth checking whether TMCE is converting data to/from UTF-8. Also, JavaScript does support Unicode, so maybe it is assuming any 8-bit-high characters are really UTF-8? My JS I18N is not very strong, just read about it a while back.

Since images are attachments, it may be that any images attached before fix to Item3652 was finalised are also causing problems. The original Image path looks OK as a URL-encoded string (3 x ISO-8559-1 characters).

So... Is TMCE doing some URL-decoding somewhere and then re-encoding to UTF-8?

It should be possible to get TMCE to do the right thing here. Disabling TMCE if it finds I18N characters is a bit hard on all those I18N users, and could be hard to do - any %-encoded characters in any URL (even external to TWiki) might disable TWiki here.

-- TWiki:Main.RichardDonkin - 12 Sep 2007

Extract from IRC chat with CDot to understand what may be going on:

[15:01] <RichardDonkin> the fact that it varies between IE and others is interesting - only difference should be in client JS, as the URL is pre-encoded 
[15:02] <RichardDonkin> ... so browser's preference on how to encode URLs makes no difference

[15:02] <CDot> on the TML->HTML side, the content is first encoded into the textarea. That content is then XMLHttpRequested back to the server, where WysiwygPlugin/TML2HTML.pm converts it to HTML
[15:02] <CDot> this is then sent back to the browser where it is converted to DOM.

[15:03] <CDot> Then during an edit, a popup window accepts URLs for images
[15:03] <CDot> it does some filename manipulation (TinyMCEPlugin/pub/...../twiki.js)
[15:03] <CDot> but only when a user enters a name

[15:04] <CDot> finally, the edited HTML is post-processed by WysiwygPlugin/HTML2TML.pm - that is the most likely source of the problem
[15:04] <CDot> there isa URL-rewriting function in  WysiwygPlugin/HTML2TML.pm called postConvertURL that would be my main suspect
[15:04] <CDot> but, need to nail down where in the process it's going wrong first

-- TWiki:Main.RichardDonkin - 12 Sep 2007

Having looked at the twiki.js file in latest SVN, I think one possible problem is that this assumes that the URL's web and topic name match \w+ at line 97. For a URL that points to an I18N web or topic name, it will be URL-encoded. The else branch may help but it may not. Something like matching (?:\w|%[0-9a-f]{2})+ may work, in place of the \w. Of course, \w is not really correct either, as it includes underscores - better to use JS equivalent of the TWiki webNameRegex pattern, and so on.

The same applies to attachment names, after Item3652 fix.

Although TWiki.Codev.InternationalisationGuidelines is somewhat server and Perl oriented, a lot of the same principles still apply - using \w is still a red flag. TWiki.Codev.EncodeURLsWithUTF8 ensures that all attachment URLs are URL-encoded by server side in the site charset, so the TMCE processing chain needs to take account of this at all points. This only applies to attachment URLs, since they are served directly by the web server from pub - all other URLs are served by TWiki scripts, which means that the inbound URL is UTF-8 encoded (if the browser prefers) or site charset encoded, and then converted by the TWiki.pm convert-UTF8-to-site-charset routine.

-- TWiki:Main.RichardDonkin - 12 Sep 2007

I coded an exclusion expression - that filters chars known to be illegal in URLs. Even if I could access the TWiki regular expressions, I want something more relaxed than that.

Can someone please test on a UTF8-enabled TWiki?

-- TWiki:Main.CrawfordCurrie - 12 Sep 2007

OK, I see Kenneth tested on a non-UTF-8 enabled TWiki, and found a problem.

I had to revert the change described above, as it broke the wikiword recognition code.

CC

My comment was lost due to Item4622 but I was pointing out that the filter on line 97 of twiki.js is not an exclusion expression, so my suggestion looks relevant, although I don't understand the whole processing chain here. Were you talking about same line of code?

Also, there's no point testing on a UTF-8 Wiki as TMCE should convert the UTF-8 to a site charset. See my comments on Item4622, full Unicode support is a major project not a bug fix.

-- TWiki:Main.RichardDonkin - 14 Sep 2007

While getting Tiny ready for I18N is a very attractive thought, I am not sure it is realistic.

I was just hoping that adding an option to tiny to have it choose a fallback instead of data corruption in I18N surroundings were a doable thing (hopefully necessary only while we get tiny to work correctly for those special chars).

-- TWiki:Main.SteffenPoulsen - 14 Sep 2007

After a nightmare 3-day debug and fix, i believe this is all working correctly now.

CC

ItemTemplate
Summary I18N: Image URLs are altered / lost
ReportedBy TWiki:Main.SteffenPoulsen
Codebase

SVN Range TWiki-4.2.0, Sat, 08 Sep 2007, build 14780
AppliesTo Extension
Component TinyMCEPlugin
Priority Normal
CurrentState Closed
WaitingFor

Checkins TWikirev:14844 TWikirev:14848
TargetRelease n/a
ReleasedIn

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r15 - 2007-09-16 - CrawfordCurrie
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback