• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item4946: urlDecode() not working for characters represented by Unicode code points

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Engine I18N Urgent Closed   patch  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

TinyMCEPlugin uses urlDecode() in TWiki.pm. But urlDecode() doesn't handle characters represented by Unicode code points in the %uXXXX format.

If you use UTF-8 as the site character encoding, non ASCII characters are corrupted when edited by TinyMCEPlugin's WYSIWYG editor.

The following patch solves the problem.

--- lib/TWiki.pm        2007-03-03 23:45:57.000000000 +0900
+++ lib/TWiki.pm.patched        2007-10-23 15:46:56.000000000 +0900
@@ -2157,6 +2202,9 @@
     my $text = shift;

     $text =~ s/%([\da-f]{2})/chr(hex($1))/gei;
+    use encoding "utf8";         
+    $text =~ s/%u([\da-f]{4})/chr(hex($1))/gei;
+    no encoding;

     return $text;
 } 

urlDecode() being used rarely, this patch would have only minimal performance impact.

This site using ISO-8859-15, it's not practical to edit Chinese, Japanese, and Korean characters in raw text editing since those characters show up as &#DDDDD;. But in WYSIWYG edit, &#DDDDD; is handled under the hood hence those East Asian characters are editable.

-- TWiki:Main/HideyoImazu - 09 Nov 2007

marking as urgent to get it some attention

-- SvenDowideit - 15 Nov 2007

Sorry, I'm just not interested in I18N. I have other requirements that are higher priority, and I don't have the knowledge to know if the above patch is a good idea or not.

CC

I believe good I18N support is important for the future of TWiki. How should I move forward?

-- TWiki:Main.HideyoImazu - 16 Nov 2007

I will need to look at this a bit more - doing full TWiki:Codev.UnicodeSupport is the real solution, but maybe we can do something specific like this so that TMCE works better with UTF-8 as site charset. I would rather see something explicitly using encode or decode from CPAN:Encode so that it's clear what is going on - and generally the existing routine for UTF8 conversion should be extended or added to, so that we don't end up with UTF8 conversion in several places.

This patch is not general enough as a fix, as it will only work for UTF8 sites at present, I think.

I don't understand Hideyo's last paragraph - are you referring to TWiki.org or something? If your site is not running with UTF-8 as charset, you shouldn't really expect this URL encoding charset hack to work anyway - it's not a good way of doing I18N.

-- TWiki:Main.RichardDonkin - 16 Nov 2007

Richard, you are right. I wasn't considering other site charsets than UTF-8.

I mean http://develop.twiki.org/~twiki4/ is employing ISO-8859-15 as far as I can see. I put that paragraph just in case somebody try entering East Asian characters on this site and find TinyMCEPlugin's WYSIWYG editor can handle those characters. But that's not because urlDecode() is fine, but because Ease Asian characters end up being represented in &#DDDDD;, which makes raw text editing inpractical.

-- TWiki:Main.HideyoImazu - 17 Nov 2007

I've come up with a better patch. This should work both with Perl 5.6 and 5.8. And regardless of a site charset.

@@ -2157,6 +2202,12 @@
     my $text = shift;

     $text =~ s/%([\da-f]{2})/chr(hex($1))/gei;
+    $text =~ s/%u([\da-f]{4})/chr(hex($1))/gei;
+       # chr($unicode_codepoint) works w/o a pragma in Perl 5.8 and 5.6
+    unless ( $TWiki::cfg{Site}{CharSet} =~ /^utf-?8$/i ) {
+       my $t = $TWiki::Plugins::SESSION->UTF82SiteCharSet( $text );
+       $text = $t if ( $t );
+    }

     return $text;
 }

-- TWiki:Main.HideyoImazu - 28 Nov 2007

I just had a phone conversation with Hideyo-san. He said that he tested the patch thoroughly, it works also with European characters. May I suggest to accept this patch for 4.2 even though there are no formal tests?

I asked Hideyo-san if he can get more involved with I18N, he said that he is very busy at least until Fed next year. Let's do our best to release 4.2 with good UTF8 support.

-- TWiki:Main.PeterThoeny - 05 Dec 2007

I think this is the only real bug open marked urgent as of Monday evening 10 Dec 2007.

So far the only qualified reviewer of these patches has been Harald.

I would let is be Harald's call if the patch is to be included in 4.2.0.

-- TWiki:Main.KennethLavrsen - 10 Dec 2007

Sorry, that's too much of an honour. I had only reported problems with the Euro character but never have tested Asian characters. But alas:

  • I failed to see any difference with and without the patch with my "traditional" character set of ISO-8859-1.
  • I then tried UTF-8 as site-charset, and that's indeed very weird. As Hideyo-san observed, TMCE (or WysiwygPlugin, to be precise) is using urlDecode based on the assumption that the "Text is assumed to be URL-encoded". However, I failed to find the spot where the encoding takes place. Whatever, the encoding seems to be a bit selective.

Without the patch, if I create a page with asian characters like %u30CF%u30A4 and edit with TMCE, I see %u30CF%u30A4. That's what I think led Hideyo-san to examine the code, and provide the patch.

With the patch, the same characters are correctly displayed on edit, and saved in the topic

With the patch, any stray occurrence of %u30CF%u30A4 in a topic is also converted to, and then saved as the corresponding asian characters. One may argue that an occurrence of %u30CF is not very likely to happen in normal text, but still it seems that whoever does the encoding doesn't do it consistently. Maybe it is even browser dependant, or at least the opinions on what needs to be encoded are differing.

In my opinion the patch does a necessary job for asian characters in UTF-8 environments and no harm in any other use case. I will not object against the patch but am very far from having done a serious testing.

-- TWiki:Main.HaraldJoerg - 11 Dec 2007

A stray occurrence of %uXXXX being converted to is unavoidable because there is no way for urlDecode() to distinguish a Unicode character converted into %uXXXX and an actual "%uXXXX" string.

Besides, there should be no practical harm. %uXXXX is not to be used on an HTML page; &#DDDDD; is the way to represent a character which the content encoding cannot represent.

-- TWiki:Main.HideyoImazu - 12 Dec 2007

FWIW, the patch doesn't seem to do any harm here.

-- TWiki:Main.SteffenPoulsen - 12 Dec 2007

Well, the patch has one problem; using the Plugins::SESSION here is a really bad idea, as it is likely not to be set up yet. Rewritten, incorporated, added unit test.

-- TWiki:Main.CrawfordCurrie - 13 Dec 2007

Reopened this bug as perl segfaults on the recent patch. See Item5144. HideyoImazu, can you verify that and might come up with a better fix?

-- TWiki:Main.MichaelDaum - 18 Dec 2007

Michael, could you please provide some details of your environment? It seems that there are a lot of factors influencing the behaviour, but so far no segfault has been observed:

  • What is your Perl version?
  • What are your OS char set and TWiki site charset?
  • What are the actions which lead to a segfault?

-- TWiki:Main.HaraldJoerg - 18 Dec 2007

I believe it was Michaels revert that also made the Pickaxe problem become less of a problem.

-- TWiki:Main.KennethLavrsen - 19 Dec 2007

No, this should be irrelevant to WYSIWYG now, since I have started entity-encoding all UTF-8 characters.

-- TWiki:Main.CrawfordCurrie - 22 Dec 2007

I neutralised Michael's unit testcase for the segfault. Run the UTF8Tests.

-- TWiki:Main.CrawfordCurrie - 22 Dec 2007

Crawford, when you say "I have started entity-encoding all UTF-8 characters", do you include East Asian characters such as Chinese, Japanese, Korean characters? Is it applicable to TinyMCEPlugin 16057 and/or WysiwygPlugin 15991?

I installed TinyMCEPlugin 16057 and WysiwygPlugin 15991 in my environment. And I see accented Latin characters are converted into entity representations when they are edited by TinyMCEPlugin's WYSIWYG editor. But Japanese characters are not affected.

I wonder if converting accented Latin characters has inconvenient side-effects on searching. There are existing pages edited by raw text editing having those characters in their code points rather than in entity encoding. And there are people who prefer raw text editing. Does TWiki's search feature treat the both representations equally?

Incidentally, entity-encoding East Asian characters makes it impractical to edit a topic in raw text editing since those characters are represented in &#DDDDD; (DDDDD is a decimal number of the code point). It's really bad for TWiki sites employing UTF-8 encoding. If you use UTF-8 as the site encoding, you can edit UTF-8 characters w/o any problem on raw text editing. If you use ISO-8859-? as the site encoding, you don't see the difference between forcing entity encoding in WYSIWYG mode and not doing so. But if you use UTF-8, it makes a difference.

The crux of the problem is that there are two kinds of % in a string handed to urlDecode() from TinyMCEPlugin. 1) literal % used for TWiki markups and other uses, and 2) % for character encoding (%XX and %uXXXX). There is no way for urlDecode() to distinguish one from the other. So % of the former kind should not be there.

In any case, meanwhile, I'll try modifying the patch to avoid segfault.

-- TWiki:Main.HideyoImazu - 26 Dec 2007

The entity encoding is done by TinyMCE; the only change I made was to prevent TWiki from undoing the encoding. So, a Kanjii character should be converted to an entity by TinyMCE on save, and TWiki should respect that encoding.

8-bit Latin-1 characters are not encoded.

Yes, I understand the constraints on editing topics with entity-encoded characters. unfortunately it is too easy to create a topic with broken UTF-8 in it, which causes perl to segfault in a way that cannot be recovered from.

there are two kinds of % in a string handed to urlDecode() from TinyMCEPlugin - no. When a topic is saved from TMCE, the topic is first entity-encoded. It is then URL-encoded in the HTTP POST request. When the POST is loaded in the server, the CGI module performs the URL decoding of the POST parameters. This is the only URL decoding done by the WysiwygPlugin. urlDecode is not called.

There is a double-encoding used when the REST handler is called, but this is only applicable when the 'pickaxe' is used to convert to plain-text editing, and is consistent with a double-encoding on the client side (i.e. %ACTION will never be fed to urlDecode, it will always be %37ACTION)

-- TWiki:Main.CrawfordCurrie - 29 Dec 2007

Patch checked into TWikiRelease04x02 branch. It was actually already checked into MAIN. Seems all have forgotten.

Unit tests pass. I did not even disable any of those.

-- TWiki:Main.KennethLavrsen - 08 Jan 2008

Argh, this brings back the code that segfaults perl under certain circumstances. I removed it 18 Dec. See Item5144.

Reopened. Also made Item5114 urgent so that it does not slip under the radar again frown

-- TWiki:Main.MichaelDaum - 08 Jan 2008

Sorry about that. At the release meeting yesterday the other asked for the patch to be applied and I just offered to do the logistics. I have no clue what the code does.

-- TWiki:Main.KennethLavrsen - 08 Jan 2008

I have confirmed that this and the error in Item5248 are related. I do not think this fix is good. It seems it does not work correctly when called as rest.

-- TWiki:Main.KennethLavrsen - 15 Jan 2008

Reverting this "fix". It makes a lot of damage. Especially to the rest script I have seen several errors when using the Wysiwyg heavily searching for something else. And this is the cure to these problems. It brings back the problem with %uXXXX codes but of two evils this is the least evil.

-- TWiki:Main.KennethLavrsen - 16 Jan 2008

The following patch does nothing to people not using UTF-8. For people using UTF-8 (me included), this fixes the problem.

@@ -2157,6 +2202,10 @@
     my $text = shift;
 
     $text =~ s/%([\da-f]{2})/chr(hex($1))/gei;
+    if ( $TWiki::cfg{Site}{CharSet} =~ /^utf-?8$/i ) {
+       $text =~ s/%u([\da-f]{4})/chr(hex($1))/gei;
+       # chr($unicode_codepoint) works w/o a pragma in Perl 5.8 and 5.6
+    }
 
     return $text;
 }

-- TWiki:Main.HideyoImazu - 16 Jan 2008

I've test the last patch in my TWiki 4.2 (Chinese, UTF-8). It works partly.

The Chinese chars appear as what they should be. But some of them disappear (the last 3 Chinese words). ;-( If I press the "Edit TWiki markup" button in TMCE, then some more chars (the last 4 Chinese words) disappear. If I press the "Edit HTML source" instead, then I don't loss chars.

BxZhTW.jpg

  • Upper Left - original text; Upper Middle : TMCE edit;
    • Upper Right : Edit TWiki markup
    • Lower - Edit HTML source
  • See also: Item5314

-- TWiki:Main.ThYang - 02 Feb 2008

I've tried all 3 patches listed above. They all have the same problem in losing chars.

-- TWiki:Main.ThYang - 02 Feb 2008

The lost char problem is specific to firefox prior to version 2.0.0.12. The patch works fine for IE7 and firefox version 2.0.0.12.

-- TWiki:Main.ThYang - 04 Mar 2008

This does not work for me on OSX Firefox 2.0.0.12 or Win32 Firefox 2.0.0.12

-- TWiki:Main.TimothyChen - 04 Mar 2008

It would be really nice if someone using Latin chars could reproduce these problems. See Item5314.

I would really like the 1.5 billions Chinese/Korean/Japanese etc people to be able to use TWiki in their own language. But it is very hard when you do not even know how to type a Chinese word on a Danish/English keyboard.

What we really need is a Chinese speaking programmer to give a hand. At least with some knowledge how things work when dealing with unicode and Chinese

-- TWiki:Main.KennethLavrsen - 06 Mar 2008

Tim,

I did the test again. The lost chars problem occurred when I connected to my twiki 4.2 via proxy. But it did not occur when I edited it directly without a proxy.

To sum up -

  • OK with firefox 2.0.0.12 (under winxp, osx 10.5.2) and ie7 direct link without proxy
  • Broken under firefox (both winxp & osx), ie7 with proxy or Safari (even without setting proxy).

Kenneth,

The best way to test CJK environment is

1 turn your twiki charset to UTF8 2 install CJK fonts, 3 copy & paste CJK fonts from web. 4 Compare the result with image.  5 try to re-edit it again after you save 6 Compare the result with image, again 

Input CJK is a little bit complicated.

-- TWiki:Main.ThYang - 07 Mar 2008

I put some Chinese with images for you to copy & paste

http://hanix.twbbs.org/twiki.html

-- TWiki:Main.ThYang - 07 Mar 2008

I am actually working on this but it takes time because I finally find out that my test machine which runs Fedora Core 4 simply cannot use UTF8. The OS is too old. So I need to reinstall the entire server first.

But I am determined to at least provide some analysis on the UTF issues as I committed at last release meeting.

-- TWiki:Main.KennethLavrsen - 26 Mar 2008

OK. I now have a site that runs UTF8.

I assume it does not matter that it is en_US.UTF8 as long as it is UTF8? I need this confirmed please.

I can copy paste Chinese characters into the Wysiwyg and I can see the Chinese text. I can also go to raw text and back.

When I go to raw text (pickaxe) the text is entity encoded. (&#DDDDD;)

Before the changes done on 31 Mar 2008 I would get %u stuff. So something got better.

I also tried the latest proposed patch and with the TWiki code as of 31 Mar 2008 I cannot see ANY change what so ever applying the patch.

So I guess the problem now is that we do not see the Chinese characters in the Raw Editor.

Also when I edit raw, and then save, and edit raw again, what I see is the entity encoded stuff and not the Chinese characters.

In the topic file it is the entity coded character that is saved.

So how is this supposed to work? I feel we are fumbling in the blind because noone has given a clear spec how it is supposed to work.

  • Is is correct that it is OK to set the locale to en_US.UTF8?
  • Am I right that the desired behavior is to see the Chinese characters in the raw view and raw edit? (Answer: Yes).
  • What is supposed to be stored in the topic file? (answer is - characters in UTF8 format)

-- TWiki:Main.KennethLavrsen - 01 Apr 2008

More analysis.

I am totally confused. It turns out there are some hidden configure settings that I had not seen. I had set the {site}{locale}. And missed the {Site}{CharSet} which is hidden.

This is so totally confusing. Why do we still have all these stupid settings in configure that overlap? There are 4 settings all related to the charset and locale. There should be ONE.

The help text in configure assumes you know how international characters are represented in every detail. No normal person is ever going to find out how to set this up.

This needs to be simplyfied and we cannot hide part of the configuration if it is this important.

fixed in Item3715

-- TWiki:Main.KennethLavrsen - 01 Apr 2008

OK. So I tried to change the $TWiki::cfg{Site}{CharSet} to utf8. Also tried UTF8. And I tried leaving it unset.

Result. The Wysiwyg editor never loads the page but waits forever in the "Please wait... retrieving page from server." so it seems the UTF8 is totally broken. The only {Site}{CharSet} that allows the Wysiwyg editor to fully load is iso-8859-1. Problem is that the code needs to see the string as "utf-8" with a dash.

Look at this help text from configure: "Change this only if you must match a specific locale (from 'locale -a') whose character set is not supported by your chosen conversion module (i.e. Encode for Perl 5.8 or higher, or Unicode::MapUTF8 for other Perl versions). For example, if the locale 'ja_JP.eucjp' exists on your system but only 'euc-jp' is supported by Unicode::MapUTF8, set this to 'euc-jp'. If you don't define it, it will automatically be defaulted."

Totally useless help text. Who on earth is going to be able to put the right value in this field based on this information? There may be less than 100 people in this world that understand it and know this much detail about specific perl CPAN modules. It has to be changed. at least improved in Item3715

First step. We need Crawford to find out why the Wysiwyg editor does not load with utf8 set. Answer: missing dash in utf-8. Code should be made more robust one sunny day. Helptext added to give syntax in Item3715

-- TWiki:Main.KennethLavrsen - 01 Apr 2008

I started to look through the TWiki.pm code. I found that at least one place a check was hardcoded to look for "utf-8" so I tried that. Fixed the place I found it - KJL Item3715

Then the Wysiwyg editor loads again.

And now the raw edit actually works in Chinese. And the Wysiwyg works in Chinese. Only the pickaxe does not work in Chinese but till shows the entity encoded chars. So we are getting closer.

But for sure we have some work to do with configure and with some code that assumed the exact string "utf-8" lowercase with dash.

-- TWiki:Main.KennethLavrsen - 01 Apr 2008

The pickaxe behavior is very common to the problem I see in Item5467. In both cases the problem is that when you enter the pickaxe mode the Chinese characters are shown as entiry encoded and the ATTACHURL is shown as its value instead of the literal ATTACHURL.

The data sent from the server to the browser when going to pickaxe is not correctly translated. You should see what is equivalent to "Save" followed by "Raw Edit" except that the actual topic is not saved.

-- TWiki:Main.KennethLavrsen - 01 Apr 2008

OK. Played more.

Yes I can add Chinese. But when setting locale to da_DK.utf8 or en_US.utf-8 and the {Site}{CharSet} to utf-8 then it is impossible to write Danish characters in the Wysiwyg editor. The minute I save the topic the ÆØÅ characters become garbage. I can save OK in raw edit. If I Wysiwyg edit a topic that already has ÆØÅ the ÆØÅ becomes garbage letters also.

The WysiwygPlugin conversion garble Danish letters (and also German/French/Spanish etc) during the conversion when using UTF8.

So the UTF8 fails completely with anything else than English or Chinese?? So we are not any further making UTF8 working.

Note that I have also worked on Item3715 but this observation with Danish not working on UTF8 is seen both on straight from SVN code as well as the version I work on related 3715.

I am not sure how much further I get in this.

-- TWiki:Main.KennethLavrsen - 02 Apr 2008

I'm being asked for feedback but it's not clear what the question is. My obvious question is "how many of the above remarks relate to the code uploaded this week"? Because I changed the behaviour of pickaxe a lot, which included testing Chinese and Russian in pickaxe mode, and they work. Can you please summarise current WYSIWYG issues against this latest code?

-- TWiki:Main.CrawfordCurrie - 04 Apr 2008

I know there is a lot of analysis and that it is messy. But I add things as I find them so I will not forget.

Let me summarise.

First I was totally confused I did not know if my configure setup was correct and I did not know how things are supposed to work.

After a lot of experiments I assume the following.

  • With the TWiki setup for UTF-8 and a client with the proper fonts installed, the client should see the Chinese characters as chinese characters both in ...
    • Normal view
    • Raw view
    • Edit Raw
    • Wysiwyg editor
    • Pickaxe editor
  • A working UTF-8 setup would be
    • {Site}{CharSet} set to "utf-8"
    • {Site}{Locale} set to en_US.utf-8

  • QUESTION: Please confirm that this is all assumed correct including the fact that with utf-8 it should not matter what is in front of the ".utf-8".

NEXT. I was confused about the language settings and syntax. As part of fixing Item3715 the {Site}{Lang} and {Site}{FullLang} are GONE. Less confusion! And the {Site}{CharSet} is now unexperted and the help text improved so at least people are not in doubt of the syntax for utf-8.

Now we are down to 3 problems related to utf-8

  • The Wysiwyg editor never loads the page but waits forever in the "Please wait... retrieving page from server." if the {Site}{CharSet} is not understood by the code. Not an urgent bug. Should have its own bug report. Not urgent after I have improved the configure help text.
  • When going from Wysiwyg to Pickaxe the utf-8 strings written in Chinese are converted to entities. This should not happen when the TWiki is setup for utf-8. I should see the Chinese characters as being sent as utf-8 to the browser like it happens when you raw edit. The topics are stored OK from Raw Edit and from Wysiwyg edit when adding Chinese characters and TWiki setup for UTF-8 so this is a Pickaxe conversion problem that needs to be fixed.
  • And finally the most serious issue. When setting locale to da_DK.utf8 or en_US.utf-8 and the {Site}{CharSet} to utf-8 then it is impossible to write non English simple latin characters such as French, German, Danish characters in the Wysiwyg editor. The minute I save the topic the ÆØÅ characters become garbage. I can save OK in raw edit. If I Wysiwyg edit a topic that already has ÆØÅ the ÆØÅ becomes garbage letters also. Pure English and Chinese seem to work in utf-8 mode so some conversion seems to be destroying the 7 bit non-english characters when using utf-8. Note that French/German/Danish works fine when TWiki is configured for iso-8859-1 or iso-8859-15.

This is as far as I can come with the analysis. If the Pickaxe conversion problem is fixed and the utf-8 configured site can be made to work with non-english latin characters like ÆØÅéèöä etc then we are close to having a pretty well working utf-8 feature. All testing I do is done in latest TWikiRelease04x02 SVN checkout and both with IE and FF. Back to you Crawford. Hope the summary helps.

-- KennethLavrsen - 07 Apr 2008

The Wysiwyg editor never loads the page but waits forever... suggests that TWiki is crashing in the background when the TML to HTML conversion is attempted. The edit page is loaded with the topic text embedded in the textarea, with UTF-8 characters represented as octets (individual bytes). This text is then sent to the server, where the octets are decoded to wide chars. After rendering to HTML, any remaining wide chars are converted to entities, because perl print barfs on the wide characters. The only part of this process that is sensitive to {Site}{CharSet} is the rendering, and that process is largely shared with the standard view rendering.

The conversion to entities is necessary because if the site charset doesn't support the wide character, perl print will fail mysteriously with Wide character in print. Thus if you don't use entities, the content on the TWiki site becomes critically dependent on the selected site charset, and a topic ported to another TWiki site that uses a different site charset will crash the new site. Entities are much, much safer. Anyone who wants to can write a plugin that compacted entities back to characters as a beforeEditHandler, if editing native UTF-8 chars in raw edit mode is a big issue, but I suspect that any such plugin will be sensitive to the site charset.

I haven't tried setting the site charset as you describe; I will try that, but TBH I suspect the problem has moved on from WysiwygPlugin.

-- CrawfordCurrie - 08 Apr 2008

I am able to reproduce this using locale en_GB.utf8

Note that if you have a topic that contains high-bit characters, and you select utf8 as {Site}{CharSet}, you cannot expect it to work; in fact it may well crash TWiki, as the high-bit characters overlap with characters used in the UTF8 encoding.

Having said that, there is an issue inserting high-bit characters into UTF8 content. I have:

  • {Site}{CharSet} set to utf8
  • {Site}{Locale} set to en_GB.utf8
I edit a new topic and using the Tiny MCE symbols button, enter some high-bit characters. I then save.

The topic text is sent to the server with those characters entered as HTML entities. They are then erroneously mapped to high-bit characters (which won't work in UTF8)

-- CrawfordCurrie - 11 Apr 2008

I have tested with utf8 and 8859 char sets, and I'm pretty confident now. Note that setting the locale is not enough; TWiki uses the {Site}{CharSet} and ignores the charset given in the locale.

-- CrawfordCurrie - 12 Apr 2008

What ever was fixed in a short period is now broken again. I cannot write Chinese characters again even though the site runs UTF-8.

-- TWiki:Main.KennethLavrsen - 21 Apr 2008

Please don't re-open this report; it's too big and has covered too much ground already. If there are more problems, open a new report. Thanks.

-- CrawfordCurrie - 22 Apr 2008

I won't comment on all of this, but if you want WikiWords with accented I18N characters to work at all, you should not be using UTF-8 as the site character set. That's why utf-8 in 'utf-8 bytes mode' (see TWiki:Codev.UseUTF8, key concepts part) is only half-supported in all current TWiki versions - you simply can't get UTF-8 and wikiwords that including I18N characters. You should only use UTF-8 for Japanese, Chinese and other languages where you don't care about Wiki words using letters other than unaccented Roman A to Z.

The comment about 'doesn't matter what comes before .utf-8' is misleading. It is essential that this matches a working locale on the server (see locale -a output on server.) Although given how broken locales plus Perl Unicode mode are (very buggy) it may be no bad thing if the use locale fails perhaps.

What is a bit concerning is that there is (or was) an attempt to do UseUTF8 aka UnicodeSupport via local changes rather than looking at the whole picture. Crawford's UseUTF8 page is the way to go, as it's looking at a wide enough set of issues and hopefully will be done as a project (i.e. it can and will break things before it fixes them, and should be on separate branch).

-- TWiki:Main.RichardDonkin - 26 Jun 2008

ItemTemplate
Summary urlDecode() not working for characters represented by Unicode code points
ReportedBy TWiki:Main.HideyoImazu
Codebase 4.1.2, 4.2.0, ~twiki4
SVN Range TWiki-4.3.0, Fri, 12 Oct 2007, build 15261
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:15985 TWikirev:16031 TWikirev:16091 TWikirev:16121 TWikirev:16171 TWikirev:16237 TWikirev:16238 TWikirev:16651 TWikirev:16718 TWikirev:16719 TWikirev:16720 TWikirev:16721
TargetRelease patch
ReleasedIn

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg BxZhTW.jpg r1 manage 26.7 K 2008-02-02 - 15:14 UnknownUser Broken Chinese
Edit | Attach | Watch | Print version | History: r66 < r65 < r64 < r63 < r62 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r66 - 2008-08-04 - KennethLavrsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback