• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.
reported by TWiki:Main/ZhengLingxiang on TWiki:Plugins/BlogPluginDev:


Does it support utf-8? I creat a blogpage with chinese content. I input the chinese char in the BlogEntryForm, It works fine when save the topic first time. But If you edit it again, some chinese char will display some thing like "#nnnn;" It works fine, when the same char display and edit in the WikiPage. I install the twiki(Dakar)on the redhat Linux AS4. Please check attached file for more information.

  • display ok do the first save.:
display ok do the first save.

  • edit wrong:
edit wrong

I test a simple topic with TwikiForm and disabled all the plugins. The error still exist. When edit a page with chinese text in the form, some chinese chars may become "#nnnn;"

-- ZL

The symptoms you are seeing indicate that the character set/encoding is somehow getting mis-configured - e.g. if you have ISO-8859-1 character set and you paste in a Chinese character in UTF-8, you get these Numeric Character References (NCRs) such as &8249; meaning &8249;. Try Google:twiki+numeric+character+reference for some ideas, but the main thing is to carefully check the HTTP headers, HTML and the browser for the character encoding, to work out where it's getting set to something other than UTF-8.

Also, see TWiki:TWiki03.TWikiInstallationGuide - this has a section on how to debug internationalisation setup that largely applies to TWiki 4.0, except that the way you configure options is now through configure. You should only need to use configure to set character sets, setting the CHARSET variable in TWikiPreferences is a mistake.

Having said all that, this is specific to form values, so maybe there's some code in Dakar that has not been I18N'ed - check the Form code and see if there's anything looking for \w or [a-z] etc. Also, see TWiki:Codev.InternationalisationGuidelines.

TWiki:Codev.InternationalCharactersInFormFields may be relevant - it's about field names not values but there may be a similar bug in that module (SVNget:lib/TWiki/Forms.pm).

-- RD

Actually TWiki:Main/ZhengLingxiang verified the issue on a very simple example disabling all plugins:

So why is it specific to the BlogPlugin?

MD

I didn't say it was specific to BlogPlugin, that was just one idea early in my comment (now updated) - by the end I had decided it was probably in Forms.pm, similar to TWiki:Codev.InternationalCharactersInFormFields but for form field values not names.

RD

TWiki:Main/ZhengLingxiang's report is at TWiki:Support.EditExistedFormTextInUtf8

-- PTh

I've had a look at the code in Form.pm and its use of CGI.pm - here are some comments:

  1. The _cleanField routine should be patched for Chinese sites that want to use Chinese only field names (not the issue here, same as TWiki:Codev.InternationalCharactersInFormFields)
  2. The Form.pm module calls CGI::textfield to render the text field. On looking at CPAN:CGI version 3.10, I found a routine called escapeHTML - haven't traced the code path but it's highly suspicious that it can create ‹ numeric character references as a special case if the charset is ISO-8859-1. I suspect what may be happening is that we are not using CGI.pm to create the initial HTTP headers and start_html parts, so CGI.pm is defaulting to ISO-8859-1 internally, resulting in this bug.

To resolve this issue, I think we should:

  1. (Test by ZL) Put some debug code into the escapeHTML routine in CGI.pm, to verify that's what causing this issue, and check what it thinks the charset is. You could try commenting out the line that looks like this as a very quick test: $toencode =~ s{\x8b}{‹}gso;
  2. (Quick workaround) Replace the call to CGI::textfield with simple HTML generation within Form.pm.
  3. (Real solution) Look at how we are using CGI.pm calls, and either (a) inform it of the charset we are using using CGI::charset (which might avoid many other issues), (b) reduce our use of CGI.pm to the minimum due to its various I18N issues or (c) increase our use of CGI.pm so that we are using it in the way in which it 'wants' to be used. Option (a) is the simplest option if it turns out it thinks that it is in ISO-8859-1 mode when TWiki is using UTF-8.

Hope that helps - this requires a little verification before we spend more time debugging it, but it's not the first time CGI.pm has caused I18N issues with TWiki. However, I'm hopeful that calling CGI::charset is all that's needed.

I also just noticed this part of the TWiki 4.0 release notes, which might well be relevant...

Question for ZL - what version of CPAN:CGI are you using?

RD

Thanks! I have fixed this bug by adding the following line in Form.pm

CGI::charset("utf-8");

--ZhengLingxiang

Good to hear this worked, and thanks for letting me know! We'll have to figure out best way to put this into TWiki code, but clearly CGI::charset is the way to go. I think that around line 365 of TWiki.pm, in the 'locale setup' part, would be good - this is just telling CGI.pm some of the same info passed in the POSIX::setlocale call.. Code would look something like:

   require CGI;
   import CGI ();
   CGI::charset($TWiki::cfg{Site}{CharSet});

No real error checking needed since this routine simply sets a value. However, we might need to know whether the CGI module supports this - some older versions might not have the charset routine.

Above code is somewhat tested on CGI.pm v3.04 and TWiki 4.0.0 on Debian Linux - i.e. it runs OK, but haven't had time to test with Chinese characters in forms. Patch is attached. Needs cleaning up re the require/import of CGI, not sure of best way to do this since otherwise TWiki.pm doesn't need this module, but it's always loaded by some other module anyway.

ZhengLingxiang - could you apply this patch (after you've removed your change) and let me know if it works.

--RD

I have applied your patch. It works. I test it on CGI.pm V3.05 and TWiki 4.0.2 on RedHat Linux AS 4.

--Main.ZhengLingxiang

I've done an improved patch that removes any performance hit from ISO-8859-1 sites but is otherwise the same. Note that this patch also fixes a bug in ISO-8859-15 conversion in another routine.

UPDATE/WARNING: This patch needs some real testing for the UTF-8 case, along with reading of the CGI.pm implementation code - there's a good chance that CGI.pm will set all data to internal Perl UTF-8 characters, which will cause issues in TWiki since it's not really set up to handle these. It might 'just work' but in my experience it won't - or we might find that UTF-8 sites slow down a lot...

--RD

The patch works on CGI.pm v3.15, TWiki 4.1.1 on Fedora Core 6.

-- Main.ThYang - 07 Feb 2007

I don't believe this patch made it into the MAIN branch - it's not in 4.1.2 anyway.

RD

I merged the patch to MAIN. i can;t test, so I'll just have to assume it works - unless someone screams!

CC

why only compare against -1 instead of both -1 and -15?

-    if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
+    if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?1$/i ) {

-- WillNorris - 14 May 2007

There should only be one default character set, and ISO-8859-1 is currently it, because it maps easily to and from UTF-8 URLs, amongst other things. Supporting -15 as well is more complex, and in some places (e.g. configure) the default has been set as -1 and -15 in different places... another reason for having only one default.

-- RichardDonkin - 15 May 2007

ok, and shouldn't -15 be that default?

-- TWiki:Main.WillNorris - 15 May 2007

Can anyone confirm that this is still a release blocker, or can we de-grade it to Normal?

If there's only the -1/-15 discussion left, this item should go in the release notes and the 1/15 discussion put to another bug.

-- TWiki:Main.SteffenPoulsen - 03 Jun 2007

My comment about problems of mapping UTF-8 to and from ISO-8859-15 still applies - ISO-8859-1 is the best default as we can (and do) algorithmically map from UTF-8 URLs (which are very common now) into ISO-8859-1.

So let's just set the default to ISO-8859-1 everywhere and leave the -15 discussion for another bug, where we can consider code changes needed to support (may not be that large but would need to investigate diffs between -1 and -15 and convert those algorithmically).

-- TWiki:Main.RichardDonkin - 19 Jun 2007

The patch attached to Item3652 also fixes the ISO-8859-1 vs -15 conversion issue, and much else besides; however, it doesn't address Item2032 specifically.

-- TWiki:Main.RichardDonkin - 19 Jun 2007

Possibly Not related to Bugs:Item4419.

-- TWiki:Main.SteffenPoulsen - 20 Aug 2007

Updated Stefan's comment above, not related to that bug.

-- RichardDonkin - 23 Aug 2007

I am closing this bug for now, original reporter has not shown interest in it since February, and from the little information we have the problem appears to be solved.

Re-open as new bug if further work is necessary.

-- TWiki:Main.SteffenPoulsen - 09 Sep 2007

ItemTemplate
Summary Some UTF8 characters in form values broken (CGI.pm interaction)
ReportedBy TWiki:Main.MichaelDaum
Codebase

SVN Range Mon, 27 Mar 2006 build 9563
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:9795 TWikirev:13714
TargetRelease minor
ReleasedIn 4.2.0
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpatch CGI-charset-utf8-v2.patch r1 manage 1.8 K 2006-04-12 - 09:57 UnknownUser Updated Patch to tell CGI.pm what charset we are using
Unknown file formatpatch CGI-charset-utf8.patch r1 manage 0.5 K 2006-04-09 - 08:15 UnknownUser Patch to tell CGI.pm what charset we are using
JPEGjpg chinese_form_error.JPG r1 manage 16.9 K 2006-04-06 - 11:26 UnknownUser  
JPEGjpg display_ok.JPG r1 manage 15.6 K 2006-04-05 - 14:20 UnknownUser  
JPEGjpg edit_wrong.jpg r1 manage 76.1 K 2006-04-05 - 14:21 UnknownUser  
Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r34 - 2008-01-22 - KennethLavrsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback