reported by
TWiki:Main/ZhengLingxiang
on
TWiki:Plugins/BlogPluginDev
:
Does it support utf-8? I creat a blogpage with chinese content. I input the chinese char in the
BlogEntryForm, It works fine when save the topic first time. But If you edit it again, some chinese char will display some thing like "#nnnn;" It works fine, when the same char display and edit in the
WikiPage. I install the twiki(Dakar)on the redhat Linux AS4. Please check attached file for more information.
- display ok do the first save.:
I test a simple topic with
TwikiForm and disabled all the plugins. The error still exist. When edit a page with chinese text in the form, some chinese chars may become "#nnnn;"
--
ZL
The symptoms you are seeing indicate that the character set/encoding is somehow getting mis-configured - e.g. if you have ISO-8859-1 character set and you paste in a Chinese character in UTF-8, you get these Numeric Character References (NCRs) such as
&8249;
meaning &8249;. Try
Google:twiki+numeric+character+reference
for some ideas, but the main thing is to carefully check the HTTP headers, HTML and the browser for the character encoding, to work out where it's getting set to something other than UTF-8.
Also, see
TWiki:TWiki03.TWikiInstallationGuide
- this has a section on how to debug internationalisation setup that largely applies to TWiki 4.0, except that the way you configure options is now through
configure
. You should only need to use
configure
to set character sets, setting the CHARSET variable in
TWikiPreferences is a mistake.
Having said all that, this is specific to form values, so maybe there's some code in Dakar that has not been
I18N'ed - check the Form code and see if there's anything looking for
\w
or
[a-z]
etc. Also, see
TWiki:Codev.InternationalisationGuidelines
.
TWiki:Codev.InternationalCharactersInFormFields
may be relevant - it's about field
names not values but there may be a similar bug in that module (SVNget:lib/TWiki/Forms.pm).
--
RD
Actually
TWiki:Main/ZhengLingxiang
verified the issue on a very simple example disabling all plugins:
So why is it specific to the
BlogPlugin?
MD
I didn't say it was specific to
BlogPlugin, that was just one idea early in my comment (now updated) - by the end I had decided it was probably in Forms.pm, similar to
TWiki:Codev.InternationalCharactersInFormFields
but for form field values not names.
RD
TWiki:Main/ZhengLingxiang's report is at
TWiki:Support.EditExistedFormTextInUtf8
--
PTh
I've had a look at the code in Form.pm and its use of CGI.pm - here are some comments:
- The
_cleanField
routine should be patched for Chinese sites that want to use Chinese only field names (not the issue here, same as TWiki:Codev.InternationalCharactersInFormFields
)
- The Form.pm module calls
CGI::textfield
to render the text field. On looking at CPAN:CGI
version 3.10, I found a routine called escapeHTML
- haven't traced the code path but it's highly suspicious that it can create ‹
numeric character references as a special case if the charset is ISO-8859-1. I suspect what may be happening is that we are not using CGI.pm to create the initial HTTP headers and start_html
parts, so CGI.pm is defaulting to ISO-8859-1 internally, resulting in this bug.
To resolve this issue, I think we should:
- (Test by ZL) Put some debug code into the
escapeHTML
routine in CGI.pm, to verify that's what causing this issue, and check what it thinks the charset is. You could try commenting out the line that looks like this as a very quick test: $toencode =~ s{\x8b}{‹}gso;
- (Quick workaround) Replace the call to CGI::textfield with simple HTML generation within Form.pm.
- (Real solution) Look at how we are using CGI.pm calls, and either (a) inform it of the charset we are using using
CGI::charset
(which might avoid many other issues), (b) reduce our use of CGI.pm to the minimum due to its various I18N issues or (c) increase our use of CGI.pm so that we are using it in the way in which it 'wants' to be used. Option (a) is the simplest option if it turns out it thinks that it is in ISO-8859-1 mode when TWiki is using UTF-8.
Hope that helps - this requires a little verification before we spend more time debugging it, but it's not the first time CGI.pm has caused
I18N issues with TWiki. However, I'm hopeful that calling
CGI::charset
is all that's needed.
I also just noticed
this part of the TWiki 4.0 release notes
, which might well be relevant...
Question for
ZL - what version of
CPAN:CGI
are you using?
RD
Thanks! I have fixed this bug by adding the following line in Form.pm
CGI::charset("utf-8");
--
ZhengLingxiang
Good to hear this worked, and thanks for letting me know! We'll have to figure out best way to put this into TWiki code, but clearly
CGI::charset
is the way to go. I think that around line 365 of TWiki.pm, in the 'locale setup' part, would be good - this is just telling CGI.pm some of the same info passed in the POSIX::setlocale call.. Code would look something like:
require CGI;
import CGI ();
CGI::charset($TWiki::cfg{Site}{CharSet});
No real error checking needed since this routine simply sets a value. However, we might need to know whether the CGI module supports this - some older versions might not have the
charset
routine.
Above code is somewhat tested on CGI.pm v3.04 and TWiki 4.0.0 on Debian Linux - i.e. it runs OK, but haven't had time to test with Chinese characters in forms. Patch is attached. Needs cleaning up re the require/import of CGI, not sure of best way to do this since otherwise TWiki.pm doesn't need this module, but it's always loaded by some other module anyway.
ZhengLingxiang - could you apply this patch (after you've removed your change) and let me know if it works.
--
RD
I have applied your patch. It works. I test it on CGI.pm V3.05 and TWiki 4.0.2 on
RedHat Linux AS 4.
--
Main.ZhengLingxiang
I've done an improved patch that removes any performance hit from ISO-8859-1 sites but is otherwise the same. Note that this patch also fixes a bug in ISO-8859-15 conversion in another routine.
UPDATE/WARNING: This patch needs some real testing for the UTF-8 case, along with reading of the CGI.pm implementation code - there's a good chance that CGI.pm will set all data to internal Perl UTF-8 characters, which will cause issues in TWiki since it's not really set up to handle these. It might 'just work' but in my experience it won't - or we might find that UTF-8 sites slow down a lot...
--
RD
The patch works on CGI.pm v3.15, TWiki 4.1.1 on Fedora Core 6.
--
Main.ThYang - 07 Feb 2007
I don't believe this patch made it into the MAIN branch - it's not in 4.1.2 anyway.
RD
I merged the patch to MAIN. i can;t test, so I'll just have to assume it works - unless someone screams!
CC
why only compare against -1 instead of both -1 and -15?
- if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
+ if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?1$/i ) {
--
WillNorris - 14 May 2007
There should only be one default character set, and ISO-8859-1 is currently it, because it maps easily to and from UTF-8 URLs, amongst other things. Supporting -15 as well is more complex, and in some places (e.g.
configure
) the default has been set as -1 and -15 in different places... another reason for having only one default.
--
RichardDonkin - 15 May 2007
ok, and shouldn't -15 be that default?
--
TWiki:Main.WillNorris
- 15 May 2007
Can anyone confirm that this is still a release blocker, or can we de-grade it to Normal?
If there's only the -1/-15 discussion left, this item should go in the release notes and the 1/15 discussion put to another bug.
--
TWiki:Main.SteffenPoulsen
- 03 Jun 2007
My comment about problems of mapping UTF-8 to and from ISO-8859-15 still applies - ISO-8859-1 is the best default as we can (and do) algorithmically map from UTF-8 URLs (which are very common now) into ISO-8859-1.
So let's just set the default to ISO-8859-1 everywhere and leave the -15 discussion for another bug, where we can consider code changes needed to support (may not be that large but would need to investigate diffs between -1 and -15 and convert those algorithmically).
--
TWiki:Main.RichardDonkin
- 19 Jun 2007
The patch attached to Item3652 also fixes the ISO-8859-1 vs -15 conversion issue, and much else besides; however, it doesn't address Item2032 specifically.
--
TWiki:Main.RichardDonkin
- 19 Jun 2007
Possibly Not related to Bugs:Item4419.
--
TWiki:Main.SteffenPoulsen
- 20 Aug 2007
Updated Stefan's comment above, not related to that bug.
--
RichardDonkin - 23 Aug 2007
I am closing this bug for now, original reporter has not shown interest in it since February, and from the little information we have the problem appears to be solved.
Re-open as new bug if further work is necessary.
--
TWiki:Main.SteffenPoulsen
- 09 Sep 2007