• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

What TWiki does currently is to convert the UTF-8 string into Perl's internal Unicode, what makes the string be interpreted as iso-8859-1 if it has no characters with codepoint greater than 255. From perluniintro(1):

       Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example
       Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are
       0xFF or less, Perl uses the native eight-bit character set.  Otherwise, it uses UTF-8.

i.e.: if you use utf-8 but some string has no characters with unicode codepoints above 255, your string will be turned into latin-1. I experienced that with locale pt_BR.utf8 , when working on Item782. (Portuguese has no characters > 255).

Suggested change (for my private control, this was commited locally as r11130.):

=== TWiki.pm
==================================================================
--- TWiki.pm    (revision 11127)
+++ TWiki.pm    (local)
@@ -491,6 +491,20 @@
     # If not UTF-8 - assume in site character set, no conversion required
     return undef unless( $text =~ $regex{validUtf8StringRegex} );

+    # If site charset is already UTF-8 no need to convert anything:
+    if ( $TWiki::cfg{Site}{CharSet} =~ /^utf-?8$/i ) {
+        # Convert into internal Unicode characters if on Perl 5.8 or higher.
+        if( $] <  5.008 ) {
+            $this->writeWarning( 'UTF-8 not supported on Perl '.$].
+                                 ' - use Perl 5.8 or higher..' );
+        }
+
+        # SMELL: is this true yet?
+        $this->writeWarning( 'UTF-8 not yet supported as site charset -'.
+                             'TWiki is likely to have problems' );
+        return $text;
+    }
+
     # Convert into ISO-8859-1 if it is the site charset
     if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
         # ISO-8859-1 maps onto first 256 codepoints of Unicode
@@ -498,18 +512,6 @@
         $text =~ s/ ([\xC2\xC3]) ([\x80-\xBF]) /
           chr( ord($1) << 6 & 0xC0 | ord($2) & 0x3F )
             /egx;
-    } elsif ( $TWiki::cfg{Site}{CharSet} eq 'utf-8' ) {
-        # Convert into internal Unicode characters if on Perl 5.8 or higher.
-        if( $] >= 5.008 ) {
-            require Encode;            # Perl 5.8 or higher only
-            # 'decode' into UTF-8
-            $text = Encode::decode('utf8', $text);
-        } else {
-            $this->writeWarning( 'UTF-8 not supported on Perl '.$].
-                                 ' - use Perl 5.8 or higher..' );
-        }
-        $this->writeWarning( 'UTF-8 not yet supported as site charset -'.
-                             'TWiki is likely to have problems' );
     } else {
         # Convert from UTF-8 into some other site charset
         if( $] >= 5.008 ) {

I'm waiting comments, specially from RD

AT

I see the bug now.... Your change would remove the Encode::decode('utf8', $text) line - since that line converts UTF-8 byte sequences into internal Perl UTF-8 characters, and there's no support yet in TWiki for such characters, and it seems to be causing this issue, I think your change should be fine. I did put in the decode as part of the UTF-8 charset support work, but unfortunately never finished it and left this dangling.

Without having code to convert all data into internal UTF-8 characters (and a fix for performance problems with UTF-8, and a complete reworking of the dynamic use locale code at start of all modules, to avoid mixing locales and Unicode), it's best to just get rid of the decode as you are doing.

You should also remove the line that goes # Convert into internal Unicode characters if on Perl 5.8 or higher.

If someone could comment to the two current issues in TWiki:Support about character set issues (Umlaute on Windows etc) and point them here, that would be helpful - I'm horribly busy at the moment but it may be that their problem is related. Commenting out the decode line in Cairo might help...

RD

Commited to DEVELOP as r7374.

AT

ItemTemplate
Summary UTF82SiteCharset should do nothing when {Site}{Charset} is already UTF-8
ReportedBy AntonioTerceiro
AppliesTo Engine
Component

Priority Normal
CurrentState Closed
WaitingFor

Checkins 7374
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r4 - 2005-11-08 - AntonioTerceiro
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback