• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.
if you use

$TWiki::cfg{Site}{CharSet} = 'utf8';

then

---++ This is a headline with some _emphasis_ in it

will render

<h2><a name="This is a headline with some <em>em"></a> This is a headline with some _emphasis</em> in it </h2>

which is fatal.

Patch:

--- lib/TWiki/Render.pm (revision 11645)
+++ lib/TWiki/Render.pm (working copy)
@@ -400,10 +400,7 @@

     # For most common alphabetic-only character encodings (i.e. iso-8859-*),
     # remove non-alpha characters
-    if( defined($TWiki::cfg{Site}{CharSet}) &&
-          $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?/i ) {
-        $anchorName =~ s/[^$TWiki::regex{mixedAlphaNum}]+/_/g;
-    }
+    $anchorName =~ s/[^$TWiki::regex{mixedAlphaNum}]+/_/g;
     $anchorName =~ s/__+/_/g;           # remove excessive '_' chars
     if ( !$compatibilityMode ) {
         $anchorName =~ s/^[\s\#\_]*//;  # no leading space nor '#', '_'

Why are iso-8859-* treated special?

MD

I believe it was due to performance, as discussed in Item2032 - not sure how much impact we are talking, though.

-- SP

Hm, but the check above to distinguish iso-*** charsets from others only happens when normalizing anchor names (replacing suspicious chars with an underscore). So it can't be related to form values.

MD

This isn't a performance issue. However, answering MD's question is surprisingly complex since this is a complex area...

On a side note: I'm surprised there aren't more bugs with use of underscores for emphasis in TOC entries - never thought this would work, and semantics of what TWikiML is allowed in TOC entries should be better defined anyway.

I18N of TOC entries is quite painful and really needs some work - what it should really do is look at the intended language of the page (e.g. French or Chinese) and then check whether that language is alphabetic. (This could be configured per site perhaps, as {langType} set to alphabetic or nonalphabetic).

  • UPDATE: Although we can't mix Perl locale features with Perl Unicode support features (see TWiki:Codev.UnicodeSupport for reason), we could just use current locale setting to get a simple view of site-wide language settings, even though the UnicodeSupport will ignore the locale. However, this doesn't address sites where multiple translated versions of pages are available, in which case the language of the page can sometimes be deduced from that. It also doesn't address pages that include multiple languages, of course.

Then, for TOCs in pages (or just specific TOC entries) using alphabetic languages, all non-alphabetic characters (except for allowed TWikiML such as emphasis) are stripped, so that the anchor is normalised (but can still include accented characters).

For non-alphabetic languages such as Chinese (see TWiki:Support.TOCnotWorkingForChineseHeadings), all non-script characters need to be stripped in a similar way, but with a different regex. This really needs full Unicode support turned on in Perl (TWiki:Codev.UnicodeSupport) to be practical - e.g. the Unicode regexes enable [\p{Letter}\p{Mark}] to match any letter or accent from alphabetic and non-alphabetic scripts.

Without full Unicode support in TWiki, non-alphabetic mode would need to be done in a less safe 'filter out bad characters only' mode, as now - otherwise most Chinese TOC entries simply don't work as they are all 'invalid' characters. See TWiki:Codev.InternationalisationIssues for links to Chinese TOC issues, and TWiki:Support.TOCnotWorkingForChineseHeadings in particular.

Mixing alphabetic and non-alphabetic languages is no worse than just non-alphabetic and would definitely need Unicode.

Today, UTF-8 can of course be used with alphabetic or non-alphabetic characters - however, since I18N for TWiki doesn't support Unicode yet (WikiWord I18N doesn't work for example) it's best to assume that you are using a non-alphabetic characters.

So - the test for iso-8859-* is a rather Western European biased and sub-optimal way of checking for "current language is alphabetic" that will not work with UTF-8 (or many 8-bit alphabetic character sets used for Cyrillic and so on).

UPDATE: There's a comment from me on the topic of non-alphabetic languages in TWiki:Codev.InternationalCharactersInFormFields.

-- RD


Perhaps related: Item2455

  • RD - not related, that's a configuration error.

AC

In fact, configuring the following is also an error - please configure based on the docs using locale, then try again:

$TWiki::cfg{Site}{CharSet} = 'utf8';

RD

Some updates above, flagged as UPDATE - the more I think about this, the harder it is to really determine the language used in a specific TOC entry. Some compromise is necessary.

RD

This topic was lost from the lists due to not having a codebase field. Rediscovered 3/2/07. Just set it to "No Action" if it is dead.

CC

In any case, please make sure to keep anchor links compatible, there are many URL out there pointing to .../SomeWeb/SomeTopic#Some_auto_generated_anchor_from_subject

-- PTh

It was agreed at release meeting 02 Jul 2007 that it is OK to filter out chars and break old anchor links for links to heading with strange formatting in them to fix this.

KJL

No interest in this since issue since February, assuming patch done is working.

Closing, re-open as new bug if more work needs to be done.

-- TWiki:Main.SteffenPoulsen - 09 Sep 2007

ItemTemplate
Summary I18N: Using UTF8 in headers breaks header anchors
ReportedBy TWiki:Main.MichaelDaum
Codebase

SVN Range TWiki-4.1, Sat, 23 Sep 2006, build 11571
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor

Checkins

TargetRelease minor
ReleasedIn 4.2.0
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r18 - 2008-01-22 - KennethLavrsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback