• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7700 for generic doc work for TWiki-6.0.2. Use View topic Item7703 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.
We are already enforcing some restrictions on filenames, and to allow continued easy of use of the %ATTACHURL% construct in I18N-surroundings, 8-bit content should be stripped from filenames.

For Latin-1 I am currently doing it in this way, which seems to work very nice (used a file name like this for testing):


(Latin-1 has other characters like ¤¦¨´¸¼½¾ which I haven't tested).

This patch normalizes characters somewhat like the registration form (for scandinavian/german characters), unfortunately it introduces a dependency on Unicode::Normalize to kick in:

Index: lib/TWiki/Sandbox.pm
--- lib/TWiki/Sandbox.pm        (revision 11997)
+++ lib/TWiki/Sandbox.pm        (working copy)
@@ -186,6 +186,35 @@
     my $origName = $fileName;
     # Change spaces to underscore
     $fileName =~ s/ /_/go;
+    # If in iso8859 surroundings and Unicode::Normalize is available, let's get rid of 8-bit chars in filenames
+    if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
+        if( $] >= 5.008 && eval { require Unicode::Normalize } ) {
+            require Encode;
+            use Unicode::Normalize;
+            # Some normalizations need to be intercepted early
+            $fileName =~ s/\xc4/AE/g;
+            $fileName =~ s/\xc5/AA/g;
+            $fileName =~ s/\xd6/OE/g;
+            $fileName =~ s/\xdc/UE/g;
+            $fileName =~ s/\xe4/ae/g;
+            $fileName =~ s/\xe5/aa/g;
+            $fileName =~ s/\xf6/oe/g;
+            $fileName =~ s/\xfc/ue/g;
+            #  convert to Unicode
+            $fileName = NFD( $fileName );  # decompose (Unicode Normalization Form D)
+            $fileName =~ s/\pM//g;         # strip combining characters
+            # normalizations, Latin-1
+            $fileName =~ s/\x{00c6}/AE/g;
+            $fileName =~ s/\x{00d8}/OE/g;
+            $fileName =~ s/\x{00df}/ss/g;
+            $fileName =~ s/\x{00e6}/ae/g;
+            $fileName =~ s/\x{00f8}/oe/g;
+            $fileName =~ s/\x{0152}/OE/g;
+            $fileName =~ s/\x{0153}/ae/g;
+            # clear everything left that is 8-bit
+            $fileName =~ s/[^\0-\x80]//g;
+        }
+    }
     # Remove problematic chars
     $fileName =~ s/$TWiki::cfg{NameFilter}//goi;
     # Append .txt to some files

If there's a better way of doing this in realm of depencies we are already having let me know.

Setting this waiting for feedback.

-- SP

Committing as is, please redo with lesser requirements if possible.

-- SP

The use Unicode::Normalize; is a problem since it loads the module at compile time regardless if in a I18N environment or not. We cannot raise the number of required libs just like this. Please make this conditional.

-- PTh

Thanks, you are right about that, sorry. I have put the use statement in an eval:

-            use Unicode::Normalize;
+            eval { use Unicode::Normalize };

-- SP

4.1.0 released


I don't agree that I18N characters should be stripped from filenames - in fact I spent some time making sure they weren't! Once the real fix to Item3652 is done, this fix to 3163 should be re-examined. I've also commented about this fix under Item3652.

For reasonable security in what characters are allowed, how about stripping filename characters that don't match the already defined $regex{filenameRegex}? That will ensure only reasonable filename characters are included - just add dashes, spaces and underscores and you're done, with much less code. This should be protected with a {Site}{LangAlphabetic} boolean in config - if true, do this filtering, if false, do filtering out (or perhaps do filtering-in and use Unicode regexes locally to ensure we only have Chinese letters or whatever.) I think normalisation is probably over the top as the regexes will work either way. See TWiki:Codev.UnicodeSupport and also here re langAlphabetic idea.


Summary I18N: Strip uploaded filenames for 8-bit characters (Allow only US-ASCII)
ReportedBy TWiki:Main.SteffenPoulsen
Codebase ~twiki4
SVN Range TWiki-4.1, Thu, 09 Nov 2006, build 11947
AppliesTo Engine

Priority Normal
CurrentState Closed

Checkins 12015 12020
TargetRelease minor

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r12 - 2007-03-04 - RichardDonkin
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback