We are already enforcing some restrictions on filenames, and to allow continued easy of use of the %ATTACHURL% construct in I18N-surroundings, 8-bit content should be stripped from filenames.

For Latin-1 I am currently doing it in this way, which seems to work very nice (used a file name like this for testing):


(Latin-1 has other characters like ¤¦¨´¸¼½¾ which I haven't tested).

This patch normalizes characters somewhat like the registration form (for scandinavian/german characters), unfortunately it introduces a dependency on Unicode::Normalize to kick in:

Index: lib/TWiki/Sandbox.pm
--- lib/TWiki/Sandbox.pm        (revision 11997)
+++ lib/TWiki/Sandbox.pm        (working copy)
@@ -186,6 +186,35 @@
     my $origName = $fileName;
     # Change spaces to underscore
     $fileName =~ s/ /_/go;
+    # If in iso8859 surroundings and Unicode::Normalize is available, let's get rid of 8-bit chars in filenames
+    if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
+        if( $] >= 5.008 && eval { require Unicode::Normalize } ) {
+            require Encode;
+            use Unicode::Normalize;
+            # Some normalizations need to be intercepted early
+            $fileName =~ s/\xc4/AE/g;
+            $fileName =~ s/\xc5/AA/g;
+            $fileName =~ s/\xd6/OE/g;
+            $fileName =~ s/\xdc/UE/g;
+            $fileName =~ s/\xe4/ae/g;
+            $fileName =~ s/\xe5/aa/g;
+            $fileName =~ s/\xf6/oe/g;
+            $fileName =~ s/\xfc/ue/g;
+            #  convert to Unicode
+            $fileName = NFD( $fileName );  # decompose (Unicode Normalization Form D)
+            $fileName =~ s/\pM//g;         # strip combining characters
+            # normalizations, Latin-1
+            $fileName =~ s/\x{00c6}/AE/g;
+            $fileName =~ s/\x{00d8}/OE/g;
+            $fileName =~ s/\x{00df}/ss/g;
+            $fileName =~ s/\x{00e6}/ae/g;
+            $fileName =~ s/\x{00f8}/oe/g;
+            $fileName =~ s/\x{0152}/OE/g;
+            $fileName =~ s/\x{0153}/ae/g;
+            # clear everything left that is 8-bit
+            $fileName =~ s/[^\0-\x80]//g;
+        }
+    }
     # Remove problematic chars
     $fileName =~ s/$TWiki::cfg{NameFilter}//goi;
     # Append .txt to some files

If there's a better way of doing this in realm of depencies we are already having let me know.

Setting this waiting for feedback.

-- SP

Committing as is, please redo with lesser requirements if possible.

-- SP

The use Unicode::Normalize; is a problem since it loads the module at compile time regardless if in a I18N environment or not. We cannot raise the number of required libs just like this. Please make this conditional.

-- PTh

Thanks, you are right about that, sorry. I have put the use statement in an eval:

-            use Unicode::Normalize;
+            eval { use Unicode::Normalize };

-- SP

4.1.0 released


I don't agree that I18N characters should be stripped from filenames - in fact I spent some time making sure they weren't! Once the real fix to Item3652 is done, this fix to 3163 should be re-examined. I've also commented about this fix under Item3652.

For reasonable security in what characters are allowed, how about stripping filename characters that don't match the already defined $regex{filenameRegex}? That will ensure only reasonable filename characters are included - just add dashes, spaces and underscores and you're done, with much less code. This should be protected with a {Site}{LangAlphabetic} boolean in config - if true, do this filtering, if false, do filtering out (or perhaps do filtering-in and use Unicode regexes locally to ensure we only have Chinese letters or whatever.) I think normalisation is probably over the top as the regexes will work either way. See TWiki:Codev.UnicodeSupport and also here re langAlphabetic idea.


Summary I18N: Strip uploaded filenames for 8-bit characters (Allow only US-ASCII)
ReportedBy TWiki:Main.SteffenPoulsen
Codebase ~twiki4
SVN Range TWiki-4.1, Thu, 09 Nov 2006, build 11947
AppliesTo Engine

Priority Normal
CurrentState Closed

Checkins 12015 12020
TargetRelease minor

