We are already enforcing some restrictions on filenames, and to allow continued easy of use of the
%ATTACHURL%
construct in
I18N-surroundings, 8-bit content should be stripped from filenames.
For Latin-1 I am currently doing it in this way, which seems to work very nice (used a file name like this for testing):
ßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ.txt
(Latin-1 has other characters like
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
which I haven't tested).
This patch normalizes characters somewhat like the registration form (for scandinavian/german characters), unfortunately it introduces a dependency on Unicode::Normalize to kick in:
Index: lib/TWiki/Sandbox.pm
===================================================================
--- lib/TWiki/Sandbox.pm (revision 11997)
+++ lib/TWiki/Sandbox.pm (working copy)
@@ -186,6 +186,35 @@
my $origName = $fileName;
# Change spaces to underscore
$fileName =~ s/ /_/go;
+ # If in iso8859 surroundings and Unicode::Normalize is available, let's get rid of 8-bit chars in filenames
+ if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
+ if( $] >= 5.008 && eval { require Unicode::Normalize } ) {
+ require Encode;
+ use Unicode::Normalize;
+ # Some normalizations need to be intercepted early
+ $fileName =~ s/\xc4/AE/g;
+ $fileName =~ s/\xc5/AA/g;
+ $fileName =~ s/\xd6/OE/g;
+ $fileName =~ s/\xdc/UE/g;
+ $fileName =~ s/\xe4/ae/g;
+ $fileName =~ s/\xe5/aa/g;
+ $fileName =~ s/\xf6/oe/g;
+ $fileName =~ s/\xfc/ue/g;
+ # convert to Unicode
+ $fileName = NFD( $fileName ); # decompose (Unicode Normalization Form D)
+ $fileName =~ s/\pM//g; # strip combining characters
+ # normalizations, Latin-1
+ $fileName =~ s/\x{00c6}/AE/g;
+ $fileName =~ s/\x{00d8}/OE/g;
+ $fileName =~ s/\x{00df}/ss/g;
+ $fileName =~ s/\x{00e6}/ae/g;
+ $fileName =~ s/\x{00f8}/oe/g;
+ $fileName =~ s/\x{0152}/OE/g;
+ $fileName =~ s/\x{0153}/ae/g;
+ # clear everything left that is 8-bit
+ $fileName =~ s/[^\0-\x80]//g;
+ }
+ }
# Remove problematic chars
$fileName =~ s/$TWiki::cfg{NameFilter}//goi;
# Append .txt to some files
If there's a better way of doing this in realm of depencies we are already having let me know.
Setting this waiting for feedback.
--
SP
Committing as is, please redo with lesser requirements if possible.
--
SP
The
use Unicode::Normalize;
is a problem since it loads the module at compile time regardless if in a
I18N environment or not. We cannot raise the number of required libs just like this. Please make this conditional.
--
PTh
Thanks, you are right about that, sorry. I have put the use statement in an eval:
- use Unicode::Normalize;
+ eval { use Unicode::Normalize };
--
SP
4.1.0 released
KJL
I don't agree that
I18N characters should be stripped from filenames - in fact I spent some time making sure they weren't! Once the real fix to
Item3652 is done, this fix to 3163 should be re-examined. I've also commented about this fix under
Item3652.
For reasonable security in what characters are allowed, how about stripping filename characters that don't match the already defined
$regex{filenameRegex}
? That will ensure only reasonable filename characters are included - just add dashes, spaces and underscores and you're done, with much less code. This should be protected with a
{Site}{LangAlphabetic}
boolean in config - if true, do this filtering, if false, do filtering out (or perhaps do filtering-in and use Unicode regexes locally to ensure we only have Chinese letters or whatever.) I think normalisation is probably over the top as the regexes will work either way. See
TWiki:Codev.UnicodeSupport and also
here re langAlphabetic idea.
RD