search CGI hangs when queried by Googlebot/2.1
I recently setup Nagios to monitor the CPU load and disk space on one of the TWiki webservers that I manage.
Interestingly, immediately after doing this, I started to get reports that the CPU load was hanging around 1.2 - 2.2 for hours at a time.
After some investigating, I discovered that the TWiki 'search' CGI was consuming all of the CPU, for example:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
25168 apache 25 0 146M 134M 1432 R 99.9 13.3 6:43 1 search
28940 apache 25 0 36104 35M 1788 R 80.9 3.5 0:51 0 search
Expecting the worst (somebody trying to hack TWiki?), I started digging through the logs, trying to find out what query was being issued to the search CGI to cause it to consume so much CPU, and for so long..
I eventually came across log entries from Googlebot/2.1 that looked like this:
184.108.40.206 - - [21/Nov/2008:10:31:33 -0500] "GET /twiki/bin/search/TWiki/?scope=topic®ex=on&bookview=on&search=%5C.* HTTP/1.1" 500 614 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I attempted to replicate this behavior, so first I killed any 'search' processes that were hanging around. Then, I visited a URL like this: http://mytwikisite.org/twiki/bin/search/TWiki/?scope=topic®ex=on&bookview=on&search=\
Sure enough, I saw the 'search' CGI using up 100% of the CPU. Of course, I'd expect 'search' to be a relatively CPU intensive process, but would not expect it to hang around for hours eating up all of the CPU.
The installation that I am experiencing this problem with is "TWiki Release 4.0.0 (Dakar), 01 Feb 2006"
An older TWiki web that I manage is (still) running "01-Sep-2004 Release (Cairo)". When I visit the search CGI in my web browser using the same query string, I actually get an "aggregated" page of everything that is contained in that TWiki web. This is returned relatively quickly, and doesn't cause search to hang around for hours using up all of the CPU.
Another TWiki installation that I have is running "4.2 (Freetown), 22 Jan 2008". Interestingly, for whatever reason, I receive the message "TWiki detected an internal error - please check your TWiki logs and webserver logs for more information. Can't open access.log: Permission denied". So I checked my webserver logs and found "Can't open access.log: Permission denied at /a/twiki/lib/TWiki/Plugins/AccessStatsPlugin.pm line 318". Not sure why it wants to read access.log, but I went ahead and ran chmod 644 /var/log/apache2/access.log. Now I get a totally different error -- "Insecure dependency in eval while running with -T switch at /usr/lib/perl5/GD.pm line 95", which is a completely different problem related to ChartPlugin
. I believe this is a known issue, but I have never followed up on it so I cannot test on my TWiki 4.2 deployment.
Is this some kind of "trick" that GoogleBot
uses when it encounters a TWiki site, in an attempt to "speed up" indexing by issuing a single request to search, then googlebot parses and makes sense of it all? (I really hope not!)
In any case, Google is not able to index my TWiki site that is having problem. For example, if I search Google for "site:mytwikisite.org", I only get a single result, with no context. This is not desireable, as it is a public project page that we would certainly like to be indexed ...
- 11 Mar 2009
I thought this might be because of my Apache 2.0 installation, as per the comment in LocalLib
# -------------- Only needed to work around an Apache 2.0 bug on Unix # OPTIONAL # If you are running TWiki on Apache 2.0 on Unix you might experience # TWiki scripts hanging forever. This is a known Apache 2.0 bug. A fix is # available at http://issues.apache.org/bugzilla/show_bug.cgi?id=22030. # You are recommended to patch your Apache installation.
However, after uncommenting the workaround that redirects to /tmp/error.log, I found that nothing shows up at all in error.log. My Apache (2.0.40) does
seem to be affected (when I try the test script they give at https://issues.apache.org/bugzilla/show_bug.cgi?id=22030
), but I am not convinced that this is the problem.
When I bring up the URL that triggers the problem, after about 26 minutes the search process finally dies...(maybe running out of memory after getting up to over 400MB?)
My TWiki site is not particularly big -- there is about 9.6MB of stuff in /data/ and about 295MB in /pub/ ...
# while (true); do date; ps auxwww | grep search | grep -v grep; sleep 10; done Wed Mar 11 13:22:47 EST 2009 apache 19582 92.5 3.2 35340 33832 ? R 13:21 1:27 /usr/bin/perl -wT /var/www/virtual_servers/hpn/twiki/bin/search scope=topic\®ex=on\&bookview=on\&search=\\.\* ... Wed Mar 11 13:38:21 EST 2009 apache 19582 52.2 40.3 417668 415528 ? S 13:21 8:56 /usr/bin/perl -wT /var/www/virtual_servers/hpn/twiki/bin/search scope=topic\®ex=on\&bookview=on\&search=\\.\* Wed Mar 11 13:38:31 EST 2009 apache 19582 51.6 0.0 0 0 ? Z 13:21 8:56 [search <defunct>]
- 11 Mar 2009
Don't spend time on the search script, it is there but deprecated. Search is done now with a %SEARCH embedded in TWiki.WebSearch.
Tell your GSA via robots.txt file to ignore /twiki/bin/search and all URL that have URL parameters. The only thing you need to index is /twiki/bin/view/SomeWeb/SomeTopic (without any URL parameters).
- 11 Mar 2009
Setting state to "no action required".
- 11 Mar 2009