• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item6212: search CGI hangs when queried by Googlebot/2.1

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Engine search Normal No Action Required   n/a  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

search CGI hangs when queried by Googlebot/2.1

I recently setup Nagios to monitor the CPU load and disk space on one of the TWiki webservers that I manage.

Interestingly, immediately after doing this, I started to get reports that the CPU load was hanging around 1.2 - 2.2 for hours at a time.

After some investigating, I discovered that the TWiki 'search' CGI was consuming all of the CPU, for example:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND                                                                    
25168 apache    25   0  146M 134M  1432 R    99.9 13.3   6:43   1 search                                                                     
28940 apache    25   0 36104  35M  1788 R    80.9  3.5   0:51   0 search             

Expecting the worst (somebody trying to hack TWiki?), I started digging through the logs, trying to find out what query was being issued to the search CGI to cause it to consume so much CPU, and for so long..

I eventually came across log entries from Googlebot/2.1 that looked like this:

66.249.71.54 - - [21/Nov/2008:10:31:33 -0500] "GET /twiki/bin/search/TWiki/?scope=topic&regex=on&bookview=on&search=%5C.* HTTP/1.1" 500 614 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I attempted to replicate this behavior, so first I killed any 'search' processes that were hanging around. Then, I visited a URL like this: http://mytwikisite.org/twiki/bin/search/TWiki/?scope=topic&regex=on&bookview=on&search=\.*

Sure enough, I saw the 'search' CGI using up 100% of the CPU. Of course, I'd expect 'search' to be a relatively CPU intensive process, but would not expect it to hang around for hours eating up all of the CPU.

The installation that I am experiencing this problem with is "TWiki Release 4.0.0 (Dakar), 01 Feb 2006"

An older TWiki web that I manage is (still) running "01-Sep-2004 Release (Cairo)". When I visit the search CGI in my web browser using the same query string, I actually get an "aggregated" page of everything that is contained in that TWiki web. This is returned relatively quickly, and doesn't cause search to hang around for hours using up all of the CPU.

Another TWiki installation that I have is running "4.2 (Freetown), 22 Jan 2008". Interestingly, for whatever reason, I receive the message "TWiki detected an internal error - please check your TWiki logs and webserver logs for more information. Can't open access.log: Permission denied". So I checked my webserver logs and found "Can't open access.log: Permission denied at /a/twiki/lib/TWiki/Plugins/AccessStatsPlugin.pm line 318". Not sure why it wants to read access.log, but I went ahead and ran chmod 644 /var/log/apache2/access.log. Now I get a totally different error -- "Insecure dependency in eval while running with -T switch at /usr/lib/perl5/GD.pm line 95", which is a completely different problem related to ChartPlugin. I believe this is a known issue, but I have never followed up on it so I cannot test on my TWiki 4.2 deployment.

Is this some kind of "trick" that GoogleBot uses when it encounters a TWiki site, in an attempt to "speed up" indexing by issuing a single request to search, then googlebot parses and makes sense of it all? (I really hope not!)

In any case, Google is not able to index my TWiki site that is having problem. For example, if I search Google for "site:mytwikisite.org", I only get a single result, with no context. This is not desireable, as it is a public project page that we would certainly like to be indexed ...

-- TWiki:Main/ChristopherTracy - 11 Mar 2009

I thought this might be because of my Apache 2.0 installation, as per the comment in LocalLib.cfg:

# -------------- Only needed to work around an Apache 2.0 bug on Unix # OPTIONAL # If you are running TWiki on Apache 2.0 on Unix you might experience # TWiki scripts hanging forever. This is a known Apache 2.0 bug. A fix is # available at http://issues.apache.org/bugzilla/show_bug.cgi?id=22030. # You are recommended to patch your Apache installation. 

However, after uncommenting the workaround that redirects to /tmp/error.log, I found that nothing shows up at all in error.log. My Apache (2.0.40) does seem to be affected (when I try the test script they give at https://issues.apache.org/bugzilla/show_bug.cgi?id=22030), but I am not convinced that this is the problem.

When I bring up the URL that triggers the problem, after about 26 minutes the search process finally dies...(maybe running out of memory after getting up to over 400MB?)

My TWiki site is not particularly big -- there is about 9.6MB of stuff in /data/ and about 295MB in /pub/ ...

# while (true); do date; ps auxwww | grep search | grep -v grep; sleep 10; done Wed Mar 11 13:22:47 EST 2009 apache   19582 92.5  3.2 35340 33832 ?       R    13:21   1:27 /usr/bin/perl -wT /var/www/virtual_servers/hpn/twiki/bin/search scope=topic\®ex=on\&bookview=on\&search=\\.\* ... Wed Mar 11 13:38:21 EST 2009 apache   19582 52.2 40.3 417668 415528 ?     S    13:21   8:56 /usr/bin/perl -wT /var/www/virtual_servers/hpn/twiki/bin/search scope=topic\®ex=on\&bookview=on\&search=\\.\* Wed Mar 11 13:38:31 EST 2009 apache   19582 51.6  0.0     0    0 ?        Z    13:21   8:56 [search <defunct>] 

-- TWiki:Main.ChristopherTracy - 11 Mar 2009

Don't spend time on the search script, it is there but deprecated. Search is done now with a %SEARCH embedded in TWiki.WebSearch.

Tell your GSA via robots.txt file to ignore /twiki/bin/search and all URL that have URL parameters. The only thing you need to index is /twiki/bin/view/SomeWeb/SomeTopic (without any URL parameters).

-- TWiki:Main.PeterThoeny - 11 Mar 2009

Setting state to "no action required".

-- TWiki:Main.PeterThoeny - 11 Mar 2009

ItemTemplate
Summary search CGI hangs when queried by Googlebot/2.1
ReportedBy TWiki:Main.ChristopherTracy
Codebase 4.0.0
SVN Range TWiki-5.0.0, Mon, 23 Feb 2009, build 17838
AppliesTo Engine
Component search
Priority Normal
CurrentState No Action Required
WaitingFor

Checkins

TargetRelease n/a
ReleasedIn

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r3 - 2009-03-11 - PeterThoeny
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback