stix's load is currently running around 3 and that's mostly postgresql and index.cgi. It seems to occasionally get like this and requires a Pg & Apache restart.
OK, I'm sorry to have to do this but if I don't get a response to planning how to fix this load issue I'm going to take London Openguides offline this evening.
I'm looking for a plan with a deadline, not a solution by this evening.
Paul
On 6/20/07, Kake L Pugh kake@earth.li wrote:
On Tue 05 Jun 2007, Paul Makepeace paulm@paulm.com wrote:
OpenGuides London is frequently bursting 100% of a 3Ghz HT CPU which is causing stix (the server) load problems,
Did this issue get resolved or ameliorated, and if so what was the fix?
Kake
-- OpenGuides-Dev mailing list - OpenGuides-Dev@lists.openguides.org http://lists.openguides.org/cgi-bin/mailman/listinfo/openguides-dev
On 25/07/07, Paul Makepeace paulm@paulm.com wrote:
stix's load is currently running around 3 and that's mostly postgresql and index.cgi. It seems to occasionally get like this and requires a Pg & Apache restart.
OK, I'm sorry to have to do this but if I don't get a response to planning how to fix this load issue I'm going to take London Openguides offline this evening.
I'm sorry to hear that, Paul. Afraid the mechanics of why this is happening is beyond me. Can anyone on the list suggest methods I can use to investigate? However, my sysadmin skills are weak to non-existent, so please speak slowly and with small words and plenty of explanation (i.e. "use valgrind, man!!1!" won't be of much use to me).
It could be waves of spam, rogue crawlers gone mad or who knows what, but unfortunately I do not have access to any Apache log files on stix, so trying to find out anything that way is a non-starter.
If Paul is forced to take London.og offline, I will try and produce a low-overhead script that only reads and displays pages in a bland way so at least incoming links don't all break.
Earle
On Wed, 25 Jul 2007, Earle Martin wrote:
I'm sorry to hear that, Paul. Afraid the mechanics of why this is happening is beyond me. Can anyone on the list suggest methods I can use to investigate? However, my sysadmin skills are weak to non-existent, so please speak slowly and with small words and plenty of explanation (i.e. "use valgrind, man!!1!" won't be of much use to me).
a) use the latest code. (then you can use spam blocking stuff which may help)(http://svn.randomness.org.uk/trunk/london.randomness.org.uk/scripts/lib/RGL/... is rgls) b) provide us with a schema(no data) dump of the database. so we can check that the rigth indexes are in place. c) openguides is now mostly mod_perl safe. however i think there might be an issue with closing postgres db connections. so mod_perl might be an option. althogh i think it will stop working once it runs out of conenctions which will then need a restart so may not be the answer. d) if its google tell it to back off a bit at the google webmasters tools stuff or indeed use robots.txt to block search crawlers.
It could be waves of spam, rogue crawlers gone mad or who knows what, but unfortunately I do not have access to any Apache log files on stix, so trying to find out anything that way is a non-starter.
indeed the log files will help.
# chmod a+r /var/log/apache/*openguides*
What I'm looking for right now is a commitment to solve the problem, and a date by which that commitment is to be fulfilled. I'm sorry to put it in such strong language but I have literally been asking this for two, three? years, and no dates have ever been forthcoming.
OG is at the point where it's interfering with other customers, and it's increasingly difficult for me to keep saying "well, I have asked them not to do that".
I'm looking for a response along the lines of,
"We will test and thrash a local instance of OG London with FastCGI and Apache Bench on our home boxen by date YYYY-MM-DD. At this point we'd like to set up a test instance on stix on date YYYY-MM-DD, and let it bake for a week. If that's OK, we'll move the install over on YYYY-MM-DD."
You'll notice the prevalance of YYYY-MM-DD which thus far in any responses has been missing :-)
P
On 7/25/07, Earle Martin openguides@downlode.org wrote:
On 25/07/07, Paul Makepeace paulm@paulm.com wrote:
stix's load is currently running around 3 and that's mostly postgresql and index.cgi. It seems to occasionally get like this and requires a Pg & Apache restart.
OK, I'm sorry to have to do this but if I don't get a response to planning how to fix this load issue I'm going to take London Openguides offline this evening.
I'm sorry to hear that, Paul. Afraid the mechanics of why this is happening is beyond me. Can anyone on the list suggest methods I can use to investigate? However, my sysadmin skills are weak to non-existent, so please speak slowly and with small words and plenty of explanation (i.e. "use valgrind, man!!1!" won't be of much use to me).
It could be waves of spam, rogue crawlers gone mad or who knows what, but unfortunately I do not have access to any Apache log files on stix, so trying to find out anything that way is a non-starter.
If Paul is forced to take London.og offline, I will try and produce a low-overhead script that only reads and displays pages in a bland way so at least incoming links don't all break.
Earle
-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/
On Wed, 25 Jul 2007, Paul Makepeace wrote:
I'm looking for a response along the lines of,
"We will test and thrash a local instance of OG London with FastCGI and Apache Bench on our home boxen by date YYYY-MM-DD. At this point we'd like to set up a test instance on stix on date YYYY-MM-DD, and let it bake for a week. If that's OK, we'll move the install over on YYYY-MM-DD."
in which case we need more info . give us specs of the box its on. what access numbers are like currently. what versions of postgres and apache.
also, is there iowait on the box.
is the database being vacuumed? (just to make sure) has the database setup been tweaked.
not that i really care but no one else seems to have these issues. well boston may have done but then crschimdt implemented mod_perl, memcached and better indexes and indeed uses mysql i think.
ive certainly seen that some workloads dont cope very well on postgres.
On Wed, Jul 25, 2007 at 01:25:52PM +0100, Bob Walker wrote:
not that i really care but no one else seems to have these issues. well boston may have done but then crschimdt implemented mod_perl, memcached and better indexes and indeed uses mysql i think.
I've definitely seen problems of this type -- in fact, I still do.
The big things to look for, however, require more knowledge of the usage pattern. If we can get that, that's great: If you can get ahold of the apache logs, the thing to look for is access of pages like ?action=index with no index_type/index_value, or repeated access to large categories, like the 'Restaurants' or 'Bars' category.
If it's really a case of 'fix it or kill it', I would bet that looking at:
} elsif ($action eq 'index') { $guide->show_index( type => $q->param("index_type") || "Full", value => $q->param("index_value") || "", format => $format, ); }
and killing show_index (and making it instead return a 'sorry! You can't do that!' page) would significantly lower the load.
However, I can't guaraentee that -- another possibility is that spiders are hammering the crap out of the site with lots of requests, in which case, blocking spiders and working on that aspect of it might help.
As Bob mentioned, the indexes are Serious Business: if a small section of query log from the site can be looked at, so that the queries can be 'explain analyze'd to check for sequential scans -- especially on the metadata table, which is likely > 1mil rows if it's anything like boston -- and indexes added.
I guess this is really directed towards Earle rather than towards Paul.
Regards,
On 25/07/07, Christopher Schmidt crschmidt@crschmidt.net wrote:
As Bob mentioned, the indexes are Serious Business: if a small section of query log from the site can be looked at, so that the queries can be 'explain analyze'd to check for sequential scans -- especially on the metadata table, which is likely > 1mil rows if it's anything like boston -- and indexes added.
Well, London.og is back up, and every now and then, so is the load. I think the problem is definitely one to do with the database; as soon as a few wiki.cgi's start being run simultaneously, we start getting Postgres' "postmaster" process running continuously and taking up 95% of CPU.
Paul, could you grant me access to the Postgres log so I can follow Christopher's suggestion?
Thanks,
Earle.
On 28/07/07, Earle Martin openguides@downlode.org wrote:
Well, London.og is back up
Forgot to mention: I added logging to my spam filter. In the last two days it's blocked over 500 attempts at spam. \m/
On 28/07/07, Earle Martin openguides@downlode.org wrote:
Well, London.og is back up, and every now and then, so is the load.
That said, when I logged in five hours or so (just before writing that mail) there was a load spike going on at up to 16(!). However, since then, everything's been quiet, and we've rarely passed 0.2. This leads me to think that the database isn't coping well when we have sudden heavy attacks of spam, and that I should probably re-implement the load cutoff that Paul suggested some time ago.
Bob,
If you'd like to help here I can grant you root for a while and you can take a poke around. The load is up around 2.5 and there's no obvious (to me) cause. iostat -kx5 isn't showing anything dramatic. OG gets about a hit a second which doesn't seem like it ought to cause a problem but with cgi...
I shut off Pg and the load stayed the same. Taking apache-perl down it went to 0%us. Bringing apache-perl & Pg back up it's ok again. So it suggests something's getting wedged. Of course, it's possible it's not actually OpenGuides, but OG gets by far the most number of hits, and has the most index.cgi processes hanging around, and it is the only Pg user.
As an aside, if you have raidtools/mdadm skillz that might come in handy here too.
P
On 7/25/07, Bob Walker bob@randomness.org.uk wrote:
On Wed, 25 Jul 2007, Paul Makepeace wrote:
I'm looking for a response along the lines of,
"We will test and thrash a local instance of OG London with FastCGI and Apache Bench on our home boxen by date YYYY-MM-DD. At this point we'd like to set up a test instance on stix on date YYYY-MM-DD, and let it bake for a week. If that's OK, we'll move the install over on YYYY-MM-DD."
in which case we need more info . give us specs of the box its on. what access numbers are like currently. what versions of postgres and apache.
also, is there iowait on the box.
is the database being vacuumed? (just to make sure) has the database setup been tweaked.
not that i really care but no one else seems to have these issues. well boston may have done but then crschimdt implemented mod_perl, memcached and better indexes and indeed uses mysql i think.
ive certainly seen that some workloads dont cope very well on postgres.
-- Bob Walker http://london.randomness.org.uk/ For great beery Justice!
-- OpenGuides-Dev mailing list - OpenGuides-Dev@lists.openguides.org http://lists.openguides.org/cgi-bin/mailman/listinfo/openguides-dev
On 25/07/07, Paul Makepeace paulm@paulm.com wrote:
I shut off Pg and the load stayed the same. Taking apache-perl down it went to 0%us. Bringing apache-perl & Pg back up it's ok again. So it suggests something's getting wedged. Of course, it's possible it's not actually OpenGuides, but OG gets by far the most number of hits, and has the most index.cgi processes hanging around, and it is the only Pg user.
Okay, a quick perusal of the latest logfile shows a couple of notable things.
1) Somebody at 80.68.93.162 has been slurping the whole site, indexes, pages, format varieties (rdf, raw) and all with WWW::Mechanize at the rate of one request per second. 2) Spam. Lots and lots of spam.
So what I'm going to do is this:
a) Shut down the site immediately and put it in "maintenance mode" (i.e. the "out of order sign") b) Bring it back up as read-only c) Install a totally clean new version from the latest release, with spam prevention features d) Try and find as many different kinds of spam as I can from the database and get them into the filters e) Clean the database of spam and vacuum it f) Bring the site back up - warning Paul first - and monitor for load levels.
a) and b) are going to happen this afternoon. c) d) and e) hopefully within a day or two. f) as soon as e) is complete. If load *still* goes too high after the spamwall has been raised, it may be indicative of a deeper problem meriting heavy investigation, and I'll put the site back into maintenance mode again.
Incidental thought: since robots.txt doesn't allow wildcards, perhaps we should put 'rel="nofollow"' on all our links to resources useless to search engines. I believe the better-behaved robots should respect this.
Thanks,
Earle.
This one time, at band camp, Earle Martin wrote:
- Somebody at 80.68.93.162 has been slurping the whole site, indexes,
pages, format varieties (rdf, raw) and all with WWW::Mechanize at the rate of one request per second.
This is something we can fix! That IP == ivorw.vm.bytemark.co.uk. Ivor, care to back off slurping for a bit?
a) and b) are going to happen this afternoon. c) d) and e) hopefully within a day or two. f) as soon as e) is complete. If load *still* goes too high after the spamwall has been raised, it may be indicative of a deeper problem meriting heavy investigation, and I'll put the site back into maintenance mode again.
I think a rudimentary publishing mode would be helpful. The front page doesn't need to be built from the database every time, so having it published might be good. I suspect putting the latest changes on there causes a fair bit of pain.
Incidental thought: since robots.txt doesn't allow wildcards, perhaps we should put 'rel="nofollow"' on all our links to resources useless to search engines. I believe the better-behaved robots should respect this.
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide to Boston that almost never change. Think Category Restaurant page -- 1MB page, changes maybe once a week, Google fetches it daily.
Granted, OG Boston is particularly poorly optimized for this because we use index_list in our category pages. The actual index_value, etc. mode in wiki.cgi is significantly more lightweight. (Bad decision on my part.) But you don't have to have someone fetching much data to hurt a site, and even if Google is only requesting things slowly, they can still exceed the return rate of the server.
Regards,
On Wed, Jul 25, 2007 at 09:03:46PM -0400, Christopher Schmidt wrote:
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide to Boston that almost never change. Think Category Restaurant page -- 1MB page, changes maybe once a week, Google fetches it daily.
Granted, OG Boston is particularly poorly optimized for this because we use index_list in our category pages. The actual index_value, etc. mode in wiki.cgi is significantly more lightweight. (Bad decision on my part.) But you don't have to have someone fetching much data to hurt a site, and even if Google is only requesting things slowly, they can still exceed the return rate of the server.
I believe this is what Sitemaps are for?
https://www.google.com/webmasters/tools/docs/en/about.html
Sorry not really following most of this conversation due to flooding...
Dominic.
On 7/26/07, Christopher Schmidt crschmidt@crschmidt.net wrote:
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide to Boston that almost never change. Think Category Restaurant page -- 1MB page, changes maybe once a week, Google fetches it daily.
Spoke with the crawler guys here and your site changes more often than you seem to think. That page also has a high page rank which affects crawl frequency.
Your HTTP headers could help more: try using if-modified-since. Consider also a reverse caching proxy to reduce load. You can reduce the crawl freq with the webmaster console if you still think it's too much.
You could restructure the page to not be a megabyte too, of course ;-)
HTH, Paul (not speaking as a representative of his employer, just trying to help out)
Granted, OG Boston is particularly poorly optimized for this because we use index_list in our category pages. The actual index_value, etc. mode in wiki.cgi is significantly more lightweight. (Bad decision on my part.) But you don't have to have someone fetching much data to hurt a site, and even if Google is only requesting things slowly, they can still exceed the return rate of the server.
Regards,
Christopher Schmidt Web Developer
-- OpenGuides-Dev mailing list - OpenGuides-Dev@lists.openguides.org http://lists.openguides.org/cgi-bin/mailman/listinfo/openguides-dev
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
I think a rudimentary publishing mode would be helpful. The front page doesn't need to be built from the database every time, so having it published might be good. I suspect putting the latest changes on there causes a fair bit of pain.
That's certainly something that could be done just every few minutes, or whenever there's an edit. In fact, the same idea could be applied to every page - *all* user requests would get served static HTML, and those static pages would be regenerated automatically whenever the data behind them is changed. So that way, only searches and edits hit the database. On a normal wiki, those are a very small proportion of total requests.
A good halfway house might be to use something like Squid to proxy all incoming requests and to cache very common pages, such as the home page, and serve them from the cache instead of hitting the OG code at all. I'd be happy to help with configgering this.
This one time, at band camp, David Cantrell wrote:
A good halfway house might be to use something like Squid to proxy all incoming requests and to cache very common pages, such as the home page, and serve them from the cache instead of hitting the OG code at all. I'd be happy to help with configgering this.
I'm not a caching expert by any stretch but I don't see how this would work. Squid would ask the web server if the content has changed, and unless you're publishing out static pages OG will always return that it's changed. Otherwise it's gotta do a database hit to see if it's changed anyway, and on the homepage that means the expensive recent changed database hit.
Rev Simon Rumble wrote:
I'm not a caching expert by any stretch but I don't see how this would work. Squid would ask the web server if the content has changed, and unless you're publishing out static pages OG will always return that it's changed. Otherwise it's gotta do a database hit to see if it's changed anyway, and on the homepage that means the expensive recent changed database hit.
Like Simon I am interested to learn and discuss rather than an expert!
Do proxying caches remember the Modified: and Expired: header when they fetch a page. Or do they always go back to the main site and ask them for the headers for the page. And if they ask for the headers, might that end up as two hits?
On Fri, Jul 27, 2007 at 07:29:24AM +0100, Andrew Black - lists wrote:
Do proxying caches remember the Modified: and Expired: header when they fetch a page. Or do they always go back to the main site and ask them for the headers for the page. And if they ask for the headers, might that end up as two hits?
This is why the If-Modified-Since request header exists.
On Fri, Jul 27, 2007 at 01:15:32AM +0100, Rev Simon Rumble wrote:
This one time, at band camp, David Cantrell wrote:
A good halfway house might be to use something like Squid to proxy all incoming requests and to cache very common pages, such as the home page, and serve them from the cache instead of hitting the OG code at all. I'd be happy to help with configgering this.
I'm not a caching expert by any stretch but I don't see how this would work. Squid would ask the web server if the content has changed, ...
I believe you can configure it to *always* serve stuff from the cache, with timeouts before it goes back to the webserver.
On 25/07/07, Paul Makepeace paulm@paulm.com wrote:
# chmod a+r /var/log/apache/*openguides*
Brilliant, thank you.
I am currently downloading the access log and will follow up shortly with an analysis and consequent plan of action.
openguides-dev@lists.openguides.org