[Previous] [Next] - [Index] [Thread Index] - [Previous in Thread] [Next in Thread]


Subject: Re: UKNM: Crashing Sites
From: Sean Phelan
Date: Sun, 5 Sep 1999 09:52:22 +0100

Aha!! A subject close to my heart.

My guess is that when an auction site is down for hours, it is undergoing
a database rebuild and integrity check. Auction sites have big, complex
relational databases, and when you use the standard tools to fix corrupt
databases it can take *hours*.

The only other thing that is really difficult to fix quickly is if the
DNS farm responsible for your domain name suffers an outage (back-hoe
through the fibre, hurricane takes out the power for >12 hours, etc.)
Even that can be fixed in two or three hours if you are courageous, by
moving to another DNS.

I think the longest outage we've ever suffered at Multimap.com was when
a hurricane in the US took out one of MCI's facilities. We used to have
co-located servers in the US and the UK, but with DNS in the US. The
DNS farm was on three hosts, spread across two different tail circuits,
but both circuits came from MCI. The hurricane took out power in the
area; the local MCI pop went on to battery back-up and started its
generator, but the generator was out of fuel, it spluttered to a halt
and when the batteries died, so did all MCI's T3 circuits in the area.

The frustrating thing was that it wasn't our ISP's fault; I think our
decision to use the multiple resiliency of the US ISP was defendable
at the time. Now we've moved everything to PSINet and co-locate in
Telehouse; it is a single point of failure, but if Telehouse goes down
then we all have bigger problems than maps. Our hardware is all
replicated, and we are moving to realtime load-balancing across
multiple front-end and back-end servers.

All that stuff - replicated servers, RAID disk, redundant circuits,
etc - is kinda ho-hum and I can't imagine that the big auction sites
get this wrong.

My suspicion - and it is only speculation - is that the underlying cause
of big outages at high-profile sites is that they are growing so fast,
racing towards market dominance, that they have a risk-taking culture
and they give lots of power to inexperienced staff. This means that
they are not necessarily writing rock-solid code, and that their DB
administration may not be up to the standards of a bank or financial
institution. So an "operator error" by a DB admin or a program that
writes corrupt records to the database really can have catastrophic
results.

Cheers
Sean


>for the past couple of weeks I have been surfing Auction Sites quite a bit,
>for research, and one thing I have noticed more than anything is that, they
>crash more than many other big consumer based sites on the net, this could
>be purely coincidence.
>
>but my question is "these companies" they are usually valued in the billions
>and they know they have millions of users, their whole organisation relies
>on the site being up 100%, it is possible to create a site thats up near to
>100% it goes down once a month for 5 minutes maybe but that should be
>it(even this is bad). So these web sites should be treated as
>business/mission critical to the organisation.
>
>what goes wrong ? does any one have any experiences when a huge site went
>down and the reasons.
>
>cheers
>
>imano

=============================================================
Sean Phelan seanatmultimap [dot] com http://www.multimap.com
phone (within UK): 0171 433 0460 fax (UK): 0171 209 5194
phone (Int'l): +44 171 433 0460 fax: +44 171 209 5194
********************
UKNM is sponsored by Excite UK, visit us at http://www.excite.co.uk.
Email Khalil Ibrahimi khalilatexcitecorp [dot] com (mailto:khalilatexcitecorp [dot] com) to advertise on Excite.
********************
Change your UKNM subscription use http://www.chinwag.com/uknm.html



Replies
  UKNM: Crashing Sites, Chetan Damani

[Previous] [Next] - [Index] [Thread Index] - [Next in Thread] [Previous in Thread]