mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2013-01-13 02:45 pm

Site outage over

Hi all,

The site outage is over. My apologies for the downtime.

One of our databases filled up its disk and went offline, and this caused the site to stop responding. We failed over to the backup database and everything is now back up and running.

Everything should be working. Please let us know if you see any trouble.

We will need to schedule a maintenance window soon to handle the full database and rebuild the cluster so we have a pair again. Stay tuned to this account to watch for announcements about that.



Some time last year we realized that our master database pair was filling up its disk, so as part of another downtime we were taking, we cleaned up the slave database and brought it down to around 40% disk usage -- well within comfort.

At the time, we couldn't clean up the master database without taking the site down again or extending the downtime even more, so we decided not to do it at that time and to wait. (Also, it's generally good to separate your maintenances on pairs -- that way if you do something bad and don't notice it, it has time to come out.)

Anyway, the idea was that later we would take another downtime, switch the databases, and then clean up the second machine. That didn't happen though, and the result was that today that database finally ran out of disk space.
lawless523: kanzeon bosatsu (Default)

[personal profile] lawless523 2013-01-13 11:00 pm (UTC)(link)
I just tried posting a comment on someone else's journal and wasn't able to.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-01-13 11:09 pm (UTC)(link)
When you say "not able to", what happened?
lawless523: kanzeon bosatsu (Default)

[personal profile] lawless523 2013-01-13 11:14 pm (UTC)(link)
I got a message from the web browser (Chrome) that the DW web address in question couldn't be accessed. This even though I'd been able to access my reading list and post an entry that I'd originally tried to post when the site was down using the same web browser.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-01-13 11:18 pm (UTC)(link)
Hm. That sounds like it could be a few different things. Start by restarting Chrome and try again? If that doesn't work, let me know the entry you're trying to reply to and I'll poke at it a bit more.
dil: (Default)

[personal profile] dil 2013-01-13 11:40 pm (UTC)(link)
JFYI: During the downtime today I got "connection reset while page was loading" error message in Firefox several times when I tried to access my Reading Page (/read).
It was not the "unable to connect" message which is displayed when the server does not respond.

After about a minute the server started responding, but returned error 404 until it was finally fixed.
Edited 2013-01-13 23:41 (UTC)
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-01-13 11:42 pm (UTC)(link)
Yeah, I'm pretty sure different browsers respond to the type of downtime we were having differently. But it should all be fixed now.
(screened comment)
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

(frozen comment)

[staff profile] denise 2013-01-14 12:37 am (UTC)(link)
Please don't yell at or shame somebody for reporting a problem! We'd much rather be told about something that we already know about than not be told about something that's a problem.
lawless523: kanzeon bosatsu (Default)

[personal profile] lawless523 2013-01-14 12:29 am (UTC)(link)
It worked, although I can't tell whether that's due to the passage of time or restarting Chrome.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-01-14 12:33 am (UTC)(link)
*nod* My theory was that the browser had cached the timeout and a restart would fix it. Glad it did!