dw_maintenance | (no subject)

(no subject)

InsaneJournal has had a hardware failure that means their service is temporarily offline. To avoid sending traffic to them while they're down, we've temporarily disabled them as a crossposting and importing source. We'll re-enable them when they're back up.

Good luck to

squeaky with the recovery!

Flat | Top-Level Comments Only

Does Dreamwidth have a disaster recovery plan for such an event?

For example, redundant hardware to fail over to, either immediately (everything is running twice already, and one failure will just take out the redundancy until it's repaired/replaced), quickly (the fallback is running continually and synchronising but isn't "live" until it's activated when needed), or standing around in a corner waiting for last night's full backup to be loaded onto it?

I suppose those are all expensive to keep around 24/7, for the reasonably unlikely event of a catastrophic failure.

Or a war fund/insurance for purchasing replacement hardware?

Disclaimer: I am not the person who handles this stuff, so this is "to the best of my knowledge" and may be slightly out of date or inaccurate!

Pretty much everything is redundant! The webservers are multiply redundant (they go in and out all the time because of load, runaway processes, etc, and y'all never notice), and we have a failover database server that gets all site activity replicated to it automatically so it's ready and waiting to take over. (I can't remember if it's an automatic failover or if somebody would have to flip the switch to activate it. I think it's flip-the-switch-to-activate.) We also do daily offsite backups.

I won't get into the exact sequence of events that would have to happen for us to have permanent, total, and irrevocable data loss, because I don't want to tempt fate, but the way more likely "massive disaster" scenario for us would be a natural disaster that took out the datacenter our servers are hosted in, causing us to have to restore from offsite backup; in the absolute worst case scenario of that scenario (disaster happened right before the nightly offsite backups kicked off) we'd have lost about a day of data, and it would take us a few days to bring the site back up from backup. We could definitely have better failsafes for some of those "total disaster" scenarios, but it would be prohibitively expensive; we settle for "pretty good" risk mitigation instead of perfect.

We also keep an "in case of disaster" cash reserve of no less than six months' operating expenses (this has saved our bacon several times before!) and have a healthy line of credit in case of a sudden need to spend even more than that on really short notice.

Sounds pretty good. Thanks for the response!

(reply from suspended user)

Flat | Top-Level Comments Only