[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 240: Undefined array key 1
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 240: Undefined array key 1
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4191: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3076)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4191: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3076)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4191: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3076)
[phpBB Debug] PHP Warning: in file [ROOT]/includes/functions.php on line 4191: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3076)
Life In 19x19 • We're Back! Extended Downtime Postmortem and Future Plans
Page 1 of 2

We're Back! Extended Downtime Postmortem and Future Plans

Posted: Sat Feb 18, 2017 4:38 pm
by apetresc
Hey everyone! As you've all certainly noticed by now, Life in 19x19 was down for about 11 days. The good news is that we're now back and (hopefully) better than ever!


What Happened

Without going into too much technical detail, there was some sort of quota that was hit on the database server hosted by GoDaddy. Jordus (the admin who owned the actual hosting account) was having trouble getting timely action from GoDaddy, so not much progress was being made. After about a week, Brian Kirby and I offered to move everything to my AWS account, where we would have complete access to both the hardware and software, instead of being at the mercy of GoDaddy's tech support. At first we were scraping together old backups from Kirby's hard drive and what was left of the FTP server (which were over a year old! :( ), but eventually Jordus saved the day and gave us a raw dump of the database just prior to the outage. After some scripting, I had the full site up on an EC2 instance (for phpBB/Apache) and RDS instance (for MySQL).

The Future
Obviously the length of this downtime re-emphasized the need to reduce our bus factor. There were a couple of us with admin access to the board, but only two that could directly connect to the database, only one that controlled the domain, and 0 of us that had root on the physical hardware everything ran on (thanks GoDaddy). Going forward, we're going to be:
  • Posting the phpBB source, with all our theming and modifications, to a public GitHub organization, so anyone can clone it and help develop plugins and themes.
  • Scheduling automated daily backups of the DB. (This is actually already done as of today)
  • Make a sanitized (i.e, PMs and password hashes removed, etc) copy of this daily backup publicly available, as Linus Torvalds himself recommends ;)
  • Get a few more technically-savvy admins access to the AWS account this is running on. More on this soon.
  • We're running on a downright ancient version of phpBB (3.0.8, from November 2010 !!). Once everything's calmed down, I've verified the backup procedures work, and the versioning to GitHub is complete, I'm going to run an upgrade to the latest phpBB 3.2. Security and performance are the main benefits. This should be complete by end of day Monday February 20th.

Are we missing anything? Are there any other points of failure the community wants to see plugged?

Known Issues
Pretty much every feature that existed before the downtime has been restored. The below are the issues that I'm aware of, but decided weren't urgent enough to delay launch any further:
  • The database backup we were operating on had some character encoding issues; as such, you may notice some UTF-8 characters (e.g, accented names like Törmänen, or CJK ones like 古力) in usernames and topics be malformed. If that is the case, please contact an admin/mod and we will take care of it. I've fixed a couple of the glaring ones already, but I'm sure some have escaped me. NOTE: This should not affect actual post text, since that was binary-encoded.
  • The search index is being rebuilt overnight. Until then, search terms won't work for any topics older than today's.
  • For the next ~48 hours, your DNS settings may flip back and forth, leading you back to the old site (i.e, the error page). This will just solve itself within a day or two, as the new address reaches all the corners of the world.
If you notice any other problems not on this list, please reply to this thread with it!


Good luck and have fun, everyone :D
-Adrian

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sat Feb 18, 2017 8:34 pm
by jeromie
Glad the site is back. There are both content and members here that would be sorely missed if the go community were to lose them.

I'm moderately tech-savvy; let me know if there's anything I can do to help.

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sat Feb 18, 2017 9:15 pm
by Kirby
I'd like to extend thanks to Adrian, once again, who offered to host us on his AWS instance, and who also set things up with a pretty quick turnaround.

He received the database backup yesterday, and we are up and running today!

With multiple people having access to the AWS account, along with a daily backup of the database publicly available, we should be able to avoid the problem we had this time around - we won't be bottle-necked in fixing the site if an unexpected area fails.

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 2:43 am
by bayu
Thank you all for resolving this!

I've got one point to add to the list:
Hopefully there won't be a next time, but I d appreciate it in case of long downtime if there was, after some delay of course, a more informative error message saying "give us a week or two" or pointing to senseis or reddit where there was some status update.

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 6:02 am
by ez4u
Many, many, many thanks to Adrian and Brian (Apetresc and Kirby)!!!
:clap: :clap: :clap: :clap: :clap: :clap:

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 6:37 am
by apetresc
Marcel Grünauer wrote:When logged in, my surname "Grünauer", shows up as "Grünauer" in the upper left and also in the username for this post.

Yup, that's the sort of encoding problem I was referring to in the "Known Issues" part. Thanks for pointing it out, I've fixed it now :)

bayu wrote:Hopefully there won't be a next time, but I d appreciate it in case of long downtime if there was, after some delay of course, a more informative error message saying "give us a week or two" or pointing to senseis or reddit where there was some status update.

Yeah, for sure. Sometimes it's not possible to put an error message on the lifein19x19.com URL itself, depending on the nature of what the problem is, but at the very least Reddit/SL.

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 8:28 am
by Gomoto
Thanks a lot, keep up your outstanding work!

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 9:01 am
by joellercoaster
*dances*

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 4:53 pm
by apetresc
Progress Report

  • Fixed a serious bug with the [sgf] tag when referencing an attachment rather than an inline SGF. The links had been generated, at that time, with the /forum path, but we no longer use that. So I added a rewrite rule, and they all work now.
  • We have a GitHub organization, and the forum code is up there.
  • The search index has been rebuilt, so your searches should now work!

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 5:53 pm
by Bonobo
Wow, SO happy to be here again.

HUGE THANKS to everybody involved

I had reloaded my “view unread posts” tab almost every hour, only to get “The host does not exist.”, but just a minute ago Schachus wrote on the German DGoB forum that the URL without the WWW, i.e. http://lifein19x19.com/, DOES work, while the URL with WWW doesn't.

So:
http://www.lifein19x19.com/ BAD
http://lifein19x19.com/ GOOD
Would be nice to have this resolved, too.

• Also, could we get https?

• And about the “Donate” button … does it point to the correct PayPal acct already?


Thanks folks, you’re cool!

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 6:10 pm
by apetresc
Bonobo wrote:I had reloaded my “view unread posts” tab almost every hour, only to get “The host does not exist.”, but just a minute ago Schachus wrote on the German DGoB forum that the URL without the WWW, i.e. http://lifein19x19.com/, DOES work, while the URL with WWW doesn't.

So:
http://www.lifein19x19.com/ BAD
http://lifein19x19.com/ GOOD
Would be nice to have this resolved, too.

Good catch. I've just added the DNS entry and VirtualHost for www.lifein19x19.com too, it should start working in a few hours, again as DNS propagates. Thanks! :)

Bonobo wrote:• Also, could we get https?

Yes! Now that Let's Encrypt is giving out free certificates, that's a possibility. I'll add that to the roadmap.

Bonobo wrote:• And about the “Donate” button … does it point to the correct PayPal acct already?

Nope, haven't sorted that part out at all yet. I guess I could remove the button for now.

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 6:27 pm
by Bonobo
apetresc wrote:[..]

I've just added the DNS entry and VirtualHost for http://www. [..]
Awesome :)

• [..] https?
[..] I'll add that to the roadmap.
:-)

• [..] “Donate” button [..]?
[..] I guess I could remove the button for now.
NOOOOOO, totally inacceptable, a new and valid button, please :twisted:

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 6:45 pm
by jeromie
Thanks for getting everything running again!

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Sun Feb 19, 2017 10:45 pm
by xiayun
Thanks so much for all the efforts to get the site back online!

Re: We're Back! Extended Downtime Postmortem and Future Plan

Posted: Mon Feb 20, 2017 3:17 am
by daal
I just want to add my appreciation that Jordus, Brian and Adrian had the will and technical wherewithal to turn a database dump back in to L19 gold. Thanks for your time and work!