Service outage over the last days

andi

As some of you noticed, the dokuwiki.org website and all related subdomains where offline in the last days.

Long story short: old server broke, we're back online on a new one.

If you'd like to hear the whole story, read on:

Thursday night we got an email from our server that one of the two hardisks failed. We are using a RAID setup so all data is stored synchronized on both disks. So losing one disk shouldn't be a big problem. In fact it has happend a few times over the last years.

The procedure is simple: I open a ticket at our hoster and ask them to replace the faulty disk. For that the server is shut down, the disk is replaced and the server restarted. The RAID will then automatically recover by copying data back on the new disk. From that point on, all data is duplicated and safe again.

This procedure was started by me on Friday morning (luckily I don't work on Fridays). Unfortunately, after our provider reported to have the disk replaced, the server was still not reachable.

I inquired about the status and they connected a "LARA" which is basically a network connected monitor/keyboard thingy which let's me see what's on the screen (even when there's no network connection to the server itself). It uses a Java program that needs some security settings, so getting it to run took me a while. When it finally worked, I saw nothing:

After bit more back and forth with the provider, they reset the video settings and I could finally see something:

Not good. The file system check had found some errors it could not recover from automatically. So I ran the file system check again and said yes to every repair suggestion it made. A reboot later and the server was online again.

It looked good for a few minutes, until I noticed more error messages about hard disk problems in the system logs. By that time it was early afternoon and I decided it was time to finally get some breakfast...

When I returned, the EXT4 filesystem had failed to write it's journal (a crucial part to ensure integrity) and had remounted the file system read only. Not good. Definitely not good.

It seemed like one of the disks was still faulty. The problem was which? The RAID had by then rebuilt and reported absolutely no error. And SMART (a system to check various health indicators for modern hard disks) reported no problems at all.

I contacted the hoster again and they offered to take the server down again and run diagnostics on both disks. They did but reported back that they found no errors with them whatsoever.

They had restarted the server, but again it was not reachable. So I requested a LARA connection again. It was evening by then and Andrwe (one of our admins) joined me in diagnostics. We booted into a recovery system, checked RAID and SMART reports, found nothing noteworthy, ran a file system repair again and rebooted the server.

Again, a few minutes later the file system remounted read only and the logs were full of errors. There was very obviously some hardware error.

Back to the hoster, which offered us three options:

* put our disks into a different server
** this might have worked if the RAID controller or something else than the disks were at fault
* give us a replacement server
** this would mean we would need to restore the server form backups
* run extensive hardware diagnostics for 10 to 14 hours
** this would take us offline without knowing if anything would come out of it and might still have required one of the above actions afterwards

After some discussion with Andrwe and Frank (our other admin) we decided to go with a fourth option:

* order a completely new (and better) server and copy the data from the read only old server

We decided for a Intel® Core™ i7-6700 Quad-Core Skylake with 64GB of RAM and 2 250GB SSD disks (using a software RAID this time).

The server was provided within half an hour and we began with the basic setup Friday night.

During the Saturday I worked on bringing the various services back online. With the forum and the wiki being the first. Most of the data could be copied over from the old server. However some forum tables were corrupted and the wiki change log was lost. I restored those from a backup.

While doing this I also set up a dedicated DokuWiki just for the documentation of the server. This should help the admin team to better coordinate in the future.

Today the rest of the services followed and Andrwe helped me with setting up the additional stuff like firewall, monitoring and backups.

By now everything should be back to normal on a better, faster server (it's even a bit cheaper). And everything is better documented as well. So let's count that as a win :-)

PS: Thanks to all the people who sent me mails to alert me that the server is down :-) I highly recommend to follow the dokuwiki account on Twitter - this way you stay uptodate with what's going on.

ach

Thank you, Andi, for doing all the hard work and extra hours!
I will buy you a beer or two the next time we see each other.

Global DokuWiki Links