What's new

Planned maintenance - Friday, August 23rd, 2019.

Status
Not open for further replies.

Sinnex

Donator
Joined
Oct 14, 2017
Messages
2,375
Location
IT
the fucked thing is that they didn't give any lag / notice time so everyone's up to neck right now
 

fr0st

SysAdmin
System Administrator
Joined
Jan 28, 2011
Messages
432
Location
Texas
8/23/2019 Maintenance Postmortem

So, why did the maintenance take so damn long? This is a postmortem/writeup about what went wrong. It's a long read.

Timeline (all times in CDT)
- 10:30 PM
-> Maintenance prep begins. Grabbing a list of all RPM's, php modules, and performing backups of non-essential data

- 11:00 PM
-> Maintenance begins. All services are turned off to ensure data consistency. Backups/packaging of all cPanel accounts are done (/scripts/pkgacct) and placed on our secondary drive

- 11:50 PM
-> Backups finally complete. I initiate a 1:1 mirror from our primary drive to the secondary drive using rsync.

- 12:10 AM
-> Primary drive to secondary drive mirror/sync completes.

- 12:20 AM
-> The server says its last goodbye (https://i.imgur.com/LbGphgJ.png). I power it down

- 12:23 AM
-> I logon to IPMI console (a remote management utility for servers), load the CentOS 7 Minimal image, boot into ISO/ATEN virtual CD, and hit "Install and test media"

I initially notice the ISO (which is 811MB in size) is transferring at 2Mbps, which is painfully slow. I figure the install may take two hours.

- 1:02 AM
-> After checking IPMI console, I notice the install is hanging at "Started service Enabling Compression RAM with zRAM". I review the outgoing bandwidth on the TAP adapter on my system, it's 0.0 Kbps TX (transmit), so I concluded that it just hung up/froze completely.

- 1:10 AM
-> After some thought, I tried a different CentOS image (NetInstall) which configures the adapter to connect to the internet and retrieve the rest of the files. It's a little smaller (500MB) than the Minimal ISO, which I figured might give us a better chance.

- 1:55 AM

-> Checking the console/bandwidth graph again, I notice the install is once again frozen on a different line. I figure this must be due to the slow upload causing unexpected race conditions or service timeouts, which hang the install. I figure the only way I'm going to ever get to the install is by installing from a location closer to the server. Something closer would have lower latency, faster upload, and could possibly fix the issue.

- 2:16 AM
-> I find a cloud hosting provider (Vultr) that can spin up Windows virtual machines in London for practically no cost. I spun up a VM with 4GB of RAM & 2vCPU's. I installed Java (required for IPMI), and Chrome. I notice the CPU and RAM is completely maxing out even when everything is closed. This is no good.

- 2:30 AM
-> After figuring out the initial resources for the VPS to install our server aren't good enough, I upgrade them to 8GB of RAM & 4 vCPU's, which appears to have resolved the slowness on it.

- 2:45 AM
-> Upon attempting to install OpenVPN Connect (a VPN required to use our server IPMI), I see that it refuses to launch the installer. Whenever I open it, nothing happens. Not even an antivirus warning. Weird. I try disabling Windows Defender on the London VPS thinking that it might be blocking it silently. But nope, it wasn't.

- 2:55 AM
-> I finally figure out that Windows itself was not trusting the MSI installer because of the security context. "This file came from another computer", I fixed it by going to properties, ticking "Unblock" at the bottom, and hitting apply. And just like that, it installed fine! Finally, I'm on my way to getting this damned server installed... or so I thought.

- 3:20 AM
-> I finally get to open the IPMI console from the London VPS I spun up/configured. I load on the CentOS NetInstall image, and to my surprise, it loads at a whole 70Mbps! That's an amazing improvement from my measly 2Mbps from the U.S. I finally get the server to go onto the install screen, and I thought all of my problems were over.

- 3:35 AM
-> After getting to the install screen, I notice that the mouse is acting oddly. I tried setting the IPMI mouse modes from Relative/Absolute, but it made no difference. Clicking on things was exceedingly annoying. Most of the time a single click wouldn't register, and it thought I was trying to drag the UI for the install around. Anyway, I work past this slow and painfully, and get to the "Network setup" screen (which is required for the NetInstall), and I notice the keyboard isn't registering my input correctly.

I then thought I might be able to configure the network from a CLI/command line. I dropped straight into a shell from the install screen, but my keypresses weren't working correctly there either. They either weren't registering, or printing out odd characters with escape sequences. I even tried the remote/onscreen keyboard on IPMI, but no joy.

- 3:40 AM
-> Having been puzzled by the keyboard, I thought I might go into the BIOS to see if there were any settings I could check to control keyboard behavior. Unfortunately the keyboard wasn't even working correctly with the BIOS. I decided to open a ticket with our host asking if they could provide any guidance on resolving this. They're more familiar with the hardware and nuances/quirks than I am.

- 3:48 AM
-> The host replies to my ticket swiftly and tells me they've reset the IPMI/server (which I can't do, no privileges). I booted the server into the CentOS-Minimal install. I figured I'd skip using the NetInstall image since the London VPS uploaded it so fast. I discover the keyboard & mouse now work just fine, and I'm relieved. The install finally begins!

- 4:35 AM
-> The install finishes and prompts me to reboot, so I do that. I disconnect the CentOS ISO virtual media (since the install is done), and reboot.
Upon reboot, I discover that the OS isn't found on either the Legacy/UEFI boot options. I'm puzzled and annoyed. I check around in the BIOS some more, attempt to boot from both of my drives, but it doesn't work. Could the CentOS install really not have installed a bootloader/UEFI partition? I guess so...

- 5:00 AM
-> I decide to reinstall the server again, but this time I decide to boot from the CentOS ISO in UEFI mode instead of legacy. I hit install.
At this point, I'm already tired and annoyed, but I still see a long road ahead. So while the install runs, I run to the convenience store, pick up some energy drinks and snacks, and come back.

- 5:30 AM
-> The install completes. I reboot the server, and finally the OS is found by the BIOS! I select it as the boot option, boot into it, and it works! Finally some light at the end of the tunnel.

- 5:35 AM
-> After the server installs, I'm able to login to the new OS. I begin configuring the network (/etc/sysconfig/network-scripts/ifcfg*), which is the first step of the process. I fully configured it, configured the hostname (/etc/hostname), and rebooted.

- 5:39 AM
-> I'm able to login to the server via SSH from my own computer. Finally some progress and now I have TTY. This is my time to shine. I begin configuring and updating the server. I run system/RPM updates (via yum), I install the 5.2 kernel (CentOS ships with 3.10, yuck), import the sysctl settings (custom buffer sizes, FQ, BBR, etc), and setup basic firewall rules & change the SSH port. Then I rebooted the server.

- 5:50 AM
-> After the reboot, I decide to setup our drives (/etc/fstab) for auto-mount and setup noatime which improves drive performance. All goes well and everything is there.

- 6:01 AM
-> After all of the basics are complete, I decide it's time to install DirectAdmin (our new control panel software). I run the DA setup script, and wait patiently while everything compiles and installs.

- 6:37 AM
-> The DirectAdmin install completes. I looked over all of the software it installed and their versions. I see MySQL 5.5 is installed, that's no good. I decide to upgrade to MariaDB 10.3 & change the password (the auto-generated one was weak). After that, I install multiple versions of PHP, one for production, one for legacy stuff.

- 7:10 AM
-> After I install the basic software/DA, I figure it's time to import the accounts. I located a handy script made by the DA developers to import cPanel generated backups. I installed the script, then had it start importing all of our accounts. I knew it was going to take at least an hour, so while it ran, I watched some Mindhunter, which is a great show, if you haven't seen it.

- 8:05 AM
-> The backups were imported! I have a look over everything, and I figure out that none of the MySQL databases were imported at all, which I'm puzzled by.
I review the import logs, and I figure out that MySQL import was throwing a "Invalid password" error. I guess by changing the MySQL DA password, it didn't use the new one. I discovered it needed updating in /usr/local/directadmin/conf/mysql.conf. After a facepalm and deleting the accounts/data, I started up the import again.

- 9:02 AM
-> The backups import successfully again. This time the SQL data is there. Great! I begin taking a look over the assigned IP's that DA gave the sites. I see they were assigned incorrectly. After a lot of annoyance and troubleshooting, I finally figure out how to assign multiple sites to the same IP.

- 9:20 AM
-> After getting everything configured on the IP/Apache side, Me/Becca see the RC-RP site has some problems. Blank pages for the forum, errors on the main site. I skim through the logs and figure out there's some additional PHP extensions missing. Turns out adding new extensions isn't as easy as I thought, so compiling libmemcached/imagick for PHP took me a little while.

- 9:40 AM
->After the sites look good, I begin to look at the SA-MP server. It refuses to start and crashes with "Segmentation fault", which is expected on a new install. I had to install glibc.i686, libz.i686, and openssl-devel. From there, it started up fine. Since everything loaded fine, I decided to keep the server up.

- 10:11 AM

-> Some people report they've lost their houses. We immediately take down the server and begin investigating.
While investigating, we discover property wipes were triggered despite the lack of inactivity. Odd...

- 10:15 AM
-> I discover the server time is set to September 24th, not August 24th. After wanting to die for a solid minute, I begin investigating that.

- 10:20 AM
-> After investigating the time issue, I figure out the server got the time from the BIOS, not a NTP/time server. I'm guessing this happened during the BIOS madness or the IPMI/server reset. I powered down the server, adjusted the time, and power it back up.

- 10:26 AM
-> Upon power up, I began restoring a backup of both the forums & gameserver. The forums had posts in the future (Sept 24th) due to the time issue, so they had to be rolled back too.

- 10:50 AM
-> Backups were fully restored. The server comes back online, but remained locked while we check things.

- 10:57 AM
-> Everything checks out, server is opened up.

- 11:30 AM
-> Miscellaneous things are fixed, like 2FA on the forum (another time sync issue), UCP stuff, and some of our scripts/automation get re-enabled.

- 12:23 PM
-> Python 3.6 & RudyBot dependencies are re-installed and Rudy bot is started back up.

12:25 Maintenance finally completes.


These are the events that lead to the extended downtime/maintenance for the server and my 13 hours of hell.

So, what have we learned? Technology can sometimes be unpredictable, unreliable or outright bad. Pretty much everything that went wrong was outside of our control. There's no blue print for handling situations like these, just have to make adjustments and adapt as you go.

Hope you've enjoyed reading this, and here's to a bright future without cPanel in it.

❤️ fr0st
 
Status
Not open for further replies.
Top