I often explain that connecting to a server requires much more than an Internet connection on both ends. A lot happens in the middle and today provided a great example of how sometimes it’s the Internet connections in the middle that matter.
One of our servers had “issues” today. I filed a ticket with our data center that the server was “having a snit.” After repeatedly viewing evidence that the server had been running flawlessly all day, I suggested a problem geographically close to the server but not at the server. It turns out I was right, but this post is not about gloating (though it’s nice to be right). It’s about what I knew and how I knew it.
One of the sites that I personally manage, sent a “server down” message. Website monitors are great, but sometimes they send a false positive (reporting a problem that doesn’t exist). I tried to view the website as my next step but CloudFlare reported the site as unreachable. That’s when my website problem became a logic puzzle.
The data center told me that the server was running and there were no signs of a problem now or earlier. I tried again over the next few hours with very mixed results. I applied my systems network knowledge to the problem. In a nutshell, the server reports normal operation, but a handful of users and CloudFlare reported the server as down. How can both things be true?
Information “hops” across the Internet. That meant something was broken between everyone who saw the server as down and the server itself. It also meant that it could be hard to track down from the data center because they see the server as a local part of their network. They didn’t see a problem.
The solution required catching another outage. I received another failure notice. Immediately, I went to file a ticket with our data center and couldn’t. It was my smoking gun and the warm body all gift wrapped and delivered at my feet. Whatever plagued our server, now plagued the data center. If there was an incentive to fix my problem earlier, now they had a critical situation where less creative customers might just give up.
I called the technician on the phone. He traced the problem to one of several backbone providers that connect their servers to the broader Internet. It was my lucky day. We only had one of our servers connected on that backbone and it could be re-routed. I also got some satisfaction for providing evidence of a problem before it got any worse. As it turned out, that particular network was losing 60% of data. It wasn’t confirmed, but the technician suggested an active Distributed Denial of Service (DDOS) Attack.
The important thing was that we knew the server was up and running and the data center could take action to keep data flowing.