Debugging Kubernetes: Unable to connect to the server: EOF
Posted on April 25, 2020 • 3 minutes • 468 words
We had an EC2 instance retirement notice email from AWS. It was our Kubernetes master node. I thought to myself: we can simply just terminate and launch a new instance. I’ve done it many times. It’s no big deal.
However, this time, when our infra engineer did that, we were greeted with this error when trying to access our cluster.
Unable to connect to the server: EOF
All the apps are still fine. Thanks to Kubernetes’s design. We can have all the time we need to fix this.
So kubectl
is unable to connect to Kubernetes’s API. It’s a CNAME to API load balancer in Route53. That’s where we look first.
Route53 records are wrong
So ok. There are many problems which can cause this error. One of the first thing I notice is the Route53 DNS record for etcd
is not correct. It was the old master IP address. Could it be somehow the init script unable to update it?
So our first attempt to fix it was manually update the DNS record for etcd
to the new instance’s IP address. Nope, the error is still the same.
ELB marks master node as OutOfService
We look a little bit more into the ELB for API server. The instance was masked OutOfService
. I thought this is it. It makes sense. But what could cause the API server to be down this time? We’ve done this process many times before.
We sshed into our master instance and issue docker ps -a
. There is nothing. Zero container whatsoever.
We check systemctl
and there it is, the cloud-final.service
failed. We check the logs with journalctl -u cloud-final.service
.
We noticed from the logs that many required packages were missing like ebtables
, etc… when nodeup
script ran.
Manual apt update
So if we can fix that issue, it should be ok right? We issue apt update
manually and saw this
E: Release file for http://cloudfront.debian.net/debian/dists/jessie-backports/InRelease is expired (invalid since ...). Updates for this repository will not be applied.
Ok, this still makes sense. Our cluster is old and the release file is expire. If we manually update it, it should work again right? We do apt update with valid until
flag set to false
.
apt-get -o Acquire::Check-Valid-Until=false update
Restart cloud-final service
Restart cloud-final.service
or manually run the nodeup
script again with
/var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8
docker ps -a
at this point should show all the containers are running again. Wait for awhile (30seconds) and kubectl
should be able to communicate with the API server again.
Final
While your problem may not be exactly same as this, I thought I would just share my debugging experience in case it could help someone out there.
In our case, the problem was fixed with just 2 commands but the actual debugging process takes more than an hour.