We had an EC2 instance retirement notice email from AWS. It was our Kubernetes master node. I thought to myself: we can simply just terminate and launch a new instance. I’ve done it many times. It’s no big deal.
However, this time, when our infra engineer did that, we were greeted with this error when trying to access our cluster.
Unable to connect to the server: EOF
All the apps are still fine. Thanks to Kubernetes’s design. We can have all the time we need to fix this.
kubectl is unable to connect to Kubernetes’s API. It’s a CNAME to API load balancer in Route53. That’s where we look first.
Route53 records are wrong
So ok. There are many problems which can cause this error. One of the first thing I notice is the Route53 DNS record for
etcd is not correct. It was the old master IP address. Could it be somehow the init script unable to update it?
So our first attempt to fix it was manually update the DNS record for
etcd to the new instance’s IP address. Nope, the error is still the same.
ELB marks master node as
We look a little bit more into the ELB for API server. The instance was masked
OutOfService. I thought this is it. It makes sense. But what could cause the API server to be down this time? We’ve done this process many times before.
We sshed into our master instance and issue
docker ps -a. There is nothing. Zero container whatsoever.
systemctl and there it is, the
cloud-final.service failed. We check the logs with
journalctl -u cloud-final.service.
We noticed from the logs that many required packages were missing like
ebtables, etc… when
nodeup script ran.
So if we can fix that issue, it should be ok right? We issue
apt update manually and saw this
E: Release file for http://cloudfront.debian.net/debian/dists/jessie-backports/InRelease is expired (invalid since ...). Updates for this repository will not be applied.
Ok, this still makes sense. Our cluster is old and the release file is expire. If we manually update it, it should work again right? We do apt update with
valid until flag set to
apt-get -o Acquire::Check-Valid-Until=false update
Restart cloud-final service
cloud-final.service or manually run the
nodeup script again with
/var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8
docker ps -a at this point should show all the containers are running again. Wait for awhile (30seconds) and
kubectl should be able to communicate with the API server again.
While your problem may not be exactly same as this, I thought I would just share my debugging experience in case it could help someone out there.
In our case, the problem was fixed with just 2 commands but the actual debugging process takes more than an hour.