My Single Point of Failure Failed | Homelab Networking Lesson

A few weeks ago my UniFi switch at home died.

Not slowly, no warning signs, it was perfectly fine one moment, and perfectly dead the next.

Everything went down at once. The homelab servers, the access points, the cameras, even basic connectivity between machines on the network.

At work, we spend a lot of time talking about redundancy. Multiple switches, redundant power supplies, and so on. The gear also sits in climate-controlled rooms with proper power and cooling.

But at home, things are different. Let's just say my equipment budget is smaller at home and in my case, everything was dependent on that single switch.

My home network supports quite a few things:

Several hypervisors hosting VMs
A couple of wireless access points
Security cameras
IoT devices
Streaming devices like Roku and Apple TV
A few smaller switches connected to the main switch

When the switch went down, the entire network disappeared.

The fix wasn’t complicated. I plugged everything that needed network connectivity into a small 5-port switch. That brought some systems back online, but I quickly ran out of ports, and devices that required PoE (cameras, access points, and so on) were still offline.

At that point the network was only partially alive. Some services were back online, but wireless was still down.

Thankfully I had a small 8-port PoE switch in the living room that already had an access point plugged into it. After moving a few cables around, I was able to get wireless connectivity back up. Not ideal, but good enough for the moment.

It was a good reminder of something I think about often in infrastructure design: there’s always a single point of failure somewhere.

In a homelab, I’m generally okay with that. The cost and complexity of building a fully redundant network at home probably doesn't make much sense.

But it did raise a few questions:

How to I fix this without breaking the bank?

I kept the solution simple.

I didn't go out and buy the latest 10-gig switch. I replaced it with a similar switch mainly because I only have a 1GB connection at home anyway. I’m not doing video editing or moving massive datasets locally, so there wasn’t much benefit to upgrading.

What do I do next time this happens again?

There are a few things in place now that should make recovery quicker next time.

I keep a cheaper non-PoE switch sitting below the main switch in the rack that can be powered on if the primary one fails. There are also a couple of PoE injectors plugged into that backup switch so I can quickly bring cameras or access points back online if needed.

It’s not true redundancy, but it’s a simple way to reduce downtime.

Maybe the network core at home shouldn't be a single switch?

I have to be honest, I don't have a solution for this yet.

Right now it’s more of a budget problem than a systems problem. There are other priorities that come before expanding the homelab, like family. But at least I now have something in place that can take over quickly if another failure happens.

Sometimes that’s good enough for a homelab.

And that problem isn’t necessarily a bad thing. In fact, it forces me to think about ways to help budget-conscious organizations stay focused on their mission.

Because as I was reminded recently, every single point of failure will eventually fail.

Lessons Learned

A few things this reminded me of:

My homelab could benefit from having basic recovery options
A cheap backup switch and a couple of PoE injectors can dramatically reduce downtime

You don’t need full redundancy to improve resiliency.

Sometimes simple contingency planning is enough.