I wrote before about backups. https://www.familie-kleinman.nl/brain/index.php/2025/04/12/backup-stories-what-was-needed-to-get-to-never-again/ .
Backups are the last line of defense. If everything else fails, you fall back on them. But that raises the question: what’s the first line of defense, and what can you build in between?
Having your data on multiple hard drives, in RAID, is such a defense. One drive can fail and the data is still safe. But what happens when the server crashes that houses the RAID array? The data might still exist — but it’s not available to the network at that point.
That’s one of the key points: availability.
Proxmox
In my case, the first line was a single Proxmox node — reliable, but still a single point of failure. It only takes one power hiccup, hardware issue, or accidental press of the power button (yes, that happened recently) to take everything offline.
So while backups protect the data, and RAID protects against disk failure, neither solves the problem of service uptime. If the host is down, every VM and container it runs is down as well.
That’s where High Availability comes in.
High available what?!
High Availability (HA) means removing the single point of failure. Instead of one server, you need multiple servers — the exact number depends on the implementation.
In Proxmox this is done with Corosync, which requires at least three servers to maintain something called quorum.
Quorum?
Yes, quorum. Basically, every node talks to every other node:
“Hey, I’m here. I’ve got this data. I’m okay.”
If one node crashes, dies, suffers a hardware failure, or is down for maintenance, the others keep asking:
“Hey, you still there? Hey?”
After a short timeout, one of the remaining nodes is elected to take over the missing workloads. Depending on the configuration, the load is either spread across the remaining machines or placed on just one of them.
Once the failed server comes back online, quorum is fully restored and the missing VM’s are synced back into the cluster.
Corosync only takes care of the control plane — deciding which node should run which VM or container.
The next question is: what about the data?
Data
If the VM disks only live on the failed server, the other nodes can’t simply run them. This is where distributed storage comes in.
I can hear you think: “Why not put all VM disks on a NAS? Then, when one server goes down, the data is still available and another node can take over.”
Yes, that works — but it’s still not truly high available. Because what happens when the NAS itself is rebooted or fails?
And that’s where Ceph comes in.
Ceph
Ceph is an open-source, software-defined storage platform that spreads data across multiple nodes. It’s designed from the ground up for scalability and high availability, and the beauty is: it runs on almost any hardware.
Instead of one central box, Ceph replicates data automatically across the cluster. If a disk or even a full server fails, the system heals itself — no single hub, no single point of failure.
Ceph is like having a few close friends who all keep a copy of your photo album. If one of them goes on holiday, you still have the full album at hand. And when that friend returns, they simply sync the missing photos that were added while he was away.
So in a sense, Ceph can be compared to a kind of RAID-5(ish) solution. The difference is that instead of spreading data across disks in a single server, Ceph distributes it across disks in multiple servers. That way, the cluster as a whole acts like one big, resilient storage pool.
Networking
Let’s start with the network, I’ll get to the hardware in a minute.
Because all data between nodes needs to stay in sync, Ceph requires a fast network. On top of that there’s Corosync, where all nodes constantly talk to each other about their health. And of course, there’s the network I actually use to interact with my VMs and services.
So, to boil it down, there are three critical networking channels in a Proxmox HA cluster:
- Ceph → storage replication traffic
- Corosync → cluster health and quorum
- Management / Data → the “normal” traffic: VM access, backups, and admin work
The storage replication traffic is heavy and needs high bandwidth. Corosync, on the other hand, needs a stable link with very low latency between nodes.
And here’s the catch: those two requirements can interfere with each other.
Imagine Ceph is busy replicating data and completely saturates the network link…
Corosync: “Hey, you still there? Hey?” Other nodes: “…No answer… Crap, that node is down! Shit, we need to fail over!”
Of course, the node isn’t actually down — it just can’t get a word in because Ceph is hogging the bandwidth.
That’s why it’s best practice to separate Ceph traffic from Corosync traffic, either with dedicated NICs, VLANs, or even a tiny separate switch for Corosync.
My solution( for now )
I’ve separated the traffic into three networks, each on its own port:
- Ceph → isolated on a 10G SFP+ connection, running over a dedicated VLAN on my 10G switch.
- Corosync → connected through a simple, cheap 1Gbit switch that is completely separated from the rest of my network.
- Data → a second 10G SFP+ connection on the same 10G switch, also connected to my uplink and my NAS.
Now, that 1Gbit Corosync switch is itself a single point of failure. To cover that, I’ll configure Corosync on both network 2 and 3. If the Corosync switch fails, the 10G data network will automatically take over cluster communication.
And yes — technically the 10G switch itself is also a single point of failure…
but for now, that’s a problem I don’t have a solution for (yet).
Hardware
And now… the hardware!
I already have node 1 (i9-9900T) and node 2 (i5-12400) up and running. The 9900T is my current production machine, while the i5 serves as my Proxmox testbed at the office.
The missing piece is node 3, which will be built similar to node 2. Once that’s done, I’ll rebuild node 1 and 2 into new cases with proper power supplies, so I end up with three identical machines neatly lined up.
Because yes — esthetics matter! 😛
Networking
For high-speed links I’ll try to find some cheap Intel X710-DA2 cards. They come with dual SFP+ ports, are well supported in Proxmox, and consume less power compared to a classic CAT6a solution (10GBASE-T copper).
For the Corosync network on node 2 and 3 I’ll just use the onboard NIC. On node 1 however, the onboard I219-M NIC sucks monkey balls in Proxmox, so I’ll re-use my trusty Intel 350-T2 card for that job.
Storage
And now for the painful part…
Ceph storage.
Ceph needs multiple fast and reliable drives across all nodes, which adds up quickly in both cost and capacity planning. I’m looking at enterprise-grade NVMe drives like the Samsung PM9A3, Intel D7-5520, or Kioxia CD8 to handle the heavy write load and replication that Ceph throws at them.
Hard drives are simply not an option. Ceph does a lot of random reads and writes, and if there is one thing hard drives suck at, it’s random I/O.
For the same reason, consumer-grade SSDs are also risky. While they’re faster than spinning disks, their write endurance simply isn’t up to the task. Every byte written by a VM to Ceph is replicated to the other servers — with extra overhead. That multiplies the stress on the drives, which is exactly why enterprise NVMe is the only real choice here. I’ll dive into that in more detail later.
So yes — storage is by far the most painful part of this build.
Fast, reliable, enterprise-grade NVMe drives don’t come cheap, and Ceph needs multiple of them across all nodes.
Hey Samsung, Intel, or Kioxia (or any other supplier for that matter) — if you happen to read this, I could really use a storage sponsor 😅
And just how expensive?
- Samsung PM9A3, 3.84TB: 475,-
- Kioxia CD8-R, 3.84TB: 608,-
- Intel D7-5520, 3.84TB: 501,-
See the pattern?
For reference:
- Samsung 870 EVO, consumer SATA: 282,-
- Samsung 990 Pro NVME, 289,-
( Prices taken from https://pricewatch.tweakers.net )
TBW — And why does it matter?
The Samsung 870 EVO has a TBW rating of 2400 TB. That means Samsung guarantees the drive can handle at least 2400 terabytes written before it wears out.
2400 TB sounds like a lot. For a normal desktop or even a heavy workstation, that’s more than enough to last decades.
But for Ceph, things are different. My server currently writes between 150 and 250 GB of data per day. With Ceph replication to all nodes, that is amplified to around 825–950 GB per day across the cluster.
Do the math:
2400 TB ÷ 0.95 TB/day ≈ 2500 days.
That’s still almost 7 years of lifetime — which doesn’t sound too bad at all. But workloads tend to grow, and so will the daily write volume.
Now take the Samsung PM9A3: with 3.84 TB of storage it has a TBW of 7008 TB. That’s almost three times higher. At the same daily write load, it would last a lot longer.
And that’s exactly why enterprise NVMe with higher TBW and better endurance starts to make sense. Not to mention the extra HA features: things like power-loss protection, consistent steady-state performance, and firmware built for 24/7 datacenter workloads.
Cost
The issue is simple:
I don’t need just one or two drives. I need at least three (one per server) — and preferably six, so each node has two drives to spread the load.
See the problem?
6 × €475 = €2850… YIKES! 😅
So once again: Hey Samsung, Intel, or Kioxia (or any other supplier for that matter) — if you happen to read this, I could really use a storage sponsor.
Power consumption
Yeah, good question — and definitely a hot topic.
With three nodes and a 10G switch, power usage will go up. I’m aiming for around 25–30 W idle per machine, plus roughly 20 W for networking. So yes, the electricity bill will rise.
But here’s the thing: a server that’s down or crashed is far worse. A little extra power is a small price to pay for reliability and uptime.
As an example: when our server once crashed due to a botched Windows update (thank you, Microsoft ), it took me a few days to get everything back up and running.
How valuable is that lost time?
Let me tell you: the electricity is cheaper.
Next steps
The journey isn’t finished yet. The next steps are:
- Selecting and preparing all the hardware
- Find a solution for the switch, single point of failure.
- Building node 3
- Upgrading my NAS to 10G
- And most importantly… creating room for the servers 😅
More soon!
Brain, again playing the Sysadmin, out!