A GitOps-managed Kubernetes homelab cluster running on Talos Linux.
📋 Overview
This repository contains the declarative configuration for kantai, a bare-metal Kubernetes cluster. The cluster is designed for home infrastructure workloads with a focus on:
GitOps-driven operations via FluxCD
Advanced networking with Cilium, Envoy Gateway, external-dns, Cloudflare, and cert-manager
Distributed storage using Rook-Ceph
GPU workloads with NVIDIA GPU Operator
Comprehensive observability using VictoriaMetrics and Grafana
Continuous integration via Renovate
🏗️ Cluster Architecture
Nodes
Node
Role
Hardware
kantai1
Hyper-converged control plane and workloads
AMD EPYC 7443P, 256 GiB
NVIDIA RTX 4000 Ada Generation, 24 GB
Micron 9300 PRO, 4 TB, x7
Seagate Exos X20, 18 TB, x15
NVIDIA ConnectX-5
LSI 9500-8e
45Drives HL-15
kantai2
Virtual arm64 control plane and workloads
Apple M2 Mac Mini, 16 GB (mem), 500 GB (block)
UTM + QEMU hypervisor
kantai3
Hyper-converged control plane and workloads
AMD Ryzen Embedded V1500B, 32 GB
NVIDIA T400, 4 GB
Seagate Exos X18, 18 TB, x6
NVIDIA ConnectX-3
QNAP TS-673A
Network
kantai is connected to an all-Ubiquiti network, with a Hi-Capacity Aggregation as the TOR and a Dream Machine Pro as the gateway/router/firewall. Recent versions of Unifi Network and Unifi OS support BGP, which is used to advertise load balancer addresses and thus provide node-balanced services to the network. The cluster’s virtual network is dual-stack IPv4 and IPv6.
The cluster uses kantai.xyz as its public domain. It is registered at Cloudflare which also acts as the DNS authority. Cloudflare also proxies requests for services available from the public internet and tunnels them to the cluster for DDOS and privacy protection.
The cluster integrates with a Tailscale tailnet for private secure global access.
IPv4
Cluster nodes are connected to the main Ubiquiti network which uses 10.1.0.0/16.
Cilium advertises routes to load-balanced services using BGP.
A Unifi network matching the load balancer CIDR is programmed to prevent unnecessary NAT hairpinning and allow flows through the firewall.
Cilium masquerades pod addresses to node addresses.
For IPv6 networking, I decided to use globally routable addresses for pods, services, and LB IPAM. This means no masquerading is necessary, which is more in the spirit of IPv6. Routes and firewalls must still be programmed for traffic to flow.
Cluster nodes are connected to the main Ubiquiti network which receives an IPv6 /64 prefix via prefix delegation and assigns addresses to clients via SLAAC.
3 additional /64 prefixes are manually reserved for pods, services, and Cilium LB IPAM.
Cilium advertises routes to load-balanced services using BGP (same as IPv4).
A Unifi network matching the load balancer CIDR is programmed to prevent unnecessary NAT hairpinning and allow flows through the firewall (same as IPv4).
IPv6 masquerading is disabled.
🔧 Core Components
GitOps & Cluster Management
FluxCD
The cluster is managed entirely through GitOps using FluxCD. All resources are declared in this repository and automatically reconciled to the cluster. The Flux Operator manages the FluxCD instance.
Kustomizations define the desired state of each application
HelmReleases manage Helm chart deployments
OCIRepositories pull charts from OCI registries
Drift detection ensures cluster state matches Git
tuppr
Automated Talos and Kubernetes upgrades are managed by tuppr. Upgrade CRDs (TalosUpgrade, KubernetesUpgrade) define version targets with health checks that ensure VolSync backups complete and Ceph cluster health is OK before proceeding.
Renovate
The repository is constantly updated using Renovate and flux-local. Minor and patch updates are applied automatically while major releases require human approval.
An externalGateway is used for routes that should be available from the public internet (via a Cloudflare Tunnel), while an internalGateway is used for routes that should only be accessible on the local network or on my tailnet.
external-dns
external-dns automatically manages DNS records for services:
API Server Proxy - The Kubernetes API server is accessible over the tailnet via Tailscale’s API server proxy in auth mode, enabling API server access with tailnet authn/authz.
Split-Horizon DNS - A k8s-gatewaydeployment serves as a kantai.xyz split-horizon DNS server on the tailnet for all HTTPRoute resources with a kantai.xyz hostname, making them resolvable on the tailnet (but not reachable since the Envoy Gateway services use the Cilium BGP LoadBalancer class; see next). The k8s-gateway service itself is exposed to the tailnet using a Tailscale load balancer service.
The Unifi gateway is connected to the tailnet and programmed as a subnet router for the Cilium BGP LoadBalancer’s IPv4 CIDR, making all such services reachable over the tailnet.
Multus
Multus CNI enables attaching multiple network interfaces to pods. Used for workloads requiring direct LAN access via macvlan interfaces with dual-stack networking support.
Secrets Management
external-secrets + 1Password
external-secrets synchronizes secrets from 1Password into Kubernetes using the 1Password Connect server. A ClusterSecretStore provides cluster-wide access to secrets.
Maintains a wildcard certificate for kantai.xyz using Let’s Encrypt DNS challenge (Cloudflare API)
trust-manager distributes CA bundles across namespaces
Identity & Authentication
Pocket ID
Pocket ID serves as the in-cluster OIDC provider, enabling:
Kubernetes API server OIDC authentication
OIDC authentication for apps that do not natively support it via Envoy Gateway’s SecurityPolicy extension
Centralized identity management for applications
Storage
Rook-Ceph
Rook-Ceph provides distributed storage across the cluster:
Block Storage (ceph-block) - Default storage class with 3-way replication, LZ4 compression
Object Storage (ceph-bucket) - S3-compatible storage with erasure coding (2+1)
Dashboard exposed via Envoy Gateway
Encrypted OSDs for data-at-rest security
OpenEBS ZFS
OpenEBS ZFS LocalPV exposes existing ZFS pools on nodes as Kubernetes storage:
Provides access to large media and data pools
Supports ZFS features (compression, snapshots, datasets)
Used for workloads requiring high-capacity local storage
Samba
Samba deployments on storage nodes share ZFS-backed volumes to the local network via SMB, enabling access to cluster-managed data from non-Kubernetes clients.
VolSync + Kopia
VolSync backs up persistent volumes to Cloudflare R2 using Kopia:
Daily snapshots with 7 daily, 4 weekly, 12 monthly retention
Clone-based backups (no application downtime)
Zstd compression for efficient storage
Database
CloudNative-PG
CloudNative-PG manages PostgreSQL clusters for applications:
PostgreSQL 18 with vchord vector extensions for AI/ML workloads
ServiceMonitors for Kubernetes components (API server, kubelet, etcd, scheduler, controller-manager)
kube-state-metrics for resource metrics
Dashboards via Grafana Operator integration
Note: Prometheus and Alertmanager from this stack are disabled in favor of VictoriaMetrics. The stack is primarily used for its comprehensive ServiceMonitor definitions and dashboards.
⛵ flatops
A GitOps-managed Kubernetes homelab cluster running on Talos Linux.
📋 Overview
This repository contains the declarative configuration for kantai, a bare-metal Kubernetes cluster. The cluster is designed for home infrastructure workloads with a focus on:
🏗️ Cluster Architecture
Nodes
Network
kantai is connected to an all-Ubiquiti network, with a Hi-Capacity Aggregation as the TOR and a Dream Machine Pro as the gateway/router/firewall. Recent versions of Unifi Network and Unifi OS support BGP, which is used to advertise load balancer addresses and thus provide node-balanced services to the network. The cluster’s virtual network is dual-stack IPv4 and IPv6.
The cluster uses
kantai.xyzas its public domain. It is registered at Cloudflare which also acts as the DNS authority. Cloudflare also proxies requests for services available from the public internet and tunnels them to the cluster for DDOS and privacy protection.The cluster integrates with a Tailscale tailnet for private secure global access.
IPv4
10.1.0.0/16.10.11.0.0/1610.11.0.0/1610.11.0.0/16IPv6
For IPv6 networking, I decided to use globally routable addresses for pods, services, and LB IPAM. This means no masquerading is necessary, which is more in the spirit of IPv6. Routes and firewalls must still be programmed for traffic to flow.
/64prefix via prefix delegation and assigns addresses to clients via SLAAC./64prefixes are manually reserved for pods, services, and Cilium LB IPAM.🔧 Core Components
GitOps & Cluster Management
FluxCD
The cluster is managed entirely through GitOps using FluxCD. All resources are declared in this repository and automatically reconciled to the cluster. The Flux Operator manages the FluxCD instance.
tuppr
Automated Talos and Kubernetes upgrades are managed by tuppr. Upgrade CRDs (
TalosUpgrade,KubernetesUpgrade) define version targets with health checks that ensure VolSync backups complete and Ceph cluster health is OK before proceeding.Renovate
The repository is constantly updated using Renovate and flux-local. Minor and patch updates are applied automatically while major releases require human approval.
Networking
Cilium
Cilium serves as the CNI in kube-proxy replacement mode, providing:
Envoy Gateway
Envoy Gateway provides a complete and up-to-date implemenmtation of the Kubernetes Gateway API with advanced extensions.
An external
Gatewayis used for routes that should be available from the public internet (via a Cloudflare Tunnel), while an internalGatewayis used for routes that should only be accessible on the local network or on my tailnet.external-dns
external-dns automatically manages DNS records for services:
GatewayroutesGatewayroutes using @kashalls’s excellent Unifi providerTailscale
The Tailscale Operator integrates the cluster with my tailnet.
kantai.xyzsplit-horizon DNS server on the tailnet for allHTTPRouteresources with akantai.xyzhostname, making them resolvable on the tailnet (but not reachable since the EnvoyGatewayservices use the Cilium BGP LoadBalancer class; see next). The k8s-gateway service itself is exposed to the tailnet using a Tailscale load balancer service.Multus
Multus CNI enables attaching multiple network interfaces to pods. Used for workloads requiring direct LAN access via macvlan interfaces with dual-stack networking support.
Secrets Management
external-secrets + 1Password
external-secrets synchronizes secrets from 1Password into Kubernetes using the 1Password Connect server. A
ClusterSecretStoreprovides cluster-wide access to secrets.Certificate Management
cert-manager + trust-manager
cert-manager automates certificate lifecycle management:
kantai.xyzusing Let’s Encrypt DNS challenge (Cloudflare API)Identity & Authentication
Pocket ID
Pocket ID serves as the in-cluster OIDC provider, enabling:
SecurityPolicyextensionStorage
Rook-Ceph
Rook-Ceph provides distributed storage across the cluster:
ceph-block) - Default storage class with 3-way replication, LZ4 compressionceph-bucket) - S3-compatible storage with erasure coding (2+1)OpenEBS ZFS
OpenEBS ZFS LocalPV exposes existing ZFS pools on nodes as Kubernetes storage:
Samba
Samba deployments on storage nodes share ZFS-backed volumes to the local network via SMB, enabling access to cluster-managed data from non-Kubernetes clients.
VolSync + Kopia
VolSync backs up persistent volumes to Cloudflare R2 using Kopia:
Database
CloudNative-PG
CloudNative-PG manages PostgreSQL clusters for applications:
GPU Compute
NVIDIA GPU Operator
The NVIDIA GPU Operator enables GPU workloads:
Observability
Metrics: VictoriaMetrics
The VictoriaMetrics Operator manages the metrics stack:
Dashboards: Grafana Operator
The Grafana Operator manages Grafana instances and dashboards:
GrafanaDashboardCRDsLogs: fluent-bit
fluent-bit collects container logs from all nodes, running as a DaemonSet in the observability-agents namespace.
kube-prometheus-stack
The kube-prometheus-stack provides:
📁 Repository Structure
🚀 Getting Started
Bootstrap
Bootstrap is currently broken and unusable. I love my pets.
Maintenance
Update Talos node configuration:
🔒 Security
📊 Monitoring
Lots of dashboards available on the on-cluster Grafana instance. Alerts go out to Discord.
🙏 Acknowledgments