Building a Production-Grade Homelab:

So I built a homelab. Not the "throw Docker on a Raspberry Pi" kind of homelab, but a proper production-grade personal cloud infrastructure that I'd be comfortable showing to enterprise architects. This is the story of how three months of eBay hunting, some hard-learned networking lessons, and way too much coffee turned into a 17-service platform that handles everything from media streaming to family photo management with enterprise-level security.

Fair warning: this is going to be long. We're talking multi-site backup strategies, ZFS pool encryption, VPN-only admin access, and automated monitoring. If you're looking for a quick "Docker Compose and done" tutorial, this isn't it. But if you want to understand how to build something that actually works in production while learning from my mistakes along the way, grab your beverage of choice and settle in.

The Hardware Hunt: Three Months on eBay

The project started the way most homelab projects do—with the realization that my collection of cloud subscriptions was costing more per month than a decent used server would cost per year. But more than that, I wanted something I controlled. As someone who works in digital forensics, the idea of having complete visibility into my infrastructure, proper logging, and the ability to practice enterprise security concepts at home was appealing.

I spent about three months watching eBay for deals on enterprise hardware. The goal was simple: find something powerful enough to run a dozen containerized services, reliable enough to trust with family data, and cheap enough that I wouldn't feel bad if I completely bricked it during experimentation.

What I ended up with was a Dell T3620 workstation. Not the sexiest hardware, but it checked all the boxes: Intel i7 processor, and here's the kicker—I upgraded it to 64GB of DDR4 ECC RAM. That memory upgrade alone changed the entire scope of what became possible. With that much headroom, I could run comprehensive monitoring, allocate generous ZFS ARC cache, and never worry about memory pressure even with a dozen services running simultaneously.

For storage, I grabbed three 8TB WD Red drives for the main RAID-Z1 array, a 1TB SSD for the OS, and a 1TB NVMe drive for VM storage. The whole setup came together piece by piece as deals appeared, and by the time I had all the components, I'd spent roughly what six months of my previous cloud subscriptions would have cost.

Oh, and I picked up an APC Back-UPS 600 for power protection. Nothing fancy, but it gives me about 15-20 minutes of runtime—enough for graceful shutdowns during power events. More on why that matters later.

TrueNAS Scale: The Foundation

I went with TrueNAS Scale as the foundation OS. Being Debian-based means it has proper Docker support while still giving me enterprise-grade ZFS management. The installation itself was straightforward—flash the ISO to USB, boot, install to the dedicated SSD, configure a static IP, and you're running.

I set up two ZFS pools: the main storage array in RAID-Z1 configuration across the three 8TB drives (giving me about 16TB usable), and the NVMe as a high-performance pool for databases and temporary processing. Both pools got full AES-256-GCM encryption because, well, why wouldn't you? Here's where having 64GB of RAM became important: I could allocate 20GB to ZFS ARC cache. That might seem excessive, but when you're serving media to multiple streams while Immich is processing photo uploads and Plex is generating thumbnails, that cache hit rate makes the difference between smooth operation and stuttering performance.

The dataset structure took some planning. I created separate datasets for each major service—Plex, Immich, user files, media libraries, downloads, backups, and configuration files. ZFS datasets are like having individual filesystems with their own properties, which becomes important later when you're doing snapshots and setting different compression levels.

I enabled LZ4 compression across the board. It's essentially free on modern CPUs and typically nets you 20-30% space savings on text and configuration files. For my use case, that means an extra few terabytes of effective capacity.

Network Security: The Hard Lessons

My plan was solid: use Tailscale for VPN access, NGINX Proxy Manager for SSL termination and reverse proxying, and Pi-hole for DNS filtering. Standard stuff, right? Well, the execution taught me some valuable lessons.

DNS Resolution: When Everything Breaks After a Reboot

First major issue: after getting everything configured and running beautifully, I rebooted. Simple maintenance reboot. When the system came back up, nothing worked. Containers couldn't resolve hostnames, Docker couldn't pull images, and I was staring at "temporary failure in name resolution" errors everywhere.

TrueNAS was overwriting my DNS configuration in /etc/resolv.conf on every boot. I'd set up proper nameservers, everything worked, but the system wasn't persisting the configuration. The fix was to set the DNS configuration through the TrueNAS web interface (Network → Global Configuration) rather than editing files directly. Once I configured it properly through the UI with Google DNS and Cloudflare as fallbacks, it persisted across reboots.

Lesson learned: respect the abstraction layers. TrueNAS has its own configuration management system, and fighting it just creates problems.

Tailscale: The Restart Loop of Doom

The second major headache was getting Tailscale working for VPN access. I wanted to advertise my entire home network through Tailscale so I could access all services remotely without exposing anything to the public internet. Simple concept, frustrating execution.

The container kept restarting. Over and over. Exit code 1, restart, exit code 1, restart. The issue? I was using both the TS_ROUTES environment variable and the --advertise-routes flag in TS_EXTRA_ARGS. Tailscale didn't like the duplication and threw a fit.

The fix was to use only TS_EXTRA_ARGS with the full route advertisement configuration, and completely remove the TS_ROUTES variable. Once I made that change, Tailscale came up stable and has been rock-solid ever since. I can now access every service through the VPN, and nothing is exposed directly to the internet except the VPN endpoint itself.

The qBittorrent Authorization Mystery

Third lesson came from qBittorrent. I set it up, configured the downloads, pointed it at the correct storage... and got "Unauthorized" errors whenever I tried to access the web interface. The credentials were correct, the container was running, but authentication just failed.

Turns out the issue was port mapping. I had the external port mapped to 8083 but the internal port set differently in the container configuration. qBittorrent's web interface is particular about this—the WEBUI_PORT environment variable needs to exactly match your port mapping. Once I set both to 8083, authentication worked immediately.

These weren't catastrophic failures, but each one cost me hours of troubleshooting. And that's fine—every production system has its quirks, and learning them now means they won't bite me during an actual emergency.

Service Architecture: Building the Stack

With the foundation stable, I started deploying services. Everything runs in Docker containers managed through Portainer, with a single docker-compose stack orchestrating the whole setup. This gives me version control on the entire infrastructure configuration and makes disaster recovery straightforward—restore the compose file and data volumes, run docker-compose up, and you're back.

The service architecture breaks down into four tiers:

Infrastructure Services

At the bottom, we have the core platform services that everything else depends on. Portainer provides the container management interface. Tailscale handles VPN mesh networking. NGINX Proxy Manager terminates SSL and routes traffic to the appropriate backends. Pi-hole filters DNS queries network-wide.

These services start first and everything else depends on them. If any of these fails, the whole platform is compromised, so they get careful attention in the startup order and have aggressive health checks.

Public-Facing Applications

These are the services that family members actually use, accessible through SSL-secured domain names. Plex streams media. Immich handles photo management with AI-powered organization. Overseerr provides a clean interface for requesting new movies and TV shows. FileBrowser gives web-based file access. Vaultwarden serves as the family password manager.

Each of these services is accessible via HTTPS through properly configured reverse proxies. Let's Encrypt handles certificate automation, and everything renews automatically. Users never see the underlying complexity—they just access photos.homelab.example or requests.homelab.example and everything works.

Administrative Services

These are VPN-only. No public access, period. The download stack—qBittorrent routed through ProtonVPN, plus Radarr, Sonarr, and Prowlarr for media automation. Pi-hole admin interface for DNS management. Portainer for container management. NGINX Proxy Manager's admin interface.

The security model here is simple: if you're not on the VPN, these services don't exist. Even on my local network, they're locked down to only respond to VPN traffic. This gives me secure remote access without the attack surface of public-facing admin interfaces.

Monitoring and Automation

This is where the platform becomes self-maintaining. Uptime Kuma monitors every service with real-time health checks, tracks response times, and maintains historical data. It's got a clean dashboard accessible at dashboard.homelab.example (VPN-only, naturally) that shows the health of the entire infrastructure at a glance.

Watchtower runs in monitor-only mode, checking daily for container updates but not auto-applying them. This gives me visibility into available updates without the risk of automatic deployments breaking something at 3 AM. I review the updates manually, test them, and deploy during maintenance windows.

I wrote custom shell scripts for deeper monitoring—SSL certificate expiration tracking, API connectivity validation, system resource monitoring, and backup verification. These run on cron schedules and output to JSON status files that get parsed into an administrative dashboard. The dashboard updates every 30 minutes and shows me everything from disk usage to the last successful backup timestamp.

The Download Pipeline: Privacy-First Architecture

One of the cooler pieces of the setup is the media acquisition pipeline. I wanted something fully automated but also privacy-conscious, which meant routing all torrent traffic through a VPN.

The solution was to use Gluetun as a VPN gateway container, specifically configured for ProtonVPN. qBittorrent runs with its network stack attached to the Gluetun container, meaning all of its traffic—every single packet—routes through the encrypted VPN tunnel. If the VPN connection drops, qBittorrent loses network access entirely. No leaks, no fallback to the WAN connection. I have noticed significantly slower download speeds with this configuration.

Radarr and Sonarr monitor for new content and automatically send downloads to qBittorrent through the VPN gateway. When downloads complete, the media files get organized into the appropriate Plex libraries, metadata gets fetched, and Overseerr notifies whoever requested the content. The whole pipeline is zero-touch once configured.

This setup gives me plausible deniability and proper security. All torrent traffic is encrypted and anonymized through ProtonVPN. The ISP sees VPN traffic and nothing else. And because everything is automated, there's no manual intervention needed—request something through Overseerr, and it shows up in Plex automatically.

User Management: The Family Access Problem

Running a homelab for yourself is straightforward. Running it for family members who shouldn't have admin access? That's where things get interesting. I needed to provide services to a spouse who wants to stream media and upload photos, but definitely shouldn't be able to access download management or SSH into containers.

The solution required thinking about isolation at multiple layers...

Service-Level Access Control

Some services got configured with managed user accounts. Plex has a proper user management system, so I created a managed user account with library access but no admin capabilities. They can watch content, manage their own watch history, but can't see server settings or make configuration changes.

Overseerr integrates with Plex authentication and has granular permission controls. The user account can request content but those requests require approval. There are daily limits on requests to prevent runaway spending if someone decides to request the entire IMDb catalog. Movie requests are capped at five per day, TV series at two, individual seasons at five.

Immich got set up with completely separate user libraries. Each user account has their own photo storage with their own AI models and facial recognition. There's no cross-contamination of personal photos between accounts, and each user's data is genuinely isolated.

Filesystem-Level Isolation

At the ZFS level, I created separate datasets for user-specific data with proper permissions. The admin user has full access everywhere because, well, I'm the admin. The family user has read-write access to their own directory, read-only access to shared media libraries, and no access whatsoever to admin directories, configuration files, or download management.

SMB shares enforce these permissions at the network level. The family user can map their personal share and the media library, but they can't even see the admin directories or download folders. Access Based Share Enumeration means if you don't have permissions, the share doesn't appear in your network browser.

FileBrowser got configured with a scoped view—the user account sees only their personal directory and shared content. The admin account sees everything. Same tool, different views based on authentication.

Network-Level Restrictions

The really sensitive services don't have user accounts at all—they're simply not accessible without VPN access. Download management, container administration, system monitoring, Pi-hole configuration—all of these require being on the VPN, which requires admin credentials.

This creates a clean separation: family members access public-facing services through standard HTTPS. Administrators access everything through VPN. There's no confusion about what should or shouldn't be accessible, and there's no risk of accidentally exposing admin interfaces.

Power Protection: Planning for the Inevitable

Living in an area with somewhat questionable power reliability meant UPS integration wasn't optional. The APC BE600M1 connects via USB and gets monitored through Network UPS Tools (NUT).

The configuration handles three scenarios: power loss (log it and monitor battery levels), low battery (initiate graceful container shutdown and create emergency ZFS snapshots), and power restoration (log the event and verify service health). The shutdown sequence is carefully ordered—user-facing services stop first, then databases, then infrastructure services, ensuring nothing gets corrupted during power events.

The UPS gives me about 15-20 minutes of runtime with typical load. That's more than enough for graceful shutdowns, and it's saved me multiple times during brief power flickers that would have otherwise meant hard crashes and potential filesystem corruption.

Backup Strategy: Multiple Layers of Paranoia

As someone who deals with data recovery professionally, I'm perhaps unreasonably paranoid about backups. The strategy here is multi-layered:

ZFS Snapshots

Automated snapshots run hourly, daily, and weekly with appropriate retention policies. Hourly snapshots keep for 24 hours, daily for a week, weekly for a month. This gives me point-in-time recovery for accidental deletions or configuration mistakes. Rollback is instant—literally just a ZFS command and you're back to any previous snapshot.

Configuration Backups

Every week, a script backs up all container configurations, docker-compose files, and service configurations to encrypted archives. These get stored on the main storage pool but also copied to an external drive. If I need to rebuild from scratch, I have everything needed to restore the exact configuration.

Container Image Redundancy

I run a local Docker registry that maintains copies of all running container images. Every day, a script pulls the latest versions of critical images, saves them to the registry, and also creates compressed tar archives as a fallback.

This means if Docker Hub goes down, or an image gets pulled, or something breaks in a new version, I have local copies of everything. I can restore any container from local storage without needing internet access. Combined with the configuration backups, I can rebuild the entire platform from local resources.

Monitoring Backup Health

The monitoring scripts verify that backups are actually happening and are restorable. They check that ZFS snapshots exist, that configuration backups completed successfully, that the Docker registry is accessible, and that tar archives are present and not corrupted. If any of these checks fail, I get alerted immediately.

SSL and DNS: The Professional Touch

Everything uses proper SSL certificates. NGINX Proxy Manager integrates with Let's Encrypt and handles automatic certificate generation and renewal. Each service gets its own certificate rather than using a wildcard, which provides better security and clearer logging.

DNS is handled through Cloudflare with proper A records pointing to my public IP. NGINX Proxy Manager sits behind the router and handles all incoming HTTPS traffic, routing it to the appropriate backend services. The public services are accessible from anywhere, while admin services remain VPN-only regardless of the DNS configuration.

The automated monitoring includes SSL certificate expiration checking. I get warnings seven days before any certificate expires, though with Let's Encrypt automation, they should renew automatically. Belt and suspenders.

Current State: Production Ready

After three months of acquisition, two weeks of intense setup, and another couple weeks of debugging and optimization, the system has been running stable for weeks now. Seventeen containers operating with 99%+ uptime. The only downtime has been during planned maintenance windows.

Resource utilization is comfortable—memory usage hovers around 35GB with all services running, leaving plenty of headroom. The 20GB ZFS ARC cache is well-utilized with excellent hit rates. Storage is at about 9% utilization with 1.5TB used out of 16TB available, giving me years of growth runway.

The automated media pipeline works flawlessly. Someone requests content through Overseerr, I approve it, Radarr or Sonarr finds it, qBittorrent downloads it through the VPN, files get organized automatically, and Plex updates its library. Zero manual intervention needed beyond the approval step.

Family members use the services without even realizing they're running on local infrastructure. Photos upload to Immich, media streams through Plex, passwords sync through Vaultwarden. From their perspective, it's just another cloud service. From my perspective, it's a platform I completely control with full visibility into everything happening.

Lessons Learned: What I'd Do Differently

If I were starting over, I'd do a few things differently:

First, I'd spend more time up front on planning the network architecture. The DNS issues and Tailscale problems were avoidable if I'd properly understood how TrueNAS manages network configuration. Read the documentation, understand the platform's layers, and work with them rather than against them.

Second, I'd implement monitoring from day one rather than adding it later. Having Uptime Kuma and the monitoring scripts running from the start would have saved debugging time. When something breaks, you want historical data to reference, not just current state.

Third, I'd be more systematic about documentation. I kept notes, but they were scattered across multiple files and formats. Having a proper wiki or documentation system from the beginning would have been valuable, especially when trying to remember why I made specific configuration choices.

That said, the project was successful. I built something that works in production, handles real user traffic, and hasn't required emergency intervention. The security architecture is sound, the backup strategy is comprehensive, and the whole thing is maintainable without constant attention.

What's Next: The Roadmap

The platform is production-ready, but there's always room for improvement. On the immediate roadmap:

Adding proper off-site backup. Currently everything is local or local-plus-external-drive. I want encrypted backups going to cloud storage for true disaster recovery capability.

Expanding monitoring with Prometheus and Grafana for better metrics visualization and historical trending. Uptime Kuma is great for service health, but I want deeper visibility into resource utilization patterns.

Implementing more sophisticated automation workflows. The current setup handles media requests well, but there's potential for automated maintenance tasks, health remediation, and smarter resource management.

Adding a second host for high-availability and load distribution. The Dell T3620 is solid, but having a backup host would enable proper failover and zero-downtime updates.

The Bottom Line

Building a homelab at this scale takes time, costs money, and requires dealing with frustrating problems. But the result is worth it—a platform that provides cloud-level functionality with complete control and visibility. No monthly subscriptions, no wondering what the provider is doing with your data, no depending on someone else's uptime.

For someone in forensics and security, it's also a learning laboratory. I can experiment with security concepts, practice incident response, and understand enterprise architecture patterns in a low-stakes environment. The skills learned here translate directly to professional work.

And honestly? There's something satisfying about running your own infrastructure well. When family members seamlessly stream media or quickly find old photos, they don't think about the ZFS arrays, VPN tunnels, and automated pipelines making it possible. They just see services that work. And that's exactly the point.

If you're considering building something similar, my advice is simple: start with solid hardware, plan your network architecture carefully, implement monitoring early, and expect to learn through problems. Document everything, back up religiously, and don't be afraid to rebuild when something goes fundamentally wrong.

The homelab journey isn't about having the flashiest setup or the most services. It's about building something that works reliably, understanding how all the pieces fit together, and maintaining control over your own infrastructure. Three months of eBay hunting and two weeks of intense setup led to a platform that will serve for years. Not a bad investment.

← Previous Post Next Post →