Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessI Renamed All 43 Tools in My MCP Server. Here's Why I Did It Now.Dev.to AIWhy AI Pilots Fail — And the 5 Patterns That Actually Get to ProductionDev.to AIBuilding Predictive Maintenance Systems for Infrastructure MonitoringDev.to AIThe Best Scribe Alternative in 2026 (Privacy-First, AI-Ready)Dev.to AII Started Building a Roguelike RPG — Powered by On-Device AI #2Dev.to AIGR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue LiftDev.to AIFAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained ReasoningDev.to AIOwn Your Data: The Wake-Up CallDev.to AIHow I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)Dev.to AIClaude Code subagent patterns: how to break big tasks into bounded scopesDev.to AIIntercom Opens Fin to the World - The AI Economy | Ken YeungGNews AI RAGAnthropic says Claude subscriptions will no longer cover usage on third-party tools like OpenClaw starting April 4 at 12pm PT, to better manage capacity (Boris Cherny/@bcherny)TechmemeBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessI Renamed All 43 Tools in My MCP Server. Here's Why I Did It Now.Dev.to AIWhy AI Pilots Fail — And the 5 Patterns That Actually Get to ProductionDev.to AIBuilding Predictive Maintenance Systems for Infrastructure MonitoringDev.to AIThe Best Scribe Alternative in 2026 (Privacy-First, AI-Ready)Dev.to AII Started Building a Roguelike RPG — Powered by On-Device AI #2Dev.to AIGR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue LiftDev.to AIFAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained ReasoningDev.to AIOwn Your Data: The Wake-Up CallDev.to AIHow I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)Dev.to AIClaude Code subagent patterns: how to break big tasks into bounded scopesDev.to AIIntercom Opens Fin to the World - The AI Economy | Ken YeungGNews AI RAGAnthropic says Claude subscriptions will no longer cover usage on third-party tools like OpenClaw starting April 4 at 12pm PT, to better manage capacity (Boris Cherny/@bcherny)Techmeme
AI NEWS HUBbyEIGENVECTOREigenvector

Building a Free AI Image Generator on 7 GPUs: Architecture Deep Dive

DEV Communityby Biricik BiricikApril 2, 202611 min read1 views
Source Quiz

Most tutorials about running AI inference at scale assume you have access to cloud GPU clusters, Kubernetes, and a team of infrastructure engineers. I had none of that. What I had was a single workstation with 7 NVIDIA RTX 5090 GPUs, a fiber internet connection, and a goal: serve free AI image generation to anyone on the internet without a signup wall. This is the architecture that makes ZSky AI work. Every design decision here came from a real production failure or bottleneck, not from a whiteboard exercise. The Constraints That Shaped Everything Before diving into the architecture, here are the constraints that ruled out most "standard" approaches: Single machine. All 7 GPUs live in one box. No cluster networking, no distributed training frameworks. Consumer hardware. RTX 5090s, not A100

Most tutorials about running AI inference at scale assume you have access to cloud GPU clusters, Kubernetes, and a team of infrastructure engineers. I had none of that. What I had was a single workstation with 7 NVIDIA RTX 5090 GPUs, a fiber internet connection, and a goal: serve free AI image generation to anyone on the internet without a signup wall.

This is the architecture that makes ZSky AI work. Every design decision here came from a real production failure or bottleneck, not from a whiteboard exercise.

The Constraints That Shaped Everything

Before diving into the architecture, here are the constraints that ruled out most "standard" approaches:

  • Single machine. All 7 GPUs live in one box. No cluster networking, no distributed training frameworks.

  • Consumer hardware. RTX 5090s, not A100s or H100s. Consumer drivers, consumer cooling, consumer power delivery.

  • Real-time serving. Users expect results in under 4 seconds. Batch processing is not an option.

  • Mixed workloads. Image generation and video generation share the same GPU pool. Video uses 28GB+ VRAM; images use 12-22GB. They cannot coexist on the same GPU simultaneously.

  • Zero downtime tolerance. If one GPU crashes, the remaining six must continue serving without dropping requests.

What I Tried and Abandoned

Kubernetes with NVIDIA Device Plugin

The enterprise-approved approach. I spent two weeks on it. On a single machine, it added 200-400ms of pod scheduling overhead per request, consumed non-trivial memory for the control plane, and made GPU debugging significantly harder because every error was wrapped in three layers of Kubernetes abstraction.

For a multi-node cluster, Kubernetes makes sense. For 7 GPUs on one machine, it is pure overhead.

Ray Serve

Ray is purpose-built for multi-GPU Python workloads. It worked better than Kubernetes, but its actor model introduced indirection I did not need. Model loading through Ray's object store was measurably slower than direct VRAM loading, and debugging distributed state across Ray actors when a GPU hung was painful.

Ray is excellent for distributed computing across machines. For local multi-GPU inference, it is overkill.

Triton Inference Server

NVIDIA's own inference server. It is fast and mature, but it assumes a model-serving paradigm where you deploy one model per endpoint. I needed dynamic model loading and eviction across 7 GPUs with shared state, which Triton's static configuration model does not handle well.

The Architecture That Stuck

After three failed approaches, I arrived at something embarrassingly simple:

Enter fullscreen mode

Exit fullscreen mode

Seven independent Python processes, each pinned to a single GPU via CUDA_VISIBLE_DEVICES. They share nothing except Redis. A supervisor process monitors heartbeats and restarts crashed workers.

That is the entire orchestration layer. No service mesh, no container runtime, no scheduler. One process per GPU, one Redis instance, one supervisor.

GPU Queue Management: Model-Affinity Routing

The naive approach is a single FIFO queue: jobs go in, the first available GPU picks one up. This works until you realize that loading a model into VRAM takes 4-6 seconds. If GPU 3 already has the image generation model loaded and GPU 5 has the video model loaded, sending an image generation job to GPU 5 means the user waits an extra 4 seconds while the model loads.

Model-affinity routing fixes this:

Priority 1: GPU that already has the model loaded

AND has room in its queue

for gpu_id in range(self.num_gpus): state = self.get_gpu_state(gpu_id) if model_needed in state["loaded_models"]: if state["queue_depth"] < self.max_queue_depth: return gpu_id

Priority 2: GPU with the most free VRAM

(can load the model without evicting)

for gpu_id in sorted(range(self.num_gpus), key=lambda g: self.get_free_vram(g), reverse=True): if self.get_free_vram(gpu_id) >= MODEL_VRAM[model_needed]: return gpu_id

Priority 3: Least-busy GPU (will need to evict)

return min(range(self.num_gpus), key=lambda g: self.get_gpu_state(g)["queue_depth"])`

Enter fullscreen mode

Exit fullscreen mode

Each GPU worker publishes its state to Redis every 5 seconds: loaded models, queue depth, current VRAM usage, temperature. The router reads this state to make decisions.

This single optimization -- sending jobs to GPUs that already have the right model cached -- cut median latency by 30%. It is the highest-impact change in the entire system.

VRAM Management: The LRU Eviction Problem

Each RTX 5090 has 32GB of VRAM. A large image generation model uses about 22GB. A video model uses 28GB. Utility models (upscaling, background removal) use 1-2GB each.

The challenge: when a video generation request arrives at a GPU that currently holds an image model plus two utility models (22 + 2 + 1.5 = 25.5GB), the worker must evict everything to make room for the 28GB video model.

I use LRU eviction with a twist -- model priority. Utility models are always evicted before generation models, regardless of recency:

while free < required_mb and self.loaded_models:

Evict lowest-priority, then least-recently-used

victim = min( self.loaded_models.values(), key=lambda m: (m.priority, m.last_used) ) self.unload(victim) torch.cuda.empty_cache() free = self.get_free_vram_mb()`

Enter fullscreen mode

Exit fullscreen mode

Priority-weighted eviction prevents a common pathological case: a burst of small utility requests evicting the main generation model, followed by an image generation request that must reload it from disk. Without priority weighting, this thrashing pattern added 4+ seconds to one in every ten requests during peak traffic.

Load Balancing Across 7 GPUs

True load balancing across heterogeneous GPU workloads is harder than it sounds. Image generation takes 2-3 seconds; video generation takes 60-90 seconds. If you balance purely on queue depth, a GPU processing a video job shows queue depth 1 for 90 seconds, while image GPUs show queue depth 0 between 3-second jobs. The video GPU looks "available" but is not.

I balance on estimated completion time rather than queue depth:

Enter fullscreen mode

Exit fullscreen mode

Duration estimates come from historical data by model and resolution, stored in a simple rolling average. This gives the router a much more accurate picture of actual GPU availability.

Thermal Management: The Underrated Problem

Seven GPUs in one case generate approximately 3,500 watts of heat under full load. Consumer GPU coolers are designed for one or two cards with adequate airspace, not seven cards packed together.

Thermal throttling at 83C reduced my throughput by 15-25% during sustained loads. The fixes, in order of impact:

Fan curve override. Default fan curves prioritize noise over cooling. I run all fans at 80%+ whenever any GPU exceeds 70C using a systemd service that polls nvidia-smi and sets fan speeds via NVML:

*

Enter fullscreen mode

Exit fullscreen mode

Thermal-aware job routing. The router penalizes hot GPUs. A GPU at 80C gets a 2x weight penalty on its estimated wait time, which naturally diverts traffic to cooler GPUs:

Enter fullscreen mode

Exit fullscreen mode

Physical airflow. This is the least interesting but most effective fix. I removed the side panel, added two 140mm intake fans blowing directly across the GPU backplates, and ensured adequate spacing between cards using riser cables for the most constrained slots. This dropped peak temperatures by 8-12C.

After these changes, sustained full-load operation runs at 72-76C across all seven GPUs. No throttling.

Latency Optimization: Where the Milliseconds Go

A breakdown of a typical image generation request:

Enter fullscreen mode

Exit fullscreen mode

The optimizations that matter most:

  • Model caching eliminates 4,200ms on cache hits. Cache hit rate is 94% in production because most requests use the same model.

  • torch.compile with mode="reduce-overhead" cuts denoising by ~350ms by eliminating Python overhead in the inference loop.

  • Prompt embedding cache saves 180ms on repeated prompts. About 15% of prompts are exact repeats.

  • CUDA graphs for the most common resolution (1024x1024) saves another ~140ms by replaying a captured GPU execution plan.

  • Async PNG encoding does not reduce individual latency but frees the GPU 45ms sooner for the next request.

Failure Handling: What Breaks in Production

In four months of production operation, here is what has actually failed:

Failure Count Detection Recovery

OOM on video generation 4 Worker crash Auto-restart, job re-queued

Stuck worker (unknown cause) 2 Heartbeat timeout (120s) Supervisor kill + restart

CUDA context corruption 1 Garbled output detected by QC Worker restart

Power interruption 1 All workers died Full system restart via systemd

NVLink error 0 N/A (no NVLink) N/A

Driver crash 0 N/A N/A

The supervisor is the critical component. Every worker sends a Redis heartbeat every 10 seconds. If the heartbeat stops for 120 seconds, the supervisor kills the process, clears its GPU memory, and spawns a new worker. Jobs that were in-progress are re-queued automatically.

The most subtle failure mode is CUDA context corruption, which does not crash the worker but produces garbled images. I added a lightweight quality check on every output -- variance below a threshold (indicating a solid-color or corrupted image) triggers a worker restart and job retry.

Production Performance

Current production numbers on 7x RTX 5090:

Metric Value

Median image generation latency 2.5s

p99 image generation latency 4.1s

Median video generation latency 67s

Sustained image throughput ~2,400 images/hour

Model cache hit rate 94%

GPU utilization (avg) 41%

Uptime (last 90 days) 99.7%

The 41% average GPU utilization reflects real traffic patterns -- demand is bursty, with peaks during US business hours and valleys overnight. During peak hours, utilization hits 85-90%.

What I Would Do Differently

  • Start with per-GPU queues from day one. I initially used a single shared queue and retrofitted per-GPU queues for affinity routing. The refactor was messy.

  • Instrument everything from the start. I added Prometheus metrics after launch. Having generation-time histograms, queue-depth gauges, and VRAM-utilization metrics from day one would have caught the thermal throttling issue weeks earlier.

  • Do not try to batch real-time requests. Batching improves throughput but increases latency for the first request in the batch. For user-facing inference, single-request processing with model caching is strictly better.

  • Budget for thermal engineering from the beginning. I treated cooling as an afterthought and paid for it with two weeks of debugging intermittent throughput drops that turned out to be thermal throttling.

Conclusion

The final architecture is a Redis job queue with one Python process per GPU, a model-affinity router, an LRU VRAM manager, and a heartbeat-based supervisor. No Kubernetes, no Ray, no Triton. For a single-machine multi-GPU inference setup, simplicity is not just easier -- it is faster and more reliable.

The complexity lives where it should: in VRAM management and job routing, not in orchestration frameworks.

If you want to see the end result, ZSky AI serves 50 free generations per day with no signup required. The architecture described here is what is running behind the scenes.

I build AI infrastructure at ZSky AI, where we run AI image and video generation on self-hosted GPUs. If you are building something similar or just want to talk about GPU inference architecture, find me in the comments.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Building a …modeltraininglaunchavailableproductserviceDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!