BuildWithAI: Architecting a Serverless DR Toolkit on AWS
Overview I'd been getting more involved in disaster recovery planning lately and kept running into the same gap — a lot of teams on AWS have backups, but not a real Disaster Recovery (DR) plan. No documented runbooks, no tested failover procedures, no RTO/RPO targets tied to business impact. So that became the motivation for this side project: six AI-powered tools that automate the tedious parts of DR planning, built entirely on AWS. In part one of this three-part series, we will walk through the architecture — the serverless stack, the central model config, and the 5-layer cost guardrail system that keeps everything under $10/month (of course, you can set your own threshold; that's just what felt right for this side project). The next two parts will cover prompt engineering for each tool
Overview
I'd been getting more involved in disaster recovery planning lately and kept running into the same gap — a lot of teams on AWS have backups, but not a real Disaster Recovery (DR) plan. No documented runbooks, no tested failover procedures, no RTO/RPO targets tied to business impact. So that became the motivation for this side project: six AI-powered tools that automate the tedious parts of DR planning, built entirely on AWS.
In part one of this three-part series, we will walk through the architecture — the serverless stack, the central model config, and the 5-layer cost guardrail system that keeps everything under $10/month (of course, you can set your own threshold; that's just what felt right for this side project). The next two parts will cover prompt engineering for each tool and the lessons learned setting this side project.
Here is a look at what we're going to build. You can try out the live version at https://dr-toolkit.thecloudspark.com.
While this was implemented with the help of Kiro — AWS's spec-driven AI IDE — this series will focus on the DR toolkit, Amazon Bedrock, and the underlying AWS architecture, rather than Kiro itself.
What the toolkit does
Six tools, same workflow: provide input, Lambda calls Amazon Bedrock, get formatted output.
Tool Default Model What it does
1 Runbook Generator Nova Pro Paste IaC → get a full DR runbook
2 RTO/RPO Estimator Nova Lite Fill a form → get recovery targets and DR tier
3 DR Strategy Advisor Nova Lite Answer questions → get an AWS DR architecture pattern
4 Post-Mortem Writer Nova Lite Paste incident notes → get a structured post-mortem
5 DR Checklist Builder Nova Lite Pick your AWS services → get a tailored audit checklist
6 Template DR Reviewer Nova Pro Paste IaC → get a gap analysis with fix snippets
The live demo at DR Toolkit currently runs on Amazon Nova models. But these are just the defaults — the toolkit supports any model in the Bedrock Model Catalog. You can mix and match: Nova Lite for simple tools, Claude Sonnet for complex ones, or go all-in on a single provider. Just update models.config.json and redeploy.
Architecture
Here’s the big picture. I kept the architecture intentionally simple and straightforward AWS serverless setup. Few Lambda functions, one API Gateway, one DynamoDB table, one SNS topic, S3 + CloudFront for the frontend.
So when someone opens the toolkit, CloudFront serves the static frontend from a private S3 bucket. When they submit a tool form, the request goes through API Gateway to one of six tool Lambda functions. Each Lambda runs through the guardrail checks against DynamoDB before calling Amazon Bedrock's invoke_model. Separately, if the monthly AWS Budget hits $10, an SNS alert triggers the budget_shutoff Lambda, which flips tools_enabled=False in DynamoDB. Every tool checks that flag before doing anything else.
Browser │ ├── GET ──▶ CloudFront (security headers + URL rewrite) │ └──▶ S3 (private bucket, OAC only) │ └── POST ──▶ API Gateway (HTTP API, 10 req/s, burst 25) │ ▼ AWS Lambda (Python 3.14) ├── guardrails.py ← 5-layer cost protection ├── model_config.py ← reads models.config.json ├── Amazon Bedrock (cross-region inference profiles) └── DynamoDB (daily counters + IP rate limits + kill switch)Browser │ ├── GET ──▶ CloudFront (security headers + URL rewrite) │ └──▶ S3 (private bucket, OAC only) │ └── POST ──▶ API Gateway (HTTP API, 10 req/s, burst 25) │ ▼ AWS Lambda (Python 3.14) ├── guardrails.py ← 5-layer cost protection ├── model_config.py ← reads models.config.json ├── Amazon Bedrock (cross-region inference profiles) └── DynamoDB (daily counters + IP rate limits + kill switch)AWS Budget $10/mo ──▶ SNS ──▶ Lambda (flips kill switch)`
Enter fullscreen mode
Exit fullscreen mode
Layer What Why
Frontend Next.js 16 + Tailwind CSS v3 Static export, zero server cost
Frontend hosting S3 (private, OAC) + CloudFront Security headers, HTTPS, URL rewrite
API API Gateway HTTP API Built-in throttling, cheaper than REST API
Compute Lambda (Python 3.14) One function per tool + shared layer
AI Amazon Bedrock Cross-region inference profiles
Database DynamoDB (on-demand) Counters + feature flag + per-IP rate limits
Alerts SNS + AWS Budgets Auto-shutoff at $10/month
IaC
Serverless Framework
Single serverless.yml
Central config: models.config.json
Every tool's model, token limit, daily cap, and word count is controlled by one JSON file at the repo's root directory:
{ "region": "ap-southeast-1", "tools": { "runbook-generator": { "modelId": "apac.amazon.nova-pro-v1:0", "displayLabel": "Nova Pro", "badgeColor": "blue", "toolLimit": 50, "maxTokens": 800, "maxWords": 600 }, "rto-estimator": { "modelId": "apac.amazon.nova-lite-v1:0", "displayLabel": "Nova Lite", "badgeColor": "green", "toolLimit": 50, "maxTokens": 400, "maxWords": 300 } } }{ "region": "ap-southeast-1", "tools": { "runbook-generator": { "modelId": "apac.amazon.nova-pro-v1:0", "displayLabel": "Nova Pro", "badgeColor": "blue", "toolLimit": 50, "maxTokens": 800, "maxWords": 600 }, "rto-estimator": { "modelId": "apac.amazon.nova-lite-v1:0", "displayLabel": "Nova Lite", "badgeColor": "green", "toolLimit": 50, "maxTokens": 400, "maxWords": 300 } } }Enter fullscreen mode
Exit fullscreen mode
This config is consumed at deploy time by three things:
-
Lambda handlers — via a shared model_config.py module
-
Frontend — a slim copy with just displayLabel + badgeColor for the UI badges
-
serverless-models.js — auto-generates IAM resource ARNs so Bedrock permissions stay scoped to exactly the models in use
The handlers auto-detect the model provider from the modelId and use the correct Bedrock request format — Anthropic's anthropic_version + system string format for Claude, or Amazon's schemaVersion: messages-v1 + system array format for Nova. You can mix providers freely within the same deployment. IAM permissions update automatically on deploy — no manual policy edits needed.
Want to switch from Nova to Claude? Swap the modelId:
"runbook-generator": { "modelId": "global.anthropic.claude-sonnet-4-6", "displayLabel": "Sonnet 4.6", ... }"runbook-generator": { "modelId": "global.anthropic.claude-sonnet-4-6", "displayLabel": "Sonnet 4.6", ... }Enter fullscreen mode
Exit fullscreen mode
Redeploy and that's it 🚀. The Model Selection Guide in the repo has copy-paste-ready model IDs for every supported option.
The 5-layer cost guardrail system
Running a free public tool on Bedrock with no authentication means you need cost protection in layers. Five guardrail layers is probably overkill for most projects. But for a free public demo where anyone can hit the endpoint, I'd rather over-protect than wake up to a surprise bill. All five checks run before Bedrock ever gets called.
Layer 1 — API Gateway throttling
Configured in serverless.yml:
HttpApiStage: Properties: DefaultRouteSettings: ThrottlingRateLimit: 10 ThrottlingBurstLimit: 25HttpApiStage: Properties: DefaultRouteSettings: ThrottlingRateLimit: 10 ThrottlingBurstLimit: 25Enter fullscreen mode
Exit fullscreen mode
This is the first line of defense. Abuse gets 429s from API Gateway before Lambda even runs. Zero Bedrock cost.
Layer 2 — Daily usage counters
DynamoDB atomic conditional increments, both global (200/day) and per-tool (50/day for most tools, 30 for DR Reviewer since Nova Pro costs more per call):
table.update_item( Key={"pk": f"usage#{today}", "sk": sk}, UpdateExpression="ADD run_count :inc SET #d = :date", ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit", ExpressionAttributeValues={":inc": 1, ":limit": limit, ":date": today}, )table.update_item( Key={"pk": f"usage#{today}", "sk": sk}, UpdateExpression="ADD run_count :inc SET #d = :date", ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit", ExpressionAttributeValues={":inc": 1, ":limit": limit, ":date": today}, )Enter fullscreen mode
Exit fullscreen mode
Layer 3 — Per-IP rate limiting
3 requests per minute per IP, using DynamoDB TTL'd counters:
minute_bucket = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M") pk = f"ratelimit#{source_ip}#{minute_bucket}"minute_bucket = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M") pk = f"ratelimit#{source_ip}#{minute_bucket}"table.update_item( Key={"pk": pk, "sk": "ALL"}, UpdateExpression="ADD run_count :inc SET expires_at = :exp", ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit", ExpressionAttributeValues={ ":inc": 1, ":limit": IP_RATE_LIMIT, ":exp": int(time.time()) + 120, }, )`
Enter fullscreen mode
Exit fullscreen mode
Layer 4 — Bedrock token caps
Hard max_tokens per tool (400–800 depending on the tool). Input is also truncated to 8,000 characters before it reaches Bedrock. Most templates I tested were well under 3,000 characters, so the cap rarely triggers, but it bounds the worst case.
Layer 5 — Budget auto-shutoff
AWS Budget at $10/month → SNS → Lambda sets tools_enabled = false in DynamoDB:
def handler(event, context): table.put_item(Item={ "pk": "config", "sk": "global", "tools_enabled": False, "disabled_reason": "Monthly budget threshold reached.", })def handler(event, context): table.put_item(Item={ "pk": "config", "sk": "global", "tools_enabled": False, "disabled_reason": "Monthly budget threshold reached.", })Enter fullscreen mode
Exit fullscreen mode
Every handler checks this flag first. Worst case: tools temporarily unavailable. But never a surprise bill. (There's up to a ~5 minute lag between the budget alert and shutoff, so in-flight requests at alarm time aren't blocked. But at these volumes, the overshoot is negligible.)
Security hardening
A few key controls worth highlighting:
IAM least privilege. bedrock:InvokeModel is scoped to specific inference profile and foundation model ARNs, auto-generated from models.config.json by serverless-models.js. No wildcards on any IAM policy.
S3 private + OAC. No public access. Only CloudFront can read from the bucket.
CORS. API Gateway allowedOrigins is restricted to the CloudFront domain. The Lambda response headers themselves use Access-Control-Allow-Origin: * because the response helper doesn't know the domain and the API relies on rate limiting and daily caps (not auth tokens) for protection. The gateway-level restriction is the meaningful one.*
Prompt injection defense. All handlers use Bedrock's system parameter to separate instructions from user input. More on this in Part 2.
Full details in the Security Assessment doc in the repo.
What's next
That covers the architecture: the serverless stack, the central config, the 5-layer cost guardrails, and the security controls.
In the next part, we'll look at the tools themselves: the prompts behind each one, how to choose the right model per tool, the system prompt pattern for prompt injection defense, and the patterns that are reusable in any Bedrock project.
Try it / Fork it:
Live Demo: https://dr-toolkit.thecloudspark.com
DR Toolkit
AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.
dr-toolkit.thecloudspark.com
DR Toolkit on AWS
AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.
Tools
Tool Endpoint Model Daily Limit
1 Runbook Generator POST /runbook Nova Pro 50/day
2 RTO/RPO Estimator POST /rto-estimator Nova Lite 50/day
3 DR Strategy Advisor POST /dr-advisor Nova Lite 50/day
4 Post-Mortem Writer POST /postmortem Nova Lite 50/day
5 DR Checklist Builder POST /checklist Nova Lite 50/day
6 Template DR Reviewer POST /dr-reviewer Nova Pro 30/day
Architecture
-
Frontend: Next.js 16 (static export) + Tailwind CSS → S3 + CloudFront
-
Backend: AWS Lambda (Python 3.14) → API Gateway HTTP API
-
AI: Amazon Bedrock — Nova Lite (Tools 2–5), Nova Pro (Tools 1, 6)
-
Database: DynamoDB single table dr-toolkit-usage (usage counters + feature flag)
-
IaC: Serverless Framework v3 (serverless.yml)
-
Region: ap-southeast-1 (Singapore)
Project Structure
dr-toolkit/ ├── serverless.yml # Serverless Frameworkdr-toolkit/ ├── serverless.yml # Serverless Framework…
References:
-
Disaster Recovery of Workloads on AWS — AWS Whitepaper
-
Amazon Bedrock Developer Guide
-
Amazon Bedrock Model Catalog
-
Amazon Bedrock Cross-Region Inference
-
Amazon Bedrock — Anthropic Claude Parameters
-
CloudFront Origin Access Control
DEV Community
https://dev.to/aws-builders/buildwithai-architecting-a-serverless-dr-toolkit-on-aws-123dSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases

Dark Dish Lab: A Cursed Recipe Generator
What I Built Dark Dish Lab is a tiny, delightfully useless web app that generates cursed food or drink recipes. You pick: Hated ingredients Flavor chaos (salty / sweet / spicy / sour) Then it generates a short “recipe” with a horror score, a few steps, and a warning. It solves no real-world problem. It only creates regret. Demo YouTube demo Code GitHub repo How I Built It Frontend: React (Vite) Ingredient + flavor selection UI Calls backend API and renders the generated result Backend: Spring Boot (Java 17) POST /api/generate endpoint Generates a short recipe text and returns JSON Optional AI: Google Gemini API If AI is enabled and a key is provided, it asks Gemini for a very short recipe format If AI is disabled or fails, it falls back to a non-AI generator Notes Only Unicode emojis are u

KVerify: A Two-Year Journey to Get Validation Right
KVerify: A Two-Year Journey to Get Validation Right In December 2023, I wrote a small article about a utility I called ValidationBuilder . The idea was simple: a DSL where you'd call validation rules as extension functions on property references, collect all violations in one pass, and get a result back. Ktor-specific, but the concept was portable. I published it and moved on. Except I didn't. The Problem I came to Kotlin without a Java background. Spring was my first serious framework, and I didn't really understand it. My issue was specific: I was reaching for Kotlin object declarations where Spring wanted managed beans. I was hardcoding configuration values that should have been injected, not because I didn't know configuration existed, but because I couldn't figure out how to bridge Sp

CodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and Codex
I already wrote about why I built CodeClone and why I cared about baseline-aware code health . Then I wrote about turning it into a read-only, budget-aware MCP server for AI agents . This post is about what changed in 2.0.0b4 . The short version: if b3 made CodeClone usable through MCP, b4 made it feel like a product. Not because I added more analysis magic or built a separate "AI mode." But because I pushed the same structural truth into the places where people and agents actually work — VS Code, Claude Desktop, Codex — and tightened the contract between all of them. A lot of developer tools are strong on analysis and weak on workflow. A lot of AI-facing tools shine in a demo and fall apart in daily use. For b4 , I wanted a tighter shape: the CLI, HTML report, MCP, and IDE clients should





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!