Stories by Manjula Liyanage on Medium

Don’t Outsource Your Intuition to AI

Manjula Liyanage — Fri, 10 Apr 2026 19:15:51 GMT

I’ve been thinking about this lately.

Human intuition is one of the oldest tools we have. Before books, before systems, before frameworks — we had instinct. It helped us survive. Knowing when something feels off. Making quick decisions without overthinking. That was an evolutionary advantage.

In today’s world, we still use intuition every day. Sometimes you just know what to do. You can’t fully explain it, but your judgment feels right. That’s experience compressed into a feeling.

But not all intuition is useful anymore.

Some of it is still wired for survival in the savanna, not for modern life. We overreact, we fear things that are not real threats, we make biased decisions. So intuition is powerful, but not always correct. We need to know when to trust it — and when to pause and think properly.

Tools like ChatGPT are everywhere. And honestly, they are incredible. You can ask anything — work problems, life decisions, technical questions, even emotional stuff.

But I’m starting to see a pattern.

We ask AI for everything.

Instead of deciding, we validate.
Instead of acting, we research more.
Instead of trusting ourselves, we double-check with AI.

And slowly, we get stuck.

We feel like we are being smart — gathering more data, more opinions — but sometimes we are just avoiding the decision. Avoiding responsibility. Avoiding the discomfort of being wrong.

That’s where intuition still matters.

Not as the only tool — but as one of the tools.

Use AI when you need knowledge, perspective, or structure. But don’t forget — you already have something powerful inside you. Your intuition is built from your experiences, your failures, your wins.

Sometimes, it’s enough.

So maybe the goal is not choosing between AI and intuition.

It’s knowing when to switch.

Sometimes you ask AI.
Sometimes you trust yourself.

And sometimes — you just move.

Forget Courses. Start Skillmaxxing.

Manjula Liyanage — Fri, 20 Mar 2026 23:54:44 GMT

Skillmaxxing

I see this pattern again and again.

People want to learn tech.
They start with a course.

They open Udemy, YouTube, maybe even Coursera.
Watch few videos. Feel productive. Take some notes.

But after few weeks… nothing.

No product.
No real understanding.
Just more “I think I know this”.

This is the problem.

Courses make you feel like you’re learning.
But building is what actually teaches you.

That’s where I think this idea of Skillmaxxing fits really well.

Not learning everything.
Not becoming an expert.

Just going hard on one skill, with real output.

What is Skillmaxxing (in our world)?

For me, Skillmaxxing means:

Pick one skill (not 10 things)
Use AI as your assistant
Learn by building something real
Repeat until it clicks

That’s it.

No long theory.
No perfect roadmap.

Just focused, practical learning.

Let’s take a real example: System Design

Now this is where many beginners get scared.

“System design is for senior engineers”
“I need years of experience”
“I should first learn backend, frontend, databases…”

No.

You don’t need to master everything.

You just need to understand how things connect.

That’s system design.

Not diagrams.
Not fancy words.

Just answering simple questions:

Where does my data go?
How does the user interact with my system?
What happens when things scale?
What can break?

How to Skillmaxx System Design (with AI)

Let’s say you want to build something simple.

Example:
A “founder task tracker” (like a lightweight Jira for founders)

Now, instead of watching a course, do this:

Step 1 — Ask AI to design it

Open ChatGPT and ask:

“Design a simple system architecture for a task tracking app for founders. Keep it simple, explain components.”

You’ll get something like:

Frontend (web app)
Backend API
Database
Authentication
Notifications

Good.

Don’t try to understand everything.

Just get familiar.

Step 2 — Go deeper, one piece at a time

Now pick one part.

For example: Database

Ask:

“What database should I use for this and why?”

Then:

“Show me simple schema for tasks”

Then:

“How does this scale if I have 10,000 users?”

You are not memorising.

You are exploring.

Step 3 — Draw your own version

Now take a paper (or Miro).

Draw:

User → Frontend → Backend → Database

Add arrows.

Add small notes.

This step is very important.

Because now you are thinking, not just reading.

Step 4 — Build a tiny version

Use any AI coding tool.

Doesn’t matter if it’s messy.

Just build:

Create task
Save task
Show task list

Now suddenly everything becomes real.

Database is not theory anymore.
API is not theory anymore.

Step 5 — Break it (this is where real learning happens)

Ask:

What if 1,000 users login at same time?
What if database is slow?
What if notification fails?

Then ask AI:

“How do I improve this system for scale?”

Now you are doing system design.

Not watching.

Not memorising.

The mindset shift

Courses try to give you complete knowledge.

But real skill comes from:

Partial understanding
Repeated exposure
Real problems

You don’t need to know everything.

You need to know enough to move forward.

That’s Skillmaxxing.

A Simple 4–8 Week Skillmaxxing Plan (System Design)

If you want something practical, follow this.

Week 1–2: Foundations (but lightweight)

Learn what is frontend, backend, database (use AI explanations)
Ask AI to show 3–4 simple architectures
Draw them in your own way

Week 3–4: First system

Pick one idea (task app, booking system, notification app)
Ask AI to design it
Break it into components
Build a basic version (very simple)

Week 5–6: Go deeper

Add one feature (auth, notifications, file upload)
Ask “how this works internally?”
Learn only what you need

Week 7–8: Scale thinking

Ask “what happens with 10k users?”
Learn caching, queues (just basics)
Improve your design

Final thought

You don’t become good by finishing courses.

You become good by:

Trying
Breaking
Fixing
Repeating

AI just makes this faster.

So yeah…

Forget courses.
Start Skillmaxxing.

How I Keep My MVP Running for Free by Hopping Cloud Accounts (And Why You Can Too)

Manjula Liyanage — Tue, 27 Jan 2026 12:50:04 GMT

The cloud providers want you to succeed — so badly that they’ll give you hundreds of dollars in free credits just to try their platform.

Google Cloud gives you $300 for 90 days. AWS gives you a free tier for 12 months. Azure gives you $200 for 30 days. And nothing is stopping you from creating a new account when those credits run out.

I’ve been running OmniTrackr, a file monitoring and SLA tracking platform, on Google Cloud Platform’s free trial credits. When my first $300 ran out, I didn’t reach for my credit card. I created a new Google account, activated a fresh free trial, and migrated my entire non-production infrastructure over in a single afternoon.

No data lost. No extended downtime. No panic.

But here’s the catch: this only works if you set up your infrastructure the right way from day one. If you hardcoded project IDs, manually configured servers through the console, or stored secrets in your codebase, you’d be looking at days of painful migration instead of hours.

This post breaks down exactly how I structured my infrastructure to make cloud account hopping trivial, and how you can do the same. Even if you never plan to hop accounts, these practices will save you when you need to spin up a new environment, recover from a disaster, or hand off your project to a real DevOps team.

Why This Matters for Bootstrapped Founders

Let’s do the math.

A minimal non-prod environment on GCP with a Cloud SQL database, a Cloud Run service, a load balancer, and some storage costs roughly $80–100/month. That’s $960–1200/year before you’ve made a single dollar.

For a bootstrapped founder validating product-market fit, that’s money better spent on marketing, user research, or keeping the lights on.

With account hopping, I’ve run my demo environment for months at zero cost. That’s a real runway. And I’m not cutting corners — I’m running a full production-grade setup with:

- PostgreSQL database with automated backups

- API service with auto-scaling on Cloud Run

- Three background worker jobs on scheduled triggers

- A global HTTPS load balancer with managed SSL

- Frontend SPA hosting on Cloud Storage

- CI/CD pipelines deploying automatically on every push

- Secrets managed through a dedicated secrets manager

All of it migrated to a new account by changing three values in my Terraform files and running a handful of commands.

The Foundation: Infrastructure as Code from Day One

The single decision that makes account hopping possible is **Infrastructure as Code (IaC)**. If you build your MVP on a cloud platform and you’re clicking around the console to set things up, stop. Go learn the basics of Terraform. It will be the highest-ROI hour you spend on your entire project.

What Terraform Actually Does

Terraform lets you describe your entire cloud infrastructure in plain text files. Instead of clicking “Create Database” in the GCP console, you write:

resource "google_sql_database_instance" "postgres" {
  name             = "omnitrackr-staging-db"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  settings {
    tier            = "db-f1-micro"
    disk_size       = 10
    disk_type       = "PD_SSD"
    disk_autoresize = true
  }
}

Run `terraform apply`, and the database exists. Run it again on a different account, and an identical database exists there too. Run `terraform destroy`, and it’s all gone — cleanly, completely, no orphaned resources silently burning your credits.

My Terraform Setup

Here’s what OmniTrackr’s Terraform configuration manages:

infrastructure/terraform/staging/
  main.tf              # Provider config, backend state, API enablement
  variables.tf         # Parameterized values (project ID, region, etc.)
  terraform.tfvars     # Environment-specific values
  cloud_sql.tf         # PostgreSQL database
  cloud_run.tf         # API service
  cloud_run_jobs.tf    # Background workers + Cloud Scheduler triggers
  secrets.tf           # Secret Manager (DB password, JWT secret)
  iam.tf               # Service accounts and permissions
  storage.tf           # Frontend hosting bucket
  load_balancer.tf     # HTTPS load balancer with managed SSL
  outputs.tf           # Export IPs, URLs, connection strings

Every resource is defined in code. Every permission, every firewall rule, every cron schedule. Nothing is manually configured.

The Key: Parameterize Everything

The magic is in `variables.tf`, `terraform.tfvars`, and a technique called partial backend configuration. Every value that could change between accounts is either a variable or a CLI flag — nothing is hardcoded.

Here’s the catch most people miss: Terraform’s `backend` block does not support variables. You can’t write `bucket = var.state_bucket`. So if you hardcode the state bucket name in `main.tf`, switching between projects means editing source files — which is fragile and error-prone.

The solution is to leave the bucket empty and pass it at init time:

```hcl
# main.tf — partial backend config
terraform {
  backend "gcs" {
    prefix = "staging"
    # bucket provided via: terraform init -backend-config="bucket=BUCKET_NAME"
  }
}
```

```hcl
# variables.tf — parameterized project ID
variable "project_id" {
  description = "GCP Project ID"
  default     = "omnitrackr-staging-v2"
}
```

Then keep separate `.tfvars` files for each target:

```hcl
# terraform.tfvars — active project
project_id  = "omnitrackr-staging-v2"
region      = "us-central1"
environment = "staging"
```

```hcl
# terraform.tfvars.old — old project (kept for cleanup)
project_id  = "omnitrackr-staging"
region      = "us-central1"
environment = "staging"
```

When I migrate, I create the new `.tfvars`, update the default in `variables.tf`, and pass the new bucket at init:

```bash
terraform init -backend-config="bucket=omnitrackr-terraform-state-v2"
terraform apply # uses terraform.tfvars by default
```

Every resource — database, services, load balancer, secrets — gets created in the new project. No manual configuration. And when it’s time to tear down the old project, I don’t edit a single file:

```bash
terraform init -backend-config="bucket=omnitrackr-terraform-state" -reconfigure
terraform destroy -var-file="terraform.tfvars.old"
```

Clean setup, clean teardown, using the same Terraform files throughout.

Environment Separation Isn’t Optional

In my previous article about mistakes non-technical founders make when building with AI, I wrote about Mistake #3: ignoring multi-environment configuration from the start. That lesson is central to what makes account hopping work.

If your staging and production environments aren’t cleanly separated, migrating one without breaking the other is a nightmare. Here’s what proper separation looks like:

Separate Everything

Separate GCP projects:

- `omnitrackr-staging` (or `omnitrackr-staging-v2` after migration)
- `omnitrackr-prod` (untouched during staging migration)

Separate GitHub Actions secrets:

- `GCP_PROJECT_ID_STAGING` / `GCP_PROJECT_ID_PROD`
- `GCP_SA_KEY_STAGING` / `GCP_SA_KEY_PROD`
- `DB_HOST_STAGING` / `DB_HOST_PROD`

Separate CI/CD workflows:

- `deploy-staging.yml` triggers on push to `develop`
- `deploy-prod.yml` triggers on push to `main`

Separate Terraform state:

- Staging: `terraform init -backend-config="bucket=omnitrackr-terraform-state-v2"` + `terraform.tfvars`
- Production: `terraform init -backend-config="bucket=omnitrackr-terraform-state-prod"` + `terraform.tfvars.prod`

When I migrated staging to a new GCP account, production wasn’t affected at all. Different project, different secrets, different workflows, different state. The two environments have zero coupling.

Secrets Management Done Right

This is where I see a lot of founders cut corners. They hardcode database passwords in environment files, commit API keys to Git, or share secrets across environments.

OmniTrackr uses Google Secret Manager for all sensitive values. Terraform creates the secrets automatically:

```
resource "random_password" "db_password" {
  length  = 32
  special = true
}

resource "google_secret_manager_secret" "db_password" {
  secret_id = "db-password"
  replication { auto {} }
}

resource "google_secret_manager_secret_version" "db_password" {
  secret      = google_secret_manager_secret.db_password.id
  secret_data = random_password.db_password.result
}
```

When I create a new project, Terraform generates fresh passwords and secrets. I never copy passwords between accounts. I never store them in files that could leak. The Cloud Run services pull secrets at runtime from Secret Manager with least-privilege IAM permissions.

The only place secrets live outside GCP is in GitHub Actions secrets — and those are updated manually during migration, one at a time, through the GitHub UI.

The Actual Migration: What It Looks Like

Here’s the condensed version of what an account hop involves. The full migration plan lives in my GCP Staging Migration Plan, but this is the essence.

Step 1: Backup Your Data

```
# Export database
pg_dump "postgresql://user:pass@old-host:5432/omnitrackr" \
  --format=custom \
  --file=backup.dump
```

This is the only manual step that involves actual data. Everything else is infrastructure that gets recreated from code.

Step 2: Set Up the New Account

Create a new Google account. Activate the free trial. Then run a few `gcloud` commands to set up the project shell:

```
gcloud projects create omnitrackr-staging-v2
gcloud services enable run.googleapis.com sqladmin.googleapis.com ...
gsutil mb gs://omnitrackr-terraform-state-v2
```

Step 3: Update the tfvars and Init Against the New Backend

```bash
# Copy old tfvars for later cleanup, update the active one
cp terraform.tfvars terraform.tfvars.old
# Edit terraform.tfvars: project_id = "omnitrackr-staging-v2"
# Edit variables.tf default to match
```

No need to touch `main.tf` — the backend bucket is passed via CLI, not hardcoded.

Step 4: Apply

```bash
terraform init -backend-config="bucket=omnitrackr-terraform-state-v2"
terraform apply
```

Terraform creates the entire infrastructure from scratch in the new project: database, services, load balancer, secrets, IAM, storage bucket, scheduler jobs — everything.

Step 5: Restore Data and Update DNS

```bash
# Restore database
pg_restore --host=NEW_IP --dbname=omnitrackr backup.dump

# Update DNS to point to new load balancer IP
gcloud dns record-sets update staging.omnitrackr.dev. \
  --type=A --rrdatas=NEW_LB_IP
```

Step 6: Update GitHub Secrets and Deploy

Update the handful of secrets in GitHub (new project ID, new service account key), push to `develop`, and your CI/CD pipeline deploys everything to the new account.

Step 7: Tear Down the Old Account

```bash
# Point Terraform at the old state bucket and destroy using the old tfvars
terraform init -backend-config="bucket=omnitrackr-terraform-state" -reconfigure
terraform destroy -var-file="terraform.tfvars.old"
```

Clean. Complete. No orphaned resources.

What Makes This Work (And What Breaks It)

This works because:

1. Every resource is defined in Terraform. Nothing was clicked into existence through the console. If it’s not in `.tf` files, it doesn’t exist.

2. All configuration is parameterized. Project IDs, regions, and environment names are variables, not hardcoded strings scattered across 20 files.

3. Environments are fully isolated. Staging and production share nothing — no projects, no secrets, no state files, no CI/CD triggers.

4. Secrets are generated, not copied. Terraform creates fresh passwords for each environment. No password reuse, no secret files floating around.

5. CI/CD reads from environment-specific secrets. GitHub Actions workflows reference `GCP_PROJECT_ID_STAGING` (not a hardcoded project name), so updating one secret propagates everywhere.

6. DNS is the only external dependency. The domain registrar and DNS zone are the only things that bridge old and new accounts. Everything else is self-contained.

This breaks if:

1. You configured resources manually through the cloud console. If you added a firewall rule by clicking buttons, Terraform doesn’t know about it, and it won’t exist in the new project.

2. You hardcoded project IDs or bucket names in source files. If your `main.tf` has `bucket = “my-specific-bucket”` or your API has `project: “omnitrackr-staging”` buried in a service file, you’ll have to edit files every time you switch targets.

3. You share secrets across environments. If staging and production use the same JWT secret, rotating one breaks the other.

4. Your Terraform state is local. If your `.tfstate` file lives on your laptop instead of a cloud bucket, you can’t collaborate, and you can’t cleanly re-initialize for a new project.

Lessons from the Trenches

Start with Terraform Even If You Don’t Plan to Migrate

Even if you never hop accounts, Terraform pays for itself:

- Disaster recovery. Your entire infrastructure can be recreated from code. If something breaks catastrophically, `terraform apply` gets you back.

- Documentation. Your `.tf` files are living documentation of exactly what infrastructure exists and how it’s configured.

- Onboarding. When you hire a DevOps engineer, they can read your Terraform files and understand your entire setup in minutes instead of spelunking through the cloud console.

- Cost visibility. You can see every resource that exists and what it costs, because it’s all in your codebase.

Keep DNS Separate

I made a deliberate choice to manage DNS through the GCP Console rather than Terraform. This means `terraform destroy` doesn’t touch my DNS zone. During migration, I just update the A record to point to the new load balancer IP.

DNS zones are cheap (around $0.20/month on Google Cloud DNS), so keeping the zone in the old account while everything else moves to the new one is fine as a transitional strategy. You can migrate DNS to a registrar like GoDaddy or Cloudflare later at your convenience.

Accept the Brief Downtime

During migration, there’s a window where DNS is pointing to the new load balancer but the SSL certificate hasn’t provisioned yet. For staging, this is a non-issue. For production, you’d want to plan this during a maintenance window and set DNS TTLs low (60 seconds) beforehand.

For a pre-revenue MVP, an hour of downtime on staging is nothing. Don’t over-engineer the zero-downtime migration if you have five users.

Document the Migration Process

I wrote a detailed migration plan before touching anything. Every command, every file change, every secret to update. This isn’t just for the current migration — it’s the playbook for the next one. When this free trial runs out, I’ll spend half the time because the process is documented.

This connects back to my earlier point about documenting architectural decisions — the same discipline that helps you build maintainable code helps you build maintainable infrastructure.

The Broader Point: Build Your MVP Like It’s Going to Move

Cloud account hopping is just one application of a deeper principle: build your infrastructure to be portable from day one.

You might need to move because:

- Free credits ran out (the account hopping scenario)

- You’re switching cloud providers entirely (GCP to AWS)

- A client requires infrastructure in a specific region or cloud

- You need to spin up isolated environments for enterprise customers

- You’re handing off to a DevOps team that has their own standards

In every case, the founders who invested in Infrastructure as Code, environment separation, and proper secrets management will adapt in days. The founders who clicked through consoles and hardcoded values will be stuck for weeks.

Quick-Start Checklist for New MVPs

If you’re starting a new project today and want to keep your options open:

Use Terraform (or Pulumi/CDK) from day one. Even a basic setup is better than clicking through consoles.
Parameterize your project ID, region, and environment name. Never hardcode these values.
Use partial backend configuration. Don’t hardcode your state bucket in `main.tf` — pass it via `terraform init -backend-config=”bucket=…”`. This lets you switch between projects and environments without editing source files.
Use separate `.tfvars` files per environment (e.g., `terraform.tfvars`, `terraform.tfvars.prod`). Load them with `terraform apply -var-file=”…”`. One Terraform directory, multiple environments — no copy-pasting `.tf` files.
Store Terraform state in a cloud bucket , not on your laptop. Use a separate bucket per environment.
Create separate secrets for each environment in your CI/CD system (`_STAGING` / `_PROD` suffixes).
Use your cloud’s secrets manager for passwords and API keys. Never commit secrets to Git.
Generate fresh secrets per environment. Don’t copy passwords between accounts.
Keep DNS management separate from your main infrastructure so it survives teardowns.
Document your infrastructure. Future you (or your future DevOps hire) will thank you.
Write a migration playbook the first time you set things up, while the process is fresh.

Final Thoughts

There’s a certain irony in using engineering best practices to save money on cloud bills. The same infrastructure discipline that makes your MVP production-ready also makes it free-trial-ready.

But that’s the thing about doing things right: it pays off in ways you don’t expect. I set up Terraform because I wanted reproducible deployments. I separated environments because I didn’t want staging bugs in production. I used Secret Manager because I didn’t want credentials in Git.

And as a side effect, I can migrate my entire staging environment to a fresh GCP account in an afternoon and keep building without spending a dollar on infrastructure.

That’s the kind of leverage bootstrapped founders need.

Need Help Getting Your Technical Foundation Right?

I’ve been through the pain of setting up cloud infrastructure, making architectural decisions, and building an MVP from scratch — so you don’t have to learn these lessons the hard way.

If you’re a non-technical founder building a SaaS product and you need help with:

- Cloud infrastructure setup — Terraform, CI/CD pipelines, environment separation

- Architecture decisions — database design, API structure, secrets management

- Technical strategy— choosing the right stack, planning for scale, avoiding costly mistakes early

- AI-assisted development — getting the most out of tools like Claude and Cline without accumulating technical debt

I offer a Fractional CTO Service — helping founders make the right technical decisions from day one so they can focus on finding product-market fit instead of fighting infrastructure fires.

Book a free consultation and let’s talk about your project.

5 Critical Mistakes Non-Technical Founders Make When Building MVPs with AI (And How to Avoid Them)

Manjula Liyanage — Tue, 25 Nov 2025 22:17:29 GMT

Photo by John Moeses Bauan on Unsplash

Introduction

You’ve got a brilliant SaaS idea. You’re ready to build. But there’s one problem: you don’t have a technical co-founder, and hiring a development team would burn through your runway before you even validate product–market fit.

Enter AI dev tools like Claude, GitHub Copilot, and ChatGPT. They promise to democratize software development, letting anyone build production-ready applications with the right prompts. And they deliver — to a point.

I recently built OmniTrackr, a file monitoring and SLA tracking platform, using Claude as my primary development partner. The AI helped me ship a full-stack application with a React frontend, Express API, PostgreSQL database, background workers, and automated CI/CD pipelines — all without writing most of the code myself.

But here’s what nobody tells you: the quality of your AI-built MVP depends entirely on how well you prompt it. Make the wrong assumptions early, and you’ll spend days (and thousands of tokens) refactoring. Skip critical architectural decisions, and you’ll hand off a codebase that real developers will want to rewrite from scratch.

After building OmniTrackr and accumulating over 300KB of documentation, countless refactoring sessions, and valuable lessons learned, I’ve identified the 5 most critical mistakes non-technical founders make when “vibe coding” with AI.

These aren’t just theoretical pitfalls — they’re real mistakes I made, complete with the technical debt they created and how I fixed them. Whether you’re building a proof-of-concept or an MVP you plan to scale, avoiding these mistakes will save you time, money, and your sanity.

Mistake #1: Skipping Architecture Documentation and Design Decisions

The Mistake

When you start building with AI, it’s tempting to jump straight into feature development. You type “build me a user dashboard” and watch the code materialize. It feels like magic.

But here’s the trap: AI generates code that works now, but you need a codebase that can be understood and maintained later — preferably by a team of real developers you’ll hire once you’ve validated your idea.

Without proper documentation of your architectural decisions, database schema design, and system design, you’re creating a mystery box that even you won’t understand in six months.

Why This Matters

You’re not building a throwaway prototype. You’re building the foundation for a company. Future developers need to understand:

Why certain architectural decisions were made
What trade-offs were considered
How different components interact
When to use specific patterns or approaches

What I Did Wrong

Initially, I focused solely on getting features working. The AI would suggest an approach, I’d approve it, and we’d move on. No documentation, no decision logs.

Three weeks later, when I wanted to refactor the file source connection logic, I couldn’t remember why we chose a certain database schema. Did we consider alternatives? Were there trade-offs? I had no idea.

The Fix

I started instructing the AI to document every major decision.

Prompt template:

Before implementing [feature], create a design document that covers:
1. Problem statement and requirements
2. Proposed solution with alternatives considered
3. Database schema design with justification
4. API contract specifications
5. Security considerations
6. Testing strategy
7. Deployment plan

Real Examples from OmniTrackr

Some of the design docs that ended up saving me countless hours:

Schema Split Design — Documents the decision to split a monolithic file_sources table into normalized tables (source_connections, schedules, watchers). Includes alternatives considered, migration strategy, and trade-off analysis.
Credential Storage Architecture — Details the multi-layered approach to storing AWS credentials (Secrets Manager + encrypted database fallback), including security considerations and key rotation strategy.
Data Model and API Plan — Comprehensive overview of the entire data model with entity relationships, API endpoints, and business logic.
S3 File Source Design — A deep-dive into handling S3 file sources, including credential validation, bucket access patterns, and error handling strategies.
Repository Structure — Documents the monorepo structure, package dependencies, and architectural patterns.

Action Items

✅ Create a /docs folder from day one
✅ Document before building major features
✅ Include alternatives considered to show thought process
✅ Explain trade-offs so future developers understand constraints
✅ Update docs during refactoring to reflect learnings

Mistake #2: Not Thinking About Component Reusability and Organization

The Mistake

You prompt the AI: “Build a page where users can create, read, update, and delete file watchers.”

The AI delivers. It creates a beautiful feature with inline forms to create schedules, departments, and connections right from the watcher creation flow. Everything works perfectly.

But here’s the problem: the AI puts all the code in /src/components/watchers/ because that’s where you told it to build the feature.

Now you need those same schedule, department, and connection components on other pages. But they’re tightly coupled to the watcher workflow, nested three folders deep in watcher-specific code.

Why This Matters

Features you build in isolation often need to be reused elsewhere. When you don’t plan for component reusability upfront:

Components become tightly coupled to specific workflows
You end up duplicating code across different pages
Refactoring costs multiply (time, tokens, bugs)
You introduce inconsistent UX when duplicate components diverge

What I Did Wrong

I asked the AI to build inline creation for schedules, connections, and departments within the watcher creation flow — before building standalone pages for these entities.

The result? All the components ended up in /src/components/watchers/:

src/components/watchers/
├── ConnectionCombobox.tsx
├── ConnectionForm.tsx
├── DepartmentCombobox.tsx
├── DepartmentForm.tsx
├── ScheduleCombobox.tsx
├── ScheduleForm.tsx
├── WatcherForm.tsx
└── WatcherSheet.tsx

These should have been in:

/src/components/connections/
/src/components/departments/
/src/components/schedules/

When I later needed a standalone Connections page, I had to refactor everything. This created:

Token cost for extensive refactoring
Time waste re-generating similar code
Bugs from missed references during the move
Inconsistencies between old and new versions

The Fix

Think through your entire feature set BEFORE building individual features.

Better prompt strategy:

I need to build a watcher management system that will eventually include:
1. Standalone pages for Connections, Schedules, and Departments
2. Inline creation of these entities within other workflows
3. Reusable components across multiple pages

Please structure the codebase with:
- Shared components in /src/components/[entity]/
- Each entity gets: ComboBox, Form, Sheet, Table components
- Components should accept onSuccess callbacks for flexibility
- Keep business logic separate from UI components

Proper Component Structure

Here’s how OmniTrackr should have been structured from the start:

src/components/
├── connections/
│   ├── ConnectionCombobox.tsx
│   ├── ConnectionForm.tsx
│   └── ConnectionSheet.tsx
├── departments/
│   ├── DepartmentCombobox.tsx
│   ├── DepartmentForm.tsx
│   └── DepartmentSheet.tsx
├── schedules/
│   ├── ScheduleCombobox.tsx
│   ├── ScheduleForm.tsx
│   └── ScheduleSheet.tsx
└── watchers/
    ├── WatcherForm.tsx
    └── WatcherSheet.tsx

Action Items

✅ Map your domain model before building features
✅ Identify reusable entities (users, departments, settings, etc.)
✅ Build shared components first, before feature-specific workflows
✅ Use composition with callbacks rather than tight coupling
✅ Review component organization after each major feature

Mistake #3: Ignoring Multi-Environment Configuration from the Start

The Mistake

You’re building locally. Everything works. The AI hardcodes database credentials, API endpoints, and configuration values directly in the code.

Then you want to deploy to staging. Suddenly, you need different database credentials, different AWS buckets, and different secrets. You start doing find-and-replace across your codebase.

This is a nightmare.

Why This Matters

Modern applications need to run in multiple environments:

Development — Your local machine with test data
Staging — Cloud environment that mirrors production (I use this for testing, though. Another mistake: not naming the environment correctly)
Production — Real users with real data
CI/CD — Automated testing and deployment pipelines

Each environment needs different:

Database connections
API keys and secrets
Feature flags
Logging levels
Third-party service endpoints

Hardcoding these values or handling them inconsistently leads to:

Accidentally using production data in development
Secrets committed to Git
Broken deployments
Security vulnerabilities

What I Did Wrong

Initially, I had a single .env file with hardcoded values. When deploying to Google Cloud Run, I realized:

The staging database shouldn’t be the same as dev
Environment variables needed to be injected via Cloud Run configuration
Local development needed Cloud SQL Proxy for secure database access
Different environments needed different logging levels and error handling

The Fix

I restructured the configuration system with environment-specific files and validation.

Key pattern:

// packages/api/knexfile.ts
// Load environment-specific .env files with priority:
// .env.{NODE_ENV}.local > .env.{NODE_ENV} > .env

const nodeEnv = process.env.NODE_ENV || 'development';

dotenv.config({ path: path.resolve(__dirname, '.env') });
dotenv.config({ path: path.resolve(__dirname, `.env.${nodeEnv}`) });
dotenv.config({ path: path.resolve(__dirname, `.env.${nodeEnv}.local`) });

Environment files:

.env                    # Default values, safe to commit
.env.development        # Dev-specific values
.env.development.local  # Your personal overrides (gitignored)
.env.staging            # Staging configuration
.env.production         # Production configuration

Validation at startup:

function validateEnvVars(requiredVars: string[]): void {
  const missing = requiredVars.filter((varName) => !process.env[varName]);
  if (missing.length > 0) {
    throw new Error(
      `Missing required environment variables: ${missing.join(', ')}\n` +
      `Make sure you have created .env.${nodeEnv}.local file`
    );
  }
}

if (nodeEnv === 'staging' || nodeEnv === 'production') {
  validateEnvVars(['DB_HOST', 'DB_NAME', 'DB_USER', 'DB_PASSWORD']);
}

Cloud SQL Proxy support:

connection: (() => {
  const useProxy = process.env.USE_CLOUD_SQL_PROXY === 'true';
  if (useProxy) {
    return {
      host: '127.0.0.1',
      port: parseInt(process.env.CLOUD_SQL_PROXY_PORT || '5433'),
      // Proxy handles encryption, no SSL needed
    };
  } else {
    return {
      host: process.env.DB_HOST,
      ssl: { rejectUnauthorized: false }
    };
  }
})()

Real-World Example

The staging environment now runs with:

A separate PostgreSQL database
Google Cloud Run deployment
Cloud SQL Proxy for secure connections
Environment-specific secrets in Google Secret Manager
Automated deployments on push to the develop branch

Prompt Strategy

From the beginning, tell your AI:

Structure this application to support multiple environments (dev, staging, production):

1. Use environment variables for ALL configuration
2. Create .env.example with all required variables
3. Support layered config: .env → .env.{NODE_ENV} → .env.{NODE_ENV}.local
4. Validate required environment variables at startup
5. Never commit actual secrets to Git
6. Support both local development and cloud deployment
7. Use Cloud SQL Proxy for secure database connections in development

Action Items

✅ Use environment variables from day one
✅ Create .env.example documenting all required variables
✅ Implement config validation that fails fast with helpful errors
✅ Add .env*.local to .gitignore
✅ Support Cloud SQL Proxy for safe remote database access
✅ Document environment setup in your README

Mistake #4: Not Designing Your Database Schema Properly Upfront

The Mistake

You tell the AI: “Create a database table for file sources that stores S3 connection info, file patterns, and scheduling configuration.”

The AI creates a single file_sources table with 30+ columns mixing:

Connection credentials
Bucket configuration
File patterns and filters
Scheduling settings
Monitoring configuration
Status and tracking data

It works! You ship the feature.

Two weeks later, you realize you need to:

Reuse the same S3 connection across multiple file sources
Create schedules that apply to multiple resources
Track file changes separately from configuration

Now you’re in trouble. You need to split one monolithic table into 5 normalized tables, migrate existing data, update all your API endpoints, refactor the frontend, and update documentation.

This is exactly what happened to OmniTrackr.

Why This Matters

Database schema is the foundation of your application. Poor database design leads to:

Data duplication — Same S3 credentials stored in multiple rows
Update anomalies — Changing a schedule requires updating multiple records
Deletion problems — Can’t delete a schedule without losing file source config
Query complexity — Retrieving related data requires massive JOINs or multiple queries
Scaling issues — Table grows unwieldy with mixed concerns
Migration nightmares — Refactoring requires complex data migrations

What I Did Wrong

I started with a monolithic file_sources table that mixed everything:

CREATE TABLE file_sources (
  id SERIAL PRIMARY KEY,
  name TEXT,
  type TEXT,

  -- Connection config (should be separate table)
  aws_access_key_id TEXT,
  aws_secret_access_key TEXT,
  aws_region TEXT,
  bucket_name TEXT,

  -- Scheduling (should be separate table)
  schedule_type TEXT,
  schedule_value TEXT,
  schedule_timezone TEXT,

  -- File patterns (should be separate table)
  file_patterns JSONB,
  file_filters JSONB,

  -- Monitoring (should be separate table)
  sla_max_age_hours INTEGER,
  sla_alert_recipients JSONB,

  -- Status tracking (should be separate table)
  last_checked_at TIMESTAMPTZ,
  last_file_detected_at TIMESTAMPTZ,
  status TEXT

  -- ... 15 more columns
);

Problems this created:

Creating multiple watchers for the same S3 bucket meant duplicating credentials
Changing a schedule from “daily at 2 AM” to “every 6 hours” required updating multiple file sources
Couldn’t share departments or tags across different file sources
Couldn’t track file-level changes independently from configuration
The table had 30+ columns and kept growing

The Fix

I had to perform a major schema redesign, splitting the monolithic table into a normalized structure.

New normalized schema:

-- Reusable connection configuration
CREATE TABLE source_connections (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  type TEXT NOT NULL,  -- 's3', 'azure', 'gcp'
  connection_config JSONB NOT NULL,  -- Credentials and config
  created_by INTEGER REFERENCES users(id)
);

-- Reusable schedules
CREATE TABLE schedules (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  type TEXT NOT NULL,  -- 'daily', 'interval', 'cron'
  config JSONB NOT NULL,  -- { hour: 14, minute: 30 } or { interval_hours: 6 }
  timezone TEXT DEFAULT 'UTC'
);

-- Specific file watchers
CREATE TABLE watchers (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  connection_id INTEGER REFERENCES source_connections(id),
  schedule_id INTEGER REFERENCES schedules(id),
  department_id INTEGER REFERENCES departments(id),

  -- Watcher-specific config
  file_pattern TEXT NOT NULL,
  file_filters JSONB,
  sla_max_age_hours INTEGER,

  status TEXT DEFAULT 'active'
);

-- File-level tracking
CREATE TABLE file_tracking (
  id SERIAL PRIMARY KEY,
  watcher_id INTEGER REFERENCES watchers(id),
  file_key TEXT NOT NULL,
  file_size BIGINT,
  last_modified TIMESTAMPTZ,
  first_seen_at TIMESTAMPTZ DEFAULT NOW(),
  last_seen_at TIMESTAMPTZ DEFAULT NOW()
);

-- Audit logs
CREATE TABLE watcher_logs (
  id SERIAL PRIMARY KEY,
  watcher_id INTEGER REFERENCES watchers(id),
  event_type TEXT NOT NULL,  -- 'check', 'error', 'sla_breach'
  message TEXT,
  details JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Benefits of normalized design:

✅ Reusability — one S3 connection used by multiple watchers
✅ Consistency — update a schedule in one place
✅ Flexibility — easy to add new connection types or schedule patterns
✅ Performance — smaller tables with better indexes
✅ Clarity — each table has a single, clear purpose

The Migration Nightmare

Refactoring required:

Creating new tables with migration scripts
Migrating data from the old table to the new normalized structure
Updating API endpoints to handle new relationships
Refactoring frontend components to work with separate entities
Rewriting queries to use JOINs
Updating tests for the new data structure
Dropping old tables after validation

It cost multiple days of refactoring, thousands of tokens, introduced bugs, and delayed feature development.

How to Avoid This

Prompt your AI to design a proper schema before coding:

I'm building a file monitoring system. Before writing any code,
create a database schema design document that:

1. Lists all entities and their relationships
2. Shows entity-relationship diagrams
3. Identifies what should be normalized vs denormalized
4. Considers future extensibility (new connection types, schedule patterns)
5. Plans for audit logging and change tracking
6. Includes proper indexes for query performance
7. Documents why each design decision was made

Entities to consider:
- Users and authentication
- Source connections (S3, Azure, GCP) - REUSABLE
- Schedules (daily, interval, cron) - REUSABLE
- Departments/teams - REUSABLE
- File watchers (specific file patterns to monitor)
- File tracking (individual file metadata)
- Audit logs and events
- SLA definitions and breach tracking

Ask the AI to explain trade-offs:

“Why did you choose this design over alternative X?”
“What are the pros/cons of normalizing this data?”
“How will this schema handle [specific future requirement]?”

Action Items

✅ Design your schema before coding features
✅ Identify reusable entities vs feature-specific tables
✅ Normalize appropriately — avoid massive tables mixing concerns
✅ Plan for extensibility — new types, patterns, configurations
✅ Document your schema with ERD diagrams and explanations
✅ Use migrations properly — never alter production schema manually
✅ Add proper indexes based on expected queries

Mistake #5: Neglecting Type Safety and Input Validation from Day One

The Mistake

You’re building fast with AI. You tell it to create an API endpoint to accept user input. The AI generates the route, the controller, the database query.

It works! You can create records, update them, and delete them.

But you didn’t tell the AI to:

Validate input formats (is that email actually an email?)
Sanitize user input (hello, SQL injection!)
Handle edge cases (what if the user sends null? Empty strings? Negative numbers?)
Provide helpful error messages (just “Invalid input” isn’t helpful)
Enforce type safety across frontend and backend

Result: Your app works in happy-path scenarios but breaks in production when users inevitably send unexpected data.

Why This Matters

Without proper validation and type safety:

Security vulnerabilities — SQL injection, XSS, command injection
Data corruption — Invalid data gets into your database
Cryptic errors — Users see “500 Internal Server Error” with no context
Debugging nightmares — Runtime errors that should have been caught at compile time
Frontend/backend drift — Types don’t match, fields are missing or renamed
Poor user experience — No helpful validation messages

What I Got Right (After Learning the Hard Way)

OmniTrackr uses strict TypeScript across the entire stack with comprehensive validation at every boundary.

1. Strict TypeScript Configuration

/tsconfig.json:

{
  "compilerOptions": {
    "strict": true,
    "noImplicitAny": true,
    "strictNullChecks": true,
    "strictFunctionTypes": true,
    "forceConsistentCasingInFileNames": true
  }
}

This catches type errors at compile time, prevents undefined/null bugs, and makes refactoring safer.

2. Shared Types Package

/packages/shared/src/types/:

export interface SourceConnection {
  id: number;
  name: string;
  type: SourceType;  // 's3' | 'azure' | 'gcp'
  connection_config: S3ConnectionConfig | AzureConnectionConfig;
  status: ConnectionStatus;  // 'active' | 'inactive' | 'error'
  created_at: string;
  updated_at: string;
  created_by: number;
}

export enum SourceType {
  S3 = 's3',
  Azure = 'azure',
  GCP = 'gcp'
}

export interface S3ConnectionConfig {
  aws_access_key_id: string;
  aws_secret_access_key: string;
  aws_region: string;
  bucket_name: string;
}

Benefits:

Frontend knows exactly what fields the API returns
Backend and worker share the same type definitions
Refactoring a field name updates everywhere

3. Joi Validation with Custom Error Messages

/packages/api/src/middleware/validation.ts:

import Joi from 'joi';

export const createS3ConnectionSchema = Joi.object({
  name: Joi.string().trim().min(1).max(100).required()
    .messages({
      'string.empty': 'Connection name is required',
      'string.max': 'Connection name must be at most 100 characters',
    }),

  aws_access_key_id: Joi.string().trim().required()
    .pattern(/^[A-Z0-9]{20}$/)
    .messages({
      'string.pattern.base': 'Invalid AWS Access Key ID format. Must be 20 uppercase alphanumeric characters.',
      'any.required': 'AWS Access Key ID is required',
    }),

  aws_secret_access_key: Joi.string().trim().required()
    .pattern(/^[A-Za-z0-9/+=]{40}$/)
    .messages({
      'string.pattern.base': 'Invalid AWS Secret Access Key format',
    }),

  aws_region: Joi.string().trim().required()
    .valid('us-east-1', 'us-west-2', 'eu-west-1', 'ap-southeast-1')
    .messages({
      'any.only': 'AWS region must be one of: us-east-1, us-west-2, eu-west-1, ap-southeast-1',
    }),

  bucket_name: Joi.string().trim().required()
    .pattern(/^[a-z0-9][a-z0-9.-]*[a-z0-9]$/)
    .min(3).max(63)
    .messages({
      'string.pattern.base': 'Invalid S3 bucket name format',
      'string.min': 'Bucket name must be at least 3 characters',
      'string.max': 'Bucket name cannot exceed 63 characters',
    }),
}).options({
  stripUnknown: true,
  abortEarly: false,
});

export const validate = (schema: Joi.Schema) => {
  return (req, res, next) => {
    const { error, value } = schema.validate(req.body);

    if (error) {
      const errors = error.details.map((detail) => ({
        field: detail.path.join('.'),
        message: detail.message,
      }));

      return res.status(400).json({
        success: false,
        error: {
          code: 'VALIDATION_ERROR',
          message: 'Invalid input data',
          details: errors,
        },
      });
    }

    req.body = value;
    next();
  };
};

Used in routes:

router.post(
  '/connections',
  authenticate,
  validate(createS3ConnectionSchema),
  async (req, res) => {
    const connection = await connectionService.create(req.body);
    res.json({ success: true, data: connection });
  }
);

4. Connection Validation Before Storage

validateS3Connection middleware:

export const validateS3Connection = async (req, res, next) => {
  const { aws_access_key_id, aws_secret_access_key, aws_region, bucket_name } = req.body;

  try {
    const s3Client = new S3Client({
      region: aws_region,
      credentials: {
        accessKeyId: aws_access_key_id,
        secretAccessKey: aws_secret_access_key,
      },
    });

    await s3Client.send(new ListObjectsV2Command({
      Bucket: bucket_name,
      MaxKeys: 1,
    }));

    next();
  } catch (error) {
    if (error.name === 'NoSuchBucket') {
      return res.status(400).json({
        success: false,
        error: {
          code: 'INVALID_BUCKET',
          message: `S3 bucket '${bucket_name}' does not exist or is not accessible`,
        },
      });
    }

    if (error.name === 'InvalidAccessKeyId') {
      return res.status(400).json({
        success: false,
        error: {
          code: 'INVALID_CREDENTIALS',
          message: 'AWS Access Key ID is invalid',
        },
      });
    }

    return res.status(400).json({
      success: false,
      error: {
        code: 'CONNECTION_FAILED',
        message: 'Failed to connect to S3 bucket. Please verify credentials and permissions.',
        details: { error: error.message },
      },
    });
  }
};

This prevents storing invalid credentials and gives immediate, precise feedback.

5. Custom Error Classes with Consistent Format

/packages/api/src/utils/errors.ts:

export class AppError extends Error {
  constructor(
    public statusCode: number,
    public message: string,
    public code: string = 'INTERNAL_ERROR',
    public details?: Record
  ) {
    super(message);
    this.name = this.constructor.name;
    Error.captureStackTrace(this, this.constructor);
  }
}

export class ValidationError extends AppError {
  constructor(message: string, details?: Record) {
    super(400, message, 'VALIDATION_ERROR', details);
  }
}

export class NotFoundError extends AppError {
  constructor(resource: string, identifier?: string | number) {
    const message = identifier
      ? `${resource} with ID ${identifier} not found`
      : `${resource} not found`;
    super(404, message, 'NOT_FOUND');
  }
}

export class UnauthorizedError extends AppError {
  constructor(message: string = 'Unauthorized') {
    super(401, message, 'UNAUTHORIZED');
  }
}

Global error handler:

export const errorHandler = (err, req, res, next) => {
  console.error('Error occurred:', {
    name: err.name,
    message: err.message,
    stack: process.env.NODE_ENV === 'development' ? err.stack : undefined,
    path: req.path,
    method: req.method,
  });

  if (err instanceof AppError) {
    return res.status(err.statusCode).json({
      success: false,
      error: {
        code: err.code,
        message: err.message,
        details: err.details,
      },
    });
  }

  const message = process.env.NODE_ENV === 'production'
    ? 'Internal server error'
    : err.message;

  return res.status(500).json({
    success: false,
    error: {
      code: 'INTERNAL_ERROR',
      message,
    },
  });
};

How to Get This Right from the Start

Prompt your AI with validation and type safety requirements:

Build this feature with comprehensive type safety and validation:

1. Define TypeScript interfaces in /packages/shared/src/types/
2. Use strict TypeScript mode (noImplicitAny, strictNullChecks)
3. Create Joi (or Zod) validation schemas with:
   - Regex patterns for format validation
   - Custom error messages that help users
   - Input sanitization (trim, stripUnknown)
   - Return all errors at once (abortEarly: false)
4. Test external connections before accepting credentials
5. Use custom error classes (ValidationError, NotFoundError, etc.)
6. Implement global error handler with environment-aware messages
7. Never return sensitive details in error messages
8. Log errors with context for debugging

Action Items

✅ Enable strict TypeScript from day one
✅ Create a shared types package for frontend/backend consistency
✅ Use Joi or Zod for runtime validation
✅ Write custom error messages that help users fix issues
✅ Test external connections before accepting credentials
✅ Use custom error classes for consistent error handling
✅ Implement a global error handler with environment-aware logging
✅ Validate at boundaries (API inputs, environment variables, external data)

Conclusion: Build Smart, Not Just Fast

AI dev tools like Claude can help you build an MVP incredibly fast — but speed without strategy creates technical debt.

The difference between a throwaway prototype and a maintainable codebase comes down to how well you prompt the AI and whether you plan for the future.

The 5 Mistakes Recap

Skipping Documentation → Document architecture decisions, database schema, and trade-offs
Ignoring Reusability → Plan component structure before building features
No Environment Strategy → Support dev/staging/production from day one
Poor Database Design → Normalize your schema upfront, avoid monolithic tables
No Type Safety/Validation → Use strict TypeScript and comprehensive input validation

The Meta-Lesson

You’re not just prompting the AI to write code. You’re prompting it to make architectural decisions.

The better you understand software architecture, the better you can guide the AI. If you don’t know what questions to ask, the AI will make reasonable guesses — but those guesses might not align with your future needs.

What to Do Next

If you’re building an MVP with AI:

Start with documentation — design docs, schema design, architecture decisions
Plan your domain model — what entities exist, and how they relate
Structure for reusability — think about component organization early
Support multiple environments — even if you only use dev at first
Validate everything — type safety and input validation prevent bugs
Ask the AI to explain — “Why this approach? What are the alternatives?”

Remember: Building with AI isn’t about blindly accepting every suggestion. It’s about being a thoughtful product architect who uses AI as a highly capable implementation partner.

Happy building! 🚀

Need Help Getting Your Technical Foundation Right?

I’ve been through the pain of setting up cloud infrastructure, making architectural decisions, and building an MVP from scratch — so you don’t have to learn these lessons the hard way.

If you’re a non-technical founder building a SaaS product and you need help with:

- Cloud infrastructure setup — Terraform, CI/CD pipelines, environment separation

- Architecture decisions — database design, API structure, secrets management

- Technical strategy — choosing the right stack, planning for scale, avoiding costly mistakes early

- AI-assisted development — getting the most out of tools like Claude and Copilot without accumulating technical debt

I offer a Fractional CTO service — helping founders make the right technical decisions from day one so they can focus on finding product-market fit instead of fighting infrastructure fires.

Book a free consultation and let’s talk about your project.

Vibe Architecture for Vibe Coders: From Idea to Users

Manjula Liyanage — Thu, 14 Aug 2025 12:44:39 GMT

Photo by Lance Anderson on Unsplash

Vibe coding gets your app built fast. Vibe architecture keeps it alive.

You’ve already proven you can turn ideas into working software without touching a line of traditional code. But once it’s in the hands of real users, the rules change — suddenly you’re thinking about uptime, scaling, bug fixes, and making sure one tweak doesn’t break the whole damn thing. That’s where vibe architecture comes in.

This post is your roadmap for taking an app from “just built” to “used and loved by real people” — without burning out or breaking everything.

Who this is for

If you can build with vibe coding tools (AI-assisted coding) but struggle with deploying, maintaining, and scaling your apps, this is for you.

https://www.reddit.com/r/replit/comments/1mpxddr/why_your_app_works_for_you_but_fails_for_real/

The Lifecycle Model

The simplest way to think about the software journey is:

Develop → Build → Ship → Maintain → Scale → Improve

Here’s the model I use:

Build → Write, test, and package your app in a way that’s easy to ship.
Deployment Infrastructure → Get it running reliably for real users.
Integrations → Plug in all the useful services that make it more powerful.

Let’s walk through each capability, with examples and tools you can use today.

1. Develop & Build

Goal: Get clean, testable code and repeatable builds.

Vibe Coding Tools

You don’t need any explanation here; these are the obvious tools such as Claude Code, Windsurf, Lovebale, Base44, Cursor, Replit, GitHub Codespaces, and VS Code + Copilot.

Start with an LLM‑assisted design doc

Before you generate a single screen or API, ask ChatGPT (or your favourite LLM) to produce a detailed design doc for your app. This becomes the source of truth you’ll feed back into your AI tools while building.

What to include (in plain English):

App purpose & problem statement
User personas (primary + secondary)
Top 3 user journeys (happy path + edge cases)
Scope & constraints (MVP vs later)
Feature list grouped by “must / should / could”
Data model (entities, relationships, example records)
API surface (endpoints, request/response examples)
Integrations (email, SMS, auth, payments, analytics, LLM use)
Non‑functional requirements (security, performance, availability)
Preferred tech stack (or “let the tool choose with reasoning”)
Metrics & logging (what we will track and why)
Risks & open questions

Your architecture guardrails (tell the LLM to include these explicitly):

Code Quality & Standards: TypeScript, ESLint, Prettier, Conventional Commits, minimal folder structure.
Unit Tests for critical paths: auth, payment, scheduling, timezone logic.
Database Best Practices: migrations (Prisma/Drizzle), foreign keys, indexes, seed data, pgvector if using embeddings. Use separate databases for Dev, Test, and Production with proper environment config.
Environment Parity: dev/staging/prod mirrored configs; env vars not hardcoded.
Error Handling & Structured Logging: JSON logs, correlation/request IDs, consistent error shapes.
Observability: Sentry for errors, basic request latency metric, uptime checks.
Security Baseline: secret manager, security headers, rate limits, input validation.
Performance & Reliability: caching where safe, retries with backoff for external APIs, idempotency keys.
Availability & Backups: managed Postgres with daily backups, restore drill monthly.

Documentation (Just Enough)

Save the output of the previous step as /docs/design.md and commit it. Your AI coding tool will keep referencing it while it generates code, tests, and configs.

A Reality Check for Non-Programmers Using AI Tools

Let me be blunt: AI will not magically build your app while you sip coffee and watch Netflix.

If you don’t know programming and you think you can just throw prompts at ChatGPT, copy-paste the output, and end up with a beautiful, maintainable app… you’re in for a rude awakening.

What you’ll actually get is a messy, brittle codebase that works “just enough” for your first demo, but turns into a house of cards the moment you fix one bug or add a new feature.

You’ve got two real options:

Learn programming with the AI as your pair programmer.

Let the AI write code, but read it, ask “why?”, and tweak it yourself.
Use it as a mentor, not a vending machine.
This is slower at first, but you’ll gain the skills to maintain and improve your own app without panic.

2. Admit you’ll need to hire real developers later.

Focus on validating your idea.
Keep your scope tight so when you hand it over, you’re not giving a dev team a flaming garbage pile to rewrite from scratch.

Either way, stop pretending AI is a push-button “build my business” machine. It’s a powerful multiplier — but only if you bring knowledge, discipline, and architecture to the table.

Unit Tests

Unit tests verify that individual components of your code — such as functions, classes, or modules — work exactly as intended in isolation. By testing small, focused units of logic, you can quickly detect errors early in the development process, often before the code is integrated into the larger system. A must-have to avoid this

Why they’re useful:

Early Bug Detection: Issues are caught before they reach users, saving time and reducing costly fixes later.
Confidence in Changes: You can safely refactor or extend your code knowing that unit tests will flag any unintended side effects.
Faster Debugging: When a test fails, it pinpoints exactly where the issue lies, making troubleshooting faster.
Improved Code Quality: Writing unit tests encourages developers to write cleaner, more modular, and more maintainable code.

In short, unit tests act as a safety net, ensuring your code’s foundation is solid before it’s deployed — preventing small mistakes from turning into big problems for your users.

Tools: Jest, Vitest, Pytest.

Development Workflow

Think of your workflow as the rules of the road for how code moves from your laptop to your users. Without a clear process, you’ll end up with “It works on my machine” disasters, broken features in production, and late-night panic fixes.

Step 1 — Get on GitHub (or GitLab/Bitbucket, but GitHub is easiest)

Create a repo for your project.
Commit your /docs/design.md and your initial AI-generated code here.
Turn on branch protection rules for main so nobody (including you) can push directly without a review.

Step 2 — Use Branches for Every Change

main branch = your production-ready code.
develop branch (optional) = where features merge before going to staging.
Feature branches = feature/scheduler-ui or fix/timezone-bug.
Branch naming rule: feature/…, fix/…, or chore/….

Step 3 — Release Management for Test & Prod

Staging environment: a near-copy of production where you test new features with real data (or good fake data) before public release.
Production environment: what your real users see.
Each environment should have its own database, storage, and API keys so you can break staging without fear.
Deploy staging on every merge to develop (or main if skipping develop).
Deploy production only when you merge from develop → main or tag a release.

Step 4 — Automate Everything with CI/CD

Once you’ve got GitHub and branches sorted, let automation handle the boring (and error-prone) stuff:

Continuous Integration (CI)

Runs on every PR or commit.
Does linting (npm run lint), type checking, and unit tests.
Builds the app to catch early errors.

2. Continuous Delivery/Deployment (CD)

Automatically deploys staging after a successful PR merge.
Automatically deploys production when you merge to main (or create a GitHub Release).
Can run smoke tests (via Playwright or Cypress) against the deployed URL to confirm nothing is broken.

Example Flow:

feature/new-onboarding → PR → CI checks pass → merge to develop → auto deploy to staging
Test staging → merge develop → main → auto deploy to production

Recommended Stack for Vibe Coders:

GitHub Flow (simple and popular)
GitHub Actions (free for small projects)
Vercel Previews (instant preview URL for every PR)
Protected branches + required reviews (forces good habits)

Vulnerability Scanning

Every package you install is code you didn’t write — and it can have bugs, security holes, or even malicious updates that put your users (and your data) at risk. Vulnerability scanning is your early warning system.

Step 1 — Built-in AI Tool Scanners

Many modern AI coding tools already have security awareness baked in.

Cursor, for example, can integrate with Snyk, which scans your code and dependencies in real-time as you work.
Some AI tools will even warn you before committing if they detect API keys, passwords, or other secrets in your code.
Treat these warnings seriously — they’re not nagging you, they’re saving you from embarrassment or a data breach.

Step 2 — GitHub Security Tools

Dependabot automatically checks for outdated or vulnerable packages and opens PRs to update them.
GitHub Advanced Security (paid) adds secret scanning, code scanning, and dependency review.
At a minimum, enable Dependabot alerts in your repo settings.

Step 3 — CI/CD Integration

Add npm audit (or yarn audit / pnpm audit) to your CI pipeline so builds fail if critical vulnerabilities are found.
If using Snyk, connect it to your GitHub Actions workflow so every PR gets a vulnerability report before merge.

2. Deployment Infrastructure

Goal: Release with confidence, keep prod safe.

Some platforms (like Replit Base44, Vercel, or Render) have built-in deployment — push code, and it’s live. Others require you to set up and manage hosting yourself (AWS, Azure, custom VPS). Either way, these are the core pieces you need to understand and configure:

Environments (Test & Prod)

Separate sandboxes for testing and live users so you can break staging without breaking production.

Example: staging.example.com with its own DB.

Go-to: Vercel (app) + Supabase (DB), staging + prod.

DNS & Certificates

Maps your domain name to your app and encrypts traffic with HTTPS.

Go-to: Cloudflare DNS + Vercel auto TLS.

Security

Keeps your app and data safe with secret management, headers, and firewalls.

Example: API keys in secret manager, security headers enabled.

Go-to: Managed env vars + basic WAF.

Remote Access

Allows secure, controlled access to servers or containers for debugging.

Go-to: Tailscale, a Jump host or AWS SSM Session Manager.

Testing in Deployment

Runs automated checks on the deployed app to confirm everything works before users see it.

Example: Playwright smoke tests run after deploy.

Observability

Tracks errors, performance, and user behavior so you can spot issues early.

Example: Sentry alerts for 500 errors, PostHog for user flows, CloudWatch alerts, etc

Storage

Handles file and media storage, often with a CDN for speed.

Go-to: R2/S3 for files + Cloudflare CDN.

Certificates & Backups

Ensures secure HTTPS connections and recoverability if data is lost.

Example: Daily DB backups with monthly restore test.

Performance & Reliability

Keeps the app fast and resilient under load with queues, caching, and retries.

Example: Queue SMS jobs, retry failed sends.

Availability

Prevents downtime by removing single points of failure and enabling scaling.

Example: Multi-AZ DB, health checks, auto-scaling.

3. Integrations (Make It Useful)

Email Delivery: Postmark, Mailjet, SendGrid.
SMS: Twilio.
Push Notifications: Firebase Cloud Messaging.
Authentication: Clerk, Firebase auth or Supabase Auth.
Payments: Stripe.
Analytics: PostHog.
CRM: HubSpot Free.
LLMs: OpenAI + pgvector.
Other Services: Svix for webhooks, n8n/Zapier for automations.

4. Three Starter Stacks You Can Copy

Weekend MVP (fastest path)

Next.js on Vercel or Railway
Supabase (DB + Auth + storage)
Postmark, Stripe, PostHog, Sentry

Indie SaaS (cheap to scale)

Next.js + Vercel or Railway
Neon Postgres + Prisma
Clerk, R2 (assets), PostHog, Sentry, Stripe, BullMQ

Enterprise-Friendly

AWS (EC2/EB/ECS), Google Cloud Run, Azure App Service
RDS Postgres, CloudFront, Route53
AWS Secrets Manager, CloudWatch, Auth0, SES/SNS, Stripe, Terraform

5. Anti-Crash Checklist

[ ] Staging environment with separate DB
[ ] Health check route & smoke test after deploy
[ ] Error tracking on (Sentry)
[ ] Daily DB backups + restore test
[ ] Rate limiting + retries on external APIs
[ ] Security headers + secret manager
[ ] Observability dashboard (errors, latency)
[ ] Feature flags for risky changes

Final thought: Vibe coding lets you build in days. Vibe architecture keeps your app alive for years. Use this checklist, pick a starter stack, and start shipping without the fear of breaking everything.

A simple RAG Application

Manjula Liyanage — Fri, 06 Jun 2025 16:01:18 GMT

Simple RAG Application

Architecture

This simple Retrieval-Augmented Generation (RAG) application would read a large PDF, semantically chunk the content, embed it using OpenAI, and store those embeddings in MongoDB Atlas for efficient vector search. When a user asks a question, relevant context is retrieved and passed to an LLM to generate accurate, grounded answers.

My notebook file

Load the PDF using PyPDFLoader

from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('NET-Microservices.pdf')
pages=loader.load() #Load data into Document objects, a list.
pages

2. Just checking whether the number of pages is correct by checking the length of the pages list.

len(pages)  # Number of pages loaded

3. Perform semantic chunking

This will chunk the texts based on semantic breakpoints (e.g., sentence or paragraph ends), and maintain some overlap for context preservation.

Here, I used the RecursiveCharacterTextSplitter with semantic breakpoints (e.g., sentence or paragraph ends), a 1000-chunk size, and a 200 overlap. I wanted to understand the Greg Kamradt semantic similarity checking method, but I couldn’t do it yet — next one.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Join all pages 
full_text = " ".join([p.page_content for p in pages])

# Use RecursiveCharacterTextSplitter with sentence-level separators
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", "!", "?", ",", " "],
    chunk_size=1000,
    chunk_overlap=200,
)

chunks = text_splitter.create_documents([full_text])

Let’s see if the length of the chunks, so we know if that worked.

len(chunks)  # Number of chunks created

4. Let’s embed the chunks and insert into the MongoDB collection

from langchain.embeddings import OpenAIEmbeddings
from pymongo import MongoClient
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch

import os
from dotenv import load_dotenv
load_dotenv() 

#get connection string, database name, collection name from the .env file

client = MongoClient(os.getenv("MONGODB_CONNECTION_STRING"))
collection = client[os.getenv("MONGODB_NAME")][os.getenv("MONGODB_COLLECTION")]

# setting the embeding model
embedding_model = OpenAIEmbeddings()

# Insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents(
    chunks, embedding_model, collection=collection, index_name=os.getenv("MONGODB_INDEX_NAME")
)

Now we can go to the Atlas console and see the MongoDB collection. The data should be populated.

MogoDB Collection where embeddings are stored. It got the original text in the text field and embeddings on the embeddings array (1536 elements)

5. Index the Vector Embeddings

To index the collection, we need to create a Vector Search Index

Select Vector Search for semantic search and AI applications as the search type. Then select the collection where our embeddings are stored.

Next, we can use the JSON Editor for configuration.

Configuration: The Path is the field name of the collection where we stored the embeddings. The numDimention is the length of the array.

Similarity is the function type that will be used to perform the similarity search.
This is some math shit that my brain couldn’t comprehent, but I “belive” in maths so that gives me confident it will do the job even though I don’t understand how it works. It’s quite the opposite of the Niels Bohr popular horseshoe story 😂

6. Create a Retrieve function

# Insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents(
    chunks, embedding_model, collection=collection, index_name=os.getenv("MONGODB_INDEX_NAME")
)

Check if the vector search is working by performing a similarity search.

# perform a similarity search between the embedding of the query and the embeddings of the documents
myquery = "What is docker and how does it work?"
context_from_vecotore_store = vectorStore.similarity_search(query=myquery, k=5)

context_from_vecotore_storeÍ

It works

Now, set up the retriever, output parser, and LLM Model that we are going to use.

from langchain_openai import OpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.messages.base import BaseMessage
from langchain import hub

llm = OpenAI(model_name="gpt-3.5-turbo-instruct")

retriever = vectorStore.as_retriever()
output_parser = StrOutputParser()

rag_prompt = hub.pull("rlm/rag-prompt")

Then we have these two functions

Theformat_docs(docs) takes a list of document objects (where each object has an page_content attribute) and returns a single string.

The get_question(input) function ensures that, regardless of how a question is passed, the system can extract a clean question string for downstream processing (like retrieval or prompting).

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def get_question(input):
    if not input:
        return None
    elif isinstance(input,str):
        return input
    elif isinstance(input,dict) and 'question' in input:
        return input['question']
    elif isinstance(input,BaseMessage):
        return input.content
    else:
        raise Exception("string or dict with 'question' key expected as RAG chain input.")

Next, set up the Retrieval-Augmented Generation (RAG) pipeline using LangChain’s composable Runnable framework.

rag_chain = (
        {
            "context": RunnableLambda(get_question) | retriever | format_docs,
            "question": RunnablePassthrough()
        }
        | rag_prompt
        | llm
        | output_parser
)

Here’s a breakdown of what it’s doing:

{
“context”: RunnableLambda(get_question) | retriever | format_docs,
“question”: RunnablePassthrough()
}

This sets up a dictionary-based input processor.

"question": The original input (e.g. "What is Docker?") is passed directly via RunnablePassthrough().

"context":

get_question: The function that extracts the input question.

retriever: Uses the result to retrieve relevant documents from a vector store (e.g., MongoDB Vector Search).

format_docs: Formats the documents into a string using "\n\n".join(...).

| rag_prompt: This combines the context and question into a prompt template (e.g., "Use the following context to answer the question...").

| llm: Sends the final prompt to the LLM (e.g., OpenAI GPT-4) for response generation.

Now, let’s test this. Pass a question and invoke the RAG chain.

rag_chain.invoke("What is Docker?")

It works 🤸‍♂️

A Roadmap to Becoming a Generative AI Engineer

Manjula Liyanage — Tue, 25 Feb 2025 22:29:37 GMT

Photo by Google DeepMind on Unsplash

You want to become an AI engineer but struggling to find a clear starting point? This guide provides a structured roadmap to help you focus and avoid distractions while mastering Generative AI development.

Step 1: Introduction to NLP

Natural Language Processing (NLP) is the foundation of Generative AI. Begin by understanding the basics of NLP, including how machines process and interpret human language.

Step 2: Master Text Processing in NLP

Familiarize yourself with essential text processing concepts:

Tokenization — Breaking text into words or subwords.
Stop Words Removal — Filtering out common words like “the,” “is,” etc.
Stemming & Lemmatization — Reducing words to their base or root forms.
Lowercasing and Punctuation Removal — Standardizing text for better processing.

Step 3: Learn Part-of-Speech (POS) Tagging

Understand how NLP models classify words into their respective parts of speech (nouns, verbs, adjectives, etc.) to improve language understanding.

Step 4: Understand Named Entity Recognition (NER)

Learn how models extract useful information, such as names of people, locations, and organizations, from text.

Step 5: Study Text Vectorization

Transform text into numerical representations using:

Bag of Words (BoW)
TF-IDF (Term Frequency-Inverse Document Frequency)
Word Embeddings (e.g., Word2Vec, GloVe, FastText)

Watch this tutorial to understand text vectorization: YouTube Video

Step 6: Understand How Large Language Models (LLMs) Work

LLMs are built on the Transformer architecture. To understand them, watch an explanation of Google’s Attention is All You Need paper: YouTube Video

Step 7: Experiment with LLMs

Try using different models to get hands-on experience:

Paid Models: OpenAI’s GPT, Gemini by Google
Open-Source Models: Hugging Face, Ollama

Step 8: Learn Prompt Engineering

Understand how to craft effective prompts to get the best results from LLMs. Check out these resources:

Step 9: Explore Advanced Topics

Quantization — Optimize models for efficiency.
Fine-Tuning LLMs — Train models with custom datasets. Follow this guide: YouTube Video

Step 10: Learn LangChain and LangGraph

These tools help build applications using LLMs.

Step 11: Build a Basic AI Application

Start with a chatbot or another AI-driven project. You’ll need:

Backend Development: Learn FastAPI for LLM deployments.
FastAPI Docs & Tutorials: YouTube Playlist
Database Knowledge: Learn how to store and retrieve data efficiently.
Authentication Methods: Understand OAuth, access tokens, and user authentication.

Step 12: Build the Frontend (Optional)

For a full-stack AI application, learn:

HTML, CSS, Tailwind (UI styling)
JavaScript Frameworks: React, Angular, Vue
However, if your primary goal is to become an AI engineer, focus more on backend development.

Step 13: Deploy Your AI Application

Learn how to host and deploy your project for real-world use.

Step 14: Solve Real-World Problems

Build AI solutions that address practical challenges. The best way to master AI is to apply it!

By following this roadmap, you can systematically develop your expertise in Generative AI. Stay focused, avoid distractions, and most importantly, keep experimenting and building!

AWS Lambda Use Cases

Manjula Liyanage — Sun, 11 Dec 2022 11:30:36 GMT

What is AWS Lambda and Why?

AWS Lambda is a service provided by Amazon Web Services (AWS) that allows developers to run their code without worrying about the underlying infrastructure. Therefore, developers can focus on writing and deploying their code rather than managing and scaling servers.

AWS Lambda functions are executed in response to specific events or triggers, such as user requests or data changes in a database. These functions are independent and can be written in various languages, including JavaScript, Python, and Java.

One advantage of using AWS Lambda is its great flexibility and scalability. Since each function is independent, it can be scaled up or down separately from the rest of the application. Additionally, cloud-native services allow easy integration with other cloud-based services, such as databases and message queues. This can make it easier to build complex, scalable applications.

So, below are some highlights:

Removes the need for traditional computing services
Reduce operational cost and easier operational management
Faster development
Auto-scaling — Lambda can scale from a single request to hundreds of thousands per second. Read this to learn more about how Lambda handles Scaling.
Pay as you go
Lambda is natively microservice; therefore, it gives an easy microservice implementation model.

Given these benefits, you should still know the limitation of Lambda. In addition, take note of known Lambda pitfalls and cost optimization tips.

Here are some Lambda use-cases

Building Serverless web apps

An architecture of a Serverless web app

Lambda functions can be used to build serverless web applications, where the application logic is split into individual tasks executed in response to specific events. This can make it easier to build scalable, flexible applications.

You can create a web API with an HTTP endpoint for your Lambda function by using Amazon API Gateway. API Gateway provides tools for creating and documenting web APIs that route HTTP requests to Lambda functions.

Other Serverless services used in this architecture:

Cognito User Pools provides features to control user sign-up, sign-in, email or SMS verification and user management with more advanced features. Users can also sign in through social identity providers like Google, Facebook, and SAML identity providers.

Cognito Identity Pool will provide temporary credentials to AWS resources like S3 using the token received on successful login.

Cognito will authorize the user with the necessary permissions for the IAM role. Learn more about user authentication and authorization with AWS Cognito.

Route 53 for managing DNS (custom domain names)

Use CloudFront distribution with S3 to serve the web app. The website must be a client-side web app such as React, Angular, or JavaScript.

Certificate manager to provision an SSL certificate and associate with CloudFront

Associate the custom domain and the SSL with the API gateway

Amazon DynamoDB is a fully-managed, high-performance, NoSQL database service that is easy to set up, operate, and scale. It is used to persist session data, such as the shopping cart and the product database. The same architecture can be extended to develop complex applications like serverless e-commerce systems.

An architecture of an e-commerce app

I’m using the Microsoft cloud-native microservice reference application with AWS services.

The application also uses a related pattern called Backends-for-Frontends (BFF), which recommends creating separate API gateways for each front-end client. The reference architecture demonstrates breaking up the API gateways based on whether the request comes from a web or mobile client.

Other Serverless services used in this architecture:

Amazon Personalize: Amazon Personalize provides similar item recommendations, search re-ranking based on user preferences, and product recommendations based on user-item interactions.

Amazon Pinpoint: Amazon Pinpoint adds the ability to dynamically send welcome messages, abandoned cart messages, and messages with personalized product recommendations to the customers.

Amazon ElastiCache: This is a session store for volatile data and a caching layer for the product catalogue to reduce I/O (and cost) on DynamoDB.

Amazon Simple Notification Service (Amazon SNS) is a managed service that delivers messages from publishers to subscribers (also known as producers and consumers).

Amazon Simple Queue Service (SQS) lets you send, store, and receive messages between services at any volume without losing messages or requiring other services to be available.

EventBridge enables you to decouple your architectures to make it faster to build and innovate, using routing rules to deliver events to selected targets.

How to achieve high availability

Document conversion

Suppose your application provides documents (invoices, purchase orders, receipts, prescriptions). In that case, your application should be capable of serving those documents in different formats, such as PDF, HTML, and CSV, depending on the requirement and the device. Instead of storing documents in all forms, you can use AWS Lambda to create the desired format of the document and serve it to the customer for a download or a display on a page.

Data processing

Lambda functions can process large amounts of data, such as logs or events. This can be useful for data transformation, aggregation, or analysis tasks.

An excellent example of this use case is a Centralized logging system where Lambda is used for logging transformation.

You can attach a Lambda function to Kinesis to transform data before sending it to the destination. You can use Lambda for Structuring data as per destination accepted format, such as adding metadata and combining data from another source. Lambda can add metadata such as SourceName, SourceType, AWS Account Number and name etc. Also, Lambda can transform the raw logs to JSON format if required. Also, it can read from an S3 bucket to get the logs in the example of Load Balancer logs. Many other sources can be integrated with Kinesis via EventBridge.

AWS Log transformation and ingestion

File processing

Lambda functions can be used to process files as they are uploaded to the cloud. For example, a Lambda function could be used to resize images, extract text from documents, or transcribe audio files.

Objects are uploaded into Amazon S3. Then the S3 bucket publishes an event notification to an Amazon SNS topic.

Amazon SNS can fan-out messages to multiple subscribers.

N number of Lambda functions can be created to process data, all without the need to provision or manage servers.

After processing, data can be sent to different AWS services for storage, further processing or analytics.

A famous example of this is how Netflix do their video encoding at scale using Lambda.

With Lambda, Netflix can use rules triggered by the movement of video assets to launch and configure the necessary processing to encode 60 different parallel streams and can use the rules and events to aggregate and deploy after all the parts are processed.

Automation

Lambda functions can be used to automate a wide range of tasks, such as backups, data migration, or data synchronization. These functions can be triggered by a schedule or in response to specific events, making it easy to automate complex processes.

Sending security alerts and email automation are two common examples.

Sending security alerts

Automation of security alerts and raising incidents

You can write a Lambda function to send an alert on a specific event from Cloudwatch/CloudTrail AWS activity logs. It will notify your designated on-call staff via email, or you could even write a code that will trigger the AWS Lambda to call you on your phone.

Email automation

You can automate newsletters, email campaigns and transactional emails using Lambda functions and Simple Email Service SES.

5 Lambda pitfalls

Manjula Liyanage — Fri, 26 Aug 2022 05:30:49 GMT

1. Lambda Overloading

Sometimes you use the same Lambda function for multiple purposes.

Avoid using one Lambda function for two different tasks because both of these workloads can scale independently and would require different memory configurations.

Also, it makes monitoring challenging because It will be difficult to understand what's going on in your Lambda function by looking at the monitoring. Because there is more than one purpose of invocations, you would not able to relate monitoring metrics to different invocations.

Use purpose-built and Small Lambda Functions

One Lambda should do only one thing. Single Responsibility Principle

Function size is part of the cost

Lambda runtime would download the code from S3 or Docker image registry on every cold-start, the larger the code, the longer the wait, the more you pay

Since 2021 Lambda functions are billed in one millisecond increments, you can now save money for every millisecond your function runs faster.

2. Using CloudWatch PutMetricData API

PutMetricData API Publishes metric data points to Amazon CloudWatch.

It is a common requirement for you need to collect metrics in your Lambda function. A Lot of people use PutMetricData API from the Lambda function to publish the metrics to CloudWatch.

Your Lambda will make a synchronous HTTP call to the PutMetricData API and this would put your Lambda on hold and it will cost you money.

Instead of using Cloudwatch PutMetricData API, use metric filters.

You can simply emit a logline and within that logline, you emit the name of the metric and then the number of counters that you want to pair with that metric.

Example: You want to count the number of legal argument exceptions in your Lambda. You can put leagalarguments=1

Then you can set up a metric filter with a rule to parse that format of the log line and that will be published into Cloudwatch metrics.

This happens offline and it is not part of the lambda execution.

3. Chaining lambda functions synchronously

In synchronous Lambda calls The first one will wait for the second one to finish, and you’ll be paying for the waiting time.

Don’t call a lambda from another lambda. Run the second function using a service like Step Functions.

4. Using lambda functions as an Orchestrator

A lot of people combine Lambda with SQS because it's an easy way to move messages around from your Lambda and let the output of one function be the input of another.

This pattern can be very challenging to analyse what is actually going on in your application when your data is bouncing around in many Lambda functions

Log streams are independent of each lambda function, and it's difficult to pair all the logs that are related to a particular invocation.

Use Step Functions for Lambda Orchestration

Step functions allow you to build workflows, allow fallbacks, error handling, and retries, and it's very easy to analyse what happened in the step function workflow through the AWS Step Function console.

5. Not using batch processing in SQS and Lambda functions for event processing workloads

When you have a message in your SQS you can set it up to invoke your lambda with a single message. When you have many messages, it's a lot of invocations and will become costly.

The ideal way to do this is to use SQS and Lambda batch processing.

The Lambda function will pull multiple messages from the SQS once and pass all of those messages to the event body of the lambda function invocation.

This way you can iterate all the messages in the input and process them independently.

But you have to be careful in this approach because there would be partial failures. When the failure occurs, you have to put those messages back into the SQS to retry later.

There is an easy way to handle partial failures using failure reporting. You can return failed message ids from your Lambda to the SQS. Return the failed message ids that will indicate SQS to put those messages back into the queue. This way only the failed message will reappear in the Queue for processing and the successful will be deleted from the Queue.

Check out my Lambda cost optimization tips here.

Migrating 3 tier web application to AWS (Re-hosting/lift-and-shift)

Manjula Liyanage — Tue, 26 Apr 2022 12:59:46 GMT

The Current Architecture

Before migrating any workload to the cloud, you need to assess the right migration strategy for you. I wrote some questions that you should ask to evaluate your systems.

Deployment Overview

The current application will be migrated as it is to the AWS Cloud. There will be no code changes required other than changing the connection config.

The Oracle database will be deployed in an EC2 instance. (Oracle RDS was not chosen for this migration because code changes might be required. Also, there are possible dependencies on some Oracle services and DB server file system)

Tape backups will be replaced with scheduled backups and an S3 deep archive. You can use the AWS Storage Gateway file interface to directly back up the database to Amazon S3. AWS Storage Gateway file interface provides an NFS mount for S3 buckets. Oracle Recovery Manager (RMAN) backups written into the Network File System (NFS) mount are automatically copied to S3 buckets by the AWS Storage Gateway instance.

Backups will be stored in S3 and moved to the S3 deep archive to save costs. An Application Load balancer will replace the on-prem load balancer.

New Architecture

High availability

The application will be deployed in two AWS availability zones. This will guarantee the application availability even if one AZ goes down. The primary DB will be syncing with the second instance on the other AZ. This instance will be on standby. If in case of a failure in the AZ 1, the stand by DB will automatically be brought up to serve the requests from the application servers. An application load balancer — ALB will be used to divide the load between two AZs. ALB will ensure the traffic will be sent only to the available and healthy server. It will stop sending traffic to a web server if in case the server health check fails.

Performance

Auto-scaling cannot be used in the current application because the web application is storing the user session (stateful application). The application needs to be refactored to support auto-scaling.

AWS Cloud-front can be used to cache static content from different geographical locations to provide faster access.

Security

Authenticating users: On-prem AD and SSO will authenticate the organisation’s employees to the cloud infra.

Use IAM policies and best practices to provide only the required amount of access to the users.

Data Security in transit: All client applications will communicate over HTTPS. SSL/TLS will be deployed on the application load balancer, and only HTTPS traffic will be allowed from the Internet. Certificate manager is used to manager SSL certificates and be associated with the ALB.

Data security at rest: All EC2 instance volumes will be encrypted. Furthermore, data in EFS and S3 will be encrypted.

Application security: The web application firewall prevents vulnerabilities such as DDoS attacks, cross-site scripting, cookie poisoning, parameter tampering, etc.

All application servers will be deployed in a private subnet that allows only inbound traffic from the ALB in the public subnet.

Moreover, four security groups will enable managing traffic flows securely.

Web DMZ security group will allow port 443 traffic from the internet.
In the Web SG — Allow traffic from the Web DMZ SG
In the App SG — Allow traffic from Web SG
In the DB SG — Allow traffic from App SG

A NAT gateway will be used to download security patches for web and app servers. This is required to keep the security and software patches to be updated.

Application support and maintenance: application engineers can perform code deployments from the corporate network to the servers via the site-to-site VPN. This connectivity will be used for application support and deployments.

Monitoring and alerts

Cloud watch will be used to monitor system performances and health. If in case of any issue, Cloud watch will trigger alerts to the application support team. Cloud trail can be used to monitor user activities for audit purposes.

Backups

Scheduled backups will be enabled in the database with an appropriate retention period. After the retention period, the backups will be moved to S3 Glacier deep archive to save costs using S3 life cycle management.

In addition, manual periodical snapshots of the DB can be taken and moved to S3 Glacier deep archive.

Disaster recovery

The same infrastructure set can be deployed in another AWS region for disaster recovery purposes. The region should be selected based on corporate compliance policies.