You click "Create VPC". Now what?

A tour of what this codebase actually is — and what happens the moment a customer clicks a button.

What is this thing?

FPT Cloud BSS is the cloud portal that lets FPT's customers spin up virtual machines, networks, Kubernetes clusters, S3-style storage, and backups — without ever touching the underlying hardware.

The folder you're looking at, portal-api-dev, is the brain of the backend. It's not one program — it's a whole town of cooperating Python services.

🏙️

Mental model: a town, not a building

Think of this codebase as a small town. Each neighborhood (service) does one job. They mail letters to each other through a central post office (the API gateway). Nobody does everything — and that's the whole point.

The numbers, in one glance

📁

10,167 files

Spread across half a dozen Python services that boot independently.

🧩

64,000+ symbols

Functions, classes, and methods — the moving parts of the town.

🔗

226,000 relationships

How those parts call, import, and depend on each other.

🌊

300 execution flows

Distinct end-to-end journeys a request can take through the town.

If that sounds intimidating — relax. You don't read every street to know a city. You learn the shape of it. That's what this course does.

A click is the start of a long journey

Imagine a customer logs into the portal, fills out a form to make a new VPC, and clicks Create. Their browser sends one HTTP request. Look what's actually waiting on the other end:

🖱️

Browser

🚪

Kong

🧠

User Svc

⏳

Celery

🏗️

IaaS

Click "Next Step" to follow a click through the system

That single click triggered five different processes in three different machines. None of them know about all the others — and yet, somehow, a working network appears two minutes later. The rest of this course explains how.

Why you should care

🎯

Steer AI better

When you tell an AI agent "add a new endpoint", you'll know which of the five services it belongs in.

🐛

Debug faster

"It worked then nothing happened" usually means the request succeeded but a background task died. You'll know where to look.

🧱

Make architecture calls

Should this be a real-time API or a queued task? After this course, the answer will feel obvious.

Meet the cast

The five services that share the work — and which one to blame when something breaks.

It's a relay race, not a marathon

This codebase isn't one giant program. It's a relay team. Each runner carries the baton for their leg of the race, then passes it on. If you yell at the wrong runner, nothing improves.

Here's the team. Memorize these five names — half of all your debugging will be "which one of these is the problem?"

🚪

Kong (the gate)

The API gateway. Every request from the outside world hits Kong first. It checks your login token and forwards the request to the right service.

🧠

portal-api-svc-user

The brain. A Flask app that owns customers, projects, VPCs, and most business rules. When in doubt, the bug is probably here.

🏗️

portal-api-svc-iaas-vmw

The hands. Talks to VMware and OpenStack to actually create networks, VMs, and disks in the data center.

⏳

portal-celery

The patient one. Celery workers that pull slow jobs off a queue — provisioning, billing rollups, sending emails — so users never have to wait.

📅

portal-celery-cron

The town clock. Wakes up on a schedule (every minute, every hour, every midnight) and triggers recurring jobs — quota checks, expiry warnings, summary emails.

A peek at the brain booting up

Here is the actual code that starts the User service in development. This is the literal entry point — what runs when a developer presses Run in PyCharm.

CODE · index_dev.py

import os, sys
from dotenv import load_dotenv
load_dotenv()
sys.path.append("../portal-api-core")
sys.path.append("../fptcloud-api")
env = os.environ
env["FLASK_DEBUG"] = "1"
import service
svc = service.Service()
app = svc.app
if __name__ == "__main__":
    svc.run(port=5001, threaded=True)

PLAIN ENGLISH

Bring in tools to read settings from disk and tweak Python's import path.

Load secrets and config from a hidden .env file — like opening a sealed envelope of passwords before starting work.

Tell Python "also look in those two sister folders for code" — that's how this service borrows shared utilities from portal-api-core.

Flip on debug mode so errors show full details instead of a polite "something went wrong" page.

Now actually build the service object — this is where Flask wakes up and registers every URL.

Start the web server on port 5001, and let it handle several requests at the same time (threaded=True).

Who reports to whom?

None of these services know about all the others. The relationships are deliberately one-directional — that's what keeps the town from collapsing into chaos.

Browser → Kong

→

Kong → User Svc

→

User Svc → DB + Queue

→

Celery → IaaS Svc

→

IaaS Svc → VMware

💡

Why split it up?

If everything lived in one giant program, every tiny VMware hiccup would crash the login page. Splitting by responsibility means a slow VMware call only slows VMware-related requests — the rest of the portal keeps humming.

Quiz: which runner gets the baton?

A customer triggers "send me a billing report by email". You want to add this feature. Where does the email-sending logic actually live?

A bug report says "VPCs are being created with the wrong owner". Which service do you open first?

How they actually talk

HTTP for the urgent stuff. A queue for the slow stuff. A database for everyone to look at.

Two channels, very different rules

Services need to coordinate — but not all coordination is the same. Some things must happen now ("is this user logged in?"). Some things can happen eventually ("provision this VM, it'll take 90 seconds").

This codebase uses two distinct channels for those two situations:

📞

HTTP — the phone call

Synchronous. Sender waits for the answer. Used when the caller can't continue without a reply: "validate this token", "fetch this VPC's details".

📬

Celery + Redis — the mailbox

Asynchronous. Sender drops a job in the mailbox and walks away. A worker picks it up minutes (or seconds) later. Used for slow, retry-able work.

📮

The metaphor that sticks

HTTP is a phone call — both people are on the line, and someone is waiting. A Celery task is a postcard — you drop it in the mailbox and trust it'll get delivered. You wouldn't phone someone to say "happy birthday next year"; you wouldn't mail a postcard to ask "is my house on fire right now?".

A real conversation, group-chat style

Here's what the services actually "say" to each other when a customer creates a VPC. Tap through the messages:

The shared whiteboard: the database

Notice how nobody in that conversation says "here's the VPC data, please pass it along". Instead, everyone reads and writes the same database row. The DB is the shared whiteboard the whole team works on.

CODE · vpc_mgr.py

vpc = md.VPC(
    id=self.create_uuid(),
    created_at=date_util.utc_now(),
)
ctx.args["_vpc"] = vpc
data_util.assign_model_data(
    vpc,
    dict(
        status=md.VPCStatus.INIT,
        type=type, name=name,
        project_id=project_id,
        cidr=cidr,
    ),
)

PLAIN ENGLISH

Build a brand-new VPC record in memory and stamp it with a unique ID and the current time.

Store the new record in a "context bag" so other parts of this request can find it without re-fetching.

Fill in the actual fields — name, project, network range — and most importantly mark status = INIT.

That status is the signal: this VPC exists on paper, but the real network hasn't been built yet. Celery will flip it to READY later.

🪄

The status field is the magic

This pattern — write a row with status=INIT, return success immediately, then have a background worker flip the status when the real work finishes — is called eventual consistency. It's everywhere in cloud systems. Once you see it, you'll see it everywhere.

Quiz: spot the right channel

You're adding a "resize disk" feature. Should the actual resize run inside the HTTP request, or in a background Celery task?

After Celery finishes provisioning a VPC, how does the frontend know it's done?

The outside world

The portal doesn't actually own a single server. It's a polite middleman to half a dozen other systems.

The portal owns nothing — it just orchestrates

This is the secret most people miss when they first see a cloud portal: nearly all the hard work is done by other systems. The portal is a conductor in front of an orchestra — it doesn't play any instruments itself.

🔐

Keycloak

The bouncer. Owns customer identities and logins. The portal asks Keycloak "is this person who they say they are?" via SSO.

🖥️

VMware vSphere

The actual servers. When the portal wants a VM, it asks VMware politely (over a SOAP API). VMware says "fine" and creates it on real hardware in the data center.

☁️

OpenStack (OSP)

An alternative cloud backend. Some customers run on VMware, others on OpenStack — the portal speaks both languages so users don't have to know.

🪣

S3-compatible storage

The filing cabinet. Object storage for backups, snapshots, and large blobs. Speaks Amazon's S3 protocol so any S3 tool just works.

🗃️

PostgreSQL + Redis

The whiteboard and the mailbox. Postgres holds the long-term truth (VPCs, users, billing). Redis holds short-term scratchpad data and the Celery queue.

📊

SigNoz / Sentry

The black-box recorder. Captures errors and request traces so when something explodes at 2am, oncall has a flight recorder to read.

Why "calls an outside system" is a big deal

Every time the portal calls something it doesn't own, three new failure modes appear at once. As an AI-coding-tool driver, you need to feel these in your gut:

latency It's slow. A VMware call can take seconds. The portal must never block on it directly — that's why Celery exists.

flakiness It can fail randomly. Network glitches happen. Code that calls outside services must be ready to retry — and ready to not retry when the operation already half-succeeded.

rate limit It can throttle you. Most APIs cap how many calls per second you can make. Bursting too hard gets you banned.

cost Every call may cost money or quota. S3 charges per request. VMware burns CPU. Naive loops can rack up huge bills.

credentials You need a key, token, or certificate. Lose it or commit it to git and you have an incident.

A scheduled job, in the wild

Here's how the cron service decides when to run a recurring job. Real code from portal-celery/index.py:

CODE · portal-celery/index.py

def init_crontab_from_expression(expression: str) -> crontab:
    fields = expression.split(" ")
    return crontab(
        minute=fields[0],
        hour=fields[1],
        day_of_month=fields[2],
        month_of_year=fields[3],
        day_of_week=fields[4],
    )

PLAIN ENGLISH

Take a string like "0 3 * * *" — the famous cron expression format from Unix.

Split it into its five pieces by spaces.

Hand each piece to Celery's scheduler so it knows when to wake up and trigger the task.

Why pull the schedule from a string in the environment? So ops can change "run nightly" to "run hourly" without redeploying code — just edit a config and restart.

⚙️

A vibe-coding takeaway

When you ask AI to add "a job that runs every night", the right pattern in this codebase is: register a Celery beat schedule whose expression comes from an environment variable. Now you can tell the AI agent exactly that, in those words.

When things break

A real bug from this codebase, and the debugging instincts you'll steal from it.

A true story: the duplicate VPC bug

Customers reported that sometimes, after clicking Create VPC, they ended up with two identical VPCs in their account — same name, same network range, same created-at timestamp down to the second.

This actually happened in this codebase. The fingerprints in the database were unmistakable: identical timestamps. Two things had been created at the exact same instant. Why?

🚨

The smell test

Whenever you see "duplicates with identical timestamps", your first hypothesis should be: the same operation ran twice in parallel. Not "ran twice in sequence" — that would have different timestamps. Parallel.

The two suspects

Two patterns in this codebase, working together, made the bug possible:

🖱️🖱️

Suspect #1: double-click

A user impatiently clicks "Create" twice in a row. The frontend fires two HTTP requests. Both arrive within milliseconds.

🔁

Suspect #2: aggressive Celery retries

Celery tasks in this codebase retry on any exception, up to 5 times. If the first attempt actually finished but a network blip made it look like a failure, Celery dutifully runs it again. And again.

Either suspect alone is annoying. Together, they're a duplicate factory. And neither piece of code was "wrong" in isolation — that's what makes these bugs hard.

The fix has a name: idempotency

The cure for "the same operation accidentally ran twice" is to make sure that running it twice produces the same result as running it once. Engineers call this idempotency — and it's one of the most important concepts in this whole course.

🛗

Mental model: an elevator button

Pressing an elevator call button five times doesn't summon five elevators. The button is idempotent — extra presses are ignored once the request has been registered. Code that creates resources in the cloud should behave the same way.

The most common ways to make an operation idempotent:

unique key Add a database constraint: "you can't have two VPCs with the same (project_id, name)". The second insert fails loudly instead of quietly succeeding.

request ID The frontend generates a UUID for each click and sends it in the request. The backend remembers seen IDs and refuses to act on the same one twice.

check-then-act Before creating a VPC, look for one that already matches. If you find it, return it instead of creating a new one. (Watch out: this still races without a lock.)

narrow retries Don't retry on every exception — only on specific, transient ones (timeouts, 503s). A retry on "constraint violated" makes things worse, not better.

Spot the bug

Here's a simplified version of the kind of Celery task that caused the duplicates. Click the line that's the root problem.

1 @celery_app.task(bind=True, max_retries=5)

2 def create_vpc_task(self, name, project_id, cidr):

3 try:

4 vpc = VPC(name=name, project_id=project_id, cidr=cidr)

5 db.session.add(vpc); db.session.commit()

6 except Exception as e:

7 raise self.retry(exc=e)

Hint: it's not any single line being "wrong" — it's what the function signature is missing that makes the retries dangerous.

Quiz: debugging instincts

A user reports duplicate VPCs. You've confirmed it's caused by double-click. What's the most durable fix?

A customer says: "I clicked Create. The portal said success. But the VPC never appeared." Where do you look first?

You're writing a new Celery task that calls VMware. What's the correct retry policy?

You made it through the town

You now know the five services, the two channels they talk on, the outside systems they orchestrate, and the most common way they fail. That's already more architectural literacy than most developers who've worked on a codebase like this for a month.

Next time you tell an AI agent to "add a feature to the portal", you can be specific: which service, which channel, what idempotency key. That specificity is the difference between AI that builds the right thing and AI that confidently builds the wrong thing.