Skip to main content

Architecture & System Design

Most architecture problems aren't about choosing the wrong pattern. They're about the architecture living in one person's head: undocumented, unreviewed, and invisible to everyone else. The real leverage of architectural practice is making the thinking visible so it can be shared, challenged, and improved.


Why this matters

Architecture is where S&P's values of Teamwork and Evolution become structural. A system that only one person understands is a system the team cannot improve, cannot safely change, and cannot hand over to a client with confidence. On a small team, it feels like everyone "knows" the architecture: until someone leaves, a new developer joins, or the client asks for documentation. Then you discover it was never written down.

Documented architecture is architecture you can evolve. Undocumented architecture is a liability that grows more expensive every month.


The standard

How we think about architecture

S&P builds client-facing products with small to medium teams: typically 2-5 developers including a delivery manager, QA, and a DevOps-capable engineer. Architecture practice must be proportional to this reality. There are no enterprise architecture boards. There are no month-long design phases. There is a team of people who need to build something well, and who need enough shared understanding to do it without constant synchronisation.

Architecture is not a phase that ends when sprint 1 starts. The initial design happens before development begins, but architectural decisions happen throughout a project's life, when you add an integration, when you split a service, when load patterns change, when the client's requirements shift.

The core practice is three things:

  1. Make it visible. Diagrams and written descriptions that show how the system works, not just how the code is organised.
  2. Make it reviewable. An architecture review process that catches problems before they become expensive to fix.
  3. Make it findable. Consistent documentation locations so anyone on the team (or joining the team) knows where to look.

This is the "Write it down" principle from Engineering Principles applied to system design.

C4 model as the diagramming standard

C4 is the standard for system architecture diagrams at S&P. It provides a zoom-level hierarchy (Context, Containers, Components, Code) that matches how people actually think about systems: from "what does this system do?" down to "how does this service work internally?"

Level 1: System Context: always create this.

Shows the system as a single box, surrounded by its users (people, external systems) and the interactions between them. This is what you show the client when they ask "what does the system do?" Every project gets a Level 1 diagram before development starts.

Level 2: Container: always create for multi-service systems.

Shows the major technical building blocks (API server, web app, database, message queue, third-party integrations) and how they communicate. This is the equivalent of a traditional High-Level Design (HLD). It answers "what are the moving parts and how do they talk to each other?"

Level 3: Component: only when needed.

Shows the major components inside a single container (e.g., modules in a NestJS application). Use this sparingly, only for services complex enough that the Container diagram doesn't convey the internal structure. Most S&P projects don't need Level 3 diagrams for every container.

Level 4: Code: almost never.

The code itself is the documentation at this level. Only create a Level 4 diagram when onboarding someone to a particularly complex subsystem where the code structure is genuinely hard to navigate.

Flow diagrams

C4 diagrams show static structure, what exists and how it connects. You also need flow diagrams to show dynamic behaviour:

  • Sequence diagrams for new API flows that involve multiple services or components.
  • Activity diagrams for E2E user journeys that cross service boundaries.
  • Event flow diagrams for async or event-driven processing (queues, pub/sub, webhooks).

These complement C4 diagrams. They answer "what happens when a user does X?" rather than "what is the system made of?"

HLD and LLD for client deliverables

Some clients require traditional HLD/LLD documents. When they do, map C4 levels to client expectations:

Client asks forDeliver C4
High-Level Design (HLD)Level 1 (Context) + Level 2 (Container)
Low-Level Design (LLD)Level 3 (Component) for relevant containers

The content is the same; the packaging differs. Use the S&P HLD template on Confluence as the starting point for client-facing documents.

Diagramming tools

S&P teams use three tools depending on context:

  • OneModel: Data-centric diagrams and system maps. Templates are available.
  • Mermaid - All in one editor for diagrams (flowcharts, C4, architecture)
  • Eraser: Fast, collaborative, good for iterating on designs in real time using AI.
  • Draw.io: Detailed diagrams, embeds well in Confluence, good for polished client deliverables.

The standard is the output (C4 levels with clear labels, consistent notation, and up-to-date content), not the tool. Use whichever fits your workflow. Store diagrams alongside the project documentation (Confluence for client-facing, repo for internal).

Architecture decision records

Architectural decisions need to be captured, not just made. S&P uses a dual system:

DACI records on Confluence for decisions that cross team boundaries or involve non-engineering stakeholders: cloud provider choices, major technology changes, client-facing architectural proposals, cross-project standards. The DACI template and process are defined in Engineering Principles. Use the S&P DACI board on Confluence.

Lightweight ADRs in the project repo for decisions scoped to a single project that primarily concern engineers: library choices, API design decisions, database schema decisions, patterns adopted. These live in docs/decisions/ in the project repository.

The routing rule: If the decision crosses team boundaries or involves non-engineering stakeholders, use DACI on Confluence. If it's scoped to one project and affects engineers only, use an ADR in the repo.

The ADR template below is deliberately lighter than the DACI format. It captures just enough context for project-scoped technical decisions without the stakeholder tracking (Driver, Approver, Contributors, Informed) that DACI provides for cross-team decisions.

ADR template

# ADR-XXXX: [Short title]

| Field | Value |
|------------|------------------------------------------------------------|
| **Status** | Proposed / Accepted / Deprecated / Superseded by ADR-YYYY |
| **Date** | YYYY-MM-DD |
| **Author** | [Name] |

## Context

What is the technical situation that requires a decision? What constraints exist?

## Decision

What did we decide and why?

## Consequences

What are the positive and negative consequences of this decision?

## Alternatives considered

What other approaches did we evaluate and why did we reject them?

ADRs are immutable. When a decision changes, write a new ADR that supersedes the old one rather than editing it. This creates a decision trail, you can always trace why the architecture evolved the way it did.

For decisions that affect multiple people, build consensus before writing the ADR, not after. This is the Nemawashi principle from Engineering Principles: informal alignment first, formal record second.

API design principles

This section covers how we think about APIs at a conceptual level. Framework-specific implementation (NestJS decorators, Spectral linting, Zalando rule enforcement) belongs in the Backend appendix.

Contract-first design. Define the API contract (OpenAPI specification) before writing implementation code. This aligns frontend and backend teams early, catches design issues before code is written, and enables API type generation from day one. The backend generates the OpenAPI spec; the frontend consumes it to generate typed clients.

Consistency over cleverness. APIs across S&P projects should feel similar. Consistent resource naming, consistent error response shapes, consistent pagination patterns. When a developer moves between projects, the APIs should not feel foreign. The Zalando RESTful API Guidelines are the reference standard: adopt their naming conventions and patterns, adapted to S&P's stack.

URL path versioning. Use /api/v1/ for API versioning. Header-based versioning adds complexity that is not justified for most S&P projects. When a breaking change is unavoidable, version the affected endpoints, do not version the entire API surface.

Consistent error responses. All APIs should return errors in a predictable shape so frontend code can handle them generically. The specific error format and error handling patterns are defined in Code Standards; the principle here is that error consistency is an architectural decision, not an afterthought.

Capacity planning and load testing

Architecture decisions should be grounded in numbers, not instinct. Before committing to an infrastructure shape or a scaling strategy, do the math: even rough math is better than none.

Back-of-the-envelope estimation

At project kickoff (or before any major architectural change), estimate the key numbers that drive infrastructure decisions:

What to estimateWhy it matters
Expected concurrent users (peak)Determines instance sizing, connection pool limits, WebSocket capacity
Requests per second (peak)Drives API server count, rate limiting, queue throughput
Data volume and growth rateShapes database choice, storage tier, backup strategy, retention policy
Payload sizes (API responses, file uploads)Affects bandwidth costs, CDN needs, timeout configuration
Third-party API rate limitsConstrains integration design, you may need queues or caching layers you didn't plan for

How to do it: Start with the number of users the client expects at launch and at 6/12/24 months. Multiply through the user journey to estimate requests, storage, and bandwidth. Use cloud provider pricing calculators to convert estimates into cost ranges. Document the assumptions, they are as valuable as the numbers because they tell you what to revisit when assumptions change.

The S&P Cloud Providers Comparison template on Confluence includes a cost estimation structure with user-tier breakdowns (1K, 50K, 200K, 800K+ users). Use it as a starting point.

When to revisit estimates: After the first real usage data comes in (post-launch), when the client's user projections change, and before any infrastructure scaling decision.

Load testing

Estimates tell you what should work. Load testing tells you what actually works. Run load tests to validate architectural assumptions before they become production problems.

When to load test:

  • Before the first production deployment of a new system.
  • Before a launch with a significant expected user base.
  • After major architectural changes (new service, database migration, caching layer).
  • When the system will handle spiky or unpredictable traffic (marketing campaigns, seasonal peaks).

What to test:

  • Baseline load: Sustained traffic at expected average usage. Does the system hold steady without resource exhaustion or degrading response times?
  • Peak load: Traffic at estimated peak concurrency. Do connection pools, queues, and databases handle the spike?
  • Stress test: Traffic beyond expected peak. Where does the system break first? Understanding the failure mode matters more than the exact breaking point.
  • Endurance (soak) test: Sustained moderate load over hours. Catches memory leaks, connection pool exhaustion, and log storage overflow that don't appear in short bursts.

Tools: k6 for scripted load tests (JavaScript-based, good fit for S&P's stack), Artillery as an alternative. For simple endpoint checks, ab or wrk work. Cloud providers offer managed load testing services for large-scale scenarios.

Capture the results. Load test findings feed directly into architecture decisions. If a test reveals that the database becomes the bottleneck at 500 concurrent users, that's an ADR waiting to happen, document the finding, the options (read replicas, caching, query optimisation), and the decision.

Architecture review

This is not a gate. It is a knowledge-sharing mechanism that catches problems early and spreads architectural understanding across the team.

When to hold an architecture review:

  • Before starting a new project (after the initial C4 diagrams are drafted).
  • When adding a new service or major component to an existing system.
  • When making an infrastructure change that affects the system topology.
  • When a decision record with high impact is proposed.

Format: async-first.

The proposer writes a one-page document: a C4 Level 2 diagram plus a brief narrative explaining what the system does, why it's structured this way, and what trade-offs were made. Share it for async review with a 2-3 business day deadline for written feedback. If async feedback raises unresolved questions, follow up with a 30-minute synchronous discussion.

Who reviews:

At minimum, one engineer not on the project team (for cross-pollination and fresh perspective) and the relevant DRI. For infrastructure changes, include the engineer handling DevOps responsibilities.

Capture the outcome. After the review, create or update the relevant ADR or DACI record. The review discussion itself doesn't need formal minutes, the decision record captures what was decided and why.


Critical thinking

  • Architecture diagrams rot. A diagram that hasn't been updated in six months is worse than no diagram, it's actively misleading. Treat diagrams like code: update them when the system changes, review them periodically. If you can't keep a diagram current, simplify it until you can.

  • Don't design for a team you don't have. Microservices, event sourcing, CQRS, and domain-driven design solve real problems at scale. A 3-person team building a CRUD application does not need a message bus between six services. Start with a modular monolith. Extract services when you have evidence that a module needs independent scaling, deployment, or team ownership, not before.

  • The 12-Factor App is a starting point, not a religion. Most 12-Factor App principles (config in environment variables, stateless processes, disposable instances) are good defaults for S&P's cloud-deployed applications. But not every factor applies to every project. Apply them where they reduce complexity; don't contort your architecture to satisfy a checklist.

  • Architecture review and code review serve different purposes. Architecture review is about direction: are we building the right thing in the right way? Code review is about execution: is this change correct, clear, and consistent? Don't use code review to re-litigate architectural decisions that were already made and recorded.

  • Client requirements shape the documentation, not the thinking. Some clients want formal HLD/LLD documents. Others just want the system to work. Adjust the documentation format to the audience, but always do the architectural thinking regardless. A C4 Level 2 diagram is valuable even if the client never sees it.

  • Share what you learn across projects. An ADR in one project's repo is invisible to other teams. When you solve an interesting architectural problem or discover a useful pattern, share it: in a knowledge-sharing session, a Slack post, or a playbook update. The best architecture decisions are the ones you only need to make once.


Checklist

Before starting a new project

  • A C4 Level 1 (System Context) diagram exists showing the system, its users, and external dependencies.
  • A C4 Level 2 (Container) diagram exists showing the major technical building blocks and how they communicate.
  • An architecture review has been conducted with at least one engineer outside the project team.
  • Key architectural decisions are captured as ADRs in the repo or DACI records on Confluence.
  • The API contract (OpenAPI spec) is drafted for core endpoints.
  • Back-of-the-envelope capacity estimates exist (concurrent users, requests/sec, data volume, costs).
  • The diagramming tool and documentation location are agreed upon by the team.

Before shipping a major change

  • Diagrams reflect the current system architecture, not the architecture from three months ago.
  • New API endpoints were designed contract-first with an OpenAPI spec before implementation.
  • Any new architectural decisions are documented (ADR or DACI, depending on scope).
  • If the change introduces a new service, component, or integration, the relevant C4 diagram has been updated.
  • If the change affects capacity or performance characteristics, load testing has been run or scheduled.
  • The change has been reviewed by someone with enough context to challenge the approach.

AI tips

  • Generating C4 diagrams: Describe your system's components and their interactions to AI and ask it to generate a C4 Level 2 diagram in Mermaid, PlantUML, or Structurizr DSL. AI tends to get the structure right but misses nuances in communication patterns and data flow direction: review and refine.
  • Drafting ADRs: Describe the decision context and the options you considered. AI is good at structuring the pros/cons comparison and identifying consequences you might miss. It doesn't know your team's specific constraints, always add those yourself.
  • Reviewing architecture proposals: Paste your C4 diagram or architecture narrative and ask AI to identify single points of failure, scalability bottlenecks, and security gaps. Use this as a pre-review sanity check, not a replacement for human review.
  • Back-of-the-envelope estimation: Give AI your user projections and describe the user journey. Ask it to estimate requests/sec, storage growth, and bandwidth. AI is good at the multiplication and unit conversion; you supply the assumptions and sanity-check the output against real-world benchmarks.
  • System design exploration: When evaluating a new technical approach, describe your constraints (team size, timeline, existing stack) and ask AI to compare architectural options. AI is particularly useful for surfacing trade-offs you haven't considered. Cross-reference against the System Design Primer for depth.

Resources