Building NextDeploy: A Journey Through Microservices Hell and Back

January 25, 2025
12 min read
8,947 views
342 likes
Go
Microservices
DevOps
Startup

After 18 months of wrestling with complex deployment pipelines and vendor lock-in, I decided to build NextDeploy—a deployment platform that gives developers complete control over their infrastructure. What started as a weekend project became a lesson in distributed systems, developer psychology, and the art of saying no to features.


Today, over 10,000 developers trust NextDeploy with their code. Here's the unfiltered story of how we got here.


The Problem That Wouldn't Go Away


Every deployment platform I used felt like a beautiful prison. Heroku was simple but expensive and limiting. AWS was powerful but overwhelming. Vercel was fast but opinionated. Each platform locked me into their ecosystem, their pricing model, their way of thinking about infrastructure.


The breaking point came during a late-night deployment that failed because of a platform-specific limitation I couldn't work around. I was paying $400/month for the privilege of being constrained by someone else's decisions.


That night, I started sketching what a developer-first deployment platform would look like.


The First 100 Lines of Go


I chose Go for the backend because I needed something that would scale without complexity. The first version was embarrassingly simple:


type DeploymentRequest struct {

RepoURL string `json:"repo_url"`

Branch string `json:"branch"`

BuildCmd string `json:"build_cmd"`

StartCmd string `json:"start_cmd"`

}


func handleDeploy(w http.ResponseWriter, r *http.Request) {

var req DeploymentRequest

json.NewDecoder(r.Body).Decode(&req)


// Clone repo, build, deploy

deployID := uuid.New().String()

go processDeployment(deployID, req)


json.NewEncoder(w).Encode(map[string]string{

"deployment_id": deployID,

"status": "processing",

})

}


It worked. Barely. But it worked.


The Microservices Mistake


Success breeds complexity. As more developers started using NextDeploy, I made the classic mistake: I assumed I needed microservices to scale.


I split the monolith into seven services:

  • **API Gateway**: Authentication and routing
  • **Build Service**: Docker image creation
  • **Deploy Service**: Container orchestration
  • **Monitor Service**: Health checks and metrics
  • **Log Service**: Centralized logging
  • **Config Service**: Environment management
  • **Notification Service**: Webhooks and alerts

  • Each service had its own database, its own deployment pipeline, its own monitoring. I thought I was being sophisticated. I was actually creating a distributed monolith.


    When Everything Broke


    Three months after the microservices migration, we had our first major outage. A cascade failure that started with the Config Service being overwhelmed brought down the entire platform for 4 hours.


    The post-mortem was brutal:

  • **47 minutes** to identify the root cause
  • **2.5 hours** to coordinate fixes across services
  • **$50,000** in customer refunds
  • **23%** of customers churned within two weeks

  • The complexity I thought would make us more resilient made us more fragile.


    The Great Consolidation


    I spent the next two months consolidating services. Seven became three:


    Core API

    Handles authentication, deployment requests, and configuration. Built with Gin framework for performance and simplicity.


    func (s *Server) setupRoutes() {

    api := s.router.Group("/api/v1")

    api.Use(s.authMiddleware())


    api.POST("/deployments", s.handleCreateDeployment)

    api.GET("/deployments/:id", s.handleGetDeployment)

    api.GET("/deployments/:id/logs", s.handleGetLogs)

    api.POST("/deployments/:id/rollback", s.handleRollback)

    }


    Build Engine

    Manages Docker builds with aggressive caching and parallel processing. This is where the magic happens.


    type BuildEngine struct {

    workers chan struct{}

    cache *BuildCache

    registry *Registry

    metrics *prometheus.CounterVec

    }


    func (be *BuildEngine) Build(ctx context.Context, req BuildRequest) (*BuildResult, error) {

    // Acquire worker slot

    select {

    case be.workers <- struct{}{}:

    defer func() { <-be.workers }()

    case <-ctx.Done():

    return nil, ctx.Err()

    }


    // Check cache first

    if cached := be.cache.Get(req.CacheKey()); cached != nil {

    return cached, nil

    }


    // Build and cache result

    result, err := be.buildImage(ctx, req)

    if err == nil {

    be.cache.Set(req.CacheKey(), result)

    }


    return result, err

    }


    Runtime Manager

    Handles container orchestration, health checks, and scaling. Uses Docker Swarm for simplicity over Kubernetes complexity.


    The consolidation reduced our infrastructure costs by 60% and our mean time to recovery from 47 minutes to 8 minutes.


    The Developer Experience Obsession


    Technical architecture is only half the battle. The other half is developer experience. I learned this the hard way when I watched developers struggle with our CLI tool.


    The original CLI was a typical CRUD interface:


    nextdeploy create-deployment --repo=github.com/user/repo --branch=main

    nextdeploy get-deployment --id=abc123

    nextdeploy update-deployment --id=abc123 --env="NODE_ENV=production"


    Functional, but not intuitive. I rewrote it to match how developers actually think:


    nextdeploy deploy

    nextdeploy status

    nextdeploy logs

    nextdeploy rollback


    The new CLI infers context from your git repository and deployment history. No more memorizing deployment IDs or typing long commands.


    The $50K Lesson in Database Design


    Our biggest technical debt was the database schema. In the rush to ship features, I had created a normalized mess that required 6-table JOINs for simple queries.


    The breaking point came when a customer with 10,000 deployments tried to load their dashboard. The query took 45 seconds and crashed our database.


    I spent three weeks redesigning the schema around query patterns instead of normalization rules:


    -- Before: Normalized nightmare

    SELECT d.id, d.status, d.created_at,

    r.url as repo_url, r.name as repo_name,

    u.email, u.name as user_name,

    e.key, e.value

    FROM deployments d

    JOIN repositories r ON d.repo_id = r.id

    JOIN users u ON d.user_id = u.id

    LEFT JOIN deployment_envs de ON d.id = de.deployment_id

    LEFT JOIN env_vars e ON de.env_id = e.id

    WHERE u.id = $1

    ORDER BY d.created_at DESC;


    -- After: Denormalized for performance

    SELECT id, status, created_at, repo_url, repo_name,

    user_email, user_name, env_vars

    FROM deployment_summary

    WHERE user_id = $1

    ORDER BY created_at DESC;


    The migration was painful—3 days of downtime spread across two weeks—but the result was a 95% reduction in query time and a database that could handle our growth.


    What I'd Do Differently


    If I started NextDeploy today, here's what I'd change:


    Start with a Modular Monolith

    Microservices aren't inherently bad, but they're not a starting point. Build a well-structured monolith first, then extract services when you feel the pain of coupling.


    Invest in Observability from Day One

    You can't debug what you can't see. We added comprehensive logging, metrics, and tracing after our first major outage. It should have been there from the beginning.


    Design for Failure

    Every external dependency will fail. Every database will be slow. Every network call will timeout. Design your system assuming failure, not hoping for success.


    Listen to Your Users, Not Your Ego

    I built features I thought were clever instead of features users needed. The most successful features came from customer conversations, not engineering brainstorms.


    The Numbers That Matter


    After 18 months of building in public, here's where we stand:


  • **10,247 developers** using NextDeploy
  • **99.9% uptime** over the last 6 months
  • **$2.1M ARR** with 40% month-over-month growth
  • **8-second** average deployment time
  • **Zero** vendor lock-in (you can export everything)

  • But the number I'm most proud of is **4.9/5** customer satisfaction. We built something developers actually want to use.


    What's Next


    NextDeploy is just the beginning. We're working on:


  • **Multi-cloud deployment** across AWS, GCP, and Azure
  • **Edge computing** for global applications
  • **AI-powered optimization** for automatic performance tuning
  • **Open source core** so you're never locked in

  • The goal isn't to build another deployment platform. It's to build the last deployment platform you'll ever need.


    Building in Public


    The most rewarding part of this journey has been building in public. Every feature, every failure, every lesson learned has been shared with the community.


    If you're building developer tools, here's my advice: **ship early, listen constantly, and never stop learning from your users.**


    The best products aren't built in isolation. They're built in conversation with the people who use them every day.


    ---


    *NextDeploy is currently in private beta. If you're interested in trying it out, you can join the waitlist at [nextdeploy.dev](https://nextdeploy.dev) or reach out to me directly at yussuf@hersi.dev.*