Laravel-Queues und Scheduler auf AWS Fargate: was Toy-Repos auslassen

Locally, everything works. docker compose up, the app responds, a job gets processed, the scheduler ticks. Then the same image goes to AWS Fargate, and three things happen quietly: the queue grows instead of shrinking, the scheduler never runs, and the next deploy loses a job mid-processing. None of it throws an error. It just does not work.

In the comparison of ECS and EKS, the recommendation for mid-sized companies landed clearly on ECS Fargate. But that platform has no host, no systemd, no crontab, and no Supervisor in the classic sense. The exact building blocks a Laravel deploy relies on when running on a normal server are missing. This article shows how to wire a Laravel app on Fargate so that queues run, the scheduler ticks, and a deploy does not eat your jobs.

What it is not: an introduction to Docker or ECS. If you have never deployed a container, start with Docker in Production and the ECS vs EKS comparison. Versions are Laravel 13, region eu-central-1 as reference, as of June 2026.

A Laravel App Is Three Processes, Not One

On a classic server, three things run side by side without much thought: php-fpm or Octane serves HTTP, one or more queue:work processes sit under Supervisor, and cron calls schedule:run every minute. Three process types, one server, done.

On Fargate, this becomes the same image three times, but as three separate ECS services with different commands and different scaling. The web service hangs on the load balancer and scales on request load. The worker service has no load balancer and scales on queue depth. The scheduler is a single process that never scales.

The temptation to pack everything into one container is strong. A Supervisor holding FPM plus worker plus cron in one Fargate task sounds like less work. It breaks the platform's assumptions, though. A container has one main process and one health signal. Scale up the web part, and the workers scale with it by accident. If the worker dies, the health signal does not notice, because it only watches FPM. And on deploy, FPM, worker, and cron all get one SIGTERM and have to share the same window to clean up. That goes fine for a while and then falls apart under load.

The rule is simple: one process type per service, one image for all three, the command and task definition decide the behavior.

The Web Service

Health Check With /up

Since Laravel 11, the framework ships the /up endpoint out of the box. That is the natural health check for the ALB target group, no custom status endpoint needed. One detail that often goes wrong: the health check should verify that the app boots, not that the database is reachable. Hang the DB on the health check, and a brief database hiccup pulls the entire web service out of the load balancer, even though the app itself is perfectly fine.

FPM or Octane

Nginx plus php-fpm in the container is the robust default and the right choice for most setups. Octane with Swoole or FrankenPHP brings more throughput but demands state discipline, because the worker stays in memory between requests (details in the Octane article). For Fargate, FrankenPHP is interesting because a single process serves HTTP and PHP, sparing you the Nginx sidecar. I would start with FPM and only move to Octane when throughput really demands it.

config:cache in the Image Build

config:cache and route:cache belong in the Docker build, not the entrypoint. Compile the config at build time, and no task has to rebuild it on start, so startup is faster. This has a consequence that regularly causes confusion: with cached config, Laravel reads values from the compiled file, not from env() at runtime. An env() call deep in the code then returns null. All environment values must go through config() files, otherwise the very optimization breaks access to your own variables.

Here is the web task definition, with /up as the health check and awslogs for logs:

{
  "family": "laravel-web",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/laravel-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789012.dkr.ecr.eu-central-1.amazonaws.com/laravel:1.4.0",
      "command": ["php", "artisan", "octane:frankenphp", "--host=0.0.0.0", "--port=8000"],
      "essential": true,
      "portMappings": [{ "name": "http", "containerPort": 8000, "protocol": "tcp" }],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8000/up || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 15
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/laravel-web",
          "awslogs-region": "eu-central-1",
          "awslogs-stream-prefix": "web"
        }
      }
    }
  ]
}

The Queue Worker as Its Own Service

queue:work or Horizon

queue:work with your own Supervisor pattern is minimal and works, but gives no visibility into the queue. Horizon adds a dashboard, metrics, tags, and automatic balancing across multiple queues, but needs Redis or Valkey as its connection. For most mid-sized setups, Horizon is worth the effort just for the view into throughput and wait time. When I get woken up at night because the queue is backing up, I do not want to guess which job type is stuck.

The Queue Backend

Redis or Valkey on ElastiCache. As of June 2026, Valkey is the default AWS recommends for new clusters, around 20 percent cheaper than Redis OSS and protocol-compatible. For a new Laravel app, there is little reason to still pick Redis OSS. SQS is the alternative if you want to pull the queue out of your own operations entirely, but it costs you the Horizon view, because Horizon needs Redis.

Memory Hygiene With max-time and max-jobs

Worker processes live long and accumulate memory along the way, because PHP does not release everything between jobs. Eventually the process hits the memory limit and gets killed hard. --max-time and --max-jobs let the worker restart cleanly after a defined runtime or job count, before that happens. --max-time=3600 plus --max-jobs=1000 is a sensible starting point. The most common beginner mistake here is a different one: starting queue:work as a background process in the web container. The worker then eventually dies unnoticed, because the health signal only checks FPM, and nobody notices until the queue overflows.

The worker definition with Horizon as the command, plus the queue:work variant as a note:

{
  "name": "worker",
  "image": "123456789012.dkr.ecr.eu-central-1.amazonaws.com/laravel:1.4.0",
  "command": ["php", "artisan", "horizon"],
  "essential": true,
  "stopTimeout": 120,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/laravel-worker",
      "awslogs-region": "eu-central-1",
      "awslogs-stream-prefix": "worker"
    }
  }
}

Without Horizon, replace the command with ["php", "artisan", "queue:work", "redis", "--max-time=3600", "--max-jobs=1000", "--tries=3"] and let the service scale through the desired count.

The Scheduler Problem on Fargate

This is where the biggest hole is. Laravel's scheduler expects a cron entry that calls php artisan schedule:run every minute. On Fargate there is no crontab and no host to write it into. Writing * * * * * php artisan schedule:run into a file means looking for a place that does not exist on Fargate. Three ways work.

Option A: A Dedicated schedule:work Service

schedule:work is a long-running process that internally runs schedule:run every minute. A single Fargate task with this command fully replaces the crontab. The most important rule, and the practically most valuable sentence in this article: the desired count must be exactly 1. With two tasks, every scheduled job runs twice. Reports go out twice, sync jobs run twice, and in the worst case you only notice it through duplicate invoices.

As an extra safeguard, define scheduled tasks with onOneServer. That relies on a cache lock in Redis or Valkey and protects against the brief moment during a rolling deploy when two scheduler tasks can run at once for a few seconds:

// routes/console.php
use Illuminate\Support\Facades\Schedule;
 
Schedule::command('reports:daily')
    ->dailyAt('06:00')
    ->onOneServer()
    ->withoutOverlapping();
 
Schedule::command('subscriptions:renew')
    ->hourly()
    ->onOneServer();

The trade-off: a permanently running, almost always idle task that still costs vCPU and memory. For a small task, that is negligible.

Option B: EventBridge Scheduler With RunTask

EventBridge Scheduler is a serverless cron that starts a short-lived ECS task with the schedule:run command every minute. No permanently running container, billing only for the short runs. That sounds more elegant but has snags. The task start costs latency every minute, schedule:run must finish well under 60 seconds, and the RunTask occasionally fails to start, for capacity or networking reasons. So you need retry logic and should monitor the failed starts.

resource "aws_scheduler_schedule" "laravel_scheduler" {
  name                = "laravel-schedule-run"
  schedule_expression = "rate(1 minute)"
 
  flexible_time_window {
    mode = "OFF"
  }
 
  target {
    arn      = aws_ecs_cluster.laravel.arn
    role_arn = aws_iam_role.scheduler.arn
 
    ecs_parameters {
      task_definition_arn = aws_ecs_task_definition.scheduler.arn
      launch_type         = "FARGATE"
 
      network_configuration {
        subnets          = var.private_subnet_ids
        security_groups  = [aws_security_group.laravel_tasks.id]
        assign_public_ip = false
      }
    }
 
    retry_policy {
      maximum_retry_attempts = 2
    }
  }
}

Option C: Supercronic as a Sidecar

Supercronic is a cron-compatible runner for containers that triggers schedule:run every minute. Pragmatic if a team is attached to its crontab syntax, but it adds another component you have to maintain.

Recommendation

Option	Idle cost	Complexity	Best scenario
schedule:work service (count 1)	one task always on	low	default for most setups
EventBridge with RunTask	per run only	medium	many jobs, cost pressure, short schedule:run
Supercronic sidecar	low	medium	teams with existing crontab logic

For most mid-sized apps, Option A is the right choice. It is the easiest to understand and the hardest to misconfigure, as long as the desired count stays at 1. Regardless of the option, one rule holds: long work does not belong in schedule:run itself, it gets dispatched from there as a queue job. The scheduler triggers, the worker works. A report that computes for three minutes would otherwise block the next scheduler run.

Graceful Shutdown and Deploy Safety

SIGTERM, stopTimeout, SIGKILL

When a task stops, which is every deploy, every scale-in, and every Spot reclaim, ECS sends a SIGTERM to the main process. If the process does not react within stopTimeout, a SIGKILL follows. The default is 30 seconds, practically up to 120 seconds are possible (Fargate Spot requires under 120). Since December 2025, ECS Fargate additionally reads the STOPSIGNAL from the image config, in case a process reacts cleanly to a different signal. Without one, it stays SIGTERM.

Stopping Worker and Horizon Cleanly

queue:work catches SIGTERM and finishes the currently running job before it stops. For that to work, stopTimeout must be larger than the typical job runtime. The math is merciless: a job that needs 50 seconds, plus the default stopTimeout of 30 seconds, results in a SIGKILL mid-job. The job then counts as unfinished and runs again, in the worse case with side effects that already half happened. That is why the worker definition above sets stopTimeout: 120.

Horizon ships horizon:terminate, which stops the master cleanly and lets running jobs finish. A trap in deploy scripts: the command returns asynchronously and immediately, it does not wait for the workers to finish. With ECS, the clean path is a different one anyway. You send SIGTERM to the Horizon master (ECS does that itself when stopping) and give it enough time through stopTimeout. You do not need a manual terminate in the entrypoint for that.

Idempotency as a Safety Net

Because a job can run again on hard abort, jobs must be idempotent. That is not a Fargate quirk, but Fargate deploys make the case frequent instead of rare. A job that triggers a payment and knows no dedup is, on Fargate, a question of when, not if. Unique job keys, withoutOverlapping, and a check whether the work is already done cost little and save a lot.

Secrets and Configuration

The anti-pattern is widespread: DB password and APP_KEY as plaintext in the environment section of the task definition. Anyone with ecs:DescribeTaskDefinition then reads them in the clear. The right way is the secrets section, pointing at AWS Secrets Manager or the SSM Parameter Store. Only the ARN sits in the task definition, ECS injects the value at start as an environment variable:

"secrets": [
  {
    "name": "APP_KEY",
    "valueFrom": "arn:aws:ssm:eu-central-1:123456789012:parameter/laravel/app-key"
  },
  {
    "name": "DB_PASSWORD",
    "valueFrom": "arn:aws:secretsmanager:eu-central-1:123456789012:secret:laravel/db-AbCdEf:password::"
  }
]

The choice between the two is a cost question. Secrets Manager can rotate and costs per secret, ideal for DB credentials that RDS rotates automatically. The SSM Parameter Store with SecureString is cheaper and enough for values that rarely change, like the APP_KEY. That one must not go into the image, because it makes rotation impossible and is a leak risk.

And once more the interaction with config:cache: the secrets injected by ECS must land in config() files, not be read via env() deep in the code. With cached config, you otherwise get null back, and the error only shows up in production, because locally you work without the cache.

Migrations and Zero-Downtime Deploys

migrate in the container entrypoint is a classic that bites under load. On a rolling deploy with three new web tasks, each task starts migrate at the same time. In the best case the database lock blocks and two tasks wait, in the worse case the migrations collide. The right way is a single ECS RunTask with migrate --force as a separate pipeline step, after the image push and before the service update:

aws ecs run-task \
  --cluster laravel-production \
  --task-definition laravel-migrate \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SG],assignPublicIp=DISABLED}" \
  --overrides '{"containerOverrides":[{"name":"app","command":["php","artisan","migrate","--force"]}]}'

Only when this task is done does the pipeline update the three services to the new image. The CI/CD pipeline runs this step.

Two rolling-update parameters steer the behavior during a deploy. minimumHealthyPercent and maximumPercent set how many old tasks stay up. 100 and 200 means bring up all new ones first, then take old ones down: full capacity, but briefly double the cost. 50 and 100 is leaner but reduces capacity during the deploy. On top comes a discipline at the schema level: during a deploy, only additive migrations, that is, add a column instead of renaming one. Otherwise the old code sees, for a moment, a schema it does not know. That is the expand-contract pattern, only sketched here.

Worker Autoscaling on Queue Depth

You scale workers on queue depth, not on CPU. The reason is the nature of the work: a worker waiting on a slow external API call has low CPU load and a growing queue. CPU-based autoscaling reports in that situation that all is calm, while the queue fills up. It scales in the wrong direction.

With SQS, the native metric is ApproximateNumberOfMessagesVisible, perfect for target tracking. With Redis or Valkey and Horizon, you publish the queue length or the Horizon wait time as a custom CloudWatch metric, for example through a small Lambda that reads the queue depth every minute. A target-tracking policy on the SQS metric looks like this:

resource "aws_appautoscaling_policy" "worker_queue_depth" {
  name               = "worker-scale-on-queue"
  policy_type        = "TargetTrackingScaling"
  resource_id        = "service/laravel-production/laravel-worker"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
 
  target_tracking_scaling_policy_configuration {
    target_value = 100  # target messages per worker task
 
    customized_metric_specification {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace   = "AWS/SQS"
      statistic   = "Average"
      dimensions {
        name  = "QueueName"
        value = "laravel-default"
      }
    }
  }
}

Always keep one minimum worker running instead of scaling to zero. Otherwise the first message after an idle period pays the task start latency, and to the user that feels like a hang. Scale-to-zero only makes sense when latency really does not matter. The scheduler stays out of all this, it never scales and stays at count 1.

Observability

Logs go through the awslogs driver to CloudWatch Logs, that is the simple default. For structured logs and routing to multiple targets, FireLens with Fluent Bit is the next step. Important on the Laravel side: set the logging channel to stderr, so logs land as container logs and do not disappear into a file in the ephemeral container filesystem, which dies with the task.

The Horizon dashboard gives the live view of throughput, wait time, and failed jobs. The failed_jobs table needs an alarm, because a growing failed count is an incident, not a detail. At least two CloudWatch alarms belong in place: one on queue depth that fires when the queue fills up, and one on the failed-jobs rate. More on this in the upcoming article on observability in production.

Anti-Patterns

The patterns I see again and again in technical audits:

Everything in one container. FPM, worker, and cron under one Supervisor in one Fargate task. Breaks the health signal, scaling, and deploy behavior all at once.

queue:work in the web container. The worker as a background process next to FPM dies unnoticed, because the health signal only checks the web server.

Scheduler with desired count 2. Every scheduled job runs twice. The most common reason for duplicate emails and duplicate sync runs.

migrate in the entrypoint. Race across N tasks on a rolling deploy. Belongs in a single RunTask.

Secrets as plaintext env. Readable by anyone with DescribeTaskDefinition. Belongs in Secrets Manager or SSM.

No stopTimeout tuning. Default 30 seconds plus jobs over 30 seconds means SIGKILL mid-job, on every deploy.

Relying on SIGKILL. Without SIGTERM handling and without idempotency, every deploy loses or duplicates jobs.

Host cron on Fargate. Does not exist, there is no host. Anyone who wants to write a crontab is on the wrong platform or needs one of the three scheduler options.

Long work directly in schedule:run. Blocks the scheduler, belongs in a dispatched queue job.

Reference Setup

The scenario: a Laravel 13 app, an ALB-fronted web layer, Horizon for queues on Valkey, a dedicated scheduler, RDS PostgreSQL, secrets in the SSM Parameter Store. Three ECS services from one image.

Three Services From One Image

The web service runs with FrankenPHP or nginx plus FPM, hangs on the ALB target group with /up as the health check, and scales on request count. The worker service runs with horizon, has no load balancer, scales on queue depth, and sets stopTimeout to 120. The scheduler service runs with schedule:work, has a fixed desired count of 1, and uses onOneServer locking. Three services, one image, three commands.

Deploy Pipeline

The order is half the battle. Build the image with config:cache and route:cache, push to ECR, run the RunTask with migrate --force, and only then update the three services to the new image in a rolling fashion. Run the migration after the service update, and you have new code on an old schema for a moment.

# excerpt from the Dockerfile
RUN php artisan config:cache \
 && php artisan route:cache \
 && php artisan event:cache

EventBridge Alternative

If you want to save the permanently running scheduler task, replace the scheduler service with the EventBridge schedule from Option B. The rest of the setup stays identical. That mainly pays off when cost pressure is high and schedule:run finishes fast.

For scale: three small Fargate tasks plus one small Valkey node land, for a typical mid-sized app, in the low triple digits per month. The pricing depth sits in the ECS vs EKS article, the values are for eu-central-1, and for your own region you best check them in the AWS Pricing Calculator.

Conclusion

ECS Fargate is the right platform for Laravel in mid-sized companies, but it demands a shift in thinking: the app is three processes, not one container. The three places where starter repos and Forge clones stay silent are exactly the three that hurt first in production. The scheduler without host cron, the worker without graceful shutdown, the deploy without a clean migration.

Wire those three cleanly, and you have a setup that survives deploys without losing jobs and scales with the load. The effort is not in the complexity, it is in the details you only learn once you have gotten them wrong. That is exactly the part a maintained IaC setup encapsulates once and correctly, instead of cobbling it together fresh in every project.

Is your Laravel app running on Fargate, Forge, or your own setup, and you are not sure about the queue, scheduler, or deploy behavior? Contact me for a technical audit that examines it in two to three days.

I am currently building a maintained IaC kit for production Laravel on AWS (CDK and Terraform). If you want to know early, you can get on the list without obligation. No newsletter barrage, just one message when it is ready.