Observability in Production: CloudWatch, OpenTelemetry, and Grafana

Photo by Frederic Köberl on Unsplash
A user reports the app is slow. On the dashboard, everything is green. CPU is normal, there is no error in the log, the health check answers politely. And yet one specific request takes eight seconds. Without traces, that is a search in the dark, with a few SSH sessions and a lot of guessing. With traces, it is one look: an N-plus-1 query in a rarely used endpoint that only shows up under load.
Observability is not the dashboard you build. It is the ability to answer a question you did not ask in advance. A dashboard shows what you expected. An incident is almost always what you did not expect. Three pillars deliver the answer: logs say what happened, metrics say how much and how often, and traces say where in the system the time is lost.
In 2026, a few things on AWS have shifted that make the decision easier. OpenTelemetry has established itself as the standard, and AWS has caught up on the native integration. This article shows the AWS-native path, the OpenTelemetry path, and when each one fits. As of June 2026, region eu-central-1 as reference.
The Three Pillars: Logs, Metrics, Traces
The three pillars answer different questions, and none replaces another.
Logs say what happened, with context. They belong structured as JSON, not as free text, otherwise they are not searchable. A log with fields for request ID, user ID, and duration can be filtered, a glued-together string cannot.
Metrics are aggregated numbers over time. Request rate, error rate, latency percentiles, queue depth. They are cheap to store and fast to aggregate, but they have no detail. A metric says the error rate has risen to five percent, not why.
Traces show the path of a single request through all services and dependencies, with the time per span. That is the pillar that answers "why is exactly this request slow", and the one missing in most mid-sized setups.
The actual value lies in the connection. An alarm on a metric leads to the trace, the trace leads to the log with the same trace ID. Only when the three connect do three data sources become a tool. That connection used to be manual work and has become standard with OpenTelemetry.
The Standard in 2026: OpenTelemetry
What OpenTelemetry Solves
OpenTelemetry is a vendor-neutral standard to generate, collect, and export logs, metrics, and traces. The point is the separation: you instrument the app once to the standard, and the backend is a configuration question, not a code decision. Instrument with OpenTelemetry today, and you can switch from CloudWatch to Grafana or Datadog tomorrow without touching a line of application code.
X-Ray in Maintenance Mode
This is no longer a theoretical advantage. The X-Ray SDKs and the X-Ray daemon formally entered maintenance mode in 2026. AWS recommends migrating to AWS Distro for OpenTelemetry, ADOT for short, the AWS distribution of the OpenTelemetry Collector. Anyone setting up a new project no longer instruments with the X-Ray SDK, but with OpenTelemetry. Anyone with an existing one plans the migration.
What CloudWatch Caught Up On in 2026
AWS has built out the native side considerably. CloudWatch now accepts OpenTelemetry across all three pillars, via OTLP, the OpenTelemetry protocol. One protocol for logs, metrics, and traces. The part that matters most in practice: OTLP metrics allow up to 150 labels, compared to the 30-dimension limit of classic CloudWatch custom metrics. That difference is what makes high-cardinality Kubernetes and microservice workloads manageable at all. On top comes PromQL support, so you can query CloudWatch metrics with the Prometheus query language.
The ADOT Collector at the Core
The Pipeline
The collector is the middleman between app and backend. The app sends its telemetry as OTLP to the collector, over gRPC on port 4317 or HTTP on port 4318. The collector processes it and exports it onward. The app itself does not know the backend, it only knows the collector.
The pipeline has three stages. A receiver takes the OTLP data. One or more processors batch, filter, and sample. One or more exporters send the data to the target, which can be CloudWatch, Amazon Managed Prometheus, or both in parallel. A minimal collector config looks like this:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
awsemf:
namespace: laravel-production
log_group_name: /ecs/laravel/metrics
prometheusremotewrite:
endpoint: https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/ws-xxxx/api/v1/remote_write
auth:
authenticator: sigv4auth
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [awsemf, prometheusremotewrite]Sidecar on Fargate, DaemonSet on EKS
Where the collector runs depends on the platform. On Fargate it runs as a sidecar container in the same task definition as the app. The app sends to localhost:4317, the sidecar exports outward. On EKS the collector runs as a DaemonSet, that is, one collector pod per node, or as a central deployment for aggregating tasks.
Why the collector at all, instead of exporting straight from the app? Because sampling, filtering, and batching belong in one place, not in every app and every language. Switch the backend, and you change a collector config and redeploy the sidecar, instead of touching every app. The sidecar in the Fargate setup is one additional container in the task definition, nothing more.
Instrumenting a Laravel App
Auto-Instrumentation
The pleasant part: you do not have to fit a Laravel app with spans by hand. The opentelemetry-auto-laravel package captures HTTP requests, Eloquent queries, Redis operations, and queue jobs automatically, without changing code per route or per model. You install the Composer package plus the OpenTelemetry PHP extension and configure the rest through environment variables:
composer require open-telemetry/opentelemetry-auto-laravel# environment variables, set in the task definition
OTEL_PHP_AUTOLOAD_ENABLED=true
OTEL_SERVICE_NAME=laravel-web
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobufThe OTLP endpoint points to the collector sidecar on localhost. From there, traces flow for every request, every query, and every job, with no further code.
Connecting Logs and Traces
This is the point that makes the difference in an incident. With the OTLP log driver, Laravel injects the current trace ID into the log context automatically. An error log then carries the trace ID of the request that triggered it. In an incident you jump from the slow or failing trace straight to the matching log, without hunting through timestamps. What you add by hand beyond that are custom spans for your own critical paths, for example an external API call or an expensive computation that auto-instrumentation does not know about.
The AWS-Native Path: CloudWatch
Container Insights and Logs Insights
For many setups, CloudWatch alone is the right choice. CloudWatch Logs with Logs Insights allows ad-hoc queries over structured logs, without a second system. A query for the error rate per endpoint is one line:
fields @timestamp, route, status
| filter status >= 500
| stats count() as errors by route
| sort errors desc
Container Insights provides the container layer for ECS and EKS, and since 2026 with OpenTelemetry and Prometheus metrics, queryable via PromQL. That blurs the old line between the "CloudWatch world" and the "Prometheus world", because the same PromQL queries work on both sides.
Application Signals
The perhaps most underrated building block is Application Signals. It delivers APM and SLOs from standard OpenTelemetry instrumentation, in many cases without a single code change, only through an environment variable:
OTEL_AWS_APPLICATION_SIGNALS_ENABLED=trueWith that you get a service map, latency and error rates per service, and SLO tracking. For a mid-sized team without its own SRE department, that is a lot of value for very little effort. These alarms also feed the automatic rollback on deploy, which is the topic of the upcoming zero-downtime article.
When CloudWatch alone is enough: for the majority of mid-sized setups. One to a few services, one team, no existing Grafana stack. You save yourself two additional systems and the work of running them.
The OpenTelemetry Path: Prometheus and Grafana
Amazon Managed Prometheus
If you need more, combine ADOT with Amazon Managed Prometheus for metrics. AMP is pay-per-use, with no upfront commitment. As of June 2026 it costs around 0.003 USD per million ingested samples plus query and collector costs. PromQL is native, and anyone coming from the Kubernetes world feels immediately at home. The ADOT collector exports the metrics to AMP via remote write, the logs go to CloudWatch in parallel.
Amazon Managed Grafana
For visualization, Amazon Managed Grafana comes in. Billing runs per active user per workspace, as of June 2026 around 9 USD per editor and 5 USD per viewer per month. Grafana visualizes AMP metrics, CloudWatch logs, and traces in one interface, and anyone who already has Grafana dashboards from other projects keeps using them.
When this pays off: with existing Grafana knowledge and dashboards, with high-cardinality metrics, with multiple data sources that should converge in one view, or with a multi-cloud strategy that does not want to be tied to CloudWatch. The honest flip side: more power means more moving parts. For a single app on one account, the Grafana stack is overhead you run without using it.
What You Actually Measure
Golden Signals and RED
Collecting telemetry is easy. Measuring the right thing is the actual work. The Golden Signals from the Google SRE book are a good anchor: latency, traffic, errors, and saturation. For request-driven services you often condense them into RED, that is, rate, errors, and duration. Those are the three numbers that count first in an incident.
You measure latency as percentiles, never as an average. The average hides exactly the slow requests users feel. If 95 percent of requests finish in 100 ms and 5 percent in 4 seconds, the average looks unremarkable, and yet every twentieth user has a bad experience. p50, p95, and p99 show that, the average does not.
SLOs Instead of Vanity Metrics
A service-level objective gives alerting a purpose. "99 percent of requests under 300 ms over 30 days" is an SLO you can measure the system against, and against which an alarm is justified. Application Signals tracks SLOs natively. For workers and queues, queue depth, job wait time, and the failed-jobs rate belong here, the same ones that were the basis for autoscaling in the Fargate article. What does not belong is everything measurable just because it is measurable. Every unused metric is noise and costs money.
Cost Discipline
Observability eats itself if you let it, and that is the direct bridge to the AWS money pits. Three levers keep the costs in check.
Trace sampling: you do not have to trace every request. Head or tail sampling in the collector keeps, for example, 10 percent of requests plus all that contain an error. That way you see every problem and pay only a fraction. Log retention: debug logs do not belong kept forever, a sensible retention per log group saves quietly and steadily. Metric cardinality: high-cardinality labels like user ID or request ID do not belong in metrics, but in traces. A metric with one label per user explodes in cardinality, and with AMP you pay per million samples.
The rough comparison: CloudWatch is simple and without license costs, but scales in price at high volume. AMP plus AMG has user and sample costs, but is more predictable at large metric volume. Datadog and comparable vendors are more powerful and considerably more expensive. For mid-sized companies the order is usually clear: first exhaust CloudWatch, then extend deliberately.
Which Stack for Whom
Four questions settle the choice before you set up anything. How many services and teams, now and in twelve months? Is there already Grafana dashboards or PromQL knowledge on the team? Do we need high-cardinality metrics, for example per tenant or per customer? Do we want to be tied to CloudWatch or stay portable?
| Profile | Recommendation |
|---|---|
| One to a few services, one team, AWS-only | CloudWatch plus Application Signals |
| Existing Grafana stack, PromQL knowledge | ADOT to AMP plus AMG, logs in CloudWatch |
| High-cardinality per-tenant metrics | AMP for metrics, CloudWatch for logs |
| Multi-cloud or portability important | OTel collector, backend interchangeable |
| Maximum depth, budget available | Datadog or New Relic, but check the cost |
The nice thing about the OpenTelemetry layer: this decision is no longer final. Instrument with OTel, and you can switch the stack later without touching the app.
Anti-Patterns
Logs as the only pillar. Without metrics and traces, every incident is a grep session, and performance problems stay invisible until a user calls.
Average latency. Hides the slow requests that count. Always percentiles.
Tracing everything without sampling. The trace bill explodes, the insight barely grows. 10 percent plus all errors is almost always enough.
High-cardinality labels in metrics. User ID or request ID as a metric dimension blows up cardinality and cost. That information belongs in traces.
A Grafana stack for a single-service app. Many moving parts, little gain over CloudWatch.
Building dashboards nobody looks at. Observability is for incidents and SLOs, not wall decoration in the office.
X-Ray SDK in new projects. It is in maintenance mode. New instrumentation goes through OpenTelemetry.
Telemetry without alarms. Collecting data but nobody gets notified when the error rate rises. Then you find out about the outage only when the user calls.
Conclusion
Observability is the ability to answer questions you did not ask in advance. Three pillars, connected through the trace ID, deliver it: logs for the what, metrics for the how much, traces for the where. Without the connection they are three separate data heaps, with it a tool.
In 2026, the instrumentation decision has gotten easier. OpenTelemetry has prevailed, X-Ray is in maintenance mode, and CloudWatch accepts OTLP across all three pillars. Instrument with OpenTelemetry, and you are not tied to CloudWatch or any vendor, and can switch the stack later without code changes. For mid-sized companies, CloudWatch plus Application Signals is the pragmatic default, with Grafana and Prometheus added when the requirements carry them.
In the end, what counts is not the amount of data collected, but the answer in the incident and the alarm that arrives before the user calls.
Are you already collecting telemetry, but the answer is still missing in an incident, or the CloudWatch bill grows faster than the insight? Contact me for an observability audit that reviews and prioritizes instrumentation, alarms, and SLOs in a few days.
I am currently building an IaC kit for production Laravel on AWS, including an ADOT collector sidecar and sensible base alarms. If you want to know early, you can get on the list without obligation. No newsletter barrage, just one message when it is ready.