Observability & Tracing

Master observability in microservices with comprehensive metrics, logging, and distributed tracing strategies.

Introduction

What is observability in microservices

Observability is the ability to understand the internal state and behavior of your distributed system by examining its external outputs, primarily through metrics, logs, and traces. In microservices architectures, observability becomes critical because applications are distributed across multiple services, making it impossible to understand system behavior by looking at any single component in isolation. Unlike traditional monitoring which focuses on predefined alerts and dashboards, observability provides the tools and data needed to ask arbitrary questions about your system's behavior and debug unknown problems as they arise. Effective observability enables teams to quickly identify, understand, and resolve issues in complex distributed systems while also providing insights for performance optimization and capacity planning.

Why metrics, logs, and tracing are essential

Metrics provide quantitative measurements of system behavior over time, such as request rates, error counts, response times, and resource utilization, giving you a high-level view of system health and performance trends. Logs capture detailed records of individual events and transactions, providing the context needed to understand what happened during specific incidents or to debug complex business logic failures. Traces follow individual requests as they flow through multiple services, showing the complete journey of a transaction and enabling you to identify bottlenecks, failures, and dependencies across service boundaries. Together, these three pillars of observability provide complementary views into your system: metrics show you what's happening, logs explain why it's happening, and traces reveal how it's happening across your distributed architecture.

Overview of monitoring and alerting

Monitoring involves continuously collecting, storing, and analyzing observability data to maintain awareness of system health and performance, while alerting automatically notifies teams when predefined conditions indicate problems or anomalies. Effective monitoring strategies combine real-time dashboards for operational awareness, historical analysis for trend identification, and proactive alerting for incident response, creating a comprehensive view of system behavior that enables both reactive problem-solving and proactive optimization. Modern monitoring approaches focus on building observability into applications from the ground up rather than trying to monitor systems from the outside, ensuring that the data needed for effective troubleshooting and optimization is available when needed. The goal is to create monitoring systems that provide actionable insights without overwhelming teams with false positives or irrelevant alerts, enabling quick response to real issues while maintaining confidence in system reliability.

Metrics

Metrics collection with Micrometer

Micrometer is Spring Boot's metrics facade that provides a vendor-neutral interface for collecting application metrics, automatically integrating with popular monitoring systems like Prometheus, Grafana, InfluxDB, and CloudWatch. It acts as a dimensional metrics library that abstracts away the differences between various metrics backends, allowing you to write metric collection code once and export to multiple monitoring systems without code changes. Micrometer automatically collects JVM metrics, HTTP request metrics, database connection pool metrics, and other infrastructure-level measurements out of the box, while providing APIs for custom business metrics. The library integrates seamlessly with Spring Boot's autoconfiguration, requiring minimal setup to start collecting comprehensive metrics about your application's behavior and performance.

Example: exposing custom counters and gauges

Custom counters track the number of times specific events occur in your application, such as user registrations, order completions, or payment failures, providing insights into business-level activities and operational patterns. Gauges represent current values that can go up or down, such as active user sessions, queue sizes, or cache hit rates, giving you real-time visibility into system state and resource utilization. Micrometer's fluent API makes it easy to create and manage custom metrics with proper tagging for dimensional analysis, enabling you to slice and dice metrics by various attributes like user type, geographic region, or service version. These custom metrics bridge the gap between technical infrastructure metrics and business KPIs, providing the data needed to understand both system performance and business impact.

@Service
public class OrderService {

    private final Counter orderCreatedCounter;
    private final Counter orderFailedCounter;
    private final Gauge activeOrdersGauge;
    private final Timer orderProcessingTimer;
    private final MeterRegistry meterRegistry;

    public OrderService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        // Counter for successful orders
        this.orderCreatedCounter = Counter.builder("orders.created")
            .description("Number of orders successfully created")
            .tag("service", "order-service")
            .register(meterRegistry);

        // Counter for failed orders with reason tags
        this.orderFailedCounter = Counter.builder("orders.failed")
            .description("Number of failed order attempts")
            .tag("service", "order-service")
            .register(meterRegistry);

        // Gauge for active orders count
        this.activeOrdersGauge = Gauge.builder("orders.active")
            .description("Number of currently active orders")
            .register(meterRegistry, this, OrderService::getActiveOrderCount);

        // Timer for order processing duration
        this.orderProcessingTimer = Timer.builder("orders.processing.duration")
            .description("Time taken to process orders")
            .register(meterRegistry);
    }

    public Order createOrder(CreateOrderRequest request) {
        return Timer.Sample.start(meterRegistry)
            .stop(orderProcessingTimer.timer("operation", "create"))
            .recordCallable(() -> {
                try {
                    Order order = processOrder(request);
                    orderCreatedCounter.increment(Tags.of(
                        "user_type", request.getUserType(),
                        "order_type", request.getOrderType()
                    ));
                    return order;
                } catch (PaymentException e) {
                    orderFailedCounter.increment(Tags.of("reason", "payment_failed"));
                    throw e;
                } catch (InventoryException e) {
                    orderFailedCounter.increment(Tags.of("reason", "insufficient_inventory"));
                    throw e;
                }
            });
    }

    @EventListener
    public void handleUserLogin(UserLoginEvent event) {
        meterRegistry.counter("user.logins",
            "user_type", event.getUserType(),
            "login_method", event.getLoginMethod())
            .increment();
    }

    private double getActiveOrderCount() {
        return orderRepository.countByStatus(OrderStatus.ACTIVE);
    }
}

Integrating metrics with Prometheus/Grafana dashboards

Prometheus integration enables automatic scraping of Micrometer metrics through Spring Boot Actuator endpoints, providing a scalable time-series database for storing and querying metrics data with powerful PromQL query language capabilities. The integration exposes all application metrics in Prometheus format at the /actuator/prometheus endpoint, which Prometheus servers can scrape at regular intervals to build comprehensive time-series datasets. Grafana dashboards consume Prometheus data to create rich visualizations including graphs, heatmaps, and alerts that provide both real-time operational visibility and historical analysis capabilities. This combination creates a powerful observability stack where Micrometer handles metric collection, Prometheus provides storage and querying, and Grafana delivers visualization and alerting, enabling teams to build sophisticated monitoring solutions tailored to their specific needs.

# application.yml - Prometheus integration
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  metrics:
    export:
      prometheus:
        enabled: true
    distribution:
      percentiles-histogram:
        http.server.requests: true
        orders.processing.duration: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99
        orders.processing.duration: 0.5, 0.95, 0.99
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}

@Configuration
public class MetricsConfig {

    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config()
            .commonTags("application", "order-service")
            .commonTags("version", getClass().getPackage().getImplementationVersion());
    }

    @Bean
    @ConditionalOnProperty(value = "management.metrics.export.prometheus.enabled", havingValue = "true")
    public TimedAspect timedAspect(MeterRegistry registry) {
        return new TimedAspect(registry);
    }
}

Logging

Logging strategies in microservices

Effective logging strategies in microservices focus on creating consistent, searchable, and contextual log entries that can be correlated across service boundaries to understand complex distributed transactions. Each service should log at appropriate levels (ERROR, WARN, INFO, DEBUG) with consistent formatting and include correlation IDs that link related log entries across multiple services involved in the same business operation. Logging should capture not just errors and exceptions, but also important business events, performance milestones, and security-relevant activities, providing a comprehensive audit trail of system behavior. The strategy must balance providing sufficient detail for troubleshooting with avoiding log volume that overwhelms storage systems or makes analysis difficult, often achieved through careful level configuration and log sampling techniques.

Centralized logging (ELK/EFK stack overview)

The ELK (Elasticsearch, Logstash, Kibana) and EFK (Elasticsearch, Fluentd, Kibana) stacks provide centralized logging solutions where logs from all microservices are collected, processed, stored, and visualized in a unified system. Elasticsearch serves as the scalable search and analytics engine that stores log data and enables fast querying across massive datasets, while Logstash or Fluentd act as log processing pipelines that collect, parse, transform, and forward logs from various sources. Kibana provides the web-based interface for searching, filtering, and visualizing log data through dashboards, graphs, and real-time monitoring capabilities that enable teams to quickly find relevant information during incidents. This centralized approach eliminates the need to SSH into individual servers or containers to view logs, instead providing a single interface where operations teams can search across all services simultaneously and correlate activities across the entire distributed system.

Structured logging for easier tracing

Structured logging uses consistent formats like JSON to make log entries machine-readable and searchable, enabling automated analysis and correlation across distributed systems. Instead of writing free-form text messages, structured logging captures log data as key-value pairs that can be indexed and queried efficiently, including fields like timestamp, service name, correlation ID, user ID, and business context. This approach enables powerful log analysis capabilities such as filtering by specific users, tracing requests across services, aggregating error patterns, and building automated alerting based on log content patterns. Structured logs also integrate seamlessly with log processing pipelines and observability platforms, enabling automated parsing and enrichment that would be difficult or impossible with unstructured text logs.

@Slf4j
@RestController
public class OrderController {

    private final OrderService orderService;
    private final Logger structuredLogger = LoggerFactory.getLogger("STRUCTURED");

    @PostMapping("/orders")
    public ResponseEntity<Order> createOrder(@RequestBody CreateOrderRequest request,
                                           HttpServletRequest httpRequest) {
        String correlationId = UUID.randomUUID().toString();
        String userId = extractUserId(httpRequest);

        // Add correlation ID to MDC for all subsequent logs
        MDC.put("correlationId", correlationId);
        MDC.put("userId", userId);
        MDC.put("operation", "createOrder");

        try {
            log.info("Creating order for user: {} with correlation: {}", userId, correlationId);

            Order order = orderService.createOrder(request);

            // Structured business event logging
            structuredLogger.info("ORDER_CREATED",
                kv("correlationId", correlationId),
                kv("userId", userId),
                kv("orderId", order.getId()),
                kv("orderValue", order.getTotalAmount()),
                kv("itemCount", order.getItems().size()),
                kv("timestamp", Instant.now()),
                kv("event", "ORDER_CREATED")
            );

            return ResponseEntity.ok(order);

        } catch (PaymentException e) {
            log.error("Payment failed for user: {} correlation: {}", userId, correlationId, e);

            structuredLogger.error("PAYMENT_FAILED",
                kv("correlationId", correlationId),
                kv("userId", userId),
                kv("errorType", "PAYMENT_FAILED"),
                kv("errorMessage", e.getMessage()),
                kv("timestamp", Instant.now())
            );

            throw new OrderCreationException("Order creation failed", e);
        } finally {
            MDC.clear(); // Clean up MDC
        }
    }
}

# logback-spring.xml configuration for structured logging
logging:
  level:
    com.example.orderservice: INFO
    STRUCTURED: INFO
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level [%X{correlationId:-}] %logger{36} - %msg%n"
    file: "%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level [%X{correlationId:-}] %logger{36} - %msg%n"

# JSON structured logging with Logstash encoder
management:
  endpoints:
    web:
      exposure:
        include: loggers

@Component
public class LoggingFilter implements Filter {

    private static final Logger log = LoggerFactory.getLogger(LoggingFilter.class);

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {

        HttpServletRequest httpRequest = (HttpServletRequest) request;
        HttpServletResponse httpResponse = (HttpServletResponse) response;

        String correlationId = httpRequest.getHeader("X-Correlation-ID");
        if (correlationId == null) {
            correlationId = UUID.randomUUID().toString();
        }

        // Set correlation ID in response header and MDC
        httpResponse.setHeader("X-Correlation-ID", correlationId);
        MDC.put("correlationId", correlationId);
        MDC.put("requestUri", httpRequest.getRequestURI());
        MDC.put("httpMethod", httpRequest.getMethod());

        long startTime = System.currentTimeMillis();

        try {
            chain.doFilter(request, response);
        } finally {
            long duration = System.currentTimeMillis() - startTime;
            log.info("Request completed - Status: {} Duration: {}ms",
                    httpResponse.getStatus(), duration);
            MDC.clear();
        }
    }
}

Distributed Tracing

What is distributed tracing and why it is important

Distributed tracing tracks individual requests as they flow through multiple services in a microservices architecture, creating a complete picture of the request's journey including timing, dependencies, and any errors that occur along the way. Each trace represents a single transaction or operation initiated by a user or external system, while spans represent individual operations within that trace, such as database queries, HTTP calls, or business logic execution. This visibility is crucial in microservices because a single user action might trigger dozens of internal service calls, making it impossible to understand performance bottlenecks or failure points without seeing the complete request flow. Distributed tracing enables teams to quickly identify which service is causing latency, where errors are occurring, and how changes to one service affect the overall system performance.

Using OpenTelemetry with Spring Boot

OpenTelemetry is the industry-standard observability framework that provides APIs, libraries, and instrumentation for collecting traces, metrics, and logs from applications in a vendor-neutral way. Spring Boot integrates seamlessly with OpenTelemetry through auto-instrumentation that automatically creates spans for HTTP requests, database calls, messaging operations, and other common frameworks without requiring code changes. The integration captures detailed timing information, propagates trace context across service boundaries, and exports trace data to various backends like Jaeger, Zipkin, or cloud-native tracing services. OpenTelemetry's standardized approach ensures that tracing works consistently across different languages and frameworks, enabling end-to-end visibility even in polyglot microservices environments.

<!-- Maven dependencies for OpenTelemetry -->
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-jaeger</artifactId>
</dependency>

@Service
@Slf4j
public class OrderService {

    private final Tracer tracer;
    private final PaymentService paymentService;
    private final InventoryService inventoryService;

    public OrderService(Tracer tracer, PaymentService paymentService, InventoryService inventoryService) {
        this.tracer = tracer;
        this.paymentService = paymentService;
        this.inventoryService = inventoryService;
    }

    @WithSpan("order.create") // Automatic span creation
    public Order createOrder(CreateOrderRequest request) {
        Span span = tracer.nextSpan().name("order.processing")
            .tag("user.id", request.getUserId())
            .tag("order.type", request.getOrderType())
            .start();

        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            log.info("Processing order for user: {}", request.getUserId());

            // Validate inventory - creates child span automatically
            validateInventory(request.getItems());

            // Process payment - creates child span automatically
            PaymentResult paymentResult = paymentService.processPayment(request.getPaymentInfo());

            if (!paymentResult.isSuccessful()) {
                span.tag("error", true);
                span.tag("error.message", "Payment failed");
                throw new PaymentException("Payment processing failed");
            }

            // Create order record
            Order order = createOrderRecord(request, paymentResult);

            span.tag("order.id", order.getId());
            span.tag("order.total", order.getTotalAmount().toString());

            // Send confirmation asynchronously - preserves trace context
            sendOrderConfirmation(order);

            return order;

        } catch (Exception e) {
            span.tag("error", true);
            span.tag("error.message", e.getMessage());
            log.error("Order creation failed", e);
            throw e;
        } finally {
            span.end();
        }
    }

    @NewSpan("inventory.validation") // Create custom span
    private void validateInventory(List<OrderItem> items) {
        Span span = tracer.nextSpan().name("inventory.check")
            .tag("item.count", items.size())
            .start();

        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            for (OrderItem item : items) {
                boolean available = inventoryService.checkAvailability(item.getProductId(), item.getQuantity());
                if (!available) {
                    span.tag("error", true);
                    span.tag("unavailable.product", item.getProductId());
                    throw new InsufficientStockException("Product not available: " + item.getProductId());
                }
            }
            span.tag("validation.result", "success");
        } finally {
            span.end();
        }
    }

    @Async
    @ContinueSpan // Preserve trace context in async operations
    public void sendOrderConfirmation(Order order) {
        Span span = tracer.nextSpan().name("order.confirmation")
            .tag("order.id", order.getId())
            .tag("notification.type", "email")
            .start();

        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            // Send notification logic here
            notificationService.sendOrderConfirmation(order);
            span.tag("notification.sent", true);
        } catch (Exception e) {
            span.tag("error", true);
            span.tag("error.message", e.getMessage());
        } finally {
            span.end();
        }
    }
}

Visualizing traces across services

Trace visualization tools like Jaeger and Zipkin provide web-based interfaces that display traces as interactive timelines showing the sequence and duration of operations across multiple services. These visualizations reveal the complete request flow including service dependencies, parallel operations, and bottlenecks, making it easy to identify which services contribute most to overall latency. The timeline view shows spans as bars with duration proportional to their execution time, while service maps display the overall architecture and communication patterns discovered from trace data. Advanced features include trace comparison for performance analysis, error highlighting, and drill-down capabilities that allow teams to examine specific operations in detail, enabling quick identification of performance issues and system inefficiencies.

Detecting latency, bottlenecks, and errors

Distributed tracing enables systematic detection of performance issues by analyzing span durations, identifying operations that consistently take longer than expected, and highlighting services that contribute disproportionately to overall request latency. Error detection becomes straightforward as failed spans are clearly marked in traces, showing exactly where in the request flow errors occur and providing context about the operations that were in progress when failures happened. Bottleneck analysis involves examining traces to find operations that block other operations or services that become overloaded under high traffic, while dependency analysis reveals critical paths and services that affect overall system performance. Teams can use trace data to establish performance baselines, set up alerting for anomalous latency patterns, and prioritize optimization efforts based on actual user impact rather than guessing about performance problems.

# application.yml - OpenTelemetry configuration
management:
  tracing:
    sampling:
      probability: 1.0 # Sample 100% for development, reduce in production

otel:
  exporter:
    jaeger:
      endpoint: http://localhost:14250
  resource:
    attributes:
      service.name: order-service
      service.version: 1.0.0
      environment: ${spring.profiles.active}
  instrumentation:
    spring-webmvc:
      enabled: true
    jdbc:
      enabled: true
    kafka:
      enabled: true

@Configuration
public class TracingConfig {

    @Bean
    public OpenTelemetry openTelemetry() {
        return OpenTelemetrySDK.builder()
            .setTracerProvider(
                SdkTracerProvider.builder()
                    .addSpanProcessor(BatchSpanProcessor.builder(
                        JaegerGrpcSpanExporter.builder()
                            .setEndpoint("http://localhost:14250")
                            .build())
                        .build())
                    .setResource(Resource.getDefault()
                        .merge(Resource.builder()
                            .put(ResourceAttributes.SERVICE_NAME, "order-service")
                            .put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
                            .build()))
                    .build())
            .build();
    }

    @Bean
    public Tracer tracer(OpenTelemetry openTelemetry) {
        return openTelemetry.getTracer("order-service");
    }
}

Lesson Summary

In this lesson, we explored observability and distributed tracing for microservices architectures. Here's a comprehensive summary of all the concepts and implementation approaches covered:

Observability Fundamentals

Definition: Ability to understand internal system state by examining external outputs through metrics, logs, and traces
Importance in microservices: Critical for distributed systems where no single component provides complete visibility
Three pillars: Metrics (what's happening), logs (why it's happening), traces (how it's happening across services)
Benefits: Quick issue identification, debugging unknown problems, performance optimization, and capacity planning

Metrics Collection with Micrometer

Purpose: Vendor-neutral metrics facade providing dimensional metrics for Spring Boot applications
Integration: Seamless Spring Boot autoconfiguration with multiple monitoring backends (Prometheus, Grafana, InfluxDB)
Built-in metrics: JVM metrics, HTTP requests, database connections, and infrastructure measurements
Custom metrics: Counters for event counting, gauges for current values, timers for duration measurement

Custom Metrics Implementation

Counters: Track business events like order completions, user registrations, and payment failures
Gauges: Monitor real-time values like active sessions, queue sizes, and cache hit rates
Timers: Measure operation durations with percentile distributions for performance analysis
Tagging: Dimensional analysis with attributes like user type, geographic region, and service version

Prometheus and Grafana Integration

Prometheus scraping: Automatic metric collection through /actuator/prometheus endpoint
Time-series storage: Scalable data storage with powerful PromQL query language capabilities
Grafana visualization: Rich dashboards with graphs, heatmaps, and real-time monitoring
Alerting capabilities: Proactive notifications based on metric thresholds and anomaly detection

Logging Strategies

Consistency: Uniform formatting and correlation IDs across all microservices
Context preservation: Correlation IDs linking related log entries across service boundaries
Appropriate levels: ERROR, WARN, INFO, DEBUG for different types of information
Balance: Sufficient detail for troubleshooting without overwhelming storage systems

Centralized Logging

ELK/EFK stacks: Elasticsearch for storage, Logstash/Fluentd for processing, Kibana for visualization
Unified interface: Single location for searching logs across all microservices
Correlation capabilities: Cross-service log analysis and real-time monitoring dashboards
Operational benefits: Eliminates need for individual server access and enables automated analysis

Structured Logging

Machine-readable format: JSON-based logging with consistent key-value pairs for automated analysis
Enhanced searchability: Efficient filtering and querying capabilities across large datasets
Integration benefits: Seamless processing with log pipelines and observability platforms
Context enrichment: Correlation IDs, user IDs, and business context in structured format

Distributed Tracing Concepts

Purpose: Track individual requests across multiple services showing complete journey and timing
Traces and spans: Traces represent transactions, spans represent individual operations within traces
Critical visibility: Essential for understanding performance bottlenecks and failure points in distributed systems
Request flow analysis: Complete picture of service dependencies and interaction patterns

OpenTelemetry Implementation

Industry standard: Vendor-neutral observability framework for traces, metrics, and logs
Auto-instrumentation: Automatic span creation for HTTP, database, messaging without code changes
Context propagation: Automatic trace context transmission across service boundaries
Backend flexibility: Export to Jaeger, Zipkin, or cloud-native tracing services

Trace Visualization and Analysis

Timeline views: Interactive displays showing operation sequence and duration across services
Service maps: Architecture visualization and communication pattern discovery from trace data
Performance analysis: Bottleneck identification and latency analysis tools
Error detection: Clear marking of failed operations with context and error propagation

Observability Best Practices

Sampling strategies: Balance between data completeness and system performance impact
Correlation IDs: Consistent request tracking across all observability pillars
Business metrics: Bridge technical metrics with business KPIs for comprehensive visibility
Alerting hygiene: Actionable alerts without false positives for effective incident response

Production Deployment Considerations

Performance impact: Minimal overhead from instrumentation and data collection
Data retention: Appropriate storage policies for metrics, logs, and traces
Security: Sensitive data handling in logs and traces with proper redaction
Scalability: Observability infrastructure scaling with application growth

Key Takeaways

Observability is essential for understanding and operating complex microservices architectures effectively
Metrics, logs, and traces provide complementary views that together enable comprehensive system understanding
Micrometer and OpenTelemetry provide industry-standard, vendor-neutral observability instrumentation
Structured logging and correlation IDs are crucial for effective distributed system troubleshooting
Proper observability implementation enables proactive monitoring, quick incident response, and data-driven optimization