Monitoring & Logging

Best practices for observability using Prometheus, Grafana, and ELK stack.

Why Observability Matters

Observability is crucial for:

Understanding system behavior
Detecting and diagnosing issues
Performance optimization
Capacity planning
SLA/SLO tracking

The Three Pillars of Observability

Metrics - Numerical measurements over time
Logs - Discrete events with context
Traces - Request flow through distributed systems

Prometheus

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability.

Installation

Using Docker:

docker run -d \
  -p 9090:9090 \
  -v prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Configuration

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['localhost:3000']

Instrumenting Your Application

Node.js Example:

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create a Registry
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Middleware to measure request duration
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.path, status_code: res.statusCode });
  });
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

PromQL Queries

Common Prometheus queries:

# CPU usage
rate(cpu_usage_seconds_total[5m])

# Memory usage
process_resident_memory_bytes / 1024 / 1024

# HTTP request rate
rate(http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Grafana

Installation

Using Docker:

docker run -d \
  -p 3000:3000 \
  --name=grafana \
  grafana/grafana

Connecting to Prometheus

Navigate to Configuration → Data Sources
Add Prometheus data source
Set URL to http://prometheus:9090
Click "Save & Test"

Creating Dashboards

Example dashboard JSON:

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Popular Community Dashboards

Node Exporter Full (ID: 1860)
Kubernetes Cluster Monitoring (ID: 7249)
Docker Container Metrics (ID: 193)

ELK Stack (Elasticsearch, Logstash, Kibana)

Docker Compose Setup

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    ports:
      - "5000:5000"
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

volumes:
  elasticsearch-data:

Logstash Configuration

Create logstash.conf:

input {
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
  stdout {
    codec => rubydebug
  }
}

Application Logging

Node.js with Winston:

const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;

const logger = winston.createLogger({
  transports: [
    new LogstashTransport({
      host: 'logstash',
      port: 5000
    })
  ]
});

logger.info('Application started');
logger.error('An error occurred', { error: err.message });

Loki (Lightweight Alternative to ELK)

Docker Compose with Loki

version: '3.8'

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yaml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"

Alerting

Prometheus Alerting Rules

Create alerts.yml:

groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} for {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (process_resident_memory_bytes / 1024 / 1024) > 500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}MB"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} is down"

Alertmanager Configuration

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'alerts@example.com'
        from: 'prometheus@example.com'
        smarthost: smtp.gmail.com:587
        auth_username: 'prometheus@example.com'
        auth_password: 'password'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Best Practices

1. Label Cardinality

Avoid high-cardinality labels:

❌ Don't:

counter.inc({ user_id: userId }); // Too many unique values

✅ Do:

counter.inc({ user_type: userType }); // Limited set of values

2. Structured Logging

Always use structured logs:

logger.info('User login', {
  userId: user.id,
  ip: req.ip,
  timestamp: new Date().toISOString()
});

3. Log Levels

Use appropriate log levels:

DEBUG: Detailed diagnostic information
INFO: General informational messages
WARN: Warning messages for potentially harmful situations
ERROR: Error events that might still allow the application to continue
FATAL: Severe errors that cause premature termination

4. Sampling

For high-traffic applications, sample logs:

if (Math.random() < 0.1) { // 10% sampling
  logger.debug('Request processed', { requestId });
}

5. Retention Policies

Set appropriate data retention:

# Prometheus
storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

Distributed Tracing

Jaeger Setup

version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"
      - "14268:14268"
      - "14250:14250"
      - "9411:9411"
    environment:
      - COLLECTOR_ZIPKIN_HTTP_PORT=9411

OpenTelemetry Integration

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      serviceName: 'my-service',
    })
  )
);
provider.register();

DevOps Overview - Introduction to DevOps practices

Why Observability Matters​

The Three Pillars of Observability​

Prometheus​

What is Prometheus?​

Installation​

Configuration​

Instrumenting Your Application​

PromQL Queries​

Grafana​

Installation​

Connecting to Prometheus​

Creating Dashboards​

Popular Community Dashboards​

ELK Stack (Elasticsearch, Logstash, Kibana)​

Docker Compose Setup​

Logstash Configuration​

Application Logging​

Loki (Lightweight Alternative to ELK)​

Docker Compose with Loki​

Alerting​

Prometheus Alerting Rules​

Alertmanager Configuration​

Best Practices​

1. Label Cardinality​

2. Structured Logging​

3. Log Levels​

4. Sampling​

5. Retention Policies​

Distributed Tracing​

Jaeger Setup​

OpenTelemetry Integration​

Related Topics​

Why Observability Matters

The Three Pillars of Observability

Prometheus

What is Prometheus?

Installation

Configuration

Instrumenting Your Application

PromQL Queries

Grafana

Installation

Connecting to Prometheus

Creating Dashboards

Popular Community Dashboards

ELK Stack (Elasticsearch, Logstash, Kibana)

Docker Compose Setup

Logstash Configuration

Application Logging

Loki (Lightweight Alternative to ELK)

Docker Compose with Loki

Alerting

Prometheus Alerting Rules

Alertmanager Configuration

Best Practices

1. Label Cardinality

2. Structured Logging

3. Log Levels

4. Sampling

5. Retention Policies

Distributed Tracing

Jaeger Setup

OpenTelemetry Integration

Related Topics