website-and-documentation/content/en/docs/edp/deployment/infrastructure/stacks/observability.md
Martin McCaffery 41e3306942
Some checks failed
build / build (push) Failing after 52s
ci / build (push) Successful in 55s
feat(docs): Restructure entire documentation
2025-12-18 10:25:07 +01:00

16 KiB

title linkTitle weight description
Observability Observability 50 Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure

Overview

The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration.

The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints.

Key Features

  • Metrics Collection: VictoriaMetrics-based Kubernetes monitoring with long-term storage
  • Visualization: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components
  • Log Aggregation: VictoriaLogs for centralized logging with Grafana integration
  • SSO Integration: OAuth authentication through Dex with role-based access control
  • Alerting: Alertmanager with email notifications for critical events
  • Secure Access: TLS-enabled ingress with authentication proxy (VMAuth)
  • Persistent Storage: Encrypted volumes with configurable retention policies

Repository

Code: Observability Stack Templates

Documentation:

Getting Started

Prerequisites

  • Kubernetes cluster with ArgoCD installed (provided by core stack)
  • Ingress controller configured (provided by otc stack)
  • cert-manager for TLS certificate management (provided by otc stack)
  • Dex SSO provider (provided by core stack)
  • Infrastructure deployed through Infra Deploy

Quick Start

The Observability stack is deployed as part of the EDP installation process:

  1. Trigger Deploy Pipeline

    • Go to Infra Deploy Pipeline
    • Click on Run workflow
    • Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter test-me then domains will be vmauth.test-me.t09.de and grafana.test-me.t09.de)
    • Execute workflow
  2. ArgoCD Synchronization ArgoCD automatically deploys:

    • VictoriaMetrics Operator and components
    • VictoriaMetrics Single (metrics storage)
    • VMAuth (authentication proxy)
    • Alertmanager (alerting)
    • Grafana Operator
    • Grafana instance with OAuth
    • VictoriaLogs datasource
    • Pre-configured dashboards
    • Ingress configurations with TLS

Verification

Verify the Observability deployment:

# Check ArgoCD applications status
kubectl get application grafana-operator -n argocd
kubectl get application victoria-k8s-stack -n argocd

# Verify VictoriaMetrics components are running
kubectl get pods -n observability

# Check Grafana instance status
kubectl get grafana grafana -n observability

# Verify ingress configurations
kubectl get ingress -n observability

Access the monitoring interfaces:

  • Grafana: https://grafana.{DOMAIN_O12Y}

Architecture

Component Architecture

The Observability stack consists of multiple integrated components:

VictoriaMetrics Components:

  • VictoriaMetrics Operator: Manages VictoriaMetrics custom resources
  • VictoriaMetrics Single: Standalone metrics storage with 20Gi storage and 1-month retention
  • VMAgent: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd)
  • VMAuth: Authentication proxy on port 8427 for secure metrics access
  • VMAlertmanager: Handles alert routing and notifications

Grafana Components:

  • Grafana Operator: Manages Grafana instances and dashboards as Kubernetes resources
  • Grafana Instance: Web application for metrics visualization with OAuth authentication
  • Pre-configured Dashboards: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring

Logging:

  • VictoriaLogs: Log aggregation service integrated as Grafana datasource

Storage:

  • VictoriaMetrics Single: 20Gi persistent storage on csi-disk storage class
  • Grafana: 10Gi persistent storage on csi-disk storage class with KMS encryption
  • Configurable retention: 1 month for metrics, minimum 24 hours enforced

Networking:

  • Nginx ingress with TLS termination for Grafana and VMAuth
  • cert-manager integration for automatic certificate management
  • Internal ClusterIP services for component communication

Configuration

VictoriaMetrics Configuration

Key configuration in stacks/observability/victoria-k8s-stack/values.yaml:

Operator Settings:

victoria-metrics-operator:
  enabled: true
  operator:
    enable_converter_ownership: true
  admissionWebhooks:
    certManager:
      enabled: true
      issuer:
        name: main

Storage Configuration:

vmsingle:
  enabled: true
  spec:
    retentionPeriod: "1"
    storage:
      storageClassName: csi-disk
      resources:
        requests:
          storage: 20Gi

VMAuth Configuration:

vmauth:
  enabled: true
  spec:
    port: "8427"
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - name: "{{{ .Env.DOMAIN_O12Y }}}"
    tls:
      - secretName: vmauth-tls-secret
        hosts:
          - "{{{ .Env.DOMAIN_O12Y }}}"
    annotations:
      cert-manager.io/cluster-issuer: main

Monitoring Targets:

  • Kubelet (cadvisor, probes, resources metrics)
  • CoreDNS
  • etcd
  • kube-apiserver

Disabled Collectors (to avoid alerts on managed clusters):

  • kube-controller-manager
  • kube-scheduler
  • kube-proxy

Alertmanager Configuration

Email alerting configured in values.yaml:

alertmanager:
  spec:
    externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}"
    configSecret: vmalertmanager-config
  config:
    route:
      routes:
        - matchers:
            - severity =~ "critical|major"
          receiver: mail
    receivers:
      - name: 'mail'
        email_configs:
          - to: 'alerts@example.com'
            from: 'monitoring@example.com'
            smarthost: 'mail.mms-support.de:465'
            auth_username:
              name: email-user-credentials
              key: username
            auth_password:
              name: email-user-credentials
              key: password

Grafana Configuration

Grafana instance configuration in stacks/observability/grafana-operator/manifests/grafana.yaml:

OAuth/SSO Integration:

config:
  auth.generic_oauth:
    enabled: "true"
    disable_login_form: "true"
    client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}"
    client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}"
    scopes: "openid email profile offline_access groups"
    auth_url: "https://dex.{DOMAIN}/auth"
    token_url: "https://dex.{DOMAIN}/token"
    api_url: "https://dex.{DOMAIN}/userinfo"
    role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'"

Storage:

deployment:
  spec:
    template:
      spec:
        volumes:
          - name: grafana-data
            persistentVolumeClaim:
              claimName: grafana-pvc

persistentVolumeClaim:
  spec:
    storageClassName: csi-disk
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi

Ingress:

ingress:
  spec:
    ingressClassName: nginx
    rules:
      - host: "{{{ .Env.DOMAIN_GRAFANA }}}"
        http:
          paths:
            - path: /
              pathType: Prefix
              backend:
                service:
                  name: grafana-service
                  port:
                    number: 3000
    tls:
      - hosts:
          - "{{{ .Env.DOMAIN_GRAFANA }}}"
        secretName: grafana-tls-secret

ArgoCD Application Configuration

Grafana Operator Application (template/stacks/observability/grafana-operator.yaml):

  • Name: grafana-operator
  • Chart: grafana-operator v5.18.0 from ghcr.io/grafana/helm-charts
  • Automated sync with self-healing enabled
  • Namespace: observability

VictoriaMetrics Stack Application (template/stacks/observability/victoria-k8s-stack.yaml):

  • Name: victoria-k8s-stack
  • Chart: victoria-metrics-k8s-stack v0.48.1 from https://victoriametrics.github.io/helm-charts/
  • Automated self-healing enabled
  • Creates namespace automatically

Usage Examples

Accessing Grafana

Access Grafana through SSO:

  1. Navigate to Grafana

    open https://grafana.${DOMAIN_GRAFANA}
    
  2. Authenticate via Dex

    • Click "Sign in with OAuth"
    • Authenticate through configured identity provider
    • Users in DevFW group receive Admin role, others receive Viewer role

Querying Metrics

Query VictoriaMetrics directly:

# Access VMAuth endpoint
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
  -d 'query=up' | jq

# Query pod CPU usage
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
  -d 'query=container_cpu_usage_seconds_total' | jq

# Query with time range
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \
  -d 'query=container_memory_usage_bytes' \
  -d 'start=2024-01-01T00:00:00Z' \
  -d 'end=2024-01-01T23:59:59Z' \
  -d 'step=5m' | jq

Creating Custom Dashboards

Create custom Grafana dashboards as Kubernetes resources:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: custom-app-dashboard
  namespace: observability
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Custom Application Metrics",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "rate(http_requests_total[5m])",
                "datasource": "VictoriaMetrics"
              }
            ]
          }
        ]
      }
    }

Apply the dashboard:

kubectl apply -f custom-dashboard.yaml

Viewing Logs in Grafana

Access VictoriaLogs through Grafana:

  1. Navigate to Grafana https://grafana.${DOMAIN_GRAFANA}
  2. Go to Explore
  3. Select "VictoriaLogs" datasource
  4. Use LogQL queries:
    {namespace="default"}
    {app="nginx"} |= "error"
    {namespace="observability"} | json | level="error"
    

Setting Up Custom Alerts

Create custom alert rules using VMRule:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  name: custom-app-alerts
  namespace: observability
spec:
  groups:
    - name: custom-app
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "Error rate is {{ $value }} requests/sec"

Push the alert rule to stacks instances

Integration Points

  • Core Stack: Depends on ArgoCD for deployment orchestration
  • OTC Stack: Requires ingress-nginx controller and cert-manager for external access and TLS
  • Dex (SSO): Integrated for Grafana authentication with role-based access control
  • All Platform Services: Automatically collects metrics from Kubernetes components and platform services
  • Application Stacks: Provides monitoring for Coder, Forgejo, and other deployed services

Troubleshooting

VictoriaMetrics Pods Not Starting

Problem: VictoriaMetrics components remain in Pending or CrashLoopBackOff state

Solution:

  1. Check VictoriaMetrics resources:

    kubectl get vmsingle,vmagent,vmalertmanager -n observability
    kubectl describe vmsingle vmsingle -n observability
    
  2. Verify persistent volume claims:

    kubectl get pvc -n observability
    kubectl describe pvc vmstorage-vmsingle-0 -n observability
    
  3. Check operator logs:

    kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator
    

Grafana Not Accessible

Problem: Grafana web interface is not accessible at configured URL

Solution:

  1. Verify Grafana instance status:

    kubectl get grafana grafana -n observability
    kubectl describe grafana grafana -n observability
    
  2. Check Grafana pod logs:

    kubectl logs -n observability -l app=grafana
    
  3. Verify ingress configuration:

    kubectl get ingress -n observability
    kubectl describe ingress grafana-ingress -n observability
    
  4. Check TLS certificate status:

    kubectl get certificate -n observability
    kubectl describe certificate grafana-tls-secret -n observability
    

OAuth Authentication Failing

Problem: Cannot authenticate to Grafana via SSO

Solution:

  1. Verify Dex is running:

    kubectl get pods -n core -l app=dex
    kubectl logs -n core -l app=dex
    
  2. Check OAuth client secret:

    kubectl get secret dex-grafana-client -n observability
    kubectl describe secret dex-grafana-client -n observability
    
  3. Review Grafana OAuth configuration:

    kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth
    
  4. Check Grafana logs for OAuth errors:

    kubectl logs -n observability -l app=grafana | grep -i oauth
    

Metrics Not Appearing

Problem: Metrics not showing up in Grafana or VictoriaMetrics

Solution:

  1. Check VMAgent scraping status:

    kubectl get vmagent -n observability
    kubectl logs -n observability -l app.kubernetes.io/name=vmagent
    
  2. Verify service monitors are created:

    kubectl get vmservicescrape -n observability
    kubectl get vmpodscrape -n observability
    
  3. Check target endpoints:

    # Access VMAgent UI (port-forward if needed)
    kubectl port-forward -n observability svc/vmagent 8429:8429
    open http://localhost:8429/targets
    
  4. Verify VictoriaMetrics Single is accepting data:

    kubectl logs -n observability -l app.kubernetes.io/name=vmsingle
    

Alerts Not Sending

Problem: Alertmanager not sending email notifications

Solution:

  1. Verify Alertmanager configuration:

    kubectl get vmalertmanager -n observability
    kubectl describe vmalertmanager vmalertmanager -n observability
    
  2. Check email credentials secret:

    kubectl get secret email-user-credentials -n observability
    kubectl describe secret email-user-credentials -n observability
    
  3. Review Alertmanager logs:

    kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager
    
  4. Test alert firing manually:

    # Access Alertmanager UI
    kubectl port-forward -n observability svc/vmalertmanager 9093:9093
    open http://localhost:9093
    

High Storage Usage

Problem: VictoriaMetrics storage running out of space

Solution:

  1. Check current storage usage:

    kubectl exec -it -n observability vmsingle-0 -- df -h /storage
    
  2. Reduce retention period in values.yaml:

    vmsingle:
      spec:
        retentionPeriod: "15d"  # Reduce from 1 month
    
  3. Increase PVC size:

    kubectl patch pvc vmstorage-vmsingle-0 -n observability \
      -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'
    
  4. Monitor storage metrics in Grafana for capacity planning

Additional Resources