website-and-documentation/content/en/docs/edp/deployment/infrastructure/stacks/observability.md at 41e33069423fa8f4007d2b6987a22a354410ee24

DevFW-CICD/website-and-documentation

Fork 1

Martin McCaffery 41e3306942

build / build (push) Failing after 52s

Details

ci / build (push) Successful in 55s

Details

feat(docs): Restructure entire documentation

2025-12-18 10:25:07 +01:00

16 KiB

Raw Blame History

title	linkTitle	weight	description
Observability	Observability	50	Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure

Overview

The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration.

The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints.

Key Features

Metrics Collection: VictoriaMetrics-based Kubernetes monitoring with long-term storage
Visualization: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components
Log Aggregation: VictoriaLogs for centralized logging with Grafana integration
SSO Integration: OAuth authentication through Dex with role-based access control
Alerting: Alertmanager with email notifications for critical events
Secure Access: TLS-enabled ingress with authentication proxy (VMAuth)
Persistent Storage: Encrypted volumes with configurable retention policies

Repository

Code: Observability Stack Templates

Documentation:

Getting Started

Prerequisites

Kubernetes cluster with ArgoCD installed (provided by core stack)
Ingress controller configured (provided by otc stack)
cert-manager for TLS certificate management (provided by otc stack)
Dex SSO provider (provided by core stack)
Infrastructure deployed through Infra Deploy

Quick Start

The Observability stack is deployed as part of the EDP installation process:

Trigger Deploy Pipeline
- Go to Infra Deploy Pipeline
- Click on Run workflow
- Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter test-me then domains will be vmauth.test-me.t09.de and grafana.test-me.t09.de)
- Execute workflow
ArgoCD Synchronization ArgoCD automatically deploys:
- VictoriaMetrics Operator and components
- VictoriaMetrics Single (metrics storage)
- VMAuth (authentication proxy)
- Alertmanager (alerting)
- Grafana Operator
- Grafana instance with OAuth
- VictoriaLogs datasource
- Pre-configured dashboards
- Ingress configurations with TLS

Verification

Verify the Observability deployment:

# Check ArgoCD applications status
kubectl get application grafana-operator -n argocd
kubectl get application victoria-k8s-stack -n argocd

# Verify VictoriaMetrics components are running
kubectl get pods -n observability

# Check Grafana instance status
kubectl get grafana grafana -n observability

# Verify ingress configurations
kubectl get ingress -n observability

Access the monitoring interfaces:

Grafana: https://grafana.{DOMAIN_O12Y}

Architecture

Component Architecture

The Observability stack consists of multiple integrated components:

VictoriaMetrics Components:

VictoriaMetrics Operator: Manages VictoriaMetrics custom resources
VictoriaMetrics Single: Standalone metrics storage with 20Gi storage and 1-month retention
VMAgent: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd)
VMAuth: Authentication proxy on port 8427 for secure metrics access
VMAlertmanager: Handles alert routing and notifications

Grafana Components:

Grafana Operator: Manages Grafana instances and dashboards as Kubernetes resources
Grafana Instance: Web application for metrics visualization with OAuth authentication
Pre-configured Dashboards: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring

Logging:

VictoriaLogs: Log aggregation service integrated as Grafana datasource

Storage:

VictoriaMetrics Single: 20Gi persistent storage on csi-disk storage class
Grafana: 10Gi persistent storage on csi-disk storage class with KMS encryption
Configurable retention: 1 month for metrics, minimum 24 hours enforced

Networking:

Nginx ingress with TLS termination for Grafana and VMAuth
cert-manager integration for automatic certificate management
Internal ClusterIP services for component communication

Configuration

VictoriaMetrics Configuration

Key configuration in stacks/observability/victoria-k8s-stack/values.yaml:

Operator Settings:

victoria-metrics-operator:
  enabled: true
  operator:
    enable_converter_ownership: true
  admissionWebhooks:
    certManager:
      enabled: true
      issuer:
        name: main

Storage Configuration:

vmsingle:
  enabled: true
  spec:
    retentionPeriod: "1"
    storage:
      storageClassName: csi-disk
      resources:
        requests:
          storage: 20Gi

VMAuth Configuration:

vmauth:
  enabled: true
  spec:
    port: "8427"
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - name: "{{{ .Env.DOMAIN_O12Y }}}"
    tls:
      - secretName: vmauth-tls-secret
        hosts:
          - "{{{ .Env.DOMAIN_O12Y }}}"
    annotations:
      cert-manager.io/cluster-issuer: main

Monitoring Targets:

Kubelet (cadvisor, probes, resources metrics)
CoreDNS
etcd
kube-apiserver

Disabled Collectors (to avoid alerts on managed clusters):

kube-controller-manager
kube-scheduler
kube-proxy

Alertmanager Configuration

Email alerting configured in values.yaml:

alertmanager:
  spec:
    externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}"
    configSecret: vmalertmanager-config
  config:
    route:
      routes:
        - matchers:
            - severity =~ "critical|major"
          receiver: mail
    receivers:
      - name: 'mail'
        email_configs:
          - to: 'alerts@example.com'
            from: 'monitoring@example.com'
            smarthost: 'mail.mms-support.de:465'
            auth_username:
              name: email-user-credentials
              key: username
            auth_password:
              name: email-user-credentials
              key: password

Grafana Configuration

Grafana instance configuration in stacks/observability/grafana-operator/manifests/grafana.yaml:

OAuth/SSO Integration:

config:
  auth.generic_oauth:
    enabled: "true"
    disable_login_form: "true"
    client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}"
    client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}"
    scopes: "openid email profile offline_access groups"
    auth_url: "https://dex.{DOMAIN}/auth"
    token_url: "https://dex.{DOMAIN}/token"
    api_url: "https://dex.{DOMAIN}/userinfo"
    role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'"

Storage:

deployment:
  spec:
    template:
      spec:
        volumes:
          - name: grafana-data
            persistentVolumeClaim:
              claimName: grafana-pvc

persistentVolumeClaim:
  spec:
    storageClassName: csi-disk
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi

Ingress:

ingress:
  spec:
    ingressClassName: nginx
    rules:
      - host: "{{{ .Env.DOMAIN_GRAFANA }}}"
        http:
          paths:
            - path: /
              pathType: Prefix
              backend:
                service:
                  name: grafana-service
                  port:
                    number: 3000
    tls:
      - hosts:
          - "{{{ .Env.DOMAIN_GRAFANA }}}"
        secretName: grafana-tls-secret

ArgoCD Application Configuration

Grafana Operator Application (template/stacks/observability/grafana-operator.yaml):

Name: grafana-operator
Chart: grafana-operator v5.18.0 from ghcr.io/grafana/helm-charts
Automated sync with self-healing enabled
Namespace: observability

VictoriaMetrics Stack Application (template/stacks/observability/victoria-k8s-stack.yaml):

Name: victoria-k8s-stack
Chart: victoria-metrics-k8s-stack v0.48.1 from https://victoriametrics.github.io/helm-charts/
Automated self-healing enabled
Creates namespace automatically

Usage Examples

Accessing Grafana

Access Grafana through SSO:

Navigate to Grafana
```
open https://grafana.${DOMAIN_GRAFANA}
```
Authenticate via Dex
- Click "Sign in with OAuth"
- Authenticate through configured identity provider
- Users in DevFW group receive Admin role, others receive Viewer role

Querying Metrics

Query VictoriaMetrics directly:

# Access VMAuth endpoint
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
  -d 'query=up' | jq

# Query pod CPU usage
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
  -d 'query=container_cpu_usage_seconds_total' | jq

# Query with time range
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \
  -d 'query=container_memory_usage_bytes' \
  -d 'start=2024-01-01T00:00:00Z' \
  -d 'end=2024-01-01T23:59:59Z' \
  -d 'step=5m' | jq

Creating Custom Dashboards

Create custom Grafana dashboards as Kubernetes resources:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: custom-app-dashboard
  namespace: observability
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Custom Application Metrics",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "rate(http_requests_total[5m])",
                "datasource": "VictoriaMetrics"
              }
            ]
          }
        ]
      }
    }

Apply the dashboard:

kubectl apply -f custom-dashboard.yaml

Viewing Logs in Grafana

Access VictoriaLogs through Grafana:

Navigate to Grafana https://grafana.${DOMAIN_GRAFANA}
Go to Explore
Select "VictoriaLogs" datasource

Use LogQL queries:

{namespace="default"}
{app="nginx"} |= "error"
{namespace="observability"} | json | level="error"

Setting Up Custom Alerts

Create custom alert rules using VMRule:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  name: custom-app-alerts
  namespace: observability
spec:
  groups:
    - name: custom-app
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "Error rate is {{ $value }} requests/sec"

Push the alert rule to stacks instances

Integration Points

Core Stack: Depends on ArgoCD for deployment orchestration
OTC Stack: Requires ingress-nginx controller and cert-manager for external access and TLS
Dex (SSO): Integrated for Grafana authentication with role-based access control
All Platform Services: Automatically collects metrics from Kubernetes components and platform services
Application Stacks: Provides monitoring for Coder, Forgejo, and other deployed services

Troubleshooting

VictoriaMetrics Pods Not Starting

Problem: VictoriaMetrics components remain in Pending or CrashLoopBackOff state

Solution:

Check VictoriaMetrics resources:

kubectl get vmsingle,vmagent,vmalertmanager -n observability
kubectl describe vmsingle vmsingle -n observability

Verify persistent volume claims:

kubectl get pvc -n observability
kubectl describe pvc vmstorage-vmsingle-0 -n observability

Check operator logs:

kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator

Grafana Not Accessible

Problem: Grafana web interface is not accessible at configured URL

Solution:

Verify Grafana instance status:

kubectl get grafana grafana -n observability
kubectl describe grafana grafana -n observability

Check Grafana pod logs:

kubectl logs -n observability -l app=grafana

Verify ingress configuration:

kubectl get ingress -n observability
kubectl describe ingress grafana-ingress -n observability

Check TLS certificate status:

kubectl get certificate -n observability
kubectl describe certificate grafana-tls-secret -n observability

OAuth Authentication Failing

Problem: Cannot authenticate to Grafana via SSO

Solution:

Verify Dex is running:

kubectl get pods -n core -l app=dex
kubectl logs -n core -l app=dex

Check OAuth client secret:

kubectl get secret dex-grafana-client -n observability
kubectl describe secret dex-grafana-client -n observability

Review Grafana OAuth configuration:

kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth

Check Grafana logs for OAuth errors:

kubectl logs -n observability -l app=grafana | grep -i oauth

Metrics Not Appearing

Problem: Metrics not showing up in Grafana or VictoriaMetrics

Solution:

Check VMAgent scraping status:

kubectl get vmagent -n observability
kubectl logs -n observability -l app.kubernetes.io/name=vmagent

Verify service monitors are created:

kubectl get vmservicescrape -n observability
kubectl get vmpodscrape -n observability

Check target endpoints:

# Access VMAgent UI (port-forward if needed)
kubectl port-forward -n observability svc/vmagent 8429:8429
open http://localhost:8429/targets

Verify VictoriaMetrics Single is accepting data:

kubectl logs -n observability -l app.kubernetes.io/name=vmsingle

Alerts Not Sending

Problem: Alertmanager not sending email notifications

Solution:

Verify Alertmanager configuration:

kubectl get vmalertmanager -n observability
kubectl describe vmalertmanager vmalertmanager -n observability

Check email credentials secret:

kubectl get secret email-user-credentials -n observability
kubectl describe secret email-user-credentials -n observability

Review Alertmanager logs:

kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager

Test alert firing manually:

# Access Alertmanager UI
kubectl port-forward -n observability svc/vmalertmanager 9093:9093
open http://localhost:9093

High Storage Usage

Problem: VictoriaMetrics storage running out of space

Solution:

Check current storage usage:

kubectl exec -it -n observability vmsingle-0 -- df -h /storage

Reduce retention period in values.yaml:

vmsingle:
  spec:
    retentionPeriod: "15d"  # Reduce from 1 month

Increase PVC size:

kubectl patch pvc vmstorage-vmsingle-0 -n observability \
  -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'

Monitor storage metrics in Grafana for capacity planning

16 KiB Raw Blame History

Overview

Key Features

Repository

Getting Started

Prerequisites

Quick Start

Verification

Architecture

Component Architecture

Configuration

VictoriaMetrics Configuration

Alertmanager Configuration

Grafana Configuration

ArgoCD Application Configuration

Usage Examples

Accessing Grafana

Querying Metrics

Creating Custom Dashboards

Viewing Logs in Grafana

Setting Up Custom Alerts

Integration Points

Troubleshooting

VictoriaMetrics Pods Not Starting

Grafana Not Accessible

OAuth Authentication Failing

Metrics Not Appearing

Alerts Not Sending

High Storage Usage

Additional Resources

16 KiB

Raw Blame History