--- title: "Observability" linkTitle: "Observability" weight: 50 description: > Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure --- ## Overview The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration. The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints. ## Key Features * **Metrics Collection**: VictoriaMetrics-based Kubernetes monitoring with long-term storage * **Visualization**: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components * **Log Aggregation**: VictoriaLogs for centralized logging with Grafana integration * **SSO Integration**: OAuth authentication through Dex with role-based access control * **Alerting**: Alertmanager with email notifications for critical events * **Secure Access**: TLS-enabled ingress with authentication proxy (VMAuth) * **Persistent Storage**: Encrypted volumes with configurable retention policies ## Repository **Code**: [Observability Stack Templates](https://edp.buildth.ing/DevFW-CICD/stacks/src/branch/main/template/stacks/observability) **Documentation**: * [VictoriaMetrics Documentation](https://docs.victoriametrics.com/) * [Grafana Documentation](https://grafana.com/docs/) * [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/) ## Getting Started ### Prerequisites * Kubernetes cluster with ArgoCD installed (provided by `core` stack) * Ingress controller configured (provided by `otc` stack) * cert-manager for TLS certificate management (provided by `otc` stack) * Dex SSO provider (provided by `core` stack) * Infrastructure deployed through [Infra Deploy](https://edp.buildth.ing/DevFW/infra-deploy) ### Quick Start The Observability stack is deployed as part of the EDP installation process: 1. **Trigger Deploy Pipeline** - Go to [Infra Deploy Pipeline](https://edp.buildth.ing/DevFW/infra-deploy/actions?workflow=deploy.yaml) - Click on Run workflow - Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter `test-me` then domains will be `vmauth.test-me.t09.de` and `grafana.test-me.t09.de`) - Execute workflow 2. **ArgoCD Synchronization** ArgoCD automatically deploys: - VictoriaMetrics Operator and components - VictoriaMetrics Single (metrics storage) - VMAuth (authentication proxy) - Alertmanager (alerting) - Grafana Operator - Grafana instance with OAuth - VictoriaLogs datasource - Pre-configured dashboards - Ingress configurations with TLS ### Verification Verify the Observability deployment: ```bash # Check ArgoCD applications status kubectl get application grafana-operator -n argocd kubectl get application victoria-k8s-stack -n argocd # Verify VictoriaMetrics components are running kubectl get pods -n observability # Check Grafana instance status kubectl get grafana grafana -n observability # Verify ingress configurations kubectl get ingress -n observability ``` Access the monitoring interfaces: * Grafana: `https://grafana.{DOMAIN_O12Y}` ## Architecture ### Component Architecture The Observability stack consists of multiple integrated components: **VictoriaMetrics Components**: - **VictoriaMetrics Operator**: Manages VictoriaMetrics custom resources - **VictoriaMetrics Single**: Standalone metrics storage with 20Gi storage and 1-month retention - **VMAgent**: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd) - **VMAuth**: Authentication proxy on port 8427 for secure metrics access - **VMAlertmanager**: Handles alert routing and notifications **Grafana Components**: - **Grafana Operator**: Manages Grafana instances and dashboards as Kubernetes resources - **Grafana Instance**: Web application for metrics visualization with OAuth authentication - **Pre-configured Dashboards**: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring **Logging**: - **VictoriaLogs**: Log aggregation service integrated as Grafana datasource **Storage**: - VictoriaMetrics Single: 20Gi persistent storage on `csi-disk` storage class - Grafana: 10Gi persistent storage on `csi-disk` storage class with KMS encryption - Configurable retention: 1 month for metrics, minimum 24 hours enforced **Networking**: - Nginx ingress with TLS termination for Grafana and VMAuth - cert-manager integration for automatic certificate management - Internal ClusterIP services for component communication ## Configuration ### VictoriaMetrics Configuration Key configuration in `stacks/observability/victoria-k8s-stack/values.yaml`: **Operator Settings**: ```yaml victoria-metrics-operator: enabled: true operator: enable_converter_ownership: true admissionWebhooks: certManager: enabled: true issuer: name: main ``` **Storage Configuration**: ```yaml vmsingle: enabled: true spec: retentionPeriod: "1" storage: storageClassName: csi-disk resources: requests: storage: 20Gi ``` **VMAuth Configuration**: ```yaml vmauth: enabled: true spec: port: "8427" ingress: enabled: true ingressClassName: nginx hosts: - name: "{{{ .Env.DOMAIN_O12Y }}}" tls: - secretName: vmauth-tls-secret hosts: - "{{{ .Env.DOMAIN_O12Y }}}" annotations: cert-manager.io/cluster-issuer: main ``` **Monitoring Targets**: - Kubelet (cadvisor, probes, resources metrics) - CoreDNS - etcd - kube-apiserver **Disabled Collectors** (to avoid alerts on managed clusters): - kube-controller-manager - kube-scheduler - kube-proxy ### Alertmanager Configuration Email alerting configured in `values.yaml`: ```yaml alertmanager: spec: externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}" configSecret: vmalertmanager-config config: route: routes: - matchers: - severity =~ "critical|major" receiver: mail receivers: - name: 'mail' email_configs: - to: 'alerts@example.com' from: 'monitoring@example.com' smarthost: 'mail.mms-support.de:465' auth_username: name: email-user-credentials key: username auth_password: name: email-user-credentials key: password ``` ### Grafana Configuration Grafana instance configuration in `stacks/observability/grafana-operator/manifests/grafana.yaml`: **OAuth/SSO Integration**: ```yaml config: auth.generic_oauth: enabled: "true" disable_login_form: "true" client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}" client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}" scopes: "openid email profile offline_access groups" auth_url: "https://dex.{DOMAIN}/auth" token_url: "https://dex.{DOMAIN}/token" api_url: "https://dex.{DOMAIN}/userinfo" role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'" ``` **Storage**: ```yaml deployment: spec: template: spec: volumes: - name: grafana-data persistentVolumeClaim: claimName: grafana-pvc persistentVolumeClaim: spec: storageClassName: csi-disk accessModes: - ReadWriteOnce resources: requests: storage: 10Gi ``` **Ingress**: ```yaml ingress: spec: ingressClassName: nginx rules: - host: "{{{ .Env.DOMAIN_GRAFANA }}}" http: paths: - path: / pathType: Prefix backend: service: name: grafana-service port: number: 3000 tls: - hosts: - "{{{ .Env.DOMAIN_GRAFANA }}}" secretName: grafana-tls-secret ``` ### ArgoCD Application Configuration **Grafana Operator Application** (`template/stacks/observability/grafana-operator.yaml`): - Name: `grafana-operator` - Chart: `grafana-operator` v5.18.0 from `ghcr.io/grafana/helm-charts` - Automated sync with self-healing enabled - Namespace: `observability` **VictoriaMetrics Stack Application** (`template/stacks/observability/victoria-k8s-stack.yaml`): - Name: `victoria-k8s-stack` - Chart: `victoria-metrics-k8s-stack` v0.48.1 from `https://victoriametrics.github.io/helm-charts/` - Automated self-healing enabled - Creates namespace automatically ## Usage Examples ### Accessing Grafana Access Grafana through SSO: 1. **Navigate to Grafana** ```bash open https://grafana.${DOMAIN_GRAFANA} ``` 2. **Authenticate via Dex** - Click "Sign in with OAuth" - Authenticate through configured identity provider - Users in `DevFW` group receive Admin role, others receive Viewer role ### Querying Metrics Query VictoriaMetrics directly: ```bash # Access VMAuth endpoint curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \ -d 'query=up' | jq # Query pod CPU usage curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \ -d 'query=container_cpu_usage_seconds_total' | jq # Query with time range curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \ -d 'query=container_memory_usage_bytes' \ -d 'start=2024-01-01T00:00:00Z' \ -d 'end=2024-01-01T23:59:59Z' \ -d 'step=5m' | jq ``` ### Creating Custom Dashboards Create custom Grafana dashboards as Kubernetes resources: ```yaml apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: custom-app-dashboard namespace: observability spec: instanceSelector: matchLabels: dashboards: "grafana" json: | { "dashboard": { "title": "Custom Application Metrics", "panels": [ { "title": "Request Rate", "targets": [ { "expr": "rate(http_requests_total[5m])", "datasource": "VictoriaMetrics" } ] } ] } } ``` Apply the dashboard: ```bash kubectl apply -f custom-dashboard.yaml ``` ### Viewing Logs in Grafana Access VictoriaLogs through Grafana: 1. Navigate to Grafana `https://grafana.${DOMAIN_GRAFANA}` 2. Go to Explore 3. Select "VictoriaLogs" datasource 4. Use LogQL queries: ``` {namespace="default"} {app="nginx"} |= "error" {namespace="observability"} | json | level="error" ``` ### Setting Up Custom Alerts Create custom alert rules using VMRule: ```yaml apiVersion: operator.victoriametrics.com/v1beta1 kind: VMRule metadata: name: custom-app-alerts namespace: observability spec: groups: - name: custom-app interval: 30s rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} requests/sec" ``` Push the alert rule to [stacks instances](https://edp.buildth.ing/DevFW-CICD/stacks-instances/src/branch/main/otc/observability.t09.de/stacks/observability/victoria-k8s-stack/manifests) ## Integration Points * **Core Stack**: Depends on ArgoCD for deployment orchestration * **OTC Stack**: Requires ingress-nginx controller and cert-manager for external access and TLS * **Dex (SSO)**: Integrated for Grafana authentication with role-based access control * **All Platform Services**: Automatically collects metrics from Kubernetes components and platform services * **Application Stacks**: Provides monitoring for Coder, Forgejo, and other deployed services ## Troubleshooting ### VictoriaMetrics Pods Not Starting **Problem**: VictoriaMetrics components remain in `Pending` or `CrashLoopBackOff` state **Solution**: 1. Check VictoriaMetrics resources: ```bash kubectl get vmsingle,vmagent,vmalertmanager -n observability kubectl describe vmsingle vmsingle -n observability ``` 2. Verify persistent volume claims: ```bash kubectl get pvc -n observability kubectl describe pvc vmstorage-vmsingle-0 -n observability ``` 3. Check operator logs: ```bash kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator ``` ### Grafana Not Accessible **Problem**: Grafana web interface is not accessible at configured URL **Solution**: 1. Verify Grafana instance status: ```bash kubectl get grafana grafana -n observability kubectl describe grafana grafana -n observability ``` 2. Check Grafana pod logs: ```bash kubectl logs -n observability -l app=grafana ``` 3. Verify ingress configuration: ```bash kubectl get ingress -n observability kubectl describe ingress grafana-ingress -n observability ``` 4. Check TLS certificate status: ```bash kubectl get certificate -n observability kubectl describe certificate grafana-tls-secret -n observability ``` ### OAuth Authentication Failing **Problem**: Cannot authenticate to Grafana via SSO **Solution**: 1. Verify Dex is running: ```bash kubectl get pods -n core -l app=dex kubectl logs -n core -l app=dex ``` 2. Check OAuth client secret: ```bash kubectl get secret dex-grafana-client -n observability kubectl describe secret dex-grafana-client -n observability ``` 3. Review Grafana OAuth configuration: ```bash kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth ``` 4. Check Grafana logs for OAuth errors: ```bash kubectl logs -n observability -l app=grafana | grep -i oauth ``` ### Metrics Not Appearing **Problem**: Metrics not showing up in Grafana or VictoriaMetrics **Solution**: 1. Check VMAgent scraping status: ```bash kubectl get vmagent -n observability kubectl logs -n observability -l app.kubernetes.io/name=vmagent ``` 2. Verify service monitors are created: ```bash kubectl get vmservicescrape -n observability kubectl get vmpodscrape -n observability ``` 3. Check target endpoints: ```bash # Access VMAgent UI (port-forward if needed) kubectl port-forward -n observability svc/vmagent 8429:8429 open http://localhost:8429/targets ``` 4. Verify VictoriaMetrics Single is accepting data: ```bash kubectl logs -n observability -l app.kubernetes.io/name=vmsingle ``` ### Alerts Not Sending **Problem**: Alertmanager not sending email notifications **Solution**: 1. Verify Alertmanager configuration: ```bash kubectl get vmalertmanager -n observability kubectl describe vmalertmanager vmalertmanager -n observability ``` 2. Check email credentials secret: ```bash kubectl get secret email-user-credentials -n observability kubectl describe secret email-user-credentials -n observability ``` 3. Review Alertmanager logs: ```bash kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager ``` 4. Test alert firing manually: ```bash # Access Alertmanager UI kubectl port-forward -n observability svc/vmalertmanager 9093:9093 open http://localhost:9093 ``` ### High Storage Usage **Problem**: VictoriaMetrics storage running out of space **Solution**: 1. Check current storage usage: ```bash kubectl exec -it -n observability vmsingle-0 -- df -h /storage ``` 2. Reduce retention period in `values.yaml`: ```yaml vmsingle: spec: retentionPeriod: "15d" # Reduce from 1 month ``` 3. Increase PVC size: ```bash kubectl patch pvc vmstorage-vmsingle-0 -n observability \ -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}' ``` 4. Monitor storage metrics in Grafana for capacity planning ## Additional Resources * [VictoriaMetrics Documentation](https://docs.victoriametrics.com/) * [VictoriaMetrics Operator Documentation](https://docs.victoriametrics.com/operator/) * [Grafana Documentation](https://grafana.com/docs/grafana/latest/) * [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/docs/) * [VictoriaLogs Documentation](https://docs.victoriametrics.com/victorialogs/) * [Prometheus Querying Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/) * [PromQL for VictoriaMetrics](https://docs.victoriametrics.com/metricsql/)