16 KiB
| title | linkTitle | weight | description |
|---|---|---|---|
| Observability | Observability | 50 | Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure |
Overview
The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration.
The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints.
Key Features
- Metrics Collection: VictoriaMetrics-based Kubernetes monitoring with long-term storage
- Visualization: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components
- Log Aggregation: VictoriaLogs for centralized logging with Grafana integration
- SSO Integration: OAuth authentication through Dex with role-based access control
- Alerting: Alertmanager with email notifications for critical events
- Secure Access: TLS-enabled ingress with authentication proxy (VMAuth)
- Persistent Storage: Encrypted volumes with configurable retention policies
Repository
Code: Observability Stack Templates
Documentation:
Getting Started
Prerequisites
- Kubernetes cluster with ArgoCD installed (provided by
corestack) - Ingress controller configured (provided by
otcstack) - cert-manager for TLS certificate management (provided by
otcstack) - Dex SSO provider (provided by
corestack) - Infrastructure deployed through Infra Deploy
Quick Start
The Observability stack is deployed as part of the EDP installation process:
-
Trigger Deploy Pipeline
- Go to Infra Deploy Pipeline
- Click on Run workflow
- Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter
test-methen domains will bevmauth.test-me.t09.deandgrafana.test-me.t09.de) - Execute workflow
-
ArgoCD Synchronization ArgoCD automatically deploys:
- VictoriaMetrics Operator and components
- VictoriaMetrics Single (metrics storage)
- VMAuth (authentication proxy)
- Alertmanager (alerting)
- Grafana Operator
- Grafana instance with OAuth
- VictoriaLogs datasource
- Pre-configured dashboards
- Ingress configurations with TLS
Verification
Verify the Observability deployment:
# Check ArgoCD applications status
kubectl get application grafana-operator -n argocd
kubectl get application victoria-k8s-stack -n argocd
# Verify VictoriaMetrics components are running
kubectl get pods -n observability
# Check Grafana instance status
kubectl get grafana grafana -n observability
# Verify ingress configurations
kubectl get ingress -n observability
Access the monitoring interfaces:
- Grafana:
https://grafana.{DOMAIN_O12Y}
Architecture
Component Architecture
The Observability stack consists of multiple integrated components:
VictoriaMetrics Components:
- VictoriaMetrics Operator: Manages VictoriaMetrics custom resources
- VictoriaMetrics Single: Standalone metrics storage with 20Gi storage and 1-month retention
- VMAgent: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd)
- VMAuth: Authentication proxy on port 8427 for secure metrics access
- VMAlertmanager: Handles alert routing and notifications
Grafana Components:
- Grafana Operator: Manages Grafana instances and dashboards as Kubernetes resources
- Grafana Instance: Web application for metrics visualization with OAuth authentication
- Pre-configured Dashboards: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring
Logging:
- VictoriaLogs: Log aggregation service integrated as Grafana datasource
Storage:
- VictoriaMetrics Single: 20Gi persistent storage on
csi-diskstorage class - Grafana: 10Gi persistent storage on
csi-diskstorage class with KMS encryption - Configurable retention: 1 month for metrics, minimum 24 hours enforced
Networking:
- Nginx ingress with TLS termination for Grafana and VMAuth
- cert-manager integration for automatic certificate management
- Internal ClusterIP services for component communication
Configuration
VictoriaMetrics Configuration
Key configuration in stacks/observability/victoria-k8s-stack/values.yaml:
Operator Settings:
victoria-metrics-operator:
enabled: true
operator:
enable_converter_ownership: true
admissionWebhooks:
certManager:
enabled: true
issuer:
name: main
Storage Configuration:
vmsingle:
enabled: true
spec:
retentionPeriod: "1"
storage:
storageClassName: csi-disk
resources:
requests:
storage: 20Gi
VMAuth Configuration:
vmauth:
enabled: true
spec:
port: "8427"
ingress:
enabled: true
ingressClassName: nginx
hosts:
- name: "{{{ .Env.DOMAIN_O12Y }}}"
tls:
- secretName: vmauth-tls-secret
hosts:
- "{{{ .Env.DOMAIN_O12Y }}}"
annotations:
cert-manager.io/cluster-issuer: main
Monitoring Targets:
- Kubelet (cadvisor, probes, resources metrics)
- CoreDNS
- etcd
- kube-apiserver
Disabled Collectors (to avoid alerts on managed clusters):
- kube-controller-manager
- kube-scheduler
- kube-proxy
Alertmanager Configuration
Email alerting configured in values.yaml:
alertmanager:
spec:
externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}"
configSecret: vmalertmanager-config
config:
route:
routes:
- matchers:
- severity =~ "critical|major"
receiver: mail
receivers:
- name: 'mail'
email_configs:
- to: 'alerts@example.com'
from: 'monitoring@example.com'
smarthost: 'mail.mms-support.de:465'
auth_username:
name: email-user-credentials
key: username
auth_password:
name: email-user-credentials
key: password
Grafana Configuration
Grafana instance configuration in stacks/observability/grafana-operator/manifests/grafana.yaml:
OAuth/SSO Integration:
config:
auth.generic_oauth:
enabled: "true"
disable_login_form: "true"
client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}"
client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}"
scopes: "openid email profile offline_access groups"
auth_url: "https://dex.{DOMAIN}/auth"
token_url: "https://dex.{DOMAIN}/token"
api_url: "https://dex.{DOMAIN}/userinfo"
role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'"
Storage:
deployment:
spec:
template:
spec:
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana-pvc
persistentVolumeClaim:
spec:
storageClassName: csi-disk
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Ingress:
ingress:
spec:
ingressClassName: nginx
rules:
- host: "{{{ .Env.DOMAIN_GRAFANA }}}"
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana-service
port:
number: 3000
tls:
- hosts:
- "{{{ .Env.DOMAIN_GRAFANA }}}"
secretName: grafana-tls-secret
ArgoCD Application Configuration
Grafana Operator Application (template/stacks/observability/grafana-operator.yaml):
- Name:
grafana-operator - Chart:
grafana-operatorv5.18.0 fromghcr.io/grafana/helm-charts - Automated sync with self-healing enabled
- Namespace:
observability
VictoriaMetrics Stack Application (template/stacks/observability/victoria-k8s-stack.yaml):
- Name:
victoria-k8s-stack - Chart:
victoria-metrics-k8s-stackv0.48.1 fromhttps://victoriametrics.github.io/helm-charts/ - Automated self-healing enabled
- Creates namespace automatically
Usage Examples
Accessing Grafana
Access Grafana through SSO:
-
Navigate to Grafana
open https://grafana.${DOMAIN_GRAFANA} -
Authenticate via Dex
- Click "Sign in with OAuth"
- Authenticate through configured identity provider
- Users in
DevFWgroup receive Admin role, others receive Viewer role
Querying Metrics
Query VictoriaMetrics directly:
# Access VMAuth endpoint
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
-d 'query=up' | jq
# Query pod CPU usage
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
-d 'query=container_cpu_usage_seconds_total' | jq
# Query with time range
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \
-d 'query=container_memory_usage_bytes' \
-d 'start=2024-01-01T00:00:00Z' \
-d 'end=2024-01-01T23:59:59Z' \
-d 'step=5m' | jq
Creating Custom Dashboards
Create custom Grafana dashboards as Kubernetes resources:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: custom-app-dashboard
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "Custom Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"datasource": "VictoriaMetrics"
}
]
}
]
}
}
Apply the dashboard:
kubectl apply -f custom-dashboard.yaml
Viewing Logs in Grafana
Access VictoriaLogs through Grafana:
- Navigate to Grafana
https://grafana.${DOMAIN_GRAFANA} - Go to Explore
- Select "VictoriaLogs" datasource
- Use LogQL queries:
{namespace="default"} {app="nginx"} |= "error" {namespace="observability"} | json | level="error"
Setting Up Custom Alerts
Create custom alert rules using VMRule:
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: custom-app-alerts
namespace: observability
spec:
groups:
- name: custom-app
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests/sec"
Push the alert rule to stacks instances
Integration Points
- Core Stack: Depends on ArgoCD for deployment orchestration
- OTC Stack: Requires ingress-nginx controller and cert-manager for external access and TLS
- Dex (SSO): Integrated for Grafana authentication with role-based access control
- All Platform Services: Automatically collects metrics from Kubernetes components and platform services
- Application Stacks: Provides monitoring for Coder, Forgejo, and other deployed services
Troubleshooting
VictoriaMetrics Pods Not Starting
Problem: VictoriaMetrics components remain in Pending or CrashLoopBackOff state
Solution:
-
Check VictoriaMetrics resources:
kubectl get vmsingle,vmagent,vmalertmanager -n observability kubectl describe vmsingle vmsingle -n observability -
Verify persistent volume claims:
kubectl get pvc -n observability kubectl describe pvc vmstorage-vmsingle-0 -n observability -
Check operator logs:
kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator
Grafana Not Accessible
Problem: Grafana web interface is not accessible at configured URL
Solution:
-
Verify Grafana instance status:
kubectl get grafana grafana -n observability kubectl describe grafana grafana -n observability -
Check Grafana pod logs:
kubectl logs -n observability -l app=grafana -
Verify ingress configuration:
kubectl get ingress -n observability kubectl describe ingress grafana-ingress -n observability -
Check TLS certificate status:
kubectl get certificate -n observability kubectl describe certificate grafana-tls-secret -n observability
OAuth Authentication Failing
Problem: Cannot authenticate to Grafana via SSO
Solution:
-
Verify Dex is running:
kubectl get pods -n core -l app=dex kubectl logs -n core -l app=dex -
Check OAuth client secret:
kubectl get secret dex-grafana-client -n observability kubectl describe secret dex-grafana-client -n observability -
Review Grafana OAuth configuration:
kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth -
Check Grafana logs for OAuth errors:
kubectl logs -n observability -l app=grafana | grep -i oauth
Metrics Not Appearing
Problem: Metrics not showing up in Grafana or VictoriaMetrics
Solution:
-
Check VMAgent scraping status:
kubectl get vmagent -n observability kubectl logs -n observability -l app.kubernetes.io/name=vmagent -
Verify service monitors are created:
kubectl get vmservicescrape -n observability kubectl get vmpodscrape -n observability -
Check target endpoints:
# Access VMAgent UI (port-forward if needed) kubectl port-forward -n observability svc/vmagent 8429:8429 open http://localhost:8429/targets -
Verify VictoriaMetrics Single is accepting data:
kubectl logs -n observability -l app.kubernetes.io/name=vmsingle
Alerts Not Sending
Problem: Alertmanager not sending email notifications
Solution:
-
Verify Alertmanager configuration:
kubectl get vmalertmanager -n observability kubectl describe vmalertmanager vmalertmanager -n observability -
Check email credentials secret:
kubectl get secret email-user-credentials -n observability kubectl describe secret email-user-credentials -n observability -
Review Alertmanager logs:
kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager -
Test alert firing manually:
# Access Alertmanager UI kubectl port-forward -n observability svc/vmalertmanager 9093:9093 open http://localhost:9093
High Storage Usage
Problem: VictoriaMetrics storage running out of space
Solution:
-
Check current storage usage:
kubectl exec -it -n observability vmsingle-0 -- df -h /storage -
Reduce retention period in
values.yaml:vmsingle: spec: retentionPeriod: "15d" # Reduce from 1 month -
Increase PVC size:
kubectl patch pvc vmstorage-vmsingle-0 -n observability \ -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}' -
Monitor storage metrics in Grafana for capacity planning