diff --git a/content/en/docs/components/orchestration/stacks/observability.md b/content/en/docs/components/orchestration/stacks/observability.md index 8b51478..2a7b3a4 100644 --- a/content/en/docs/components/orchestration/stacks/observability.md +++ b/content/en/docs/components/orchestration/stacks/observability.md @@ -2,126 +2,580 @@ title: "Observability" linkTitle: "Observability" weight: 50 -description: Observability +description: > + Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure --- -{{% alert title="Draft" color="warning" %}} -**Editorial Status**: This page is currently being developed. - -* **Jira Ticket**: [TBD] -* **Assignee**: [Name or Team] -* **Status**: Draft -* **Last Updated**: YYYY-MM-DD -* **TODO**: - * [ ] Add detailed component description - * [ ] Include usage examples and code samples - * [ ] Add architecture diagrams - * [ ] Review and finalize content -{{% /alert %}} - ## Overview -[Detailed description of the component - what it is, what it does, and why it exists] +The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration. + +The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints. ## Key Features -* [Feature 1] -* [Feature 2] -* [Feature 3] - -## Purpose in EDP - -[Explain the role this component plays in the Edge Developer Platform and how it contributes to the overall platform capabilities] +* **Metrics Collection**: VictoriaMetrics-based Kubernetes monitoring with long-term storage +* **Visualization**: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components +* **Log Aggregation**: VictoriaLogs for centralized logging with Grafana integration +* **SSO Integration**: OAuth authentication through Dex with role-based access control +* **Alerting**: Alertmanager with email notifications for critical events +* **Secure Access**: TLS-enabled ingress with authentication proxy (VMAuth) +* **Persistent Storage**: Encrypted volumes with configurable retention policies ## Repository -**Code**: [Link to source code repository] +**Code**: [Observability Stack Templates](https://edp.buildth.ing/DevFW-CICD/stacks/src/branch/main/template/stacks/observability) -**Documentation**: [Link to component-specific documentation] +**Documentation**: +* [VictoriaMetrics Documentation](https://docs.victoriametrics.com/) +* [Grafana Documentation](https://grafana.com/docs/) +* [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/) ## Getting Started ### Prerequisites -* [Prerequisite 1] -* [Prerequisite 2] +* Kubernetes cluster with ArgoCD installed (provided by `core` stack) +* Ingress controller configured (provided by `otc` stack) +* cert-manager for TLS certificate management (provided by `otc` stack) +* Dex SSO provider (provided by `core` stack) +* Infrastructure deployed through [Infra Deploy](https://edp.buildth.ing/DevFW/infra-deploy) ### Quick Start -[Step-by-step guide to get started with this component] +The Observability stack is deployed as part of the EDP installation process: -1. [Step 1] -2. [Step 2] -3. [Step 3] +1. **Trigger Deploy Pipeline** + - Go to [Infra Deploy Pipeline](https://edp.buildth.ing/DevFW/infra-deploy/actions?workflow=deploy.yaml) + - Click on Run workflow + - Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter `test-me` then domains will be `vmauth.test-me.t09.de` and `grafana.test-me.t09.de`) + - Execute workflow + +2. **ArgoCD Synchronization** + ArgoCD automatically deploys: + - VictoriaMetrics Operator and components + - VictoriaMetrics Single (metrics storage) + - VMAuth (authentication proxy) + - Alertmanager (alerting) + - Grafana Operator + - Grafana instance with OAuth + - VictoriaLogs datasource + - Pre-configured dashboards + - Ingress configurations with TLS ### Verification -[How to verify the component is working correctly] - -## Usage Examples - -### [Use Case 1] - -[Example with code/commands showing common use case] +Verify the Observability deployment: ```bash -# Example commands +# Check ArgoCD applications status +kubectl get application grafana-operator -n argocd +kubectl get application victoria-k8s-stack -n argocd + +# Verify VictoriaMetrics components are running +kubectl get pods -n observability + +# Check Grafana instance status +kubectl get grafana grafana -n observability + +# Verify ingress configurations +kubectl get ingress -n observability ``` -### [Use Case 2] - -[Another common scenario] - -## Integration Points - -* **[Component A]**: [How it integrates] -* **[Component B]**: [How it integrates] -* **[Component C]**: [How it integrates] +Access the monitoring interfaces: +* Grafana: `https://grafana.{DOMAIN_O12Y}` ## Architecture -[Optional: Add architectural diagrams and descriptions] +### Component Architecture -### Component Architecture (C4) +The Observability stack consists of multiple integrated components: -[Add C4 Container or Component diagrams showing the internal structure] +**VictoriaMetrics Components**: +- **VictoriaMetrics Operator**: Manages VictoriaMetrics custom resources +- **VictoriaMetrics Single**: Standalone metrics storage with 20Gi storage and 1-month retention +- **VMAgent**: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd) +- **VMAuth**: Authentication proxy on port 8427 for secure metrics access +- **VMAlertmanager**: Handles alert routing and notifications -### Sequence Diagrams +**Grafana Components**: +- **Grafana Operator**: Manages Grafana instances and dashboards as Kubernetes resources +- **Grafana Instance**: Web application for metrics visualization with OAuth authentication +- **Pre-configured Dashboards**: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring -[Add sequence diagrams showing key interaction flows with other components] +**Logging**: +- **VictoriaLogs**: Log aggregation service integrated as Grafana datasource -### Deployment Architecture +**Storage**: +- VictoriaMetrics Single: 20Gi persistent storage on `csi-disk` storage class +- Grafana: 10Gi persistent storage on `csi-disk` storage class with KMS encryption +- Configurable retention: 1 month for metrics, minimum 24 hours enforced -[Add infrastructure and deployment diagrams showing how the component is deployed] +**Networking**: +- Nginx ingress with TLS termination for Grafana and VMAuth +- cert-manager integration for automatic certificate management +- Internal ClusterIP services for component communication ## Configuration -[Key configuration options and how to set them] +### VictoriaMetrics Configuration + +Key configuration in `stacks/observability/victoria-k8s-stack/values.yaml`: + +**Operator Settings**: +```yaml +victoria-metrics-operator: + enabled: true + operator: + enable_converter_ownership: true + admissionWebhooks: + certManager: + enabled: true + issuer: + name: main +``` + +**Storage Configuration**: +```yaml +vmsingle: + enabled: true + spec: + retentionPeriod: "1" + storage: + storageClassName: csi-disk + resources: + requests: + storage: 20Gi +``` + +**VMAuth Configuration**: +```yaml +vmauth: + enabled: true + spec: + port: "8427" + ingress: + enabled: true + ingressClassName: nginx + hosts: + - name: "{{{ .Env.DOMAIN_O12Y }}}" + tls: + - secretName: vmauth-tls-secret + hosts: + - "{{{ .Env.DOMAIN_O12Y }}}" + annotations: + cert-manager.io/cluster-issuer: main +``` + +**Monitoring Targets**: +- Kubelet (cadvisor, probes, resources metrics) +- CoreDNS +- etcd +- kube-apiserver + +**Disabled Collectors** (to avoid alerts on managed clusters): +- kube-controller-manager +- kube-scheduler +- kube-proxy + +### Alertmanager Configuration + +Email alerting configured in `values.yaml`: + +```yaml +alertmanager: + spec: + externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}" + configSecret: vmalertmanager-config + config: + route: + routes: + - matchers: + - severity =~ "critical|major" + receiver: mail + receivers: + - name: 'mail' + email_configs: + - to: 'alerts@example.com' + from: 'monitoring@example.com' + smarthost: 'mail.mms-support.de:465' + auth_username: + name: email-user-credentials + key: username + auth_password: + name: email-user-credentials + key: password +``` + +### Grafana Configuration + +Grafana instance configuration in `stacks/observability/grafana-operator/manifests/grafana.yaml`: + +**OAuth/SSO Integration**: +```yaml +config: + auth.generic_oauth: + enabled: "true" + disable_login_form: "true" + client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}" + client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}" + scopes: "openid email profile offline_access groups" + auth_url: "https://dex.{DOMAIN}/auth" + token_url: "https://dex.{DOMAIN}/token" + api_url: "https://dex.{DOMAIN}/userinfo" + role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'" +``` + +**Storage**: +```yaml +deployment: + spec: + template: + spec: + volumes: + - name: grafana-data + persistentVolumeClaim: + claimName: grafana-pvc + +persistentVolumeClaim: + spec: + storageClassName: csi-disk + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +``` + +**Ingress**: +```yaml +ingress: + spec: + ingressClassName: nginx + rules: + - host: "{{{ .Env.DOMAIN_GRAFANA }}}" + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: grafana-service + port: + number: 3000 + tls: + - hosts: + - "{{{ .Env.DOMAIN_GRAFANA }}}" + secretName: grafana-tls-secret +``` + +### ArgoCD Application Configuration + +**Grafana Operator Application** (`template/stacks/observability/grafana-operator.yaml`): +- Name: `grafana-operator` +- Chart: `grafana-operator` v5.18.0 from `ghcr.io/grafana/helm-charts` +- Automated sync with self-healing enabled +- Namespace: `observability` + +**VictoriaMetrics Stack Application** (`template/stacks/observability/victoria-k8s-stack.yaml`): +- Name: `victoria-k8s-stack` +- Chart: `victoria-metrics-k8s-stack` v0.48.1 from `https://victoriametrics.github.io/helm-charts/` +- Automated self-healing enabled +- Creates namespace automatically + +## Usage Examples + +### Accessing Grafana + +Access Grafana through SSO: + +1. **Navigate to Grafana** + ```bash + open https://grafana.${DOMAIN_GRAFANA} + ``` + +2. **Authenticate via Dex** + - Click "Sign in with OAuth" + - Authenticate through configured identity provider + - Users in `DevFW` group receive Admin role, others receive Viewer role + +### Querying Metrics + +Query VictoriaMetrics directly: + +```bash +# Access VMAuth endpoint +curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \ + -d 'query=up' | jq + +# Query pod CPU usage +curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \ + -d 'query=container_cpu_usage_seconds_total' | jq + +# Query with time range +curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \ + -d 'query=container_memory_usage_bytes' \ + -d 'start=2024-01-01T00:00:00Z' \ + -d 'end=2024-01-01T23:59:59Z' \ + -d 'step=5m' | jq +``` + +### Creating Custom Dashboards + +Create custom Grafana dashboards as Kubernetes resources: + +```yaml +apiVersion: grafana.integreatly.org/v1beta1 +kind: GrafanaDashboard +metadata: + name: custom-app-dashboard + namespace: observability +spec: + instanceSelector: + matchLabels: + dashboards: "grafana" + json: | + { + "dashboard": { + "title": "Custom Application Metrics", + "panels": [ + { + "title": "Request Rate", + "targets": [ + { + "expr": "rate(http_requests_total[5m])", + "datasource": "VictoriaMetrics" + } + ] + } + ] + } + } +``` + +Apply the dashboard: + +```bash +kubectl apply -f custom-dashboard.yaml +``` + +### Viewing Logs in Grafana + +Access VictoriaLogs through Grafana: + +1. Navigate to Grafana `https://grafana.${DOMAIN_GRAFANA}` +2. Go to Explore +3. Select "VictoriaLogs" datasource +4. Use LogQL queries: + ``` + {namespace="default"} + {app="nginx"} |= "error" + {namespace="observability"} | json | level="error" + ``` + +### Setting Up Custom Alerts + +Create custom alert rules using VMRule: + +```yaml +apiVersion: operator.victoriametrics.com/v1beta1 +kind: VMRule +metadata: + name: custom-app-alerts + namespace: observability +spec: + groups: + - name: custom-app + interval: 30s + rules: + - alert: HighErrorRate + expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 + for: 5m + labels: + severity: critical + annotations: + summary: "High error rate detected" + description: "Error rate is {{ $value }} requests/sec" +``` + +Push the alert rule to [stacks instances](https://edp.buildth.ing/DevFW-CICD/stacks-instances/src/branch/main/otc/observability.t09.de/stacks/observability/victoria-k8s-stack/manifests) + +## Integration Points + +* **Core Stack**: Depends on ArgoCD for deployment orchestration +* **OTC Stack**: Requires ingress-nginx controller and cert-manager for external access and TLS +* **Dex (SSO)**: Integrated for Grafana authentication with role-based access control +* **All Platform Services**: Automatically collects metrics from Kubernetes components and platform services +* **Application Stacks**: Provides monitoring for Coder, Forgejo, and other deployed services ## Troubleshooting -### [Common Issue 1] +### VictoriaMetrics Pods Not Starting -**Problem**: [Description] +**Problem**: VictoriaMetrics components remain in `Pending` or `CrashLoopBackOff` state -**Solution**: [How to fix] +**Solution**: +1. Check VictoriaMetrics resources: + ```bash + kubectl get vmsingle,vmagent,vmalertmanager -n observability + kubectl describe vmsingle vmsingle -n observability + ``` -### [Common Issue 2] +2. Verify persistent volume claims: + ```bash + kubectl get pvc -n observability + kubectl describe pvc vmstorage-vmsingle-0 -n observability + ``` -**Problem**: [Description] +3. Check operator logs: + ```bash + kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator + ``` -**Solution**: [How to fix] +### Grafana Not Accessible -## Status +**Problem**: Grafana web interface is not accessible at configured URL -**Maturity**: [Production / Beta / Experimental] +**Solution**: +1. Verify Grafana instance status: + ```bash + kubectl get grafana grafana -n observability + kubectl describe grafana grafana -n observability + ``` + +2. Check Grafana pod logs: + ```bash + kubectl logs -n observability -l app=grafana + ``` + +3. Verify ingress configuration: + ```bash + kubectl get ingress -n observability + kubectl describe ingress grafana-ingress -n observability + ``` + +4. Check TLS certificate status: + ```bash + kubectl get certificate -n observability + kubectl describe certificate grafana-tls-secret -n observability + ``` + +### OAuth Authentication Failing + +**Problem**: Cannot authenticate to Grafana via SSO + +**Solution**: +1. Verify Dex is running: + ```bash + kubectl get pods -n core -l app=dex + kubectl logs -n core -l app=dex + ``` + +2. Check OAuth client secret: + ```bash + kubectl get secret dex-grafana-client -n observability + kubectl describe secret dex-grafana-client -n observability + ``` + +3. Review Grafana OAuth configuration: + ```bash + kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth + ``` + +4. Check Grafana logs for OAuth errors: + ```bash + kubectl logs -n observability -l app=grafana | grep -i oauth + ``` + +### Metrics Not Appearing + +**Problem**: Metrics not showing up in Grafana or VictoriaMetrics + +**Solution**: +1. Check VMAgent scraping status: + ```bash + kubectl get vmagent -n observability + kubectl logs -n observability -l app.kubernetes.io/name=vmagent + ``` + +2. Verify service monitors are created: + ```bash + kubectl get vmservicescrape -n observability + kubectl get vmpodscrape -n observability + ``` + +3. Check target endpoints: + ```bash + # Access VMAgent UI (port-forward if needed) + kubectl port-forward -n observability svc/vmagent 8429:8429 + open http://localhost:8429/targets + ``` + +4. Verify VictoriaMetrics Single is accepting data: + ```bash + kubectl logs -n observability -l app.kubernetes.io/name=vmsingle + ``` + +### Alerts Not Sending + +**Problem**: Alertmanager not sending email notifications + +**Solution**: +1. Verify Alertmanager configuration: + ```bash + kubectl get vmalertmanager -n observability + kubectl describe vmalertmanager vmalertmanager -n observability + ``` + +2. Check email credentials secret: + ```bash + kubectl get secret email-user-credentials -n observability + kubectl describe secret email-user-credentials -n observability + ``` + +3. Review Alertmanager logs: + ```bash + kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager + ``` + +4. Test alert firing manually: + ```bash + # Access Alertmanager UI + kubectl port-forward -n observability svc/vmalertmanager 9093:9093 + open http://localhost:9093 + ``` + +### High Storage Usage + +**Problem**: VictoriaMetrics storage running out of space + +**Solution**: +1. Check current storage usage: + ```bash + kubectl exec -it -n observability vmsingle-0 -- df -h /storage + ``` + +2. Reduce retention period in `values.yaml`: + ```yaml + vmsingle: + spec: + retentionPeriod: "15d" # Reduce from 1 month + ``` + +3. Increase PVC size: + ```bash + kubectl patch pvc vmstorage-vmsingle-0 -n observability \ + -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}' + ``` + +4. Monitor storage metrics in Grafana for capacity planning ## Additional Resources -* [Link to external documentation] -* [Link to community resources] -* [Link to related components] - -## Documentation Notes - -[Instructions for team members filling in this documentation - remove this section once complete] +* [VictoriaMetrics Documentation](https://docs.victoriametrics.com/) +* [VictoriaMetrics Operator Documentation](https://docs.victoriametrics.com/operator/) +* [Grafana Documentation](https://grafana.com/docs/grafana/latest/) +* [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/docs/) +* [VictoriaLogs Documentation](https://docs.victoriametrics.com/victorialogs/) +* [Prometheus Querying Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/) +* [PromQL for VictoriaMetrics](https://docs.victoriametrics.com/metricsql/)