added observability stacks docs
Some checks failed
Hugo Site Tests / test (push) Failing after 1s
ci / build (push) Successful in 53s

This commit is contained in:
Manuel Ganter 2025-12-16 10:56:33 +01:00
parent 5be5493015
commit eb1aaec0bc
No known key found for this signature in database

View file

@ -2,126 +2,580 @@
title: "Observability"
linkTitle: "Observability"
weight: 50
description: Observability
description: >
Comprehensive monitoring, metrics, and logging for Kubernetes infrastructure
---
{{% alert title="Draft" color="warning" %}}
**Editorial Status**: This page is currently being developed.
* **Jira Ticket**: [TBD]
* **Assignee**: [Name or Team]
* **Status**: Draft
* **Last Updated**: YYYY-MM-DD
* **TODO**:
* [ ] Add detailed component description
* [ ] Include usage examples and code samples
* [ ] Add architecture diagrams
* [ ] Review and finalize content
{{% /alert %}}
## Overview
[Detailed description of the component - what it is, what it does, and why it exists]
The Observability stack provides enterprise-grade monitoring, metrics collection, and logging capabilities for the Edge Developer Platform. Built on VictoriaMetrics and Grafana, it offers a complete observability solution with pre-configured dashboards, alerting, and SSO integration.
The stack deploys VictoriaMetrics for metrics storage and querying, Grafana for visualization, VictoriaLogs for log aggregation, and VMAuth for authenticated access to monitoring endpoints.
## Key Features
* [Feature 1]
* [Feature 2]
* [Feature 3]
## Purpose in EDP
[Explain the role this component plays in the Edge Developer Platform and how it contributes to the overall platform capabilities]
* **Metrics Collection**: VictoriaMetrics-based Kubernetes monitoring with long-term storage
* **Visualization**: Grafana with pre-built dashboards for ArgoCD, Ingress-Nginx, and infrastructure components
* **Log Aggregation**: VictoriaLogs for centralized logging with Grafana integration
* **SSO Integration**: OAuth authentication through Dex with role-based access control
* **Alerting**: Alertmanager with email notifications for critical events
* **Secure Access**: TLS-enabled ingress with authentication proxy (VMAuth)
* **Persistent Storage**: Encrypted volumes with configurable retention policies
## Repository
**Code**: [Link to source code repository]
**Code**: [Observability Stack Templates](https://edp.buildth.ing/DevFW-CICD/stacks/src/branch/main/template/stacks/observability)
**Documentation**: [Link to component-specific documentation]
**Documentation**:
* [VictoriaMetrics Documentation](https://docs.victoriametrics.com/)
* [Grafana Documentation](https://grafana.com/docs/)
* [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/)
## Getting Started
### Prerequisites
* [Prerequisite 1]
* [Prerequisite 2]
* Kubernetes cluster with ArgoCD installed (provided by `core` stack)
* Ingress controller configured (provided by `otc` stack)
* cert-manager for TLS certificate management (provided by `otc` stack)
* Dex SSO provider (provided by `core` stack)
* Infrastructure deployed through [Infra Deploy](https://edp.buildth.ing/DevFW/infra-deploy)
### Quick Start
[Step-by-step guide to get started with this component]
The Observability stack is deployed as part of the EDP installation process:
1. [Step 1]
2. [Step 2]
3. [Step 3]
1. **Trigger Deploy Pipeline**
- Go to [Infra Deploy Pipeline](https://edp.buildth.ing/DevFW/infra-deploy/actions?workflow=deploy.yaml)
- Click on Run workflow
- Enter a name in "Select environment directory to deploy". This must be DNS Compatible. (if you enter `test-me` then domains will be `vmauth.test-me.t09.de` and `grafana.test-me.t09.de`)
- Execute workflow
2. **ArgoCD Synchronization**
ArgoCD automatically deploys:
- VictoriaMetrics Operator and components
- VictoriaMetrics Single (metrics storage)
- VMAuth (authentication proxy)
- Alertmanager (alerting)
- Grafana Operator
- Grafana instance with OAuth
- VictoriaLogs datasource
- Pre-configured dashboards
- Ingress configurations with TLS
### Verification
[How to verify the component is working correctly]
## Usage Examples
### [Use Case 1]
[Example with code/commands showing common use case]
Verify the Observability deployment:
```bash
# Example commands
# Check ArgoCD applications status
kubectl get application grafana-operator -n argocd
kubectl get application victoria-k8s-stack -n argocd
# Verify VictoriaMetrics components are running
kubectl get pods -n observability
# Check Grafana instance status
kubectl get grafana grafana -n observability
# Verify ingress configurations
kubectl get ingress -n observability
```
### [Use Case 2]
[Another common scenario]
## Integration Points
* **[Component A]**: [How it integrates]
* **[Component B]**: [How it integrates]
* **[Component C]**: [How it integrates]
Access the monitoring interfaces:
* Grafana: `https://grafana.{DOMAIN_O12Y}`
## Architecture
[Optional: Add architectural diagrams and descriptions]
### Component Architecture
### Component Architecture (C4)
The Observability stack consists of multiple integrated components:
[Add C4 Container or Component diagrams showing the internal structure]
**VictoriaMetrics Components**:
- **VictoriaMetrics Operator**: Manages VictoriaMetrics custom resources
- **VictoriaMetrics Single**: Standalone metrics storage with 20Gi storage and 1-month retention
- **VMAgent**: Scrapes metrics from Kubernetes components (kubelet, CoreDNS, kube-apiserver, etcd)
- **VMAuth**: Authentication proxy on port 8427 for secure metrics access
- **VMAlertmanager**: Handles alert routing and notifications
### Sequence Diagrams
**Grafana Components**:
- **Grafana Operator**: Manages Grafana instances and dashboards as Kubernetes resources
- **Grafana Instance**: Web application for metrics visualization with OAuth authentication
- **Pre-configured Dashboards**: ArgoCD, Ingress-Nginx, VictoriaLogs monitoring
[Add sequence diagrams showing key interaction flows with other components]
**Logging**:
- **VictoriaLogs**: Log aggregation service integrated as Grafana datasource
### Deployment Architecture
**Storage**:
- VictoriaMetrics Single: 20Gi persistent storage on `csi-disk` storage class
- Grafana: 10Gi persistent storage on `csi-disk` storage class with KMS encryption
- Configurable retention: 1 month for metrics, minimum 24 hours enforced
[Add infrastructure and deployment diagrams showing how the component is deployed]
**Networking**:
- Nginx ingress with TLS termination for Grafana and VMAuth
- cert-manager integration for automatic certificate management
- Internal ClusterIP services for component communication
## Configuration
[Key configuration options and how to set them]
### VictoriaMetrics Configuration
Key configuration in `stacks/observability/victoria-k8s-stack/values.yaml`:
**Operator Settings**:
```yaml
victoria-metrics-operator:
enabled: true
operator:
enable_converter_ownership: true
admissionWebhooks:
certManager:
enabled: true
issuer:
name: main
```
**Storage Configuration**:
```yaml
vmsingle:
enabled: true
spec:
retentionPeriod: "1"
storage:
storageClassName: csi-disk
resources:
requests:
storage: 20Gi
```
**VMAuth Configuration**:
```yaml
vmauth:
enabled: true
spec:
port: "8427"
ingress:
enabled: true
ingressClassName: nginx
hosts:
- name: "{{{ .Env.DOMAIN_O12Y }}}"
tls:
- secretName: vmauth-tls-secret
hosts:
- "{{{ .Env.DOMAIN_O12Y }}}"
annotations:
cert-manager.io/cluster-issuer: main
```
**Monitoring Targets**:
- Kubelet (cadvisor, probes, resources metrics)
- CoreDNS
- etcd
- kube-apiserver
**Disabled Collectors** (to avoid alerts on managed clusters):
- kube-controller-manager
- kube-scheduler
- kube-proxy
### Alertmanager Configuration
Email alerting configured in `values.yaml`:
```yaml
alertmanager:
spec:
externalURL: "https://{{{ .Env.DOMAIN_O12Y }}}"
configSecret: vmalertmanager-config
config:
route:
routes:
- matchers:
- severity =~ "critical|major"
receiver: mail
receivers:
- name: 'mail'
email_configs:
- to: 'alerts@example.com'
from: 'monitoring@example.com'
smarthost: 'mail.mms-support.de:465'
auth_username:
name: email-user-credentials
key: username
auth_password:
name: email-user-credentials
key: password
```
### Grafana Configuration
Grafana instance configuration in `stacks/observability/grafana-operator/manifests/grafana.yaml`:
**OAuth/SSO Integration**:
```yaml
config:
auth.generic_oauth:
enabled: "true"
disable_login_form: "true"
client_id: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_ID}"
client_secret: "$__env{GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET}"
scopes: "openid email profile offline_access groups"
auth_url: "https://dex.{DOMAIN}/auth"
token_url: "https://dex.{DOMAIN}/token"
api_url: "https://dex.{DOMAIN}/userinfo"
role_attribute_path: "contains(groups[*], 'DevFW') && 'Admin' || 'Viewer'"
```
**Storage**:
```yaml
deployment:
spec:
template:
spec:
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana-pvc
persistentVolumeClaim:
spec:
storageClassName: csi-disk
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
```
**Ingress**:
```yaml
ingress:
spec:
ingressClassName: nginx
rules:
- host: "{{{ .Env.DOMAIN_GRAFANA }}}"
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana-service
port:
number: 3000
tls:
- hosts:
- "{{{ .Env.DOMAIN_GRAFANA }}}"
secretName: grafana-tls-secret
```
### ArgoCD Application Configuration
**Grafana Operator Application** (`template/stacks/observability/grafana-operator.yaml`):
- Name: `grafana-operator`
- Chart: `grafana-operator` v5.18.0 from `ghcr.io/grafana/helm-charts`
- Automated sync with self-healing enabled
- Namespace: `observability`
**VictoriaMetrics Stack Application** (`template/stacks/observability/victoria-k8s-stack.yaml`):
- Name: `victoria-k8s-stack`
- Chart: `victoria-metrics-k8s-stack` v0.48.1 from `https://victoriametrics.github.io/helm-charts/`
- Automated self-healing enabled
- Creates namespace automatically
## Usage Examples
### Accessing Grafana
Access Grafana through SSO:
1. **Navigate to Grafana**
```bash
open https://grafana.${DOMAIN_GRAFANA}
```
2. **Authenticate via Dex**
- Click "Sign in with OAuth"
- Authenticate through configured identity provider
- Users in `DevFW` group receive Admin role, others receive Viewer role
### Querying Metrics
Query VictoriaMetrics directly:
```bash
# Access VMAuth endpoint
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
-d 'query=up' | jq
# Query pod CPU usage
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query \
-d 'query=container_cpu_usage_seconds_total' | jq
# Query with time range
curl -u username:password https://vmauth.${DOMAIN_O12Y}/api/v1/query_range \
-d 'query=container_memory_usage_bytes' \
-d 'start=2024-01-01T00:00:00Z' \
-d 'end=2024-01-01T23:59:59Z' \
-d 'step=5m' | jq
```
### Creating Custom Dashboards
Create custom Grafana dashboards as Kubernetes resources:
```yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: custom-app-dashboard
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "Custom Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"datasource": "VictoriaMetrics"
}
]
}
]
}
}
```
Apply the dashboard:
```bash
kubectl apply -f custom-dashboard.yaml
```
### Viewing Logs in Grafana
Access VictoriaLogs through Grafana:
1. Navigate to Grafana `https://grafana.${DOMAIN_GRAFANA}`
2. Go to Explore
3. Select "VictoriaLogs" datasource
4. Use LogQL queries:
```
{namespace="default"}
{app="nginx"} |= "error"
{namespace="observability"} | json | level="error"
```
### Setting Up Custom Alerts
Create custom alert rules using VMRule:
```yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: custom-app-alerts
namespace: observability
spec:
groups:
- name: custom-app
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests/sec"
```
Push the alert rule to [stacks instances](https://edp.buildth.ing/DevFW-CICD/stacks-instances/src/branch/main/otc/observability.t09.de/stacks/observability/victoria-k8s-stack/manifests)
## Integration Points
* **Core Stack**: Depends on ArgoCD for deployment orchestration
* **OTC Stack**: Requires ingress-nginx controller and cert-manager for external access and TLS
* **Dex (SSO)**: Integrated for Grafana authentication with role-based access control
* **All Platform Services**: Automatically collects metrics from Kubernetes components and platform services
* **Application Stacks**: Provides monitoring for Coder, Forgejo, and other deployed services
## Troubleshooting
### [Common Issue 1]
### VictoriaMetrics Pods Not Starting
**Problem**: [Description]
**Problem**: VictoriaMetrics components remain in `Pending` or `CrashLoopBackOff` state
**Solution**: [How to fix]
**Solution**:
1. Check VictoriaMetrics resources:
```bash
kubectl get vmsingle,vmagent,vmalertmanager -n observability
kubectl describe vmsingle vmsingle -n observability
```
### [Common Issue 2]
2. Verify persistent volume claims:
```bash
kubectl get pvc -n observability
kubectl describe pvc vmstorage-vmsingle-0 -n observability
```
**Problem**: [Description]
3. Check operator logs:
```bash
kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator
```
**Solution**: [How to fix]
### Grafana Not Accessible
## Status
**Problem**: Grafana web interface is not accessible at configured URL
**Maturity**: [Production / Beta / Experimental]
**Solution**:
1. Verify Grafana instance status:
```bash
kubectl get grafana grafana -n observability
kubectl describe grafana grafana -n observability
```
2. Check Grafana pod logs:
```bash
kubectl logs -n observability -l app=grafana
```
3. Verify ingress configuration:
```bash
kubectl get ingress -n observability
kubectl describe ingress grafana-ingress -n observability
```
4. Check TLS certificate status:
```bash
kubectl get certificate -n observability
kubectl describe certificate grafana-tls-secret -n observability
```
### OAuth Authentication Failing
**Problem**: Cannot authenticate to Grafana via SSO
**Solution**:
1. Verify Dex is running:
```bash
kubectl get pods -n core -l app=dex
kubectl logs -n core -l app=dex
```
2. Check OAuth client secret:
```bash
kubectl get secret dex-grafana-client -n observability
kubectl describe secret dex-grafana-client -n observability
```
3. Review Grafana OAuth configuration:
```bash
kubectl get grafana grafana -n observability -o yaml | grep -A 20 auth.generic_oauth
```
4. Check Grafana logs for OAuth errors:
```bash
kubectl logs -n observability -l app=grafana | grep -i oauth
```
### Metrics Not Appearing
**Problem**: Metrics not showing up in Grafana or VictoriaMetrics
**Solution**:
1. Check VMAgent scraping status:
```bash
kubectl get vmagent -n observability
kubectl logs -n observability -l app.kubernetes.io/name=vmagent
```
2. Verify service monitors are created:
```bash
kubectl get vmservicescrape -n observability
kubectl get vmpodscrape -n observability
```
3. Check target endpoints:
```bash
# Access VMAgent UI (port-forward if needed)
kubectl port-forward -n observability svc/vmagent 8429:8429
open http://localhost:8429/targets
```
4. Verify VictoriaMetrics Single is accepting data:
```bash
kubectl logs -n observability -l app.kubernetes.io/name=vmsingle
```
### Alerts Not Sending
**Problem**: Alertmanager not sending email notifications
**Solution**:
1. Verify Alertmanager configuration:
```bash
kubectl get vmalertmanager -n observability
kubectl describe vmalertmanager vmalertmanager -n observability
```
2. Check email credentials secret:
```bash
kubectl get secret email-user-credentials -n observability
kubectl describe secret email-user-credentials -n observability
```
3. Review Alertmanager logs:
```bash
kubectl logs -n observability -l app.kubernetes.io/name=vmalertmanager
```
4. Test alert firing manually:
```bash
# Access Alertmanager UI
kubectl port-forward -n observability svc/vmalertmanager 9093:9093
open http://localhost:9093
```
### High Storage Usage
**Problem**: VictoriaMetrics storage running out of space
**Solution**:
1. Check current storage usage:
```bash
kubectl exec -it -n observability vmsingle-0 -- df -h /storage
```
2. Reduce retention period in `values.yaml`:
```yaml
vmsingle:
spec:
retentionPeriod: "15d" # Reduce from 1 month
```
3. Increase PVC size:
```bash
kubectl patch pvc vmstorage-vmsingle-0 -n observability \
-p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'
```
4. Monitor storage metrics in Grafana for capacity planning
## Additional Resources
* [Link to external documentation]
* [Link to community resources]
* [Link to related components]
## Documentation Notes
[Instructions for team members filling in this documentation - remove this section once complete]
* [VictoriaMetrics Documentation](https://docs.victoriametrics.com/)
* [VictoriaMetrics Operator Documentation](https://docs.victoriametrics.com/operator/)
* [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
* [Grafana Operator Documentation](https://grafana.github.io/grafana-operator/docs/)
* [VictoriaLogs Documentation](https://docs.victoriametrics.com/victorialogs/)
* [Prometheus Querying Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/)
* [PromQL for VictoriaMetrics](https://docs.victoriametrics.com/metricsql/)