added obs-client stack docs

2025-12-16 11:03:37 +01:00 · 2025-12-16 11:03:37 +01:00 · babd8df7b5
commit babd8df7b5
parent eb1aaec0bc
1 changed files with 447 additions and 74 deletions
--- a/content/en/docs/components/orchestration/stacks/observability-client.md
+++ b/content/en/docs/components/orchestration/stacks/observability-client.md
@ -1,127 +1,500 @@
 ---
-title: "Observability-Client"
-linkTitle: "Observability-Client"
+title: "Observability Client"
+linkTitle: "Observability Client"
 weight: 60
-description: Observability-Client
+description: >
+  Core observability components for metrics collection, log aggregation, and monitoring
 ---

-{{% alert title="Draft" color="warning" %}}
-**Editorial Status**: This page is currently being developed.
-
-* **Jira Ticket**: [TBD]
-* **Assignee**: [Name or Team]
-* **Status**: Draft
-* **Last Updated**: YYYY-MM-DD
-* **TODO**:
-  * [ ] Add detailed component description
-  * [ ] Include usage examples and code samples
-  * [ ] Add architecture diagrams
-  * [ ] Review and finalize content
-{{% /alert %}}
-
 ## Overview

-[Detailed description of the component - what it is, what it does, and why it exists]
+The Observability Client stack provides essential monitoring and observability infrastructure for Kubernetes environments. As part of the Edge Developer Platform, it deploys client-side components that collect, process, and forward metrics and logs to centralized observability systems.
+
+The stack integrates three core components: Kubernetes Metrics Server for resource metrics, Vector for log collection and forwarding, and Victoria Metrics for comprehensive metrics monitoring and alerting.

 ## Key Features

-* [Feature 1]
-* [Feature 2]
-* [Feature 3]
-
-## Purpose in EDP
-
-[Explain the role this component plays in the Edge Developer Platform and how it contributes to the overall platform capabilities]
+* **Resource Metrics**: Real-time CPU and memory metrics via Kubernetes Metrics Server
+* **Log Aggregation**: Unified log collection and forwarding with Vector
+* **Metrics Monitoring**: Comprehensive metrics collection, storage, and alerting with Victoria Metrics
+* **Prometheus Compatibility**: Full Prometheus protocol support for metrics scraping
+* **Multi-Tenant Support**: Configurable tenant isolation for metrics and logs
+* **Automated Alerting**: Pre-configured alert rules with Alertmanager integration
+* **Grafana Integration**: Built-in dashboard provisioning and datasource configuration

 ## Repository

-**Code**: [Link to source code repository]
+**Code**: [Observability Client Stack Templates](https://edp.buildth.ing/DevFW-CICD/stacks/src/branch/main/template/stacks/observability-client)

-**Documentation**: [Link to component-specific documentation]
+**Documentation**:
+* [Kubernetes Metrics Server](https://github.com/kubernetes-sigs/metrics-server)
+* [Vector Documentation](https://vector.dev/docs/)
+* [Victoria Metrics Documentation](https://docs.victoriametrics.com/)

 ## Getting Started

 ### Prerequisites

-* [Prerequisite 1]
-* [Prerequisite 2]
+* Kubernetes cluster with ArgoCD installed (provided by `core` stack)
+* cert-manager for certificate management (provided by `otc` stack)
+* Observability backend services for receiving metrics and logs

 ### Quick Start

-[Step-by-step guide to get started with this component]
+The Observability Client stack is deployed as part of the EDP installation process:

-1. [Step 1]
-2. [Step 2]
-3. [Step 3]
+1. **Trigger Deploy Pipeline**
+   - Go to [Infra Deploy Pipeline](https://edp.buildth.ing/DevFW/infra-deploy/actions?workflow=deploy.yaml)
+   - Click on Run workflow
+   - Enter a name in "Select environment directory to deploy". This must be DNS Compatible.
+   - Execute workflow
+
+2. **ArgoCD Synchronization**
+   ArgoCD automatically deploys:
+   - Metrics Server (Helm chart v3.12.2)
+   - Vector agent (Helm chart v0.43.0)
+   - Victoria Metrics k8s-stack (Helm chart v0.48.1)
+   - ServiceMonitor resources for Prometheus scraping
+   - Authentication secrets for remote write endpoints

 ### Verification

-[How to verify the component is working correctly]
-
-## Usage Examples
-
-### [Use Case 1]
-
-[Example with code/commands showing common use case]
+Verify the Observability Client deployment:

 ```bash
-# Example commands
+# Check ArgoCD application status
+kubectl get application -n argocd | grep -E "metrics-server|vector|vm-client"
+
+# Verify Metrics Server is running
+kubectl get pods -n observability -l app.kubernetes.io/name=metrics-server
+
+# Test metrics API
+kubectl top nodes
+kubectl top pods -A
+
+# Verify Vector pods are running
+kubectl get pods -n observability -l app.kubernetes.io/name=vector
+
+# Check Victoria Metrics components
+kubectl get pods -n observability -l app.kubernetes.io/name=victoria-metrics-k8s-stack
+
+# Verify ServiceMonitor resources
+kubectl get servicemonitor -n observability
 ```

-### [Use Case 2]
-
-[Another common scenario]
-
-## Integration Points
-
-* **[Component A]**: [How it integrates]
-* **[Component B]**: [How it integrates]
-* **[Component C]**: [How it integrates]
-
 ## Architecture

-[Optional: Add architectural diagrams and descriptions]
+### Component Architecture

-### Component Architecture (C4)
+The Observability Client stack consists of three integrated components:

-[Add C4 Container or Component diagrams showing the internal structure]
+**Metrics Server**:
+- Collects resource metrics (CPU, memory) from kubelet
+- Provides Metrics API for kubectl top and HPA
+- Lightweight aggregator for cluster-wide resource usage
+- Exposes ServiceMonitor for Prometheus scraping

-### Sequence Diagrams
+**Vector Agent**:
+- DaemonSet deployment for log collection across all nodes
+- Processes and transforms Kubernetes logs
+- Forwards logs to centralized Elasticsearch backend
+- Injects cluster metadata and environment information
+- Supports compression and bulk operations

-[Add sequence diagrams showing key interaction flows with other components]
+**Victoria Metrics Stack**:
+- VMAgent: Scrapes metrics from Kubernetes components and applications
+- VMAlertmanager: Manages alert routing and notifications
+- VMOperator: Manages VictoriaMetrics CRDs and lifecycle
+- Integration with remote Victoria Metrics storage
+- Supports multi-tenant metrics isolation

-### Deployment Architecture
+### Data Flow

-[Add infrastructure and deployment diagrams showing how the component is deployed]
+```
+Kubernetes Resources → Metrics Server → Metrics API
+                                      ↓
+                                ServiceMonitor → VMAgent → Remote VictoriaMetrics
+
+Application Logs → Vector Agent → Transform → Remote Elasticsearch
+
+Prometheus Exporters → VMAgent → Remote VictoriaMetrics → VMAlertmanager
+```

 ## Configuration

-[Key configuration options and how to set them]
+### Metrics Server Configuration
+
+Configured in `stacks/observability-client/metrics-server/values.yaml`:
+
+```yaml
+metrics:
+  enabled: true
+serviceMonitor:
+  enabled: true
+```
+
+**Key Settings**:
+- Enables metrics collection endpoint
+- Exposes ServiceMonitor for Prometheus-compatible scraping
+- Deployed via Helm chart from `https://kubernetes-sigs.github.io/metrics-server/`
+
+### Vector Configuration
+
+Configured in `stacks/observability-client/vector/values.yaml`:
+
+**Role**: Agent (DaemonSet deployment across nodes)
+
+**Authentication**:
+Credentials sourced from `simple-user-secret`:
+- `VECTOR_USER`: Username for remote write authentication
+- `VECTOR_PASSWORD`: Password for remote write authentication
+
+**Data Sources**:
+- `k8s`: Collects Kubernetes container logs
+- `internal_metrics`: Gathers Vector internal metrics
+
+**Log Processing**:
+```yaml
+transforms:
+  parser:
+    - Parse JSON from log messages
+    - Inject cluster environment metadata
+    - Remove original message field
+```
+
+**Output Sink**:
+- Elasticsearch bulk API (v8)
+- Basic authentication with environment variables
+- Gzip compression enabled
+- Custom headers: AccountID and ProjectID
+
+### Victoria Metrics Stack Configuration
+
+Configured in `stacks/observability-client/vm-client-stack/values.yaml`:
+
+**Operator Settings**:
+- Enabled with admission webhooks
+- Managed by cert-manager for ArgoCD compatibility
+
+**VMAgent Configuration**:
+- Basic authentication for remote write
+- Credentials from `vm-remote-write-secret`
+- Stream parsing enabled
+- Drop original labels to reduce memory footprint
+
+**Monitoring Targets**:
+- Node exporter for hardware metrics
+- kube-state-metrics for Kubernetes object states
+- Kubelet metrics (cadvisor)
+- Kubernetes control plane components (API server, etcd, scheduler, controller manager)
+- CoreDNS metrics
+
+**Alertmanager Integration**:
+- Slack notification templates
+- Configurable routing rules
+- TLS support for secure communication
+
+**Storage Options**:
+- VMSingle: Single-node deployment
+- VMCluster: Distributed deployment with replication
+- Configurable retention period
+
+## ArgoCD Application Configuration
+
+**Metrics Server Application** (`template/stacks/observability-client/metrics-server.yaml`):
+- Name: `metrics-server`
+- Chart version: 3.12.2
+- Automated sync with self-heal enabled
+- Namespace: `observability`
+
+**Vector Application** (`template/stacks/observability-client/vector.yaml`):
+- Name: `vector`
+- Chart version: 0.43.0
+- Automated sync with self-heal enabled
+- Namespace: `observability`
+
+**Victoria Metrics Application** (`template/stacks/observability-client/vm-client-stack.yaml`):
+- Name: `vm-client`
+- Chart version: 0.48.1
+- Automated sync with self-heal enabled
+- Namespace: `observability`
+- References manifests from instance repository
+
+## Usage Examples
+
+### Querying Resource Metrics
+
+Access resource metrics collected by Metrics Server:
+
+```bash
+# View node resource usage
+kubectl top nodes
+
+# View pod resource usage across all namespaces
+kubectl top pods -A
+
+# View pod resource usage in specific namespace
+kubectl top pods -n observability
+
+# Sort pods by CPU usage
+kubectl top pods -A --sort-by=cpu
+
+# Sort pods by memory usage
+kubectl top pods -A --sort-by=memory
+```
+
+### Using Metrics for Autoscaling
+
+Create Horizontal Pod Autoscaler based on metrics:
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: myapp-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: myapp
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+```
+
+### Accessing Application Logs
+
+Vector automatically collects logs from all containers. View logs in your centralized Elasticsearch/Kibana:
+
+```bash
+# Logs are automatically forwarded to Elasticsearch
+# Access via Kibana dashboard or Elasticsearch API
+
+# Example: Query logs via Elasticsearch API
+curl -u $VECTOR_USER:$VECTOR_PASSWORD \
+  -X GET "https://elasticsearch.example.com/_search" \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "query": {
+      "match": {
+        "kubernetes.namespace": "my-namespace"
+      }
+    }
+  }'
+```
+
+### Querying Victoria Metrics
+
+Query metrics collected by Victoria Metrics:
+
+```bash
+# Access Victoria Metrics query API
+# Metrics are forwarded to remote Victoria Metrics instance
+
+# Example PromQL queries:
+# - Container CPU usage: container_cpu_usage_seconds_total
+# - Pod memory usage: container_memory_usage_bytes
+# - Node disk I/O: node_disk_io_time_seconds_total
+
+# Query via Victoria Metrics API
+curl -X POST https://victoriametrics.example.com/api/v1/query \
+  -d 'query=up' \
+  -d 'time=2025-12-16T00:00:00Z'
+```
+
+### Creating Custom ServiceMonitors
+
+Expose application metrics for collection:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: myapp-metrics
+  labels:
+    app: myapp
+spec:
+  ports:
+  - name: metrics
+    port: 8080
+    targetPort: 8080
+  selector:
+    app: myapp
+---
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: myapp-monitor
+  namespace: observability
+spec:
+  selector:
+    matchLabels:
+      app: myapp
+  endpoints:
+  - port: metrics
+    path: /metrics
+    interval: 30s
+```
+
+## Integration Points
+
+* **Core Stack**: Depends on ArgoCD for deployment orchestration
+* **OTC Stack**: Requires cert-manager for certificate management
+* **Observability Stack**: Forwards metrics and logs to centralized observability backend
+* **All Application Stacks**: Collects metrics and logs from all platform applications

 ## Troubleshooting

-### [Common Issue 1]
+### Metrics Server Not Responding

-**Problem**: [Description]
+**Problem**: `kubectl top` commands fail or return no data

-**Solution**: [How to fix]
+**Solution**:
+1. Check Metrics Server pod status:
+   ```bash
+   kubectl get pods -n observability -l app.kubernetes.io/name=metrics-server
+   kubectl logs -n observability -l app.kubernetes.io/name=metrics-server
+   ```

-### [Common Issue 2]
+2. Verify kubelet metrics endpoint:
+   ```bash
+   kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
+   ```

-**Problem**: [Description]
+3. Check ServiceMonitor configuration:
+   ```bash
+   kubectl get servicemonitor -n observability -o yaml
+   ```

-**Solution**: [How to fix]
+### Vector Not Forwarding Logs

-## Status
+**Problem**: Logs are not appearing in Elasticsearch

-**Maturity**: [Production / Beta / Experimental]
+**Solution**:
+1. Check Vector agent status:
+   ```bash
+   kubectl get pods -n observability -l app.kubernetes.io/name=vector
+   kubectl logs -n observability -l app.kubernetes.io/name=vector --tail=50
+   ```
+
+2. Verify authentication secret:
+   ```bash
+   kubectl get secret simple-user-secret -n observability
+   kubectl get secret simple-user-secret -n observability -o jsonpath='{.data.username}' | base64 -d
+   ```
+
+3. Test Elasticsearch connectivity:
+   ```bash
+   kubectl exec -it -n observability $(kubectl get pod -n observability -l app.kubernetes.io/name=vector -o jsonpath='{.items[0].metadata.name}') -- \
+     curl -u $VECTOR_USER:$VECTOR_PASSWORD https://elasticsearch.example.com/_cluster/health
+   ```
+
+4. Check Vector internal metrics:
+   ```bash
+   kubectl port-forward -n observability svc/vector 9090:9090
+   curl http://localhost:9090/metrics
+   ```
+
+### Victoria Metrics Not Scraping
+
+**Problem**: Metrics are not being collected or forwarded
+
+**Solution**:
+1. Check VMAgent status:
+   ```bash
+   kubectl get pods -n observability -l app.kubernetes.io/name=vmagent
+   kubectl logs -n observability -l app.kubernetes.io/name=vmagent
+   ```
+
+2. Verify remote write secret:
+   ```bash
+   kubectl get secret vm-remote-write-secret -n observability
+   kubectl get secret vm-remote-write-secret -n observability -o jsonpath='{.data.username}' | base64 -d
+   ```
+
+3. Check ServiceMonitor targets:
+   ```bash
+   kubectl get servicemonitor -n observability
+   kubectl describe servicemonitor metrics-server -n observability
+   ```
+
+4. Verify operator is running:
+   ```bash
+   kubectl get pods -n observability -l app.kubernetes.io/name=victoria-metrics-operator
+   kubectl logs -n observability -l app.kubernetes.io/name=victoria-metrics-operator
+   ```
+
+### High Memory Usage
+
+**Problem**: Victoria Metrics or Vector consuming excessive memory
+
+**Solution**:
+1. For Victoria Metrics, verify `dropOriginalLabels` is enabled:
+   ```bash
+   kubectl get vmagent -n observability -o yaml | grep dropOriginalLabels
+   ```
+
+2. Reduce scrape intervals for high-cardinality metrics:
+   ```yaml
+   # Edit ServiceMonitor
+   spec:
+     endpoints:
+     - interval: 60s  # Increase from 30s
+   ```
+
+3. Filter unnecessary logs in Vector:
+   ```yaml
+   # Add filter transform to Vector configuration
+   transforms:
+     filter:
+       type: filter
+       condition: '.kubernetes.namespace != "kube-system"'
+   ```
+
+4. Check resource limits:
+   ```bash
+   kubectl describe pod -n observability -l app.kubernetes.io/name=vmagent
+   kubectl describe pod -n observability -l app.kubernetes.io/name=vector
+   ```
+
+### Certificate Issues
+
+**Problem**: TLS certificate errors in logs
+
+**Solution**:
+1. Verify cert-manager is running:
+   ```bash
+   kubectl get pods -n cert-manager
+   ```
+
+2. Check certificate status:
+   ```bash
+   kubectl get certificate -n observability
+   kubectl describe certificate -n observability
+   ```
+
+3. Review webhook configuration:
+   ```bash
+   kubectl get validatingwebhookconfigurations | grep victoria-metrics
+   kubectl get mutatingwebhookconfigurations | grep victoria-metrics
+   ```
+
+4. Restart operator if needed:
+   ```bash
+   kubectl rollout restart deployment victoria-metrics-operator -n observability
+   ```

 ## Additional Resources

-* [Link to external documentation]
-* [Link to community resources]
-* [Link to related components]
-
-## Documentation Notes
-
-[Instructions for team members filling in this documentation - remove this section once complete]
+* [Kubernetes Metrics Server Documentation](https://github.com/kubernetes-sigs/metrics-server)
+* [Vector Documentation](https://vector.dev/docs/)
+* [Victoria Metrics Documentation](https://docs.victoriametrics.com/)
+* [Victoria Metrics Operator](https://docs.victoriametrics.com/operator/)
+* [Prometheus Operator API](https://prometheus-operator.dev/docs/operator/api/)
+* [ArgoCD Documentation](https://argo-cd.readthedocs.io/)