从零搭建到告警落地:我的Prometheus+Grafana监控实战踩坑全记录
从零搭建到告警落地我的PrometheusGrafana监控实战踩坑全记录第一次看到服务器宕机却毫无察觉时我意识到监控系统不是可选项而是必选项。那次事故导致业务中断两小时团队不得不连夜排查问题。正是这次教训让我下定决心搭建一套完整的监控告警体系。经过多次尝试和调整最终用PrometheusGrafana构建了稳定可靠的监控解决方案。本文将完整还原从零搭建到告警落地的全过程包括那些让我熬夜的坑和最终找到的解决方案。1. 环境准备与Docker化部署选择Docker Compose作为部署方式因为它能完美解决组件依赖和版本兼容问题。我的开发环境是Ubuntu 20.04但以下方案在任何Linux发行版上都适用。首先创建项目目录结构monitoring/ ├── docker-compose.yml ├── prometheus/ │ ├── prometheus.yml │ └── alert.rules ├── alertmanager/ │ └── config.yml └── grafana/ └── provisioning/关键的docker-compose.yml配置如下version: 3 services: prometheus: image: prom/prometheus:v2.37.0 ports: - 9090:9090 volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/alert.rules:/etc/prometheus/alert.rules command: - --config.file/etc/prometheus/prometheus.yml - --web.enable-lifecycle alertmanager: image: prom/alertmanager:v0.24.0 ports: - 9093:9093 volumes: - ./alertmanager/config.yml:/etc/alertmanager/config.yml depends_on: - prometheus node-exporter: image: prom/node-exporter:v1.3.1 ports: - 9100:9100 pid: host grafana: image: grafana/grafana:9.0.2 ports: - 3000:3000 depends_on: - prometheus部署时遇到的第一个坑是权限问题。直接运行docker-compose up会导致Prometheus无法写入数据解决方法是在宿主机创建数据目录并设置正确权限mkdir -p prometheus/data chown -R 65534:65534 prometheus/data2. Prometheus核心配置详解prometheus.yml是整套系统的中枢神经我的最终版本配置如下global: scrape_interval: 15s evaluation_interval: 15s rule_files: - /etc/prometheus/alert.rules scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node-exporter static_configs: - targets: [node-exporter:9100] - job_name: docker static_configs: - targets: [cadvisor:8080] metrics_path: /metrics常见配置错误包括缩进错误必须使用空格不能使用Tab时间单位缺失如写成scrape_interval: 15targets地址格式错误必须包含端口号验证配置是否正确的命令docker-compose exec prometheus promtool check config /etc/prometheus/prometheus.yml当Targets显示为DOWN时按这个检查清单排查网络连通性docker-compose exec prometheus ping node-exporter端口暴露docker-compose ps确认端口映射端点可达curl http://node-exporter:9100/metrics时间同步所有容器时间必须一致3. Grafana仪表板设计与优化安装完Grafana后第一件事是添加Prometheus数据源。在Configuration Data Sources中选择Prometheus关键配置项URL: http://prometheus:9090Access: Server (Default)Scrape interval: 15s我收集了几个最实用的Dashboard模板Node Exporter FullID1860Docker and system monitoringID893Prometheus 2.0 StatsID3662自定义仪表板时这些PromQL查询特别有用# CPU使用率 100 - (avg by(instance)(irate(node_cpu_seconds_total{modeidle}[5m])) * 100) # 内存使用 (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 # 磁盘空间 100 - (node_filesystem_avail_bytes{mountpoint/,fstype!tmpfs} * 100 / node_filesystem_size_bytes{mountpoint/,fstype!tmpfs})可视化优化技巧对阈值类指标使用Stat面板时间序列数据用Graph面板关键指标添加Gauge面板设置合理的Y轴最大值避免曲线扁平化4. 告警规则与Alertmanager配置alert.rules文件示例groups: - name: host-alerts rules: - alert: HighCPUUsage expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80 for: 5m labels: severity: warning annotations: summary: High CPU usage on {{ $labels.instance }} description: CPU usage is {{ $value }}% - alert: MemoryRunningOut expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 85 for: 10m labels: severity: critical annotations: summary: Memory running out on {{ $labels.instance }} description: Memory usage is {{ $value }}%Alertmanager的config.yml配置邮件通知示例route: group_by: [alertname] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: email-notifications receivers: - name: email-notifications email_configs: - to: your-emailexample.com from: alertmanageryourdomain.com smarthost: smtp.yourdomain.com:587 auth_username: smtp-user auth_password: smtp-password require_tls: true钉钉机器人配置要点在钉钉群添加自定义机器人获取Webhook地址在Alertmanager中添加如下配置- name: dingtalk webhook_configs: - url: https://oapi.dingtalk.com/robot/send?access_tokenyour_token send_resolved: true告警不触发的常见原因Prometheus没有加载规则文件检查--web.enable-lifecycle参数Alertmanager路由配置错误测试工具amtool告警持续时间(for)设置过长表达式阈值设置不合理5. 高级技巧与性能优化长期运行后我总结出这些提升稳定性的方法存储优化方案# 启动参数添加存储配置 command: - --storage.tsdb.retention.time30d - --storage.tsdb.path/prometheus - --storage.tsdb.wal-compression查询性能优化使用recording rules预处理常用查询避免在仪表板中使用高基数指标合理设置抓取间隔关键指标15s普通指标1m高可用方案# 多实例Prometheus配置示例 prometheus-replica: image: prom/prometheus:v2.37.0 ports: - 9091:9090 volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml command: - --config.file/etc/prometheus/prometheus.yml - --web.enable-lifecycle监控Kubernetes的特别注意事项使用kube-state-metrics合理设置抓取间隔建议30s注意RBAC权限配置使用ServiceMonitor自定义抓取目标