用 Prometheus 采集腾讯云容器服务的监控数据时如何配置采集规则?主要需要注意的是 kubelet 与 cadvisor 的监控指标采集,本文分享为 Prometheus 配置 scrape_config 来采集腾讯云容器服务集群的监控数据的方法。
- job_name: "tke-cadvisor" scheme: https metrics_path: /metrics/cadvisor # 采集容器 cadvisor 监控数据 tls_config: insecure_skip_verify: true # tke 的 kubelet 使用自签证书,忽略证书校验 authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet # 排除超级节点 action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: "tke-kubelet" scheme: https metrics_path: /metrics # 采集 kubelet 自身的监控数据 tls_config: insecure_skip_verify: true authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: "tke-probes" # 采集容器健康检查健康数据 scheme: https metrics_path: /metrics/probes tls_config: insecure_skip_verify: true authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: eks # 采集超级节点监控数据 honor_timestamps: true metrics_path: '/metrics' # 所有健康数据都在这个路径 params: # 通常需要加参数过滤掉 ipvs 相关的指标,因为可能数据量较大,打高 Pod 负载。 collect[]: - 'ipvs' # - 'cpu' # - 'meminfo' # - 'diskstats' # - 'filesystem' # - 'load0vg' # - 'netdev' # - 'filefd' # - 'pressure' # - 'vmstat' scheme: http kubernetes_sd_configs: - role: pod # 超级节点 Pod 的监控数据暴露在 Pod 自身 IP 的 9100 端口,所以使用 Pod 服务发现 relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_tke_cloud_tencent_com_pod_type] regex: eklet # 只采集超级节点的 Pod action: keep - source_labels: [__meta_kubernetes_pod_phase] regex: Running # 非 Running 状态的 Pod 机器资源已释放,不需要采集 action: keep - source_labels: [__meta_kubernetes_pod_ip] separator: ; regex: (.*) target_label: __address__ replacement: ${1}:9100 # 监控指标暴露在 Pod 的 9100 端口 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod # 将 Pod 名字写到 "pod" label replacement: ${1} action: replace metric_relabel_configs: - source_labels: [__name__] separator: ; regex: (container_.*|pod_.*|kubelet_.*) replacement: $1 action: keep如今都流行使用 kube-prometheus-stack 这个 helm chart 来自建 Prometheus,在 values.yaml 中进行自定义配置然后安装到集群,其中可以配置 Prometheus 原生的 scrape_config (非 CRD),配置方法是将自定义的 scrape_config 写到 prometheus.prometheusSpec.additionalScrapeConfigs 字段下,下面是示例:
prometheus: prometheusSpec: additionalScrapeConfigs: - job_name: "tke-cadvisor" scheme: https metrics_path: /metrics/cadvisor tls_config: insecure_skip_verify: true authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: "tke-kubelet" scheme: https metrics_path: /metrics tls_config: insecure_skip_verify: true authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: "tke-probes" scheme: https metrics_path: /metrics/probes tls_config: insecure_skip_verify: true authorization: credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type] regex: eklet action: drop - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: eks honor_timestamps: true metrics_path: '/metrics' params: collect[]: ['ipvs'] # - 'cpu' # - 'meminfo' # - 'diskstats' # - 'filesystem' # - 'load0vg' # - 'netdev' # - 'filefd' # - 'pressure' # - 'vmstat' scheme: http kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_tke_cloud_tencent_com_pod_type] regex: eklet action: keep - source_labels: [__meta_kubernetes_pod_phase] regex: Running action: keep - source_labels: [__meta_kubernetes_pod_ip] separator: ; regex: (.*) target_label: __address__ replacement: ${1}:9100 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod replacement: ${1} action: replace metric_relabel_configs: - source_labels: [__name__] separator: ; regex: (container_.*|pod_.*|kubelet_.*) replacement: $1 action: keep storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi超级节点的 Pod 监控指标使用 collect[] 查询参数来过滤不需要的监控指标:
curl ${IP}:9100/metrics?collect[]=ipvs&collect[]=vmstat为什么要使用这么奇怪的参数名?这是因为 node_exporter 就是用的这个参数,超级节点的 Pod 内部引用了 node_exporter 的逻辑,这里 是 node_exporter 的 collect[] 参数用法说明。
| 留言与评论(共有 0 条评论) “” |