K8S学习笔记系列6-监控平台Promethus+Grafana

乡下的树 2021年01月30日 553次浏览

一、介绍Prometheus

Prometheus(普罗米修斯)是一个最初在SoundCloud上构建的监控系统。自2012年成为社区开源项目,拥有非常活跃的开发人员和用户社区。为强调开源及独立维护,Prometheus于2016年加入CNCF,成为继kubernetes之后的第二个托管项目。
官网:https://prometheus.io/
源码托管:https://github.com/prometheus

Prometheus的特点:

  • 多维数据模型:由度量名称和键值对标识的时间序列数据。
  • 内置时间序列数据库:TSDB
  • promQL:一种灵活的查询语言,可以利用多维数据完成复杂查询
  • 基于HTTP的pull(拉取)方式采集时间序列数据(exporter)
  • 同时支持PushGateway组件收集数据
  • 通过服务发现或静态配置发现目标收集数据
  • 多种图形模式及仪表盘支持
  • 支持做为数据源接入Grafana

Prometheus架构:

image-1648048541927

  • 服务发现:重点,一种是基于kuberntes本身元数据的自动发现,叫kuberntes_sd(kuberntes_service discovery).另一种是你把自动发现规则写到文件 ,基于文件自动发现。叫做file_sd
  • Retrieval:收集自动发现的规则,二还要收集exporters收集的监控指标。都由Retrieval引擎帮收集,然后传到TSDB
  • TSDB:可以落盘到HDD/SSD
  • HTTP server:普罗米修斯本身还提供可http server 1、服务于报警,push alerts推报警信息。2、通过PromQL给Prometheus web ui 或者grafana提供数据查询的接口

二、搭建Prometheus

1、交付Exporters

正常需要4个exporters(kube-state-metrics、node-exporters)就可以
1.1、部署kube-state-metrics
kube-state-metrics用来收集k8s状态信息,或者收集基本状态信息的监控代理。比如k8s中有多少个节点,每个节点有多少个deployment,deployment更新过多少版等等

1.1.1、准备镜像

kube-state-metrics官方quay.io地址:https://quay.io/repository/coreos/kube-state-metrics?tab=tags

[root@k8s-7-200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0   # quay.io无法访问可改用以下源
[root@k8s-7-200 ~]# docker pull quay.mirrors.ustc.edu.cn/coreos/kube-state-metrics:v1.5.0
[root@k8s-7-200 ~]# docker images |grep kube-state
quay.io/coreos/kube-state-metrics            v1.5.0                     91599517197a   3 years ago     31.8MB
[root@k8s-7-200 ~]# docker tag 91599517197a harbor.itdo.top/public/kube-state-metrics:v1.5.0
[root@k8s-7-200 ~]# docker push harbor.itdo.top/public/kube-state-metrics:v1.5.0

image-1648043731136

1.1.2、准备资源配置清单

[root@k8s-7-200 ~]# mkdir -p /data/k8s-yaml/kube-state-metrics  && cd /data/k8s-yaml/kube-state-metrics
[root@k8s-7-200 kube-state-metrics]# vi rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system

[root@k8s-7-200 kube-state-metrics]# vi dp.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
  labels:
    grafanak8sapp: "true"
    app: kube-state-metrics
  name: kube-state-metrics
  namespace: kube-system
spec:
  selector:
    matchLabels:
      grafanak8sapp: "true"
      app: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        grafanak8sapp: "true"
        app: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: harbor.itdo.top/public/kube-state-metrics:v1.5.0  #修改镜像地址
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        readinessProbe:   #就绪性探针,探测确认容器已经完成启动,才给调度流量
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
      serviceAccountName: kube-state-metrics

1.1.3、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/kube-state-metrics/rbac.yaml
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/kube-state-metrics/dp.yaml
deployment.extensions/kube-state-metrics created

1.1.4、检测正否正常

1 main.go:208] Starting metrics server: 0.0.0.0:8080    内部启动8080
1 main.go:182] Starting kube-state-metrics self metrics server: 0.0.0.0:8081   自己检测自己跑的是8081

image-1648043961521

检查部署正常
[root@k8s-7-21 ~]# kubectl get pods -o wide -n kube-system |grep kube-state-metrics
kube-state-metrics-5db9f97d69-pvtfd    1/1     Running   0          21s   172.7.21.6   k8s-7-21.host.top   <none>           <none>
[root@k8s-7-21 ~]# curl 172.7.21.6:8080/healthz
ok
[root@k8s-7-21 ~]#    ok代表正常

1.2、部署node-exporters

node-exporters用来收集k8s运算节点上基础设施信息。比如运算节点还有多少内存、cpu内存使用量、网络IO等等。要在所有运算节点上

1.2.1、准备镜像

网站:https://hub.docker.com/r/prom/node-exporter/tags

[root@k8s-7-200 ~]# docker pull prom/node-exporter:v0.15.0
[root@k8s-7-200 ~]# docker images |grep node-exporter
prom/node-exporter                                                   v0.15.0                         12d51ffa2b22        3 years ago         22.8MB
[root@k8s-7-200 ~]# docker image tag 12d51ffa2b22 harbor.itdo.top/public/node-exporter:v0.15.0
[root@k8s-7-200 ~]# docker image push harbor.itdo.top/public/node-exporter:v0.15.0

1.2.2、准备资源配置清单

[root@k8s-7-200 k8s-yaml]# mkdir -p /data/k8s-yaml/node-exporter && cd /data/k8s-yaml/node-exporter
[root@k8s-7-200 node-exporter]# vi ds.yaml
#node-exporter采用daemonset类型控制器,部署在所有Node节点,且共享了宿主机网络名称空间
#通过挂载宿主机的/proc和/sys目录采集宿主机的系统信息

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: node-exporter
  namespace: kube-system
  labels:
    daemon: "node-exporter"
    grafanak8sapp: "true"
spec:
  selector:
    matchLabels:
      daemon: "node-exporter"
      grafanak8sapp: "true"
  template:
    metadata:
      name: node-exporter
      labels:
        daemon: "node-exporter"
        grafanak8sapp: "true"
    spec:
      volumes:
      - name: proc
        hostPath: 
          path: /proc
          type: ""
      - name: sys
        hostPath:
          path: /sys
          type: ""
      containers:
      - name: node-exporter
        image: harbor.itdo.top/public/node-exporter:v0.15.0
        args:
        - --path.procfs=/host_proc
        - --path.sysfs=/host_sys
        ports:
        - name: node-exporter
          hostPort: 9100
          containerPort: 9100
          protocol: TCP
        volumeMounts:
        - name: sys
          readOnly: true
          mountPath: /host_sys
        - name: proc
          readOnly: true
          mountPath: /host_proc
      hostNetwork: true

1.2.3、应用资源配置清单

在一个node节点执行

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/node-exporter/ds.yaml
daemonset.extensions/node-exporter created

1.2.4、检测正否正常

image-1648044192412
查看9100端口:

[root@k8s-7-21 ~]# netstat -tulpn |grep 9100
tcp6       0      0 :::9100                 :::*                    LISTEN      27069/node_exporter 
[root@k8s-7-21 ~]# kubectl get pods -o wide -n kube-system |grep node-exporter
node-exporter-2lccz                    1/1     Running   0          2m13s   10.4.7.21    k8s-7-21.host.top   <none>           <none>
node-exporter-s6s8d                    1/1     Running   0          2m13s   10.4.7.22    k8s-7-22.host.top   <none>           <none>

由于我们跟宿主机共享网络,所以docker没有IP了,IP就是宿主机
[root@k8s-7-21 ~]# curl localhost:9100
<html>
            <head><title>Node Exporter</title></head>
            <body>
            <h1>Node Exporter</h1>
            <p><a href="/metrics">Metrics</a></p>
            </body>
            </html>

image-1648044227887
docker取到的宿主机数据: curl http://10.4.7.21:9100/metrics
image-1648044245042

1.3、部署cadvisor

容器本身到底耗费多少资源,应该怎么监控到目前为止不知道。cadvisor用来监控容器内部的使用资源的重要工具,他是通过容器外部来监控。cadvisor在k8s1.9以前跟kubectl集成在一起,启动kubectl,cadvisor启动。从1.9以后分开。
所以部署cadvisor后,cadvisor通过跟kubectl要数据,kubectl跟docker引擎要数据,才能实现采集数据。所以容器本身消耗资源的监控是依赖于cadvisor,而不是kube-state-metrics
官方地址: https://hub.docker.com/r/google/cadvisor

1.3.1、准备镜像

[root@k8s-7-200 ~]# docker pull google/cadvisor:v0.28.3
[root@k8s-7-200 node-exporter]# docker images |grep cadvisor
google/cadvisor                                                      v0.28.3                         75f88e3ec333        3 years ago         62.2MB
[root@k8s-7-200 node-exporter]# docker image tag 75f88e3ec333 harbor.itdo.top/public/cadvisor:v0.28.3
[root@k8s-7-200 node-exporter]# docker push harbor.itdo.top/public/cadvisor:v0.28.3

1.3.2、准备资源配置清单

[root@k8s-7-200 k8s-yaml]# mkdir -p /data/k8s-yaml/cadvisor && cd /data/k8s-yaml/cadvisor
[root@k8s-7-200 cadvisor]# vim ds.yaml

#cadvisor采用daemonset方式运行在node节点上,通过污点的方式排除master
#同时将部分宿主机目录挂载到本地,如docker的数据目录

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
 name: cadvisor
 namespace: kube-system
 labels:
   app: cadvisor
spec:
 selector:
   matchLabels:
     name: cadvisor
 template:
   metadata:
     labels:
       name: cadvisor
   spec:
     hostNetwork: true
     tolerations:
     - key: node-role.kubernetes.io/master
       effect: NoSchedule
     containers:
     - name: cadvisor
       image: harbor.itdo.top/public/cadvisor:v0.28.3   #修改镜像地址
       imagePullPolicy: IfNotPresent
       volumeMounts:
       - name: rootfs
         mountPath: /rootfs
         readOnly: true
       - name: var-run
         mountPath: /var/run
       - name: sys
         mountPath: /sys
         readOnly: true
       - name: docker
         mountPath: /var/lib/docker
         readOnly: true
       ports:
         - name: http
           containerPort: 4194
           protocol: TCP
       readinessProbe:
         tcpSocket:
           port: 4194
         initialDelaySeconds: 5
         periodSeconds: 10
       args:
         - --housekeeping_interval=10s
         - --port=4194
     terminationGracePeriodSeconds: 30
     volumes:
     - name: rootfs
       hostPath:
         path: /
     - name: var-run
       hostPath:
         path: /var/run
     - name: sys
       hostPath:
         path: /sys
     - name: docker
       hostPath:
         path: /data/docker

解释:
effect 定义对Pod 排斥效果:

  • NoSchedule: 仅影响调度过程,对现存的Pod 不产生影响
  • NoExecute:影响调度过程,也影响现存的Pod,不满足容忍度就会去除节点的Pod
  • PreferNoSchedule:属于柔性约束,尽量避免将Pod调度到具有该污点的Node上,不过无其他节点可以调度时也可以允许调度

修改运算节点的软连接,在所有的运算节点。

理由:版本原因,不修改软连接容器启动不起来
k8s-7-21和k8s-7-22都需要操作

配置/sys/fs/cgroup/重新挂载
[root@k8s-7-21 ~]# mount -o remount,rw /sys/fs/cgroup/   
   
挂在在那呢,
[root@k8s-7-21 ~]# ll /sys/fs/cgroup/
drwxr-xr-x. 7 root root  0 2月   1 15:51 blkio
lrwxrwxrwx. 1 root root 11 2月   1 15:45 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 2月   1 15:45 cpuacct -> cpu,cpuacct
drwxr-xr-x. 7 root root  0 2月   1 15:51 cpu,cpuacct
 
容器使用cpuacct,cpu
[root@k8s-7-21 ~]# ln -s /sys/fs/cgroup/cpu,cpuacct/ /sys/fs/cgroup/cpuacct,cpu
[root@k8s-7-21 ~]# ll /sys/fs/cgroup/ 
lrwxrwxrwx. 1 root root 11 2月   1 15:45 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 2月   1 15:45 cpuacct -> cpu,cpuacct
lrwxrwxrwx. 1 root root 27 2月   2 10:35 cpuacct,cpu -> /sys/fs/cgroup/cpu,cpuacct/
drwxr-xr-x. 7 root root  0 2月   1 15:51 cpu,cpuacct

1.3.3、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/cadvisor/ds.yaml 
daemonset.extensions/cadvisor created

1.3.4、查看状态

image-1648044504137

1.4、blackbox-exporter

用来帮助探明业务容器是否存活,zabbix有一个重要监控场所,tomcat起来占用8080端口,要监控8080端口是否存活,如果是意外的情况下宕了,要触发报警。 在这里面专门检测容器是否宕机的工具(发版跟新代码不算)

1.4.1、准备镜像

[root@k8s-7-200 cadvisor]# docker pull prom/blackbox-exporter:v0.15.1
[root@k8s-7-200 cadvisor]# docker images |grep blackbox-exporter
prom/blackbox-exporter                                               v0.15.1                         81b70b6158be        16 months ago       19.7MB
[root@k8s-7-200 cadvisor]# docker image tag 81b70b6158be harbor.itdo.top/public/blackbox-exporter:v0.15.1
[root@k8s-7-200 cadvisor]# docker image push harbor.itdo.top/public/blackbox-exporter:v0.15.1

1.4.2、准备资源配置清单

[root@k8s-7-200 cadvisor]# mkdir -p /data/k8s-yaml/blackbox-exporter && cd /data/k8s-yaml/blackbox-exporter

[root@k8s-7-200 blackbox-exporter]# vi cm.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: kube-system
data:
  blackbox.yml: |-
    modules:
      http_2xx:
        prober: http
        timeout: 2s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2"]
          valid_status_codes: [200,301,302]
          method: GET
          preferred_ip_protocol: "ip4"
      tcp_connect:
        prober: tcp
        timeout: 2s

[root@k8s-7-200 blackbox-exporter]# vi dp.yaml

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: blackbox-exporter
  namespace: kube-system
  labels:
    app: blackbox-exporter
  annotations:
    deployment.kubernetes.io/revision: 1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blackbox-exporter
  template:
    metadata:
      labels:
        app: blackbox-exporter
    spec:
      volumes:
      - name: config
        configMap:
          name: blackbox-exporter
          defaultMode: 420
      containers:
      - name: blackbox-exporter
        image: harbor.itdo.top/public/blackbox-exporter:v0.15.1
        imagePullPolicy: IfNotPresent
        args:
        - --config.file=/etc/blackbox_exporter/blackbox.yml
        - --log.level=info
        - --web.listen-address=:9115
        ports:
        - name: blackbox-port
          containerPort: 9115
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 50Mi
        volumeMounts:
        - name: config
          mountPath: /etc/blackbox_exporter
        readinessProbe:
          tcpSocket:
            port: 9115
          initialDelaySeconds: 5
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 3

[root@k8s-7-200 blackbox-exporter]# vi svc.yaml
#没有指定targetPort是因为Pod中暴露端口名称为 blackbox-port

kind: Service
apiVersion: v1
metadata:
  name: blackbox-exporter
  namespace: kube-system
spec:
  selector:
    app: blackbox-exporter
  ports:
    - name: blackbox-port
      protocol: TCP
      port: 9115

[root@k8s-7-200 blackbox-exporter]# vi ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: blackbox-exporter
  namespace: kube-system
spec:
  rules:
  - host: blackbox.itdo.top
    http:
      paths:
      - path: /
        backend:
          serviceName: blackbox-exporter
          servicePort: blackbox-port

配置blackbox.itdo.top域名解析:
[root@k8s-7-11 ~]# vi /var/named/itdo.top.zone

blackbox           A    10.4.7.10
[root@k8s-7-11 ~]#systemctl restart named
[root@k8s-7-21 ~]# dig -t A blackbox.itdo.top @192.168.0.2 +short
10.4.7.10

1.4.3、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/blackbox-exporter/cm.yaml
configmap/blackbox-exporter created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/blackbox-exporter/dp.yaml
deployment.extensions/blackbox-exporter created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/blackbox-exporter/svc.yaml
service/blackbox-exporter created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/blackbox-exporter/ingress.yaml
ingress.extensions/blackbox-exporter created

1.4.4、查看状态

image-1648044814280
image-1648044826323

2、交付Prometheus Server

2.1、准备镜像

[root@k8s-7-200 ~]# docker pull prom/prometheus:v2.14.0
[root@k8s-7-200 ~]# docker image tag prom/prometheus:v2.14.0 harbor.itdo.top/infra/prometheus:v2.14.0
[root@k8s-7-200 ~]# docker image push harbor.itdo.top/infra/prometheus:v2.14.0

2.2、准备资源配置清单

[root@k8s-7-200 ~]# mkdir /data/k8s-yaml/prometheus && cd /data/k8s-yaml/prometheus

[root@k8s-7-200 prometheus]# vi rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
  namespace: infra
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: infra

[root@k8s-7-200 prometheus]# vi dp.yaml
#Prometheus在生产环境中,一般采用一个单独的大内存node部署,采用污点让其它pod不会调度上来
#–storage.tsdb.min-block-duration 内存中缓存最新多少分钟的TSDB数据,生产中会缓存更多的数据
#–storage.tsdb.retention TSDB数据保留的时间,生产中会保留更多的数据

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "5"
  labels:
    name: prometheus
  name: prometheus
  namespace: infra
spec:
  progressDeadlineSeconds: 600
  replicas: 1   
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      nodeName: k8s-7-21.host.top
      containers:
      - name: prometheus
        image: harbor.itdo.top/infra/prometheus:v2.14.0
        imagePullPolicy: IfNotPresent
        command:
        - /bin/prometheus
        args:
        - --config.file=/data/etc/prometheus.yml
        - --storage.tsdb.path=/data/prom-db
        - --storage.tsdb.min-block-duration=5m
        - --storage.tsdb.retention=24h
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /data
          name: data
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1000m"
            memory: "2Gi"
      imagePullSecrets:
      - name: harbor
      securityContext: 
        runAsUser: 0
      serviceAccountName: prometheus
      volumes:
      - name: data
        nfs:
          server: k8s-7-200.itdo.top
          path: /data/nfs-volume/prometheus

[root@k8s-7-200 prometheus]# vi svc.yaml 暴露9090端口

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: infra
spec:
  ports:
  - port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus

[root@k8s-7-200 prometheus]# vi ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: traefik
  name: prometheus
  namespace: infra
spec:
  rules:
  - host: prometheus.itdo.top
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus
          servicePort: 9090

配置dns解析

[root@k8s-7-11 ~]# vi /var/named/itdo.top.zone
prometheus         A    10.4.7.10

[root@k8s-7-11 ~]# systemctl restart named
[root@k8s-7-11 ~]# dig -t A prometheus.itdo.top @10.4.7.11 +short
10.4.7.10

2.3、准备Prometheus配置

拷贝证书

[root@k8s-7-200 ~]# mkdir -p /data/nfs-volume/prometheus/{etc,prom-db}      创建etc配置、时间序列数据库
[root@k8s-7-200 prometheus]# cd /data/nfs-volume/prometheus/etc/  拷贝证书,因为要跟apiserver通讯。Prometheus为什么自动发现k8s的元数据,是因为跟apiserver通讯
[root@k8s-7-200 etc]# cp /opt/certs/ca.pem .
[root@k8s-7-200 etc]# cp -a /opt/certs/client.pem .
[root@k8s-7-200 etc]# cp -a /opt/certs/client-key.pem .
[root@k8s-7-200 etc]# ll
-rw-r--r--. 1 root root 1346 2月   5 16:12 ca.pem
-rw-------. 1 root root 1679 11月 30 12:22 client-key.pem 
-rw-r--r--. 1 root root 1363 11月 30 12:22 client.pem

准备Prometheus的配置文件yml
[root@k8s-7-200 etc]# vim /data/nfs-volume/prometheus/etc/prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s
scrape_configs:
- job_name: 'etcd'   #除了job name etcd需要修改参数,其他都可以直接应用到生产环境,因为其他都是自动发现规则
  tls_config:
    ca_file: /data/etc/ca.pem
    cert_file: /data/etc/client.pem
    key_file: /data/etc/client-key.pem
  scheme: https
  static_configs:          #修改为对应etcd的ip
  - targets:
    - '10.4.7.12:2379'
    - '10.4.7.21:2379'
    - '10.4.7.22:2379'
- job_name: 'kubernetes-apiservers'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'kubernetes-kubelet'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:10255
- job_name: 'kubernetes-cadvisor'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:4194
- job_name: 'kubernetes-kube-state'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
  - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
    regex: .*true.*
    action: keep
  - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
    regex: 'node-exporter;(.*)'
    action: replace
    target_label: nodename
- job_name: 'blackbox_http_pod_probe'
  metrics_path: /probe
  kubernetes_sd_configs:
  - role: pod
  params:
    module: [http_2xx]
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
    action: keep
    regex: http
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port,  __meta_kubernetes_pod_annotation_blackbox_path]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+);(.+)
    replacement: $1:$2$3
    target_label: __param_target
  - action: replace
    target_label: __address__
    replacement: blackbox-exporter.kube-system:9115
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'blackbox_tcp_pod_probe'
  metrics_path: /probe
  kubernetes_sd_configs:
  - role: pod
  params:
    module: [tcp_connect]
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
    action: keep
    regex: tcp
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __param_target
  - action: replace
    target_label: __address__
    replacement: blackbox-exporter.kube-system:9115
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'traefik'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    action: keep
    regex: traefik
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name

2.4、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/prometheus/rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/prometheus/dp.yaml
deployment.extensions/prometheus created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/prometheus/svc.yaml
service/prometheus created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/prometheus/ingress.yaml
ingress.extensions/prometheus created

2.5、查看状态

访问http://prometheus.itdo.top后跳转http://prometheus.itdo.top/graph
image-1648045257667

3、 配置文件说明

正常情况下,会看到如下这些,这些targets是怎么出来的,是之前的prometheus.yml配置文件已经写怎么调取
image-1648045342147
通过Configuration也可以查看。配置文件里面有多少个job.name,targets就有多少个模块
image-1648045365704
job_name: etcd中的配置说明:

- job_name: etcd
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics  这里代表取etcd中的哪个路径取数据,比如etc是10.4.7.12:2379,取10.4.7.12:2379/metrics数据
  scheme: https
  static_configs:   静态配置,在这些job_name只有etcd是静态(etcd是写死),其他都是自动发现,
  - targets:
    - 10.4.7.12:2379
    - 10.4.7.21:2379
    - 10.4.7.22:2379
  tls_config:   从etcd中取/存数据,使用ca认证
    ca_file: /data/etc/ca.pem
    cert_file: /data/etc/client.pem
    key_file: /data/etc/client-key.pem
    insecure_skip_verify: false

Endpoint 接入点,从那些接入点取数据。lables我能通过什么样的维度标签去过滤相关的指标
image-1648045438005
job_name:kubernetes-kube-state中的配置说明:
这是一个自动发现
image-1648045462319
Service Discovery 服务发现详细
image-1648045511073

- job_name: kubernetes-kube-state
  honor_timestamps: true
  scrape_interval: 15s  采集的时间
  scrape_timeout: 10s  超时时间
  metrics_path: /metrics  到底请求哪个url,地址
  scheme: http
  kubernetes_sd_configs:  kubernetes_service Discoveryd的配置,定义job_name: kubernetes-kube-state的时候用的就是自动发现的service Discoveryd配置.而之所普罗米修斯能够使用(kubernetes_sd)自动发现k8s,主要在于普罗米修斯天生就跟k8s在一起,所以只要声明后,自动Retrieval数据中心识别自动发现k8s的元数据
  - role: pod  能够发现k8s中所有的pod
  relabel_configs:  我不能让所有的pod都进到我job这里来,过滤

其中Endpoint里面的数据是从Service Discovery 的_address_=“” 取得
按照配置显示发现所有的pod,然后过滤后是http://10.4.7.21:9100/metrics、http://10.4.7.22:9100/metrics是node_exporter,http://172.7.22.5:8080/metrics是kube-state-metrics。怎么过滤
relabel_configs配置说明(keep):

relabel_configs:  我不能让所有的pod都进到我job这里来,过滤
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
    separator: ;
    regex: .*true.*
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_label_daemon, __meta_kubernetes_pod_node_name]
    separator: ;
    regex: node-exporter;(.*)
    target_label: nodename
    replacement: $1
    action: replace

最重要的是查看action,其中keep和dorp是主要,先看。
source_labels: [__meta_kubernetes_pod_label_grafanak8sapp] 取源数据=当你的k8s里面的pod有一个标签,这个标签的名字是grafanak8sapp,我就把他保留,没有这个标签就删除。
separator: ;
regex: .true. 如果正则表达式是true,动作就是keep
replacement: $1
action: keep

[root@k8s-7-22 ~]# kubectl get pods -o wide -n kube-system |grep kube-state-metrics
kube-state-metrics-6bc667c8b9-z9dhd     1/1     Running   1          5d18h   172.7.22.5   k8s-7-22.host.com   <none>           <none>
[root@k8s-7-22 ~]# kubectl get pod kube-state-metrics-6bc667c8b9-z9dhd -o yaml -n kube-system |grep -A 5 labels
  labels:
    app: kube-state-metrics
    grafanak8sapp: "true"   #之前在dp.yaml中定义好了
    pod-template-hash: 6bc667c8b9
  name: kube-state-metrics-6bc667c8b9-z9dhd
  namespace: kube-system
[root@k8s-7-22 ~]#

image-1648045637264

使用Prometheus 查询

输入node_cpu (函数) ,点击Execute ,其中一条就是一个数据维度
image-1648045669926
搜索的都是监控指标,所以输出都是监控指标。如node_cpu{nodename=“k8s-7-21.host.top”}
image-1648045691737
总结:从拉取到的所有的数据中(拉取的都是函数),存到普罗米修斯,通过Execute(标签)过滤这些函数
比如搜索的是node_disk_io_time_ms,他是从kubernetes_pod_name="node-exporter-2lccz"获取的
image-1648045718243
在target中kubernetes-kube-state (3/3 up)中随便找个http://10.4.7.21:9100/metrics 说明是node-exporter帮助收集。Prometheus通过 scrape_interval: 15s 的间隔,curl一下http://10.4.7.21:9100/metrics收集最新数据
image-1648045740939
​思考:搜索node_disk_io_time_ms数据时很长的,但是在http://10.4.7.21:9100/metrics中node_disk_io_time_ms只有{device=“dm-0”} ,其他数据呢。这个其他数据就是relabel_configs下的决定的

relabel_configs配置说明(replace):

image-1648045780664

- job_name: kubernetes-kube-state
  略
  relabel_configs:
  略
  - source_labels: [__meta_kubernetes_pod_name]  
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  略
 
这句话的意思是:把从k8s中获取到有用的pod的pod_name整了一个标签名字叫当前监控指标下的对应的pod的名字(kubernetes_pod_name)。如数据从kubernetes-kube-state中http://10.4.7.21:9100/metrics获取的,对应的http://10.4.7.21:9100/metrics的pod的名字是node-exporter-c92vd,所以给在http://10.4.7.21:9100/metrics获取的数据对应的pod_name整了一个标签名字叫node-exporter-c92vd
也就是kubernetes_pod_name="node-exporter-c92vd" 
-- action:重新标签动作 
  -- replace:默认,通过regex匹配source_lable的值,适用replacement来引用表达式匹配的分组 
  -- keep:删除regex与连接不匹配的目标source_label. 
  -- dorp:删除regex与连接匹配的目标source_label 
  -- hashmod:设置target_label为modules连接的哈希值source_label.  
  -- lablemap:匹配regex所有标签名称,然后复制匹配标签的值进行分组,replacement分组引用($1,$2)替代

4、Prometheus的使用

4.1、Traefik接入

image-1648045840762
查看一下配置

- job_name: traefik
略
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]  这里面是匹配了一个注解,所以在traefik的pod控制器加上一个annotations注解,重启pod后,监控就会生效
    separator: ;
    regex: traefik
    replacement: $1
    action: keep
略
 
 
- job_name: traefik
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    separator: ;
    regex: traefik
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace

修改traefik的pod控制器加上一个annotations注解,重启pod
image-1648045888992
#在traefik的daemonset.yaml的spec.template.metadata 加入注释,然后重启Pod
tips:不需要手动对齐语句

"annotations": {
  "prometheus_io_scheme": "traefik",
  "prometheus_io_path": "/metrics",
  "prometheus_io_port": "8080"
}

Annotation与Label类似,也使用key/value键值对的形式进行定义。
Label:具有严格的命名规则,它定义的是Kubernetes对象的元数据(Metadata),并且用于 Label Selector。
Annotation:则是用户任意定义的“附加”信息,以便于外部工具进行查找。
image-1648045952588
删除两个节点的traefik
image-1648088109318
如果容器发生了Terminating后,可以进行强制删除

kubectl delete pods 容器名字 -n kube-system --force --grace-period=0

启动后查看Prometheus

image-1648046028120
查看对应监控项
curl http://172.7.22.5:8080/metrics
image-1648046046979
image-1648046060407

4.2、接入Blackbox监控

监控我们服务是否存活,检测这个服务是否存活,我的先确定这个服务是什么类型的服务,是TCP还是http,然后会不定时的检测端口存活,判断是不是存活。

# 在对应pod的注释中添加,以下分别是TCP探测和HTTP探测,Prometheus中没有定义其它协议的探测
"annotations": {
  "blackbox_port": "20880",
  "blackbox_scheme": "tcp"
}
  
"annotations": {
  "blackbox_port": "8080",
  "blackbox_scheme": "http",
  "blackbox_path": "/hello?name=health"
}

测试TCP:

这里我们监控dubbo-demo-server为例子,dubbo-demo-server是TCP协议,修改dp.yaml启动容器
image-1648046124099
查看状态

显示的结果在
blackbox_http_pod_probe (0/0 up) show more   http监控结果项目
blackbox_tcp_pod_probe (1/1 up)   TCP监控结果项目

image-1648046151437
访问http://blackbox.itdo.top,所以检测机制:blackbox帮助访问172.7.22.16:20880,进而判断服务是不是正常

image-1648046174020
查看日志
image-1648046203009
在Targets中请求数据是http://blackbox-exporter.kube-system:9115/probe 而且get module=“tcp_connect” target=“172.7.22.16:20880”

blackbox-exporter使用的是集群的192.168.20.237:9115
[root@k8s-7-21 ~]# kubectl get svc -o wide -n kube-system |grep blackbox-exporter
blackbox-exporter         ClusterIP  192.168.20.237   <none>        9115/TCP                 7d20h   app=blackbox-exporter
 
随便进入一个集群中的容器,然后curl blackbox-exporter的9115端口,而且带着module=tcp_connect&target=172.7.22.10:20880
[root@k8s-7-21 ~]# kubectl exec -it nginx-ds-q5tmt /bin/bash
 
root@nginx-ds-q5tmt:/# curl 'http://blackbox-exporter.kube-system:9115/probe?module=tcp_connect&target=172.7.22.16:20880'
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.357904555
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.35803019
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 0
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0
root@nginx-ds-q5tmt:/# 

测试HTTP:

这里我们监控dubbo-demo-consumer为例子,dubbo-demo-consumer是HTTP协议,修改dp.yaml启动容器
image-1648046267597
image-1648046284475

5、部署Grafana

5.1、安装Grafana

[root@k8s-7-200 ~]# docker pull grafana/grafana:5.4.2
[root@k8s-7-200 ~]# docker image ls |grep grafana
grafana/grafana                                                      5.4.2                           6f18ddf9e552        2 years ago         243MB
[root@k8s-7-200 ~]# docker image tag 6f18ddf9e552 harbor.itdo.top/infra/grafana:v5.4.2
[root@k8s-7-200 ~]# docker image push harbor.itdo.top/infra/grafana:v5.4.2

5.2、准备资源配置清单

[root@k8s-7-200 ~]# mkdir /data/k8s-yaml/grafana && cd /data/k8s-yaml/grafana

[root@k8s-7-200 grafana]# vi rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
rules:
- apiGroups:
  - "*"
  resources:
  - namespaces
  - deployments
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana
subjects:
- kind: User
  name: k8s-node

[root@k8s-7-200 grafana]# vi dp.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: grafana
    name: grafana
  name: grafana
  namespace: infra
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      name: grafana
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: grafana
        name: grafana
    spec:
      containers:
      - name: grafana
        image: harbor.itdo.top/infra/grafana:v5.4.2 #修改镜像地址
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /var/lib/grafana
          name: data
      imagePullSecrets:
      - name: harbor
      securityContext:
        runAsUser: 0
      volumes:
      - nfs:
          server: k8s-7-200.host.top
          path: /data/nfs-volume/grafana
        name: data

新建grafana目录

[root@k8s-7-200 grafana]# mkdir /data/nfs-volume/grafana

[root@k8s-7-200 grafana]# vi svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: infra
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana

[root@k8s-7-200 grafana]# vi ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana
  namespace: infra
spec:
  rules:
  - host: grafana.itdo.top
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000

配置dns解析

[root@k8s-7-11 ~]# vi /var/named/itdo.top.zone
grafana            A    10.4.7.10

[root@k8s-7-11 ~]# systemctl restart named
[root@k8s-7-11 ~]# dig -t A grafana.itdo.top @10.4.7.11 +short
10.4.7.10

5.3、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/grafana/rbac.yaml
clusterrole.rbac.authorization.k8s.io/grafana created
clusterrolebinding.rbac.authorization.k8s.io/grafana created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/grafana/dp.yaml
deployment.extensions/grafana created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/grafana/svc.yaml
service/grafana created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/grafana/ingress.yaml
ingress.extensions/grafana created

5.4、安装插件

访问http://grafana.itdo.top 默认账户密码 admin admin
image-1648046590187
配置界面–记得保存
image-1648046608004

# 需要安装的插件
grafana-kubernetes-app
grafana-clock-panel     时钟插件
grafana-piechart-panel  饼图
briangann-gauge-panel   D3 Gauge
natel-discrete-panel    Discrete
# 以上插件安装有两种方式:
方式一: 进入docker容器中,执行各个安装命令
[root@k8s-7-21 ~]# kubectl get pod -n infra -l name=grafana
NAME                       READY   STATUS    RESTARTS   AGE
grafana-6d45bd5f75-x8pxr   1/1     Running   0          79m
 
[root@k8s-7-21 ~]# kubectl exec -it grafana-6d45bd5f75-x8pxr -n infra /bin/bash
root@grafana-596d8dbcd5-l2466:/usr/share/grafana# grafana-cli plugins install grafana-kubernetes-app
root@grafana-596d8dbcd5-l2466:/usr/share/grafana# grafana-cli plugins install grafana-clock-panel
root@grafana-596d8dbcd5-l2466:/usr/share/grafana# grafana-cli plugins install grafana-piechart-panel
root@grafana-596d8dbcd5-l2466:/usr/share/grafana# grafana-cli plugins install briangann-gauge-panel
root@grafana-596d8dbcd5-l2466:/usr/share/grafana# grafana-cli plugins install natel-discrete-panel
 
安装后有下面提示代表成功
√Installed $name sccessfully

image-1648046668382

方式二:手动下载插件zip包,访问 https://grafana.com/api/plugins/repo/$plugin_name 查询插件版本号 $version
#    通过 https://grafana.com/api/plugins/$plugin_name/versions/$version/download 下载zip包
#    将zip包解压到 /data/nfs-volume/grafana/plugins 下
# 插件安装完毕后,重启Grafana的Pod
[root@k8s-7-200 plugins]# cd /data/nfs-volume/grafana/plugins
[root@k8s-7-200 plugins]# wget -O grafana-kubernetes-app.zip https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download
[root@k8s-7-200 plugins]# wget -O grafana-clock-panel.zip https://grafana.com/api/plugins/grafana-clock-panel/versions/1.0.1/download
[root@k8s-7-200 plugins]# wget -O grafana-piechart-panel.zip https://grafana.com/api/plugins/grafana-piechart-panel/versions/1.0.1/download
[root@k8s-7-200 plugins]# wget -O briangann-gauge-panel.zip https://grafana.com/api/plugins/briangann-gauge-panel/versions/0.0.9/download
[root@k8s-7-200 plugins]# wget -O natel-discrete-panel.zip https://grafana.com/api/plugins/natel-discrete-panel/versions/0.1.1/download
[root@k8s-7-200 plugins]# ls *.zip | xargs -I {} unzip -q {}

不管方式一还是二都需要重启grafana,插件才能生效

[root@k8s-7-21 ~]# kubectl delete pod grafana-6d45bd5f75-x8pxr -n infra
pod "grafana-6d45bd5f75-x8pxr" deleted
[root@k8s-7-21 ~]# kubectl get pod -o wide -n infra
NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE                NOMINATED NODE   READINESS GATES
grafana-6d45bd5f75-wlm95   1/1     Running   0          42s    172.7.21.10    k8s-7-21.host.top   <none>           <none>

检查插件是否添加成功
image-1648046732228

5.4、添加数据源

image-1648046759206
选择prometheus
image-1648046779482
URL写的是http://prometheus.itdo.top
Access写的是Server 服务端
Whitelisted Cookies白名单不用写
image-1648046804943

把[root@k8s-7-200 ~]# cat /opt/certs/ca.pem 复制到CA.Cert
把[root@k8s-7-200 ~]# cat /opt/certs/client.pem  复制到Clinet.Cert
把[root@k8s-7-200 ~]# cat /opt/certs/client-key.pem 复制到Clinet.key

Scrape interval 采集的时间间隔
image-1648046843088
image-1648046857483
点击插件-kubernetes
image-1648046880098
点击enable后,如下点击左边kubernetes,点击Clusters,点击新建
image-1648046907600
image-1648046930329
image-1648046948164
name 填写名字随意image-1648046974797
url这里写apiserver中的地址。因为10.4.7.21、10.4.7.22 的kube-apiserver 启动6443端口,并且代理给10.4.7.10的7443端口。
填写相关ca
image-1648047003768
save
image-1648047023448
测试,可以不做
image-1648047065311
点击save & test
报错—这里忽略
image-1648047084439
点击myk8s后
image-1648047106271
发现dashboard
image-1648047124524
点击里面的dashboard后发现数据
image-1648047142836
image-1648047152002
点击左上角的K8s Cluster ,这个插件会生成4个dashboard的概要,比如k8s Deployments 里面有多少个k8s 的Deployments版本。k8s Container:内存/cpu使用率
image-1648047174663
image-1648047184658
注意:k8s Container有bug,默认取不到数据,修改方式:
点击上图的k8s Container后,点击edit,需要修改
pod_name --改为–> containner_label_io_kubernetes_pod_name
image-1648047202419
修改前
image-1648047218842
修改后
image-1648047238504
由于自带的有bug,而且监控的不准,所以4个dashboard全部删除,从官网上找新的dashboard导入
image-1648047256301
image-1648047265544

从官网上找新的dashboard导入:

方式一:

然后去官网找dashboard
image-1648088988638
点击dashboards后,https://grafana.com/grafana/dashboards,点击最多下载的
(需要使用谷歌浏览器打开才能正常搜索)
image-1648047324120
找到编号为15661image-1648047340797
在自己的grafana中,点击import dashboard
image-1648047363773
输入编号,然后load
image-1648047378796
import自动导入
image-1648047393801

方式二:

在导入的时候点击Upload .json File 会load本地的
需要导入刚才删除的4个dashboard
image-1648047462586
image-1648047472715
选中后import
image-1648047488712
导入完成后的dashboard
image-1648047504441
注:模块的查询的数据,都是事前prometheus.yml配置文件中定义好的
k8s ndoe
查看两个节点,对应的可以查看有几个运行的node
image-1648047540803
查看磁盘、IO、网络
image-1648047567164

k8s Cluster

查看集群的概况,Cluster Cpu Uasge cpu的百分比、Cluster Memory Uasge 内存的百分比
image-1648047587049
Deployment Replicas 导入后查看多少个副本
image-1648047602266

Generic Dasghboard

查看监控宿主机
image-1648047635800

etcd dashborad

先导入Etcd Dashboard.json
image-1648047664416

Traefik Dashboard

traefik-ingress暴露作用
tips:需要加annotations注解

"annotations": {
  "prometheus_io_scheme": "traefik",
  "prometheus_io_path": "/metrics",
  "prometheus_io_port": "8080"
}

image-1648047700906
比如dashboard.itdo.top 是177ms,说明延迟是177ms,3s以内是可以的
image-1648047729313

Blackbox Dashboard

导入后查看监控宿主机
tips:两个模块id 9965、7587,两个都可以用不同风格,选一个自己导入
blackbox-exporter 用来监控blackbox-exporter,blackbox-exporter监控dubbo-demo-server等
image-1648047822186

JMX dashboard

导入后监控jvm
tips:

  • 在业务容器dp.yaml需要加annotations注解
  • 业务容器镜像build已经加入监控客户端jmx_javaagent-0.3.1.jar,并且启动业务服务时传入参数 java -jar -javaagent:/opt/prom/jmx_javaagent-0.3.1.jar=(hostnamei):(hostname -i):
  • 对应Promethus.yaml配置文件中job_name: 'kubernetes-pods’的规则
    这里我们修改dp.yaml添加annotations注解
"annotations": {
  "prometheus_io_scrape": "true",
  "prometheus_io_port": "12346",
  "prometheus_io_path": "/"
}
 
注意:12346是jvm远程端口

image-1648047881704
image-1648047889210
image-1648047896666
image-1648047908998

总结:

监控 traefik:
"annotations": {
  "prometheus_io_scheme": "traefik",
  "prometheus_io_path": "/metrics",
  "prometheus_io_port": "8080"
}
 
监控 blackbox---tcp:
"annotations": {
  "blackbox_port": "20880",
  "blackbox_scheme": "tcp"
}
 
监控 blackbox---http:
"annotations": {
  "blackbox_path": "/hello?name=health",
  "blackbox_port": "8080",
  "blackbox_scheme": "http"
}
 
监控jvm:
"annotations": {
  "prometheus_io_scrape": "true",
  "prometheus_io_port": "12346",
  "prometheus_io_path": "/"
}

6、部署alertmanager

6.1、准备镜像

[root@k8s-7-200 ]# docker pull docker.io/prom/alertmanager:v0.19.0
[root@k8s-7-200 ]# docker images |grep alert
prom/alertmanager                                                    v0.19.0                         30594e96cbe8        17 months ago       53.2MB
[root@k8s-7-200 ]# docker image tag 30594e96cbe8 harbor.itdo.top/infra/alertmanager:v0.19.0
[root@k8s-7-200 ]# docker push harbor.itdo.top/infra/alertmanager:v0.19.0
 

注意:如果应用资源配置清单后,新版本容器启动后可能会报错
提示:Back-off restarting failed container
image-1648048003208
查看pod日志:# couldn’t deduce an advertise address: no private IP found, explicit advertise addr not provided
image-1648048019021
解决方案1:降低使用alertmanager:v0.14.0
解决方案2:启动参数加:–cluster.advertise-address=0.0.0.0:9093
image-1648048034335

6.2、准备资源配置清单

[root@k8s-7-200 ~]# mkdir /data/k8s-yaml/alertmanager && cd /data/k8s-yaml/alertmanager

[root@k8s-7-200 alertmanager]# vi cm.yaml 注意,注释去掉

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: infra
data:
  config.yml: |-
    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.itdo.top:465'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]' 
      smtp_auth_password: '×××××××'
      smtp_require_tls: false
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s
      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m
      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m
      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default
    receivers:
    - name: 'default'
      email_configs:
      - to: '[email protected]'  # 收件人
        send_resolved: true

[root@k8s-7-200 alertmanager]# vi dp.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: harbor.itdo.top/infra/alertmanager:v0.19.0   #修改镜像地址
        args:
          - "--config.file=/etc/alertmanager/config.yml"
          - "--storage.path=/alertmanager"
        ports:
        - name: alertmanager
          containerPort: 9093
        volumeMounts:
        - name: alertmanager-cm
          mountPath: /etc/alertmanager
      volumes:
      - name: alertmanager-cm
        configMap:
          name: alertmanager-config
      imagePullSecrets:
      - name: harbor

[root@k8s-7-200 alertmanager]# vi svc.yaml

# Prometheus调用alert采用service name。不走ingress域名
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: infra
spec:
  selector: 
    app: alertmanager
  ports:
    - port: 80
      targetPort: 9093

6.3、应用资源配置清单

[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/alertmanager/cm.yaml
configmap/alertmanager created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/alertmanager/dp.yaml
deployment.extensions/alertmanager created
[root@k8s-7-21 ~]# kubectl apply -f http://k8s-yaml.itdo.top/alertmanager/svc.yaml
service/alertmanager created

6.4、配置alert与prometheus联系

添加告警规则(直接套上去就能用,如果深入研究可以去研究prometheus SQL)

[root@k8s-7-200 ~]# vi /data/nfs-volume/prometheus/etc/rules.yml # 配置在prometheus目录下
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
  - alert: OutOfInodes
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of inodes (instance {{ $labels.instance }})"
      description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
  - alert: OutOfDiskSpace
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput in (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput out (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read rate (instance {{ $labels.instance }})"
      description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write rate (instance {{ $labels.instance }})"
      description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadLatency
    expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
  - alert: UnusualDiskWriteLatency
    expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
- name: http_status
  rules:
  - alert: ProbeFailed
    expr: probe_success == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Probe failed (instance {{ $labels.instance }})"
      description: "Probe failed (current value: {{ $value }})"
  - alert: StatusCode
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Status Code (instance {{ $labels.instance }})"
      description: "HTTP status code is not 200-399 (current value: {{ $value }})"
  - alert: SslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
      description: "SSL certificate expires in 30 days (current value: {{ $value }})"
  - alert: SslCertificateHasExpired
    expr: probe_ssl_earliest_cert_expiry - time()  <= 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "SSL certificate has expired (instance {{ $labels.instance }})"
      description: "SSL certificate has expired already (current value: {{ $value }})"
  - alert: BlackboxSlowPing
    expr: probe_icmp_duration_seconds > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow ping (instance {{ $labels.instance }})"
      description: "Blackbox ping took more than 2s (current value: {{ $value }})"
  - alert: BlackboxSlowRequests
    expr: probe_http_duration_seconds > 2 
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow requests (instance {{ $labels.instance }})"
      description: "Blackbox request took more than 2s (current value: {{ $value }})"
  - alert: PodCpuUsagePercent
    expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"

修改Prometheus配置,使其调用alter

[root@k8s-7-200 ~]# vim /data/nfs-volume/prometheus/etc/prometheus.yml # 在末尾追加,关联告警规则
......
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager"]   #调用alertmanager的service name
rule_files:
 - "/data/etc/rules.yml"

然后重启promenteus,生产的下不要删除promenteus的pod,因为太大,启动时间太长。由于之前我们配置promenteus运行在k8s-7-21上,所以找到pid 4468。(promenteus支持配置文件修改后,平滑加载)。然后kill -SIGHUP 就实现平滑加载。其中SIGHUP传递信号

[root@k8s-7-21 ~]# ps aux |grep prometheus
root       2576  0.3  0.9 169012 38264 ?        Ssl  08:34   0:20 traefik traefik --api --kubernetes --logLevel=INFO --insecureskipverify=true --kubernetes.endpoint=https://10.4.7.10:7443 --accesslog --accesslog.filepath=/var/log/traefik_access.log --traefiklog --traefiklog.filepath=/var/log/traefik.log --metrics.prometheus
root       4468 83.3 26.5 1842404 1024864 ?     Ssl  08:38  86:25 /bin/prometheus --config.file=/data/etc/prometheus.yml --storage.tsdb.path=/data/prom-db --storage.tsdb.min-block-duration=5m --storage.tsdb.retention=24h
root      35460  0.0  0.0 112832   976 pts/0    S+   10:22   0:00 grep --color=auto prometheus
[root@k8s-7-21 ~]# kill -SIGHUP 4468

查看Alerts,查看报警规则
image-1648048232949

测试告警:

把dubbo-demo-service 停止,dubbo-demo-consumer一定是报错,然后触发报警
image-1648048257838
Blackbox已经报错
image-1648048271598
查看Alerts,黄色提示报错。waiting 红色后就会发邮件
image-1648048288820
查看警告邮件
image-1648048339992