提问者:小点点

在GKE上,如果未分配nvidia.com/gpu资源,dcgm-exportorpod将无法运行


我正在尝试查询GKE pod的GPU使用指标。

以下是我为测试所做的:

  1. 创建了具有两个节点池的GKE集群,其中一个具有两个仅cpu的节点,另一个具有NVIDIA特斯拉T4 GPU的节点。所有节点都运行容器优化OS。
  2. 如https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers所写,我运行了kubectl application-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml。
  3. kubectl create-f dcgm-exporter. yaml
# dcgm-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.1.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      name: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        # resources:
        #   limits:
        #     nvidia.com/gpu: "1"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      tolerations:
        - effect: "NoExecute"
          operator: "Exists"
        - effect: "NoSchedule"
          operator: "Exists"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '9400'
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  ports:
  - name: "metrics"
    port: 9400
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

通过取消注释资源:限制:nvidia.com/gpu:1,它成功运行。然而,我不希望这个pod占用任何GPU,而只是观看它们。

我如何在不为其分配GPU的情况下运行dcgm导出器?我尝试了Ubuntu节点,但也失败了。


共1个答案

匿名用户

它与这些工作:

  1. 特权:true设置为securityContext
  2. 添加卷挂载“nvidia-install-dir-host”
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.1.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      name: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          privileged: true
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "nvidia-install-dir-host"
          mountPath: "/usr/local/nvidia"
      tolerations:
        - effect: "NoExecute"
          operator: "Exists"
        - effect: "NoSchedule"
          operator: "Exists"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "nvidia-install-dir-host"
        hostPath:
          path: "/home/kubernetes/bin/nvidia"
---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '9400'
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  ports:
  - name: "metrics"
    port: 9400