我正在尝试查询GKE pod的GPU使用指标。
以下是我为测试所做的:
kubectl create-f dcgm-exporter. yaml
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
# resources:
# limits:
# nvidia.com/gpu: "1"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
通过取消注释资源:限制:nvidia.com/gpu:1
,它成功运行。然而,我不希望这个pod占用任何GPU,而只是观看它们。
我如何在不为其分配GPU的情况下运行dcgm导出器?我尝试了Ubuntu节点,但也失败了。
它与这些工作:
特权:true
设置为securityContext
。“nvidia-install-dir-host”
。apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
privileged: true
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
mountPath: "/usr/local/nvidia"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
hostPath:
path: "/home/kubernetes/bin/nvidia"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400