使用GPU与库伯内特斯GKE和节点自动配置

提问者：小点点

使用GPU与库伯内特斯GKE和节点自动配置

我尝试做一些相当简单的事情：使用自动预配在k8s集群中运行GPU机器。当使用限制：nvidia.com/gpu规范部署Pod时，自动预配正确地创建了一个节点池并扩展了一个适当的节点。但是，Pod停留在Pend并显示以下消息：

警告失败调度59s（x5 over 2m46s）默认调度程序0/10节点可用：10nvidia.com/gpu.不足

gke似乎正确地添加了污点和容忍度。它只是没有扩大规模。

我遵循这里的指示：https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

复制：

在包含gpu的自动配置区域中创建一个新集群（我已将自己的项目名称替换为MYPROJECT）。完成这些更改后，控制台会发出以下命令：

gcloud beta container --project "MYPROJECT" clusters create "cluster-2" --zone "europe-west4-a" --no-enable-basic-auth --cluster-version "1.18.12-gke.1210" --release-channel "regular" --machine-type "e2-medium" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/MYPROJECT/global/networks/default" --subnetwork "projects/MYPROJECT/regions/europe-west4/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-autoprovisioning --min-cpu 1 --max-cpu 20 --min-memory 1 --max-memory 50 --max-accelerator type="nvidia-tesla-p100",count=1 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-vertical-pod-autoscaling --enable-shielded-nodes --node-locations "europe-west4-a"

通过安装DaemonSet安装NVIDIA驱动程序：kubectl application-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

部署请求GPU的pod：

my-gpu-pod. yaml：

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0-runtime-ubuntu18.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
            nvidia.com/gpu: 1

kubectl应用-f my-gpu-pod. yaml

帮助将是非常感激的，因为我已经花了相当多的时间在这上面：）

编辑：这是正在运行的Pod和Node规范（自动缩放的节点）：

Name:         my-gpu-pod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  my-gpu-container:
    Image:      nvidia/cuda:11.0-runtime-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      while true; do sleep 600; done;
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9rvjz (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-9rvjz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9rvjz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Normal   NotTriggerScaleUp  11m                  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added):
  Warning  FailedScheduling   5m54s (x6 over 11m)  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   54s (x7 over 5m37s)  default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.

Name:               gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n1-standard-1
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-p100
                    cloud.google.com/gke-boot-disk=pd-standard
                    cloud.google.com/gke-nodepool=nap-n1-standard-1-gpu1-18jc7z9w
                    cloud.google.com/gke-os-distribution=cos
                    cloud.google.com/machine-family=n1
                    failure-domain.beta.kubernetes.io/region=europe-west4
                    failure-domain.beta.kubernetes.io/zone=europe-west4-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-standard-1
                    topology.gke.io/zone=europe-west4-a
                    topology.kubernetes.io/region=europe-west4
                    topology.kubernetes.io/zone=europe-west4-a
Annotations:        container.googleapis.com/instance_id: 7877226485154959129
                    csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/exor-arctic/zones/europe-west4-a/instances/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2"}
                    node.alpha.kubernetes.io/ttl: 0
                    node.gke.io/last-applied-node-labels:
                      cloud.google.com/gke-accelerator=nvidia-tesla-p100,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-nodepool=nap-n1-standar...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 22 Mar 2021 11:32:17 +0100
Taints:             nvidia.com/gpu=present:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
  AcquireTime:     <unset>
  RenewTime:       Mon, 22 Mar 2021 11:38:58 +0100
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  KernelDeadlock                False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   FilesystemIsNotReadOnly         Filesystem is not read-only
  CorruptDockerOverlay2         False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentContainerdRestart     containerd is functioning properly
  NetworkUnavailable            False   Mon, 22 Mar 2021 11:32:18 +0100   Mon, 22 Mar 2021 11:32:18 +0100   RouteCreated                    NodeController create implicit route
  MemoryPressure                False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:19 +0100   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.164.0.16
  ExternalIP:   35.204.55.105
  InternalDNS:  gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
  Hostname:     gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
Capacity:
  attachable-volumes-gce-pd:  127
  cpu:                        1
  ephemeral-storage:          98868448Ki
  hugepages-2Mi:              0
  memory:                     3776196Ki
  pods:                       110
Allocatable:
  attachable-volumes-gce-pd:  127
  cpu:                        940m
  ephemeral-storage:          47093746742
  hugepages-2Mi:              0
  memory:                     2690756Ki
  pods:                       110
System Info:
  Machine ID:                 307671eefc01914a7bfacf17a48e087e
  System UUID:                307671ee-fc01-914a-7bfa-cf17a48e087e
  Boot ID:                    acd58f3b-1659-494c-b83d-427f834d23a6
  Kernel Version:             5.4.49+
  OS Image:                   Container-Optimized OS from Google
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.9
  Kubelet Version:            v1.18.12-gke.1210
  Kube-Proxy Version:         v1.18.12-gke.1210
PodCIDR:                      10.100.1.0/24
PodCIDRs:                     10.100.1.0/24
ProviderID:                   gce://exor-arctic/europe-west4-a/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                              ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-k22gv                                               100m (10%)    0 (0%)      200Mi (7%)       500Mi (19%)    6m46s
  kube-system                 gke-metrics-agent-5fblx                                           3m (0%)       0 (0%)      50Mi (1%)        50Mi (1%)      6m47s
  kube-system                 kube-proxy-gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2    100m (10%)    0 (0%)      0 (0%)           0 (0%)         6m44s
  kube-system                 nvidia-driver-installer-vmw8r                                     150m (15%)    0 (0%)      0 (0%)           0 (0%)         6m45s
  kube-system                 nvidia-gpu-device-plugin-8vqsl                                    50m (5%)      50m (5%)    10Mi (0%)        10Mi (0%)      6m45s
  kube-system                 pdcsi-node-k9brg                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         6m47s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        403m (42%)  50m (5%)
  memory                     260Mi (9%)  560Mi (21%)
  ephemeral-storage          0 (0%)      0 (0%)
  hugepages-2Mi              0 (0%)      0 (0%)
  attachable-volumes-gce-pd  0           0
Events:
  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 6m47s                  kubelet          Starting kubelet.
  Normal   NodeAllocatableEnforced  6m47s                  kubelet          Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientPID
  Normal   NodeReady                6m45s                  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeReady
  Normal   Starting                 6m44s                  kube-proxy       Starting kube-proxy.
  Warning  NodeSysctlChange         6m41s                  sysctl-monitor
  Warning  ContainerdStart          6m41s                  systemd-monitor  Starting containerd container runtime...
  Warning  DockerStart              6m41s (x2 over 6m41s)  systemd-monitor  Starting Docker Application Container Engine...
  Warning  KubeletStart             6m41s                  systemd-monitor  Started Kubernetes kubelet.

共3个答案

匿名用户

根据库伯内特斯文档https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#nvidia-gpu-device-plugin-used-by-gce，我们应该使用https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml.

所以你能跑吗

kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml

匿名用户

与GKE相关的一个常见错误是项目配额限制资源，这可能导致节点由于无法分配资源而无法自动配置或扩展。

也许您的GPU项目配额（或专门针对nvidia-tesla-p100）设置为0或低于请求的数字。

在此链接中提供了有关如何检查它以及如何为您的配额请求更多资源的更多信息。

此外，我看到您正在使用与加速器不兼容的共享核心E2实例。这应该不是问题，因为如果GKE检测到工作负载包含GPU，它应该自动将机器类型更改为N1，如本链接所示，但仍可能尝试使用其他机器类型（如N1）运行集群。

匿名用户

您可能遇到范围问题。

将节点自动配置与GPU一起使用时，默认情况下，自动配置的节点池没有足够的范围来运行安装DaemonSet。您需要手动更改默认的自动配置范围才能启用它。

在这种情况下，编写时所需的文档范围是：

[ "https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/compute"
]

这篇文章提到了这个问题：https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#using_node_auto-provisioning_with_gpus

您可能只需要展开它们并重试。手动它可以工作，因为您有必要的范围。