运维

运维相关知识和内容

Kubernetes v1.36深度解析:动态资源分配DRA进入GA,GPU调度迎来革命

DRA GA化:Kubernetes GPU调度的新时代

2026年4月22日,Kubernetes v1.36发布,DRA(动态资源分配)进入正式可用(GA)状态。这彻底改变了K8s处理GPU/FPGA等硬件加速器的方式。


一、旧设备插件的三大痛点

# 旧方式:只能按数量申请GPU,无法描述需求细节
resources:
  limits:
    nvidia.com/gpu: "2"  # 无法指定型号、显存大小等

限制:只能按数量分配 | 无法GPU共享/分时复用 | 不支持跨节点拓扑感知


二、DRA核心配置

ResourceClass - 资源类型定义

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClass
metadata:
  name: gpu-nvidia-a100
driverName: gpu.nvidia.com
parametersRef:
  apiGroup: gpu.nvidia.com/v1
  kind: NvidiaGPUConfig
  name: a100-config
---
apiVersion: gpu.nvidia.com/v1
kind: NvidiaGPUConfig
metadata:
  name: a100-config
spec:
  memory: "40Gi"           # 需要40GB显存
  computeCapability: "8.0"
  mig: false

ResourceClaim + Pod使用

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaim
metadata:
  name: my-gpu-claim
  namespace: ml-training
spec:
  resourceClassName: gpu-nvidia-a100
  allocationMode: WaitForFirstConsumer
---
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.3
    resourceClaims:
    - name: gpu-resource
      source:
        resourceClaimName: my-gpu-claim
  resourceClaims:
  - name: gpu-resource

三、GPU时间共享(DRA新特性)

apiVersion: gpu.nvidia.com/v1
kind: NvidiaGPUConfig
metadata:
  name: shared-gpu-config
spec:
  sharing:
    strategy: TimeSlicing
    timeSlicingConfig:
      interval: "5ms"
  sharing.maxSharableGPU: 4  # 4个Pod共享一个GPU(适合推理服务)

四、从设备插件迁移DRA

# 安装支持DRA的NVIDIA GPU Operator
helm upgrade --install nvidia-gpu-operator nvidia/gpu-operator \
    --set devicePlugin.enabled=false \
    --set dra.enabled=true \
    -n gpu-operator

# 验证
kubectl get resourceclasses
# 应看到 gpu.nvidia.com ResourceClass

# 测试Pod调度
kubectl apply -f test-gpu-pod.yaml
kubectl describe pod test-gpu-pod | grep -A 5 "Events:"

五、DRA监控指标

# 新增Prometheus指标
kube_resourceclaim_allocated{namespace,name,resource_class}
kube_resourceclaim_pending{namespace,name}
node_dra_resource_available{resource_class,attribute}

告警规则:

- alert: DRAResourceExhausted
  expr: kube_resourceclaim_pending > 10
  for: 5m
  annotations:
    summary: "{{ $labels.resource_class }}资源不足,{{ $value }}个等待分配"

DRA GA化是Kubernetes AI工作负载管理的里程碑,2026年值得投入的运维升级。