-
Notifications
You must be signed in to change notification settings - Fork 495
Description
What happened:
I have created PyTorchJob CR managed by Kueue, stopped the training by setting corresponding Workload .spec.active flag to false. However when I modified the PyTorchJob then corresponding Workload gets activated again.
What you expected to happen:
I was expecting Workload related to modified PyTorchJob to stay deactivated.
How to reproduce it (as minimally and precisely as possible):
Install latest Training operator : kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.3"
Setup sample Kueue resources:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: default-flavor
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "pods"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 9
- name: "memory"
nominalQuota: 36Gi
- name: "pods"
nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq
spec:
clusterQueue: cluster-queue
Create sample PyTorchJob:
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-sleep-job-small"
labels:
kueue.x-k8s.io/queue-name: lq
spec:
pytorchReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: python:3.9-alpine
command:
- "python"
- "-c"
- "import time; print('Container started with small image, sleeping for 60s...'); time.sleep(60); print('Done sleeping.')"
Once corresponding training Pod starts stop the corresponding Workload:
for wl in $(kubectl get workload -n default -o jsonpath='{.items[*].metadata.name}'); do kubectl patch workload $wl -n default --type merge -p '{"spec":{"active":false}}'; done
Scale PyTorchJob:
kubectl patch pytorchjob pytorch-sleep-job-small --type merge -p '{"spec":{"pytorchReplicaSpecs":{"Worker":{"replicas":2}}}}'
Workload gets activated again.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): v1.32.2 - Kueue version (use
git describe --tags --dirty --always): v0.13.3 - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): Fedora 42 - Kernel (e.g.
uname -a): - Install tools:
- Others: