Skip to content

Modified resource activates deactivated Workload #6719

@sutaakar

Description

@sutaakar

What happened:
I have created PyTorchJob CR managed by Kueue, stopped the training by setting corresponding Workload .spec.active flag to false. However when I modified the PyTorchJob then corresponding Workload gets activated again.

What you expected to happen:
I was expecting Workload related to modified PyTorchJob to stay deactivated.

How to reproduce it (as minimally and precisely as possible):
Install latest Training operator : kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.3"
Setup sample Kueue resources:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq
spec:
  clusterQueue: cluster-queue 

Create sample PyTorchJob:

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-sleep-job-small"
  labels:
    kueue.x-k8s.io/queue-name: lq
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: python:3.9-alpine
              command:
                - "python"
                - "-c"
                - "import time; print('Container started with small image, sleeping for 60s...'); time.sleep(60); print('Done sleeping.')"

Once corresponding training Pod starts stop the corresponding Workload:

for wl in $(kubectl get workload -n default -o jsonpath='{.items[*].metadata.name}'); do kubectl patch workload $wl -n default --type merge -p '{"spec":{"active":false}}'; done

Scale PyTorchJob:

kubectl patch pytorchjob pytorch-sleep-job-small --type merge -p '{"spec":{"pytorchReplicaSpecs":{"Worker":{"replicas":2}}}}'

Workload gets activated again.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.32.2
  • Kueue version (use git describe --tags --dirty --always): v0.13.3
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release): Fedora 42
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions