How to Dynamically Adjust Resource Allocations for Suspended Kubernetes Jobs (v1.36 Beta)

By ✦ min read

Introduction

Kubernetes v1.36 introduces a powerful enhancement for batch and machine learning workloads: the ability to modify container resource requests and limits in the pod template of a suspended Job. Now in beta (first introduced as alpha in v1.35), this feature lets queue controllers and administrators fine-tune CPU, memory, GPU, and extended resource specifications on a Job while it's suspended, before it starts or resumes running. This means you can adapt resource allocations without deleting and recreating the Job, preserving all metadata and status.

How to Dynamically Adjust Resource Allocations for Suspended Kubernetes Jobs (v1.36 Beta)

In this step-by-step guide, you'll learn how to leverage this feature to dynamically adjust resources for suspended Jobs, ensuring efficient cluster utilization and smoother operation of resource‑intensive workloads.

What You Need

A Kubernetes cluster running v1.36 or later (v1.35 with alpha feature gate enabled).
kubectl configured to interact with your cluster.
A basic understanding of Kubernetes Jobs and Suspended Jobs.
Optional: A queue controller (like Kueue) to automate resource adjustments.

Step-by-Step Guide

Step 1: Verify the Feature is Enabled

In Kubernetes v1.36, this feature is beta, so it's enabled by default. To confirm, run:

kubectl api-versions | grep batch/v1

If you're on v1.35, you may need to enable the JobMutablePodTemplate feature gate. In v1.36, no manual action is required.

Step 2: Create a Suspended Job

Define a Job manifest with the spec.suspend: true field. This suspends the Job immediately after creation, allowing you to modify its resources before any Pods are launched. Below is an example of a machine learning training Job requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training-suspended
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:latest
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

Apply it with kubectl apply -f job-suspended.yaml.

Step 3: Modify Resource Requests/Limits While Suspended

Once the Job is created and in a suspended state, you can update its pod template's resources. Use kubectl edit or kubectl patch. For example, to reduce GPU count from 4 to 2 and adjust CPU/memory:

kubectl patch job ml-training-suspended --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/example-hardware-vendor.com~1gpu", "value": "2"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/example-hardware-vendor.com~1gpu", "value": "2"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "4"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "16Gi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "4"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "16Gi"}
]'

Note: The tilde (~1) in the GPU field escapes the slash in the resource name. Ensure the new values are valid (e.g., non‑negative, within cluster capacity).

Step 4: Resume the Job

After adjusting resources, unsuspend the Job by setting spec.suspend to false:

kubectl patch job ml-training-suspended -p '{"spec":{"suspend":false}}'

The Job will start creating Pods with the updated resource specifications. You can monitor progress with kubectl get pods -w.

Step 5: Verify Resource Allocation

Check that the running Pods reflect the new resources:

kubectl get pod ml-training-suspended-xxxxx -o jsonpath='{.spec.containers[0].resources}'

You should see the adjusted requests and limits. If a queue controller is managing the Job, it can also perform these updates automatically.

Tips and Best Practices

Use with Queue Controllers: Tools like Kueue can automatically adjust resources based on cluster load and priorities. Integrate them to avoid manual patching.
Keep Jobs Suspended for Reconfiguration: Always suspend the Job before modifying resources. Attempting to change resources on an active Job will be rejected.
Mind the Limits: Ensure your requested resources do not exceed node capacity or resource quotas. The API validates the updated values only for basic format; actual enforcement happens at scheduling time.
Preserve Status and Metadata: Because you're modifying the existing Job rather than recreating it, you keep all annotations, labels, and status history – invaluable for auditing and debugging.
Test with Non‑Critical Workloads First: While the feature is stable in beta, test your modifications on development Jobs before applying to production batch jobs.
Monitor Suspended Jobs: Use kubectl get jobs with the --watch flag to track state changes. Suspended Jobs show suspend:true in their status.

This feature dramatically improves flexibility for batch and ML workloads, letting you adapt to changing cluster conditions without disruption. Embrace it to make your Kubernetes environment more resilient and efficient.

Tags: