Vertex AI Custom Job

The Vertex AI Custom Job job type submits a containerised training job to Google Vertex AI via POST projects/{project}/locations/{location}/customJobs. The template Custom Job (used to seed compute spec) is identified by its resource name — Polysync stores projects/{project}/locations/{location}/customJobs/{jobId} in the Job's External Id.

This job type is supported on the Google Vertex AI platform.

Required job fields

External Id — the template Vertex AI Custom Job resource name.
Job Type — Vertex AI Custom Job (set automatically on import).

Compute & container spec (stored as job attributes from the template): Machine Type, Container Image URI, Replica Count, optional accelerator settings.

Job discovery

GET projects/{project}/locations/{location}/customJobs?pageSize=100

(paginated via pageToken). Each Custom Job's compute spec (machineType, containerUri, replicaCount, accelerators) is captured as job attributes for re-use.

Parameter handling

Input + Input&Output parameters are sent as container environment variables on the worker pool's containerSpec:

{
  "displayName": "<polysync-name>_<timestamp>",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "replicaCount": <int>,
        "machineSpec": {
          "machineType": "<from-attribute>",
          "acceleratorType":  "<optional>",
          "acceleratorCount": <optional>
        },
        "containerSpec": {
          "imageUri": "<from-attribute>",
          "env": [
            { "name": "<param>", "value": "<typed-value-as-string>" }
          ]
        }
      }
    ]
  }
}

Direction	Sent as container env	Updated from response
`Input`	✅	❌
`Output`	❌	❌ (not supported)
`Input&Output`	✅	❌ (not supported)

Output parameters are not supported. Persist training outputs to GCS or the Vertex AI Model Registry.

Execution flow

Polysync POSTs the body above; the new job's full resource name becomes the Polysync RunId.

Status is polled via GET projects/{project}/locations/{location}/customJobs/{jobId} and decoded from state:

Vertex AI `state`	Polysync status
`JOB_STATE_SUCCEEDED`	Success
`JOB_STATE_FAILED` / `JOB_STATE_EXPIRED`	Failed
`JOB_STATE_RUNNING` / `JOB_STATE_QUEUED` / `JOB_STATE_PENDING` / `JOB_STATE_PAUSED`	Running
`JOB_STATE_CANCELLING`	Running
`JOB_STATE_CANCELLED`	Cancelled
(other)	Unknown

Cancel is supported via POST {resourceName}:cancel.

Monitor URL

https://console.cloud.google.com/vertex-ai/locations/{location}
  /training/{jobId}?project={projectId}

Best practices

Pre-build a single template Custom Job per training workload and let Polysync clone it per execution — keeps compute spec consistent.
Read parameter values inside the container with os.environ and type-cast as needed (env vars are always strings).
For long-running training, set scheduling.timeout on the template to bound runtime.

Troubleshooting

PERMISSION_DENIED on submit — the runtime identity lacks Vertex AI User or the worker SA isn't permitted to pull the container image.
Job stays JOB_STATE_QUEUED for a long time — accelerator quota in the region is exhausted; switch region or request quota.

Documentation