Vertex AI Custom Job

The Vertex AI Custom Job job type submits a containerised training job to Google Vertex AI via POST projects/{project}/locations/{location}/customJobs. The template Custom Job (used to seed compute spec) is identified by its resource name — Polysync stores projects/{project}/locations/{location}/customJobs/{jobId} in the Job's External Id.

This job type is supported on the Google Vertex AI platform.

Required job fields

  • External Id — the template Vertex AI Custom Job resource name.
  • Job TypeVertex AI Custom Job (set automatically on import).

Compute & container spec (stored as job attributes from the template): Machine Type, Container Image URI, Replica Count, optional accelerator settings.

Job discovery

GET projects/{project}/locations/{location}/customJobs?pageSize=100

(paginated via pageToken). Each Custom Job's compute spec (machineType, containerUri, replicaCount, accelerators) is captured as job attributes for re-use.

Parameter handling

Input + Input&Output parameters are sent as container environment variables on the worker pool's containerSpec:

{
  "displayName": "<polysync-name>_<timestamp>",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "replicaCount": <int>,
        "machineSpec": {
          "machineType": "<from-attribute>",
          "acceleratorType":  "<optional>",
          "acceleratorCount": <optional>
        },
        "containerSpec": {
          "imageUri": "<from-attribute>",
          "env": [
            { "name": "<param>", "value": "<typed-value-as-string>" }
          ]
        }
      }
    ]
  }
}
Direction Sent as container env Updated from response
Input
Output (not supported)
Input&Output (not supported)

Output parameters are not supported. Persist training outputs to GCS or the Vertex AI Model Registry.

Execution flow

  1. Polysync POSTs the body above; the new job's full resource name becomes the Polysync RunId.

  2. Status is polled via GET projects/{project}/locations/{location}/customJobs/{jobId} and decoded from state:

    Vertex AI state Polysync status
    JOB_STATE_SUCCEEDED Success
    JOB_STATE_FAILED / JOB_STATE_EXPIRED Failed
    JOB_STATE_RUNNING / JOB_STATE_QUEUED / JOB_STATE_PENDING / JOB_STATE_PAUSED Running
    JOB_STATE_CANCELLING Running
    JOB_STATE_CANCELLED Cancelled
    (other) Unknown
  3. Cancel is supported via POST {resourceName}:cancel.

Monitor URL

https://console.cloud.google.com/vertex-ai/locations/{location}
  /training/{jobId}?project={projectId}

Best practices

  • Pre-build a single template Custom Job per training workload and let Polysync clone it per execution — keeps compute spec consistent.
  • Read parameter values inside the container with os.environ and type-cast as needed (env vars are always strings).
  • For long-running training, set scheduling.timeout on the template to bound runtime.

Troubleshooting

  • PERMISSION_DENIED on submit — the runtime identity lacks Vertex AI User or the worker SA isn't permitted to pull the container image.
  • Job stays JOB_STATE_QUEUED for a long time — accelerator quota in the region is exhausted; switch region or request quota.