Sign in Get started

Training

Low-level primitives for spawning, polling, and cancelling training runs.

View as Markdown

The training resource manages the lifecycle of a single run against a pre-built export. For the end-to-end “give me an ONNX model from this dataset” call, use train_pipeline instead.

from pictograph import Client
client = Client()

Pipelines

pipeline_typeOutput
yoloxObject detection (boxes)
detectron2Instance segmentation (polygons + masks)
sm_pytorchSemantic segmentation
classificationImage classification
rfdetr_detectionObject detection (RT-DETR)
rfdetr_segmentationInstance segmentation (RT-DETR)

GPU tiers

gpu_typeApprox. costPick for
a10g (default)~$0.30/hrYOLOX, classification, RF-DETR-detection
a100~$2/hrDetectron2, large RF-DETR, big batches
h100~$4/hrLast resort — only when A100 OOMs

create

Spawn a run against an existing export.

run = client.training.create(
    dataset_name="road-signs",
    export_name="road-signs-20260512-120000",
    pipeline_type="yolox",
    name="yolox-run-1",
    config={"epochs": 50},
    gpu_type="a10g",
    wait=True,
    poll_interval=5.0,
    timeout=7200.0,
)
ArgTypeDefaultNotes
dataset_namestrrequiredSource project
export_namestrrequiredPre-built export
pipeline_typePipelineTyperequiredSee table above
namestr | NoneautoDefaults to <pipeline>-run-<ts>
configdict{}epochs, batch_size, learning_rate, image_size
gpu_typeGpuType"a10g"
waitboolTrueWhen False, returns immediately with status="queued"
poll_intervalfloat5.0Seconds between polls
timeoutfloat7200.0Max poll seconds (2 hours)

Returns TrainingRun.

list / iter

runs = client.training.list(limit=20, status="running")
for run in client.training.iter(page_size=50):
    print(run.id, run.status, run.progress)

get

run = client.training.get("run-uuid")
print(run.status, run.progress, run.current_epoch, "/", run.total_epochs)

status is one of {"pending", "queued", "running", "completed", "failed", "cancelled"}.

cancel

client.training.cancel("run-uuid")  # stops the worker, refunds remaining minutes

wait_for_completion

If you created with wait=False, you can block later:

run = client.training.wait_for_completion("run-uuid", timeout=3600.0)
if run.status == "completed":
    model = client.models.get(run.model_id)

Minimum dataset size

Training requires at least 5 images matching the export’s status_filter so the worker can split into train / val / test. Below that, training fails with a validation error.

ds = client.datasets.get("my-dataset")
assert ds.completed_image_count >= 5

Cost estimation

estimate = client.credits.estimate("training_a10g_per_minute", quantity=30)
if not estimate.sufficient:
    raise RuntimeError(f"Need {estimate.total_credits}, have {estimate.credits_remaining}")

Refunds for cancelled or under-budget runs appear automatically as positive ledger entries (training_refund_<gpu>).

Errors

StatusExceptionCause
402PaymentRequiredErrorInsufficient credits
404NotFoundErrorDataset or export missing
422ValidationErrorPipeline / GPU invalid, dataset too small
408PollTimeoutErrorwait=True exceeded timeout (run keeps going)

See also

  • train_pipeline — end-to-end workflow (recommended starting point)
  • Models — download trained ONNX weights
  • Creditsestimate("training_<gpu>_per_minute")
Copied to clipboard