Training

Low-level primitives for spawning, polling, and cancelling training runs.

The training resource manages the lifecycle of a single run against a pre-built export. For the end-to-end “give me an ONNX model from this dataset” call, use train_pipeline instead.

from pictograph import Client
client = Client()

Pipelines

`pipeline_type`	Output
`yolox`	Object detection (boxes)
`detectron2`	Instance segmentation (polygons + masks)
`sm_pytorch`	Semantic segmentation
`classification`	Image classification
`rfdetr_detection`	Object detection (RT-DETR)
`rfdetr_segmentation`	Instance segmentation (RT-DETR)

GPU tiers

`gpu_type`	Approx. cost	Pick for
`a10g` (default)	~$0.30/hr	YOLOX, classification, RF-DETR-detection
`a100`	~$2/hr	Detectron2, large RF-DETR, big batches
`h100`	~$4/hr	Last resort — only when A100 OOMs

create

Spawn a run against an existing export.

run = client.training.create(
    dataset_name="road-signs",
    export_name="road-signs-20260512-120000",
    pipeline_type="yolox",
    name="yolox-run-1",
    config={"epochs": 50},
    gpu_type="a10g",
    wait=True,
    poll_interval=5.0,
    timeout=7200.0,
)

Arg	Type	Default	Notes
`dataset_name`	`str`	required	Source project
`export_name`	`str`	required	Pre-built export
`pipeline_type`	`PipelineType`	required	See table above
`name`	`str \| None`	auto	Defaults to `<pipeline>-run-<ts>`
`config`	`dict`	`{}`	`epochs`, `batch_size`, `learning_rate`, `image_size`
`gpu_type`	`GpuType`	`"a10g"`
`wait`	`bool`	`True`	When `False`, returns immediately with `status="queued"`
`poll_interval`	`float`	`5.0`	Seconds between polls
`timeout`	`float`	`7200.0`	Max poll seconds (2 hours)

Returns TrainingRun.

list / iter

runs = client.training.list(limit=20, status="running")
for run in client.training.iter(page_size=50):
    print(run.id, run.status, run.progress)

get

run = client.training.get("run-uuid")
print(run.status, run.progress, run.current_epoch, "/", run.total_epochs)

status is one of {"pending", "queued", "running", "completed", "failed", "cancelled"}.

cancel

client.training.cancel("run-uuid")  # stops the worker, refunds remaining minutes

wait_for_completion

If you created with wait=False, you can block later:

run = client.training.wait_for_completion("run-uuid", timeout=3600.0)
if run.status == "completed":
    model = client.models.get(run.model_id)

Minimum dataset size

Training requires at least 5 images matching the export’s status_filter so the worker can split into train / val / test. Below that, training fails with a validation error.

ds = client.datasets.get("my-dataset")
assert ds.completed_image_count >= 5

Cost estimation

estimate = client.credits.estimate("training_a10g_per_minute", quantity=30)
if not estimate.sufficient:
    raise RuntimeError(f"Need {estimate.total_credits}, have {estimate.credits_remaining}")

Refunds for cancelled or under-budget runs appear automatically as positive ledger entries (training_refund_<gpu>).

Errors

Status	Exception	Cause
402	`PaymentRequiredError`	Insufficient credits
404	`NotFoundError`	Dataset or export missing
422	`ValidationError`	Pipeline / GPU invalid, dataset too small
408	`PollTimeoutError`	`wait=True` exceeded `timeout` (run keeps going)