Training
Low-level primitives for spawning, polling, and cancelling training runs.
The training resource manages the lifecycle of a single run against a pre-built export. For the end-to-end “give me an ONNX model from this dataset” call, use train_pipeline instead.
from pictograph import Client
client = Client()
Pipelines
pipeline_type | Output |
|---|---|
yolox | Object detection (boxes) |
detectron2 | Instance segmentation (polygons + masks) |
sm_pytorch | Semantic segmentation |
classification | Image classification |
rfdetr_detection | Object detection (RT-DETR) |
rfdetr_segmentation | Instance segmentation (RT-DETR) |
GPU tiers
gpu_type | Approx. cost | Pick for |
|---|---|---|
a10g (default) | ~$0.30/hr | YOLOX, classification, RF-DETR-detection |
a100 | ~$2/hr | Detectron2, large RF-DETR, big batches |
h100 | ~$4/hr | Last resort — only when A100 OOMs |
create
Spawn a run against an existing export.
run = client.training.create(
dataset_name="road-signs",
export_name="road-signs-20260512-120000",
pipeline_type="yolox",
name="yolox-run-1",
config={"epochs": 50},
gpu_type="a10g",
wait=True,
poll_interval=5.0,
timeout=7200.0,
)
| Arg | Type | Default | Notes |
|---|---|---|---|
dataset_name | str | required | Source project |
export_name | str | required | Pre-built export |
pipeline_type | PipelineType | required | See table above |
name | str | None | auto | Defaults to <pipeline>-run-<ts> |
config | dict | {} | epochs, batch_size, learning_rate, image_size |
gpu_type | GpuType | "a10g" | |
wait | bool | True | When False, returns immediately with status="queued" |
poll_interval | float | 5.0 | Seconds between polls |
timeout | float | 7200.0 | Max poll seconds (2 hours) |
Returns TrainingRun.
list / iter
runs = client.training.list(limit=20, status="running")
for run in client.training.iter(page_size=50):
print(run.id, run.status, run.progress)
get
run = client.training.get("run-uuid")
print(run.status, run.progress, run.current_epoch, "/", run.total_epochs)
status is one of {"pending", "queued", "running", "completed", "failed", "cancelled"}.
cancel
client.training.cancel("run-uuid") # stops the worker, refunds remaining minutes
wait_for_completion
If you created with wait=False, you can block later:
run = client.training.wait_for_completion("run-uuid", timeout=3600.0)
if run.status == "completed":
model = client.models.get(run.model_id)
Minimum dataset size
Training requires at least 5 images matching the export’s status_filter so the worker can split into train / val / test. Below that, training fails with a validation error.
ds = client.datasets.get("my-dataset")
assert ds.completed_image_count >= 5
Cost estimation
estimate = client.credits.estimate("training_a10g_per_minute", quantity=30)
if not estimate.sufficient:
raise RuntimeError(f"Need {estimate.total_credits}, have {estimate.credits_remaining}")
Refunds for cancelled or under-budget runs appear automatically as positive ledger entries (training_refund_<gpu>).
Errors
| Status | Exception | Cause |
|---|---|---|
| 402 | PaymentRequiredError | Insufficient credits |
| 404 | NotFoundError | Dataset or export missing |
| 422 | ValidationError | Pipeline / GPU invalid, dataset too small |
| 408 | PollTimeoutError | wait=True exceeded timeout (run keeps going) |
See also
train_pipeline— end-to-end workflow (recommended starting point)- Models — download trained ONNX weights
- Credits —
estimate("training_<gpu>_per_minute")