Upload from folder
Walk a local folder and upload every image to a dataset, parallel and idempotent.
upload_dataset_from_folder() walks a local directory of images, creates the destination dataset if needed, and uploads everything through a thread pool. Subdirectories become virtual folders on the dataset by default. Re-runs are idempotent — duplicate filenames are skipped, not failed.
from pictograph import Client
from pictograph.workflows import upload_dataset_from_folder
client = Client()
report = upload_dataset_from_folder(
client,
dataset_name="road-signs",
folder="./road_signs",
)
print(f"{report.images_uploaded} uploaded, {report.images_skipped} skipped")
Signature
upload_dataset_from_folder(
client: Client,
dataset_name: str,
folder: str | Path,
*,
organize_by_class: bool = True,
parallel: bool = True,
max_workers: int = 8,
skip_existing: bool = True,
create_if_missing: bool = True,
progress: Callable[[int, int, str | None], None] | None = None,
) -> UploadReport
| Argument | Default | Purpose |
|---|---|---|
dataset_name | required | Destination dataset |
folder | required | Local directory (walked recursively) |
organize_by_class | True | First-level subdirectories become virtual folders |
parallel | True | Use a thread pool |
max_workers | 8 | Pool size — higher values risk hitting the rate limit |
skip_existing | True | Treat duplicate-filename conflicts as skips, not failures |
create_if_missing | True | Create the dataset if it doesn’t exist (else NotFoundError) |
progress | None | (completed, total, filename) callback fired after each file |
Folder layout convention
With organize_by_class=True (the default), the first-level subdirectory becomes the virtual folder:
./road_signs/
├── stop/ → /stop on the dataset
│ ├── 001.jpg
│ └── 002.jpg
├── yield/ → /yield
│ └── 003.jpg
└── 004.jpg → / (root)
Nested subdirectories collapse — ./road_signs/stop/night/005.jpg still lands in /stop. Pass organize_by_class=False to put every file at the root.
Supported extensions: .jpg, .jpeg, .png, .webp, .bmp, .tif, .tiff, .gif, .heic.
Idempotency
Re-running the same call on a dataset that already has matching filenames is safe — those uploads come back as images_skipped. To force re-upload, set skip_existing=False (failures will be recorded instead).
# First run — uploads everything.
report = upload_dataset_from_folder(client, "road-signs", "./road_signs")
assert report.images_uploaded == 100 and report.images_skipped == 0
# Second run — skips everything that's already there.
report = upload_dataset_from_folder(client, "road-signs", "./road_signs")
assert report.images_uploaded == 0 and report.images_skipped == 100
Progress callback
def on_progress(done: int, total: int, filename: str | None) -> None:
print(f"[{done}/{total}] {filename}")
upload_dataset_from_folder(
client, "road-signs", "./road_signs", progress=on_progress,
)
The callback fires once per file, regardless of success or failure.
Inspecting the report
@dataclass
class UploadReport:
dataset_name: str
images_attempted: int
images_uploaded: int
images_skipped: int
failures: list[UploadFailure] # each carries .path and .reason
@property
def success(self) -> bool: ...
success is True only when there are zero failures and at least one file uploaded. An empty folder returns a report with success=False.
Errors
| Status | Exception | Cause |
|---|---|---|
FileNotFoundError | — | folder doesn’t exist or isn’t a directory |
| 404 | NotFoundError | dataset_name missing and create_if_missing=False |
Per-file errors (network, validation, conflict) are recorded in report.failures, not raised.
See also
- Full pipeline — chains upload with annotate + train
- Images — the underlying
client.images.upload()method