Dataset Augmentation
Dataset augmentation produces a new HuggingFace dataset by applying transforms to an existing one. This is a useful step between data collection and fine-tuning when you want to improve the robustness of your model without collecting additional demonstrations.
When to Use Augmentation
Section titled “When to Use Augmentation”Augmentation helps in two common scenarios:
- Robustness to lighting and visual noise — applying brightness, contrast, color temperature, blur, and occlusion variations teaches the model to generalize beyond the exact conditions under which the data was recorded.
- Small dataset expansion — generating multiple augmented copies per episode increases the effective size of your dataset, which can stabilize training on small recordings.
Augmentation Modes
Section titled “Augmentation Modes”Two augmentation types are available:
| Type | Description |
|---|---|
deterministic | Fast CPU transforms — lighting, noise, blur, occlusion |
generative | Cosmos Transfer2.5 generative re-rendering (currently disabled — coming soon) |
Deterministic augmentation runs on a single GPU instance and completes in minutes for typical datasets. Generative augmentation is being integrated and will be available in a future release.
Available Transforms
Section titled “Available Transforms”Deterministic augmentation supports four transforms, applied per frame:
| Transform | Effect |
|---|---|
lighting | Adjusts brightness, contrast, and color temperature within configurable ranges |
noise | Adds gaussian sensor noise to simulate lower-quality cameras |
blur | Applies motion or focus blur |
occlusion | Randomly masks regions of the frame to simulate partial occlusions |
You can combine transforms in two pipeline modes:
stacked— applies all selected transforms to each augmented copy. Use this when you want every copy to contain a mix of variations.independent— creates one copy per transform. Use this when you want to control which transform produced which copy.
Dataset Modes
Section titled “Dataset Modes”The dataset_mode controls how augmented frames are added to the output
dataset:
copy— augmented copies are added alongside the original episodes. The original data is preserved and the dataset grows bycopies× the original episode count.inplace— the original episodes are replaced with augmented versions. Use this when you only need the augmented data and want to keep the dataset size constant.
Creating an Augmentation Job
Section titled “Creating an Augmentation Job”From the Dashboard
Section titled “From the Dashboard”- Click New Job in the top-right of your project page.
- Select Dataset Augmentation.
- Choose the dataset you want to augment.
- Pick the transforms and tune the ranges if needed.
- Click Start Job.
The job will progress through queuing → instance_booting → instance_setup → augmentation_running → dataset_uploading → completed. Once finished, the new
dataset is published to your linked HuggingFace account.
From the SDK
Section titled “From the SDK”See the SDK reference for the full Python API. A minimal example:
from qualia import Qualia
client = Qualia()
job = client.augmentation.create( project_id="...", dataset_id="qualiaadmin/spoon10", transforms="lighting,noise", copies=3,)
# Poll until completedwhile True: status = client.augmentation.get(job.job_id) if status.status in ("completed", "failed", "cancelled"): break
if status.status == "completed": print(f"Augmented dataset: {status.augmented_dataset_id}")Tuning the Lighting Ranges
Section titled “Tuning the Lighting Ranges”The lighting transform takes three parameter ranges. Defaults are tuned for mild variation and work well in most cases, but you can widen them for more aggressive augmentation:
| Parameter | Default range | Allowed bounds |
|---|---|---|
brightness_range | (-15, 15) % | min ∈ [-50, 0], max ∈ [0, 50] |
contrast_range | (-10, 10) % | min ∈ [-50, 0], max ∈ [0, 50] |
color_temp_range | (4000, 6500) K | min ∈ [2000, 6500], max ∈ [4000, 10000] |
For each augmented frame the platform samples a value uniformly from the range. Wider ranges produce more visual diversity but can also push the data further from your robot’s real operating conditions — keep the ranges realistic for your deployment environment.
Next Steps
Section titled “Next Steps”- Fine-tune a model using your augmented dataset
- SDK reference for augmentation