[CVPR 2026] Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
2 Million Diverse, Accurate Synthetic Dense-Annotated Images (FC-1M + GC-1M) + 20M Synthetic Object Segments to Supercharge Grounding-DINO, Mask2Former, and Any Detectors / Segmentors / Grounding-VLMs
Weikai Huang1, Jieyu Zhang1,
Taoyang Jia1, Chenhao Zheng1, Ziqi Gao1,
Jae Sung Park1, Ranjay Krishna1,2
1 University of Washington
2Allen Institute for AI
A scalable pipeline for composing high-quality synthetic object segments into richly annotated images for object detection, instance segmentation, and visual grounding.
Why SOC? A small amount of high-quality synthetic data can outperform orders of magnitude more real data:
- 🚀 Efficient & Scalable: Just 50K SOC images match the gains from 20M model-generated (GRIT) or 200K human-annotated (V3Det) images on LVIS detection. We compose 2 million diverse images (FC-1M + GC-1M) with annotations, along with 20 million synthetic object segments across 47,000+ categories from Flux.
- 🎯 Accurate Annotations: Object-centric composition provides pixel-perfect masks, boxes, and referring expressions—no noisy pseudo-labels
- 🎨 Controllable Generation: Synthesize targeted data for specific scenarios (e.g., intra-class referring, rare categories, domain-specific applications)
- 🔄 Complementary to Real Data: Adding SOC to existing datasets (COCO, LVIS, V3Det, GRIT) yields consistent additive gains across all benchmarks
- 💰 Cost-Effective: Generate unlimited training data from 20M object segments without expensive human annotation
- 📈 100K SOC surpasses larger real-data baselines: +10.9 LVIS AP (OVD) and +8.4 gRefCOCO NAcc (VG); and remains complementary when combined with GRIT/V3Det
We release the following datasets for research use:
| Dataset Name | # Images | # Categories | Description | Download |
|---|---|---|---|---|
| FC-1M | 1,000,000 | 1,600 | Frequent Categories | 🤗 HuggingFace |
| GC-1M | 1,000,000 | 47,000+ | General Categories | 🤗 HuggingFace |
| SFC-200K | 200,000 | 1,600 | Single-category Frequent Category — same category objects with varied attributes | 🤗 HuggingFace |
| SGC-200K | 200,000 | 47,000+ | Single-category General Category — same category objects with varied attributes | 🤗 HuggingFace |
Examples of dataset types:
![]() FC / GC |
![]() SFC / SGC |
All datasets include:
- ✅ High-resolution images with photorealistic relighting and blending
- ✅ Pixel-perfect segmentation masks
- ✅ Tight bounding boxes
- ✅ Category labels
- ✅ Diverse referring expressions (attribute-based, spatial-based, and mixed)
Note: Other dataset variants (e.g., SOC-LVIS, MixCOCO) contain segments from existing datasets and cannot be released. Please use the code in this repository to compose your own datasets from the released object segments.
We also release 20M synthetic object segments used to compose the above datasets:
| Segment Set | # Segments | # Categories | Prompts/Category | Segments/Prompt | Download |
|---|---|---|---|---|---|
| FC Object Segments | 10,000,000 | 1,600 | 200 | 3 | 🤗 SOC-FC-Object-Segments-10M |
| GC Object Segments | 10,000,000 | 47,000+ | 10 | 3 | 🤗 SOC-GC-Object-Segments-10M |
Browse all sets via the collection: 🤗 HuggingFace Collection
Follow the steps below to set up the environment and use the repository:
# Clone the repository
git clone https://github.com/weikaih04/SOC
cd ./SOC
# Create and activate a Python virtual environment:
conda create -n SOC python==3.10
conda activate SOC
# Install the required dependencies for composing images with synthetic object segments:
pip install -r requirements.txt
# If you want to perform relighting and blending:
conda create -n SOC-relight python==3.10
conda activate SOC-relight
pip install -r requirements_relight_and_blending.txt
# If you want to generating referring expression:
conda create -n SOC-ref python==3.10
conda activate SOC-ref
pip install -r requirements_referring_expression_generation.txtIf you want to relight images and didn't directly paste object segments into the background, just use a random image as the background and set the hasBackground to false in scripts/generate_with_batch.py
You can download the BG-20K from this repo: https://github.com/JizhiziLi/GFM.git
We provide scripts to compose images with synthetic segments:
If you want to generate images for relighting and blending that only contain foreground object segments:
python scripts/generate_with_batch.py \
--num_processes 100 \ # depends on your CPUs
--total_images 100000 \
--filtering_setting filter_0 \
--image_save_path "/output/dataset_name/train" \
--mask_save_path "/output/dataset_name/panoptic_train" \
--annotation_path "/output/dataset_name/annotations" \
--json_save_path "/output/dataset_name/annotations/panoptic_train.json"- --num_processes: Number of parallel workers to generate images; set based on CPU cores.
- --total_images: Total images to generate.
- --filtering_setting: One of filter_0..filter_4 (filter_4 = strictest). Controls segment quality filters.
- --image_save_path: Output path for rendered RGBA images (PNG).
- --mask_save_path: Output path for color panoptic masks (PNG).
- --annotation_path: Output folder for per-image JSONs and category maps.
- --json_save_path: Final merged COCO-style panoptic JSON path.
Important: At the end of scripts/generate_with_batch.py, available_object_datasets must point to your local copies of released FC/GC object segments and their metadata JSON. For example, if you downloaded SOC-FC-Object-Segments-10M to /data/fc_10m with metadata fc_object_segments_metadata.json, set:
- dataset_path="/data/fc_10m"
- synthetic_annotation_path="/data/fc_10m/fc_object_segments_metadata.json" Similarly for GC: gc_object_segments_metadata.json
Notes
- We expect dataset_path to contain category/subcategory/ID.png structure as provided in our released object-segment datasets.
- The script writes per-image JSONs under annotation_path/separate_annotations and merges them into the final COCO-style panoptic JSON at json_save_path.
Minimal example
# Symlink your datasets to the default paths expected by the script (optional)
ln -s /data/fc_10m /fc_10m
ln -s /data/gc_10m /gc_10m
# Generate a tiny sample dataset locally
python scripts/generate_with_batch.py \
--num_processes 4 \
--total_images 20 \
--filtering_setting filter_0 \
--image_save_path "./out/train" \
--mask_save_path "./out/panoptic_train" \
--annotation_path "./out/annotations" \
--json_save_path "./out/annotations/panoptic_train.json"If you want to generate images that directly paste objects onto backgrounds, uncomment the with bg process_image_worker function in scripts/generate_with_batch.py.
Relight and blend images using IC-Light with mask-area-weighted blending to enhance photorealism while preserving object details and colors:
python relighting_and_blending/inference.py \
--dataset_path "$DATASET_PATH" \
--output_data_path "$OUTPUT_DATA_PATH" \
--num_splits "$NUM_SPLITS" \
--split "$SPLIT" \
--index_json_path "" \
--illuminate_prompts_path "$ILLUMINATE_PROMPTS_PATH" \
--record_path "$RECORD_PATH"Currently supports Google Cloud Storage access and local file system.
Notes
- Requires a CUDA GPU. Models load in half precision; 12GB+ VRAM recommended.
- Weights auto-download on first run:
- Stable Diffusion components from stablediffusionapi/realistic-vision-v51
- Background remover briaai/RMBG-1.4
- IC-Light offset iclight_sd15_fc.safetensors (downloaded to ./models if missing)
- Input expectations:
- dataset_path should point to the folder with RGBA foreground PNGs (e.g., ./out/train) named 0.png, 1.png, ...
- A matching color panoptic mask must exist at the same id under dataset_path with "train" replaced by "panoptic_train" (e.g., ./out/panoptic_train/0.png)
- illuminate_prompts_path must be a JSON file containing an array of prompt strings for relighting.
Minimal example
# Create a tiny illumination prompt list
cat > ./illumination_prompt.json << 'JSON'
[
"golden hour lighting, soft shadows",
"overcast daylight, diffuse light",
"studio softbox lighting"
]
JSON
# Relight a small sample from the composed outputs
python relighting_and_blending/inference.py \
--dataset_path ./out/train \
--output_data_path ./out/relit \
--num_splits 1 \
--split 0 \
--illuminate_prompts_path ./illumination_prompt.jsonWe use an OpenAI-compatible endpoint (vLLM) and query a local model.
Step 1) Start an OpenAI-compatible server (port 8080)
# Example: start vLLM OpenAI server with the model used in our script
python -m vllm.entrypoints.openai.api_server \
--model Qwen/QwQ-32B-AWQ \
--host 0.0.0.0 \
--port 8080Notes
- Our script currently assumes base_url=http://localhost:8080/v1.
- Ensure your GPU/driver supports the chosen model; adjust model name if needed.
Step 2) Run the generator
# INPUT_FILE is the merged COCO-style JSON from the composing stage
# OUTPUT_DIR will contain jsonl shards (one per job): job_0.jsonl, ...
export OPENAI_API_KEY=dummy_key # any non-empty string is accepted
python referring_expression_generation/inference.py \
1 \
0 \
./out/annotations/panoptic_train.json \
./out/refexp \
--api_key "$OPENAI_API_KEY" \
--num_workers 8Outputs
- At least 9 expressions per image (balanced across attribute/spatial/reasoning, single/multi).
- Writes per-job jsonl files under OUTPUT_DIR.
- Supports local paths and GCS (gs://) for both inputs and outputs.
Model: MM-Grounding-DINO | Benchmarks: LVIS v1.0 Full Val, OdinW-35
With only 50K synthetic images, SOC delivers gains comparable to orders of magnitude more real data:
| Training Data | LVIS AP | APrare | Gain vs Baseline |
|---|---|---|---|
| Object365+GoldG (Baseline) | 20.1 | 10.1 | - |
| + GRIT (20M images) | 27.1 | 17.1 | +7.0 AP |
| + V3Det (200K images) | 30.6 | 24.6 | +10.5 AP |
| + SOC-50K | 29.8 | 23.5 | +9.7 AP |
SOC-50K matches V3Det's gains with 400× fewer images!
Continuous improvements as we scale from 50K → 100K → 400K:
| SOC Scale | LVIS AP | APrare | OdinW-35 mAP |
|---|---|---|---|
| 50K | 29.8 | 23.5 | 21.0 |
| 100K | 31.0 (+1.2) | 26.3 (+2.8) | 21.0 |
| 400K | 31.4 (+1.6) | 27.9 (+1.6) | 22.8 (+1.8) |
Adding SOC on top of large-scale real datasets yields additive gains:
| Training Data | LVIS AP | APrare | OdinW-35 mAP |
|---|---|---|---|
| Object365+GoldG+V3Det+GRIT | 31.9 | 23.6 | - |
| + SOC-100K | 33.2 (+1.3) | 29.8 (+6.2) | +2.8 |
SOC introduces novel vocabulary and contextual variations not captured by existing real datasets.
Model: MM-Grounding-DINO | Benchmarks: RefCOCO/+/g, gRefCOCO, DoD
Large-scale real datasets provide limited gains for referring expression tasks:
| Training Data | gRefCOCO P@1 | gRefCOCO NAcc | DoD FULL mAP |
|---|---|---|---|
| Object365+GoldG | - | 89.3 | - |
| + V3Det (200K) | +0.5 | +0.0 | - |
| + GRIT (20M) | - | - | +1.4 |
Why? V3Det lacks sentence-level supervision; GRIT uses noisy model-generated caption-box pairs.
SOC generates precise referring pairs from ground truth annotations without human labels:
| Training Data | gRefCOCO NAcc | DoD FULL mAP | Gain |
|---|---|---|---|
| Object365+GoldG | 89.3 | - | Baseline |
| + SOC-50K | 93.9 (+4.6) | +1.0 | 50K images |
| + SOC-100K | 97.7 (+8.4) | +3.8 | 100K images |
Expression Types (3-6 per type, balanced coverage):
- Attribute-based: "the red apple", "charcoal-grey cat"
- Spatial-based: "dog to the right of the bike"
- Mixed-type: "red object to the right of the child"
SOC's gains per example far outperform GRIT (20M) and V3Det (200K)!
Model: APE (LVIS pre-trained) | Benchmark: LVIS v1.0 Val
Two-stage fine-tuning: (1) Train on 50K SOC-LVIS → (2) Continue on LVIS train split
| Training Protocol | AP | APrare | APcommon | APfrequent |
|---|---|---|---|---|
| LVIS only | 46.96 | 40.87 | - | - |
| SOC-50K → LVIS | 48.48 (+1.52) | 44.70 (+3.83) | - | (+0.31) |
Why the large rare-class gain? Synthetic data can be generated to cover underrepresented classes, mitigating LVIS's long-tail imbalance. Frequent classes already have ample real examples and benefit less.
Model: Mask2Former-ResNet-50 | Benchmark: COCO Instance Segmentation
Mixing real COCO segments with SOC synthetic segments (80 COCO categories):
| COCO Data Scale | COCO Only | COCO + SOC | Gain |
|---|---|---|---|
| 1% (~1K images) | - | - | +6.59 AP |
| 10% (~10K images) | - | - | ~+3 AP |
| 50% (~50K images) | - | - | ~+3 AP |
| 100% (Full) | - | - | ~+3 AP |
Key Insight: The boost is particularly dramatic at 1% COCO (+6.59 AP), and grows by roughly 3% at each subsequent data scale. SOC is most effective when real data is scarce!
Model: MM-Grounding-DINO | Benchmark: Custom intra-class benchmark (COCO + OpenImages V7)
A challenging visual grounding task requiring fine-grained attribute discrimination among same-category instances.
Example: In an image with multiple cars of different colors and makes, locate "the charcoal-grey sedan" (not just "car").
Why it's hard: Models often shortcut by ignoring attributes and relying solely on category nouns.
- Average Gap: Average confidence margin between ground-truth box and highest-scoring same-category distractor
- Positive Gap Ratio: Percentage of images where ground-truth box receives highest confidence among same-category candidates
| Training Data | Average Gap | Positive Gap Ratio |
|---|---|---|
| Object365+GoldG | 37.5 | ~80% |
| + GRIT (20M) | 34.6 (-2.9) | ~82% |
| + V3Det (200K) | 36.7 (-0.8) | ~83% |
| + GRIT + V3Det | 35.8 (-1.7) | ~85% |
| + SOC-SFC-50K + SOC-SGC-50K | 40.6 (+3.1) | 90% |
SOC-SFC/SGC: Synthetic images with multiple instances of the same category but varied attributes (e.g., cars with different colors and makes).
Key Insight: Large-scale auxiliary data (GRIT, V3Det) yields negligible or even negative impact. Only targeted synthetic data tailored to intra-class attribute variation significantly improves performance!
- Weikai Huang: weikaih@cs.washington.edu
- Jieyu Zhang: jieyuz2@cs.washington.edu
@misc{huang2025syntheticobjectcompositionsscalable,
title={Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding},
author={Weikai Huang and Jieyu Zhang and Taoyang Jia and Chenhao Zheng and Ziqi Gao and Jae Sung Park and Winson Han and Ranjay Krishna},
year={2025},
eprint={2510.09110},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09110},
}We thank the authors of FLUX-1, IC-Light, DIS, Qwen, and QwQ for their excellent open-source models that made this work possible.






