embedumap builds a standalone index.html scatterplot from a CSV using Gemini embeddings, UMAP, and clustering.
uvx --from "git+https://github.com/sanand0/embedumap.git@main" embedumap ...Dry run:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category,year \
--filter-columns primary_category,year \
--timeline-column year \
--dry-runBuild a text map:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category,year \
--filter-columns primary_category,year \
--timeline-column yearBuild a text map with short LLM cluster names:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category,year \
--filter-columns primary_category,year \
--timeline-column year \
--branding "My map" \
--opacity 0.7 \
--bar-chart-corner bottom-left \
--cluster-namesBuild a text map without LLM-interpreted axis labels:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category,year \
--filter-columns primary_category,year \
--timeline-column year \
--no-axis-labelsBuild an image-first map:
uv run embedumap samples/calvin-images.csv \
--image-columns file \
--max-image-size 768 \
--timeline-column date \
--popup-style gridBuild trails from time-bucket centroids:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category \
--filter-columns primary_category,year \
--timeline-column year \
--trailsBuild trails for selected grouping columns with a custom bucket size:
uv run embedumap samples/blog-text.csv \
--embedding-columns text \
--color-columns primary_category \
--filter-columns primary_category,year \
--timeline-column year \
--trails primary_category \
--trail-period yearlyBuild an audio-first map:
uv run embedumap /path/to/audio.csv \
--audio-columns clip \
--audio-metadata-columns title,speaker \
--filter-columns speaker \
--popup-style listNo-clone public smoke test:
uvx --from "git+https://github.com/sanand0/embedumap.git@main" embedumap https://raw.githubusercontent.com/sanand0/embedumap/main/samples/blog-text-300.csv --embedding-columns text --color-columns primary_category,year --filter-columns primary_category,year --timeline-column year- Put
GEMINI_API_KEYin.envor the environment. - The generated HTML embeds data inline and uses direct image/audio references when media columns are provided.
--brandingcontrols the top-left page label,--opacitysets active point opacity,--inactive-opacitysets point opacity outside filters or timeline range, and--trail-opacitysets trail opacity.--bar-chart-cornermoves the overlay bar chart betweentop-left,top-right,bottom-left, andbottom-right.- Axis labels are interpreted by Gemini by default using
--cluster-naming-model; use--no-axis-labelsto keepUMAP 1andUMAP 2. --max-image-size Nresizes embedded image payloads to fit inside anNbyNtile without changing aspect ratio. There is no default resize; use768to match the Gemini embedding models' tile size.- Embeddings are cached by default in
embedumap.duckdbnext to the output HTML. --batch-size Ncontrols how many rows are sent per embedding request. Increase it to speed up builds when your Gemini quota allows larger batches.--cluster-namesadds a lightweight Gemini naming pass after deterministic clustering.--trailsdraws paths through time-bucket centroids. With no value it draws cluster trails; with values like--trails primary_category,cluster, it also draws those group trails.--trail-periodoverrides the automatic bucket size. Examples:1min,1h,2h 15min,daily,weekly,2Q,yearly.- Trail playback uses a play button, cumulative toggle, logarithmic speed slider, and inactive opacity slider for dimmed nodes and trails. Trails start off in the UI; timeline range, trail visibility, playback mode, speed, and inactive opacity are shareable in the URL.
- The pipeline still stays intentionally small: no thumbnails, no sidecar JSON, no transcription pipeline.
Embeddings are cached by default in embedumap.duckdb next to the generated HTML output. Each row cache key is a SHA-256 hash of stable JSON containing:
- the cache version,
- the source label,
- the row index,
- a content hash of the normalized text payload, audio metadata text, image signatures, and audio signatures,
- the embedding model,
- the embedding dimensions,
- and
--max-image-sizewhen the row includes images.
To regenerate an HTML file using existing embeddings, rerun the same command with the same output path, source rows, embedding inputs, model, dimensions, and image resize setting. Visualization-only changes such as colors, filters, timeline controls, trails, opacity, branding, and popup options reuse the cached embeddings. If you write the HTML to a different directory, move or copy the existing embedumap.duckdb beside the new output file before rerunning.