Superlinked, Inc.

Superlinked, Inc. · 2026-01-15T16:15:01.666Z

Using open-source solutions to productionise your embeddings can get you a long way, but the efficiency problem that faces ML and AI Engineers still needs solving… *Some models can generate dense, sparse, and multi vector embeddings in one pass, but today you usually need multiple API calls because these outputs are handled separately. *Running and testing multiple models in production is costly and complex, with limited support for serving many models efficiently when VRAM is constrained. *Differences in embeddings, pooling strategies, and model quirks require careful handling by users, and current systems lack flexible ways to support new model types without code changes. Filip Makraduli takes a deep dive into the existing open source inference solutions, what they do well, and what they’re ultimately missing to make everyone’s jobs easier (and to get the most out of your GPUs). Check out the article here: https://lnkd.in/ei9GqsVF

Data Infrastructure and Analytics

San Francisco, California 5,829 followers

The data engineer’s solution to turning data into vector embeddings.

Discover all 32 employees

About us

The data engineer’s solution to turning data into vector embeddings. Building LLM demos is cool, turning 1B user clicks and millions of documents into vectors is cooler.

Website: https://superlinked.com/
External link for Superlinked, Inc.
Industry: Data Infrastructure and Analytics
Company size: 11-50 employees
Headquarters: San Francisco, California
Type: Privately Held
Founded: 2021
Specialties: Personalization, Developer APIs, Cloud Infrastructure, Information Retrieval, and Vector Embedding Compute

Locations

Primary

28 Geary Street

Suite 650

San Francisco, California 94108, US

Get directions

Employees at Superlinked, Inc.

See all employees

Updates

Superlinked, Inc. reposted this
Daniel Svonava
3d Edited
Report this post
Every re-ranking call costs you 150ms and $0.002. Your search handles 1M queries/day. Do the math. 💸 💸 The Re-Ranking Tax: ▪️Latency: 150ms per query (cross-encoder inference) ▪️ Cost: $0.002 per query (GPU compute for scoring 100 doc pairs) ▪️ Scale: 1M queries/day Daily cost: $2,000 Monthly cost: $60,000 Annual cost: $730,000 📈 And that's just compute. Add: ▪️ Infrastructure maintenance (re-ranker deployment) ▪️ Engineering time (blending metadata post-search) ▪️ User drop-off (every 100ms latency = 1% conversion loss) Why You're Paying This Tax: Initial retrieval is weak. Text embeddings only, price/rating signals missing. Re-ranking tries to fix it post-hoc. But if relevant docs aren't in top 100, re-ranker never sees them. ✅ The Alternative: Encode signals at index time: ▪️Text similarity + price optimization + rating maximization ▪️ Hard filters before search (eliminate irrelevant items) ▪️ Dynamic query-time weights (no re-embedding needed) Results: ▪️ Cost: $0 re-ranking (eliminated) ▪️ Latency: 50ms (4x faster) ▪️ Accuracy: Higher (relevant results in initial retrieval) TCO: $730K/year → $0 Re-ranking isn't a feature. It's a tax on bad retrieval architecture. Full cost breakdown + architecture comparison 👇

8 Comments

Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
1w
Report this post
Flash Attention 4 is coming. Everyone's hyped. Almost no one can explain why it's actually faster. "It's optimized for Blackwell" it's like saying "it's fast because it's new." Here's what's actually going on: 1️⃣ They replaced GPU hardware with basic math (And it's faster) GPUs have Special Function Units for exponentials. Dedicated silicon. Sounds fast. Problem: way fewer SFUs than CUDA cores. Heavy attention load = waiting in line. FA4's fix? A cubic polynomial approximation on CUDA cores. Three fused multiply-adds beats the dedicated hardware queue. 🤯 2️⃣ They got lazy with rescaling (On purpose) Standard attention rescales values every time you see a new maximum. Safe. Also wasteful. FA4 only rescales when the change actually threatens numerical stability. 10x fewer rescaling ops. Same correctness. Pure computational fat trimmed. 3️⃣ They turned attention into an async pipeline. FA3 had 2 stages. FA4 has 5 (load, MMA, softmax, correction, epilogue) running concurrently on specialized warps. Think event loop on steroids. The result? ~20% faster than FA3. First kernel to break the petaflop barrier. FA4 isn't widely available yet (Blackwell only, forward pass only). But when it lands broadly, the "just throw more GPUs at it" crowd is in for a rude awakening. Architecture eats hardware for breakfast. Props to Tri Dao for pushing the boundaries again.

8 Comments

Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
1w
Report this post
Your GPU does 312 trillion operations per second. So why does embedding inference take milliseconds? 🤯 Because you're not compute-bound. You're memory-bound. Here's what actually happens when you embed a sentence: Tokenization? Microseconds. Irrelevant. Then you run tokens through 12 Transformer layers. Each one reads weights from memory, computes, writes back, reads again. The A100 reads memory at 1.5 TB/s. Sounds fast — but compared to 312 TFLOPS of compute, it's a crawl. Your GPU spends most of its time waiting for data. Not doing math. This is why "it's fast because it's written in Rust" misses the point. 🤡 HTTP handling, JSON parsing, request routing — that's 5% of your latency on the CPU. The other 95% is GPU memory bottleneck. So what actually moves the needle? 1️⃣ Flash Attention Standard attention writes every intermediate matrix to slow HBM, then reads it back. Over and over. Flash Attention tiles computation into blocks that fit in fast SRAM — the GPU's scratchpad cache. Same math, same result, zero expensive memory round-trips. 2-4x faster. Not because of language. Because it respects the memory hierarchy. 2️⃣ Quantization Usually explained as "smaller numbers = faster math." The real win? Bandwidth. FP32 weights = 4 bytes per parameter. INT8 = 1 byte. Your 400MB model becomes 100MB. Less data moved = less waiting. Retrieval accuracy loss? Under 1% NDCG degradation. Basically free throughput. 3️⃣ Token-based batching Batch by total token count, not request count. Pack the GPU like Tetris instead of wasting cycles on padding. The takeaway: performance is architecture first, syntax second. Rust helps with GC pauses and operational simplicity. But you could build a comparably fast system in Python with the same GPU kernel choices and batching strategy. Stop optimizing the 5%. Attack the 95%. At least, that's how I see it — what about you? Props to Filip Makraduli for the deep dive.

28 Comments

Like Comment Share
Superlinked, Inc.

5,829 followers
2w
Report this post
This month we're bringing you an embeddings-focused edition of our lovely newsletter. Find out how to fine-tune, what's behind fast inference models, and how all of this could save you money on compute. Read on to get the finer details!

Faster Embeddings, Cheaper Fine-Tuning, and the State of VDBs Superlinked, Inc. on LinkedIn

Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
2w Edited
Report this post
You're training custom embedding models for recommendations. There's a faster way. ⚡ ❌ The Old Approach: Problem: LLMs can't handle structured data (prices, dates, categories, ratings). Generic embeddings underperform in production. Solution: Train custom embedding models. Cost: ▪️ Months of development ▪️ ML expertise required ▪️ Training data collection ▪️ Most projects stuck in POC ✅ The New Approach: Use Superlinked, Inc.'s vector compute framework. Creates custom embeddings WITHOUT training. How: ▪️ Combines structured + unstructured data ▪️ Pre-trained models + custom logic ▪️ Python notebook → production in weeks Example: Query "high rated quality products" Old (LLM embeds entire review text): ▪️ Top result 1: "Good Product" (5 stars) ✅ ▪️ Top result 2: "Can't beat this deal!" (5 stars) ✅ ▪️ Top result 3: "Cheap product Not good" (1 star) ❌ LLM sees "product" keyword, misses rating signal. New (Superlinked's separate spaces): ▪️ Text similarity space: "quality products" ▪️ Rating maximizer space: prioritize high ratings ▪️ Top 3 results: ALL 5-star reviews ✅ Result: Custom model performance without training overhead. Real case: E-commerce platform, 4x AOV uplift, launched in weeks using Redis + Superlinked. Full code walkthrough 👇
9 Comments

Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
3w
Report this post
Teaching Your Recommender LLMs to Think Twice Here's why single-shot reasoning breaks. 🔍 System-1 vs System-2 Thinking: System 1️⃣ (Single-Shot LLM): Fast, automatic, error-prone. Example: User clicks 3 sci-fi novels → LLM concludes "user likes sci-fi" Ignores: User disliked 10 dystopian novels → contradictory evidence missed System 2️⃣ (Reflection Loop): Slow, deliberate, self-correcting. Actor generates preference → Reflector catches flaw: "User consistently dislikes dystopian themes, contradicts your conclusion" → Actor refines Why Single-Shot Fails in RecSys: ❌ Black box reasoning (can't audit chain of thought) ❌ Small errors compound (hallucinated preference = bad features = bad recommendations) ❌ No quality control (output goes straight to downstream model) ▶︎ R4ec Solution: Built-In Quality Control Reflector Model trained to spot inconsistencies. Feedback loop forces Actor to reconsider. Final output is "vetted" for quality before feeding recommender. Production Impact: ✅ +2.2% revenue ✅ +1.6% CVR ✅ +4.1% long-tail lift (cold-start solved) This Actor-Reflector pattern works for: ▪️ Code generation (Reflector catches bugs) ▪️ Content validation (Reflector checks facts) ▪️ Chatbot responses (Reflector ensures grounding) Any task where LLM errors are costly needs a Reflector. Process matters more than model size. Two small specialized models > one large general model.
6 Comments

Like Comment Share
Superlinked, Inc.

5,829 followers
3w Edited
Report this post
“System X is fast because it’s written in Rust.” Is this true 100% of the time? Most people assume embedding inference speed comes down to the code they write. Python versus Rust, frameworks etc. In practice, almost none of that is decisive. What really affects embedding latency is memory. GPUs are extremely fast at calculations but comparatively slow at moving data. Generating an embedding is mostly about reading and writing large model weights and intermediate tensors instead of crunching numbers. That is why techniques like Flash Attention (used by popular inference solution TEI) matter. They reorganise computation so more work stays in fast on chip cache instead of repeatedly hitting slower GPU memory. Quantisation helps for the same reason. Smaller weights mean less data to move. If you want faster embeddings, start thinking about memory, cache locality, and data movement to realise some actual gains. Or better yet, read Filip’s full deep-dive on the matter here: https://lnkd.in/eyTuH2cu
Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
1mo Edited
Report this post
No Amount of Re-Ranking will Save You From Bad Retrieval. The Common Workflow: 1) Embed documents → vector search 2) Retrieve 100 results 3) Re-rank with cross-encoder 4) Return top 10 ⚠️ The Problem: ▪️ Re-ranking is expensive. Cross-encoders process every query-doc pair individually. At scale: 100 results × re-ranking = 200ms+ latency per query. ▪️ Re-ranking doesn't fix bad initial retrieval. If your first pass misses the right signals, re-ranking just reorders garbage. Example: "Affordable wireless headphones under $200 with high ratings" Initial vector search on text only: → retrieves based on "headphones" semantics → misses price and rating signals → gets $250 headphones (over budget) and 3-star products (low rating). Re-ranker tries to fix this post-hoc. But it only sees what was retrieved. Garbage in, garbage out. 🔧 The Fix: Encode multiple signals at index time. 📝 Text similarity space (semantics) 💰 Price minimizer space (optimizes for lower) ⭐ Rating maximizer space (optimizes for higher) Apply hard filters before search: ▪️ price < $200 (eliminates expensive products) ▪️ category = electronics (narrow search space) Use dynamic weights at query time: ▪️ Adjust text vs price vs rating importance without re-embedding Result: Relevant results surface in initial retrieval. No re-ranking needed. 4x faster, much lower compute costs. Full breakdown with code examples 👉https://lnkd.in/gvcnR39V
8 Comments

Like Comment Share
Superlinked, Inc.

5,829 followers
1mo Edited
Report this post
Using open-source solutions to productionise your embeddings can get you a long way, but the efficiency problem that faces ML and AI Engineers still needs solving… *Some models can generate dense, sparse, and multi vector embeddings in one pass, but today you usually need multiple API calls because these outputs are handled separately. *Running and testing multiple models in production is costly and complex, with limited support for serving many models efficiently when VRAM is constrained. *Differences in embeddings, pooling strategies, and model quirks require careful handling by users, and current systems lack flexible ways to support new model types without code changes. Filip Makraduli takes a deep dive into the existing open source inference solutions, what they do well, and what they’re ultimately missing to make everyone’s jobs easier (and to get the most out of your GPUs). Check out the article here: https://lnkd.in/ei9GqsVF
Like Comment Share
Superlinked, Inc. reposted this
Daniel Svonava
1mo
Report this post
Spotify, Google, Meta, Airbnb all use custom embeddings. Now you can too—without the $500K price tag. 🚀 Pre-trained models break on real-world data. Query: "Senior ML roles in fintech paying $200K+" You need job descriptions, salary data, AND application behavior. OpenAI embeddings only handle text. The salary and sector filters? Lost. Big Tech knows this. They all trained custom embedding models. The cost: ▪️ $500K-$600K in compute ▪️ 3-9 months to build ▪️ Expertise + proprietary data ▪️ Next project? Start over from scratch Superlinked, Inc. open-sourced the alternative. 🔧 Encoder-Stacking Framework Combine specialized encoders (text, image, number, time, category) into a single embedding space. No custom training needed. ⚡ Works With Any Vector DB MongoDB, Redis, Pinecone, Qdrant, pgvector—use your existing stack. 🐍 Python Notebook to Cloud Prototype locally, deploy to production. Your own infrastructure or managed cloud. 📊 4 Use Cases, 1 Solution Search, recommendations, RAG, analytics—same framework handles all. Example: "Recent ACME unsigned documents" Gets encoded with: 📄 Visual Page Encoder (document layout) 🏷️ Category Encoder (ACME, document type) 📅 Time Encoder (recent) Direct retrieval. No re-ranking. This is custom model performance with pre-trained model simplicity. OSS framework on GitHub. Managed cloud for production. Could this work for your use case? 💬
13 Comments

Like Comment Share

Browse jobs

Funding

Superlinked, Inc. 3 total rounds

Last Round

Seed Apr 18, 2024

US$ 9.5M

Investors

Index Ventures + 6 Other investors

See more info on crunchbase

Superlinked, Inc.

Data Infrastructure and Analytics

San Francisco, California 5,829 followers

The data engineer’s solution to turning data into vector embeddings.

About us

Locations

Employees at Superlinked, Inc.

Ben Gutkovich

Daniel Svonava

Amir Rustamzadeh

Kris Vulgan

Updates

Join now to see what you are missing

Similar pages

Nile

Astral

Palmstreet

Resend

Lightdash

Savvy Wealth

Tailor

Zoo

Rayon.design

Tezi AI

Browse jobs

Engineer jobs

Senior Scientist jobs

Scientist jobs

Analyst jobs

Lead Software Engineer jobs

Director of Engineering jobs

Machine Learning Engineer jobs

Director jobs

Head of Strategy jobs

Head of Information Technology jobs

Head of Software jobs

Enterprise Architect jobs

Principal Engineer jobs

Advocate jobs

Technical Lead jobs

Head of Engineering jobs

Staff Software Engineer jobs

Test Manager jobs

Quantitative Analyst jobs

Chief Technology Officer jobs

Funding