Superlinked, Inc.’s cover photo
Superlinked, Inc.

Superlinked, Inc.

Data Infrastructure and Analytics

San Francisco, California 5,829 followers

The data engineer’s solution to turning data into vector embeddings.

About us

The data engineer’s solution to turning data into vector embeddings. Building LLM demos is cool, turning 1B user clicks and millions of documents into vectors is cooler.

Website
https://superlinked.com/
Industry
Data Infrastructure and Analytics
Company size
11-50 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2021
Specialties
Personalization, Developer APIs, Cloud Infrastructure, Information Retrieval, and Vector Embedding Compute

Locations

  • Primary

    28 Geary Street

    Suite 650

    San Francisco, California 94108, US

    Get directions

Employees at Superlinked, Inc.

Updates

  • Superlinked, Inc. reposted this

    Every re-ranking call costs you 150ms and $0.002. Your search handles 1M queries/day. Do the math. 💸 💸 The Re-Ranking Tax: ▪️Latency: 150ms per query (cross-encoder inference)  ▪️ Cost: $0.002 per query (GPU compute for scoring 100 doc pairs)  ▪️ Scale: 1M queries/day Daily cost: $2,000  Monthly cost: $60,000  Annual cost: $730,000 📈 And that's just compute. Add:  ▪️ Infrastructure maintenance (re-ranker deployment)  ▪️ Engineering time (blending metadata post-search)  ▪️ User drop-off (every 100ms latency = 1% conversion loss) Why You're Paying This Tax: Initial retrieval is weak. Text embeddings only, price/rating signals missing. Re-ranking tries to fix it post-hoc. But if relevant docs aren't in top 100, re-ranker never sees them. ✅ The Alternative: Encode signals at index time:  ▪️Text similarity + price optimization + rating maximization  ▪️ Hard filters before search (eliminate irrelevant items)  ▪️ Dynamic query-time weights (no re-embedding needed) Results: ▪️ Cost: $0 re-ranking (eliminated)  ▪️ Latency: 50ms (4x faster)  ▪️ Accuracy: Higher (relevant results in initial retrieval) TCO: $730K/year → $0 Re-ranking isn't a feature. It's a tax on bad retrieval architecture. Full cost breakdown + architecture comparison 👇

  • Superlinked, Inc. reposted this

    Flash Attention 4 is coming. Everyone's hyped. Almost no one can explain why it's actually faster. "It's optimized for Blackwell" it's like saying "it's fast because it's new." Here's what's actually going on: 1️⃣ They replaced GPU hardware with basic math (And it's faster) GPUs have Special Function Units for exponentials. Dedicated silicon. Sounds fast. Problem: way fewer SFUs than CUDA cores. Heavy attention load = waiting in line. FA4's fix? A cubic polynomial approximation on CUDA cores. Three fused multiply-adds beats the dedicated hardware queue. 🤯 2️⃣ They got lazy with rescaling (On purpose) Standard attention rescales values every time you see a new maximum. Safe. Also wasteful. FA4 only rescales when the change actually threatens numerical stability. 10x fewer rescaling ops. Same correctness. Pure computational fat trimmed. 3️⃣ They turned attention into an async pipeline. FA3 had 2 stages. FA4 has 5 (load, MMA, softmax, correction, epilogue) running concurrently on specialized warps. Think event loop on steroids. The result? ~20% faster than FA3. First kernel to break the petaflop barrier. FA4 isn't widely available yet (Blackwell only, forward pass only). But when it lands broadly, the "just throw more GPUs at it" crowd is in for a rude awakening. Architecture eats hardware for breakfast. Props to Tri Dao for pushing the boundaries again.

  • Superlinked, Inc. reposted this

    Your GPU does 312 trillion operations per second. So why does embedding inference take milliseconds? 🤯 Because you're not compute-bound. You're memory-bound. Here's what actually happens when you embed a sentence: Tokenization? Microseconds. Irrelevant. Then you run tokens through 12 Transformer layers. Each one reads weights from memory, computes, writes back, reads again. The A100 reads memory at 1.5 TB/s. Sounds fast — but compared to 312 TFLOPS of compute, it's a crawl. Your GPU spends most of its time waiting for data. Not doing math. This is why "it's fast because it's written in Rust" misses the point. 🤡 HTTP handling, JSON parsing, request routing — that's 5% of your latency on the CPU. The other 95% is GPU memory bottleneck. So what actually moves the needle? 1️⃣ Flash Attention Standard attention writes every intermediate matrix to slow HBM, then reads it back. Over and over. Flash Attention tiles computation into blocks that fit in fast SRAM — the GPU's scratchpad cache. Same math, same result, zero expensive memory round-trips. 2-4x faster. Not because of language. Because it respects the memory hierarchy. 2️⃣ Quantization Usually explained as "smaller numbers = faster math." The real win? Bandwidth. FP32 weights = 4 bytes per parameter. INT8 = 1 byte. Your 400MB model becomes 100MB. Less data moved = less waiting. Retrieval accuracy loss? Under 1% NDCG degradation. Basically free throughput. 3️⃣ Token-based batching Batch by total token count, not request count. Pack the GPU like Tetris instead of wasting cycles on padding. The takeaway: performance is architecture first, syntax second. Rust helps with GC pauses and operational simplicity. But you could build a comparably fast system in Python with the same GPU kernel choices and batching strategy. Stop optimizing the 5%. Attack the 95%. At least, that's how I see it — what about you? Props to Filip Makraduli for the deep dive.

  • Superlinked, Inc. reposted this

    You're training custom embedding models for recommendations. There's a faster way. ⚡ ❌ The Old Approach: Problem: LLMs can't handle structured data (prices, dates, categories, ratings). Generic embeddings underperform in production. Solution: Train custom embedding models. Cost:  ▪️ Months of development  ▪️ ML expertise required  ▪️ Training data collection  ▪️ Most projects stuck in POC ✅ The New Approach: Use Superlinked, Inc.'s vector compute framework. Creates custom embeddings WITHOUT training. How:  ▪️ Combines structured + unstructured data  ▪️ Pre-trained models + custom logic  ▪️ Python notebook → production in weeks Example: Query "high rated quality products" Old (LLM embeds entire review text):  ▪️ Top result 1: "Good Product" (5 stars) ✅  ▪️ Top result 2: "Can't beat this deal!" (5 stars) ✅  ▪️ Top result 3: "Cheap product Not good" (1 star) ❌ LLM sees "product" keyword, misses rating signal. New (Superlinked's separate spaces):  ▪️ Text similarity space: "quality products"  ▪️ Rating maximizer space: prioritize high ratings  ▪️ Top 3 results: ALL 5-star reviews ✅ Result: Custom model performance without training overhead. Real case: E-commerce platform, 4x AOV uplift, launched in weeks using Redis + Superlinked. Full code walkthrough 👇

    • No alternative text description for this image
  • Superlinked, Inc. reposted this

    Teaching Your Recommender LLMs to Think Twice Here's why single-shot reasoning breaks. 🔍 System-1 vs System-2 Thinking: System 1️⃣ (Single-Shot LLM): Fast, automatic, error-prone. Example: User clicks 3 sci-fi novels → LLM concludes "user likes sci-fi" Ignores: User disliked 10 dystopian novels → contradictory evidence missed System 2️⃣ (Reflection Loop):  Slow, deliberate, self-correcting. Actor generates preference → Reflector catches flaw: "User consistently dislikes dystopian themes, contradicts your conclusion" → Actor refines Why Single-Shot Fails in RecSys: ❌ Black box reasoning (can't audit chain of thought)  ❌ Small errors compound (hallucinated preference = bad features = bad recommendations)  ❌ No quality control (output goes straight to downstream model) ▶︎ R4ec Solution: Built-In Quality Control Reflector Model trained to spot inconsistencies.  Feedback loop forces Actor to reconsider.  Final output is "vetted" for quality before feeding recommender. Production Impact: ✅ +2.2% revenue  ✅ +1.6% CVR  ✅ +4.1% long-tail lift (cold-start solved) This Actor-Reflector pattern works for:  ▪️ Code generation (Reflector catches bugs)  ▪️ Content validation (Reflector checks facts)  ▪️ Chatbot responses (Reflector ensures grounding) Any task where LLM errors are costly needs a Reflector. Process matters more than model size. Two small specialized models > one large general model.

    • No alternative text description for this image
  • “System X is fast because it’s written in Rust.” Is this true 100% of the time? Most people assume embedding inference speed comes down to the code they write. Python versus Rust, frameworks etc. In practice, almost none of that is decisive. What really affects embedding latency is memory. GPUs are extremely fast at calculations but comparatively slow at moving data. Generating an embedding is mostly about reading and writing large model weights and intermediate tensors instead of crunching numbers. That is why techniques like Flash Attention (used by popular inference solution TEI) matter. They reorganise computation so more work stays in fast on chip cache instead of repeatedly hitting slower GPU memory. Quantisation helps for the same reason. Smaller weights mean less data to move. If you want faster embeddings, start thinking about memory, cache locality, and data movement to realise some actual gains. Or better yet, read Filip’s full deep-dive on the matter here: https://lnkd.in/eyTuH2cu

    • No alternative text description for this image
  • Superlinked, Inc. reposted this

    No Amount of Re-Ranking will Save You From Bad Retrieval. The Common Workflow: 1) Embed documents → vector search 2) Retrieve 100 results 3) Re-rank with cross-encoder 4) Return top 10 ⚠️ The Problem: ▪️ Re-ranking is expensive.  Cross-encoders process every query-doc pair individually.  At scale: 100 results × re-ranking = 200ms+ latency per query. ▪️ Re-ranking doesn't fix bad initial retrieval.  If your first pass misses the right signals, re-ranking just reorders garbage. Example: "Affordable wireless headphones under $200 with high ratings" Initial vector search on text only: → retrieves based on "headphones" semantics → misses price and rating signals → gets $250 headphones (over budget) and 3-star products (low rating). Re-ranker tries to fix this post-hoc. But it only sees what was retrieved. Garbage in, garbage out. 🔧 The Fix: Encode multiple signals at index time. 📝 Text similarity space (semantics)  💰 Price minimizer space (optimizes for lower)  ⭐ Rating maximizer space (optimizes for higher) Apply hard filters before search:  ▪️ price < $200 (eliminates expensive products) ▪️ category = electronics (narrow search space) Use dynamic weights at query time:  ▪️ Adjust text vs price vs rating importance without re-embedding Result: Relevant results surface in initial retrieval. No re-ranking needed. 4x faster, much lower compute costs. Full breakdown with code examples 👉https://lnkd.in/gvcnR39V

    • No alternative text description for this image
  • Using open-source solutions to productionise your embeddings can get you a long way, but the efficiency problem that faces ML and AI Engineers still needs solving… *Some models can generate dense, sparse, and multi vector embeddings in one pass, but today you usually need multiple API calls because these outputs are handled separately. *Running and testing multiple models in production is costly and complex, with limited support for serving many models efficiently when VRAM is constrained. *Differences in embeddings, pooling strategies, and model quirks require careful handling by users, and current systems lack flexible ways to support new model types without code changes. Filip Makraduli takes a deep dive into the existing open source inference solutions, what they do well, and what they’re ultimately missing to make everyone’s jobs easier (and to get the most out of your GPUs). Check out the article here: https://lnkd.in/ei9GqsVF

    • No alternative text description for this image
  • Superlinked, Inc. reposted this

    Spotify, Google, Meta, Airbnb all use custom embeddings. Now you can too—without the $500K price tag. 🚀 Pre-trained models break on real-world data. Query: "Senior ML roles in fintech paying $200K+" You need job descriptions, salary data, AND application behavior. OpenAI embeddings only handle text. The salary and sector filters? Lost. Big Tech knows this. They all trained custom embedding models. The cost: ▪️ $500K-$600K in compute  ▪️ 3-9 months to build  ▪️ Expertise + proprietary data  ▪️ Next project? Start over from scratch Superlinked, Inc. open-sourced the alternative. 🔧 Encoder-Stacking Framework  Combine specialized encoders (text, image, number, time, category) into a single embedding space. No custom training needed. ⚡ Works With Any Vector DB  MongoDB, Redis, Pinecone, Qdrant, pgvector—use your existing stack. 🐍 Python Notebook to Cloud  Prototype locally, deploy to production. Your own infrastructure or managed cloud. 📊 4 Use Cases, 1 Solution  Search, recommendations, RAG, analytics—same framework handles all. Example: "Recent ACME unsigned documents" Gets encoded with: 📄 Visual Page Encoder (document layout)  🏷️ Category Encoder (ACME, document type)  📅 Time Encoder (recent) Direct retrieval. No re-ranking. This is custom model performance with pre-trained model simplicity. OSS framework on GitHub. Managed cloud for production. Could this work for your use case? 💬

    • No alternative text description for this image

Similar pages

Browse jobs

Funding