Atlys Engineering - Medium

Fine-Tuning a 2B Vision-Language Model for Document AI: Four Lessons That Beat Hyperparameter…

Shubham Tiwari — Sun, 17 May 2026 07:42:17 GMT

Fine-Tuning a 2B Vision-Language Model for Document AI: Four Lessons That Beat Hyperparameter Tuning

Frontier APIs handle document understanding well out of the box, until you’re processing tens of thousands of multi-page financial documents per day. At that point, API costs run into thousands of dollars a month, you inherit rate limits and latency variance and you can’t fix specific failure modes that matter to your domain.

A purpose-built 2B vision-language model on a single GPU runs at a fraction of that cost with predictable latency, full data privacy and the ability to address edge cases by adjusting your training data. We spent the last few weeks fine-tuning one to parse bank statements at production scale and most of the actual lessons turned out to be in places the standard fine-tuning playbook skips.

The scope is document AI with structured response outputs. Tasks where you give the model an image and expect a JSON output with specific fields: bank statements, invoices, medical records, lab reports, KYC forms, identity documents. For chat models or open-ended generation, some of what we say below might not apply.

Our fine-tuning playbook

Most blog posts on fine-tuning fixate on hyperparameters. In practice, fine-tuning a VLM for production is mostly about figuring out what actually moves the needle. We recently fine-tuned a 2B VLM to parse bank statements. Four things consistently helped us hit production-grade accuracy:

Schema design as a training lever
Vision-side LoRA targeting and serving
Serving configuration that interacts with model architecture
Input-side engineering that beats more training

The training script was the simplest part. Each of these had its own surprises.

1. Schema Is a Fine-Tuning Lever

The single most impactful decision wasn’t a hyperparameter. It was what we asked the model to output.

We removed one field from our transaction schema and eval loss dropped 60%. Inference throughput roughly doubled. Truncation errors went from a regular annoyance to almost zero. No change to model size, training data, or learning rate.

The model didn’t get smarter. The task got smaller.

The Setup

For our first three versions, the transaction schema looked like this:

{
  "date": "2024-01-15",
  "narration": "UPI/P2L329398532229/New Cron/UPI/TM Bank",
  "debit": 495.0,
  "credit": 0,
  "balance": 288209.95
}

Five fields per transaction. A page with 50 transactions produces around 1,800 to 2,000 output tokens. The 95th percentile of our training data sat at 2,562 output tokens.

For version 4, we removed narration:

{
  "date": "2024-01-15",
  "debit": 495.0,
  "credit": 0,
  "balance": 288209.95
}

Same training data, same model, same hyperparameters. Results across all four versions:

From: Author

Why This Happens

When you fine-tune a causal language model, even a vision-language one, the loss is computed token by token over the output sequence. A sample with a 2,000-token output produces roughly 10x more loss signal than a sample with a 200-token output, plus 10x more gradient computation, activation memory, and wall-clock time per training step.

Most of those output tokens were spent on the wrong thing. Narration strings are essentially uncompressed OCR. The model has to predict each character of strings like UPI/P2K329398532229/New Corn. Any uncertainty compounds across the sequence.

The structured fields are where the actual document understanding lives. The model has to recognize the table layout, identify column boundaries, parse Indian number formatting. The narration field is just transcription. It’s spending most of its gradient signal memorizing strings that won’t appear again.

Output Token Cascade Through the Stack

From: Author

A Note on Eval Loss

There’s a measurement trap here: V3’s 0.064 eval loss is not “almost as good” as V4’s 0.024. They’re solving different tasks and their loss numbers live in different scales.

Eval loss measures next-token prediction on your training distribution. Change the schema and you change what the loss is measuring. The only metric that means the same thing across schema changes is field-level accuracy on a held-out validation set.

2. Vision LoRA Is Where Most Tutorials Stop Short

Most LoRA tutorials target q/k/v projections in the language layers. For a vision-language model, that leaves the vision encoder essentially frozen, which means the model can’t learn to look differently at your domain.

Module Discovery

The vision encoder has its own LoRA-targetable modules.

import torch.nn as nn

vision_modules = []
for name, module in model.named_modules():
    if "visual" in name and isinstance(module, nn.Linear):
        vision_modules.append(name)
print(f"Found {len(vision_modules)} vision-side Linear modules")

They fall into two categories:

Attention projections: q_proj, k_proj, v_proj, o_proj for each vision block. The obvious ones, and where most tutorials stop.
MLP layers: gate_proj, up_proj, down_proj for each vision block. Where most of the capacity lives.

Our final target_modules list:

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]

PEFT will match the pattern across both vision and language stacks giving LoRA adapters on every Linear in both sides.

Verifying Adapters Are Actually Training

A useful sanity check after a few hundred training steps: compare adapter norms on the vision side vs the language side. If vision adapters are essentially zero while language adapters have grown, your gradient flow on the vision side is broken.

The Serving Trap

vLLM supports dynamic LoRA loading: start the server with the base model, specify which adapter to apply at request time. Convenient for serving multiple adapters from one base model.

For VLMs, this has a problem we hit hard. When you load a LoRA adapter dynamically in vLLM, the vision-side adapters get silently dropped during loading in many configurations. No error, no warning. The model serves, the language adapter is applied, the vision adapter is just gone.

The symptom is subtle: your fine-tuned model behaves about as well as the base model on tasks that depend on visual adaptation. We noticed because our version 3 to version 4 improvement vanished when we deployed via dynamic loading.

The fix is to merge the LoRA adapter into the base model before serving:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("base-model-path")
peft_model = PeftModel.from_pretrained(base, "lora-adapter-path")
merged = peft_model.merge_and_unload()
merged.save_pretrained("merged-model-path")

Then serve the merged model directly. You lose hot-swapping, but you get the vision adaptations you actually trained.

3. Serving Configuration Is Where Real Throughput Lives

Once your model is trained and merged, three vLLM parameters end up determining real production throughput. Defaults are reasonable for text-only models and wrong for VLMs.

Dtype Depends on Your GPU, Not Your Model

The standard advice is “bf16 for training, quantize for inference.” For consumer GPUs serving small VLMs, the picture is more constrained:

2B model on 24GB RTX 4090 consumer GPU: bf16 fits comfortably with vision LoRA enabled.
4B at the same resolution: doesn’t fit.

Dtype choices by hardware

From: Author

There’s also a counterintuitive interaction with image tokens: image tokens are consumed by the vision encoder, not generated. Quantizing the language model doesn’t speed up vision-side computation at all. For our workload, the vision encoder represents 30–40% of single-request latency. Even an aggressive language-model quantization that 2x’d LM speed would only improve total latency by 30–40%.

max-num-seqs Is the Real Concurrency Cap

This sets the maximum number of requests vLLM processes simultaneously. The right number scales with KV cache headroom, not raw GPU memory.

For a VLM with around 12K image tokens per request and 3K output tokens, our 24GB GPU comfortably handles 32 concurrent sequences in bf16. Going to 48 caused OOM on busy days. Going to 16 cut throughput nearly in half.

To find the right number empirically: start at half your initial guess, ramp up, watch for both OOM and the throughput plateau. The plateau matters as much as the OOM ceiling.

max-model-len Splits Between Input and Output

This is the total context budget, shared between prompt, image tokens, and output tokens. For a 16K context VLM:

Image: ~12K tokens (depends on image resolution)
Prompt: ~500 tokens
Output: ~3,500 tokens remaining

If your output schema produces more than the remaining budget, you’ll see truncation errors that look like model failures but are actually serving config failures.

Two ways to fix it:

Increase max-model-len. Works if you have memory headroom. KV cache scales with context length, so doubling max-model-len roughly doubles cache size per sequence. You’ll need to reduce max-num-seqs to compensate.
Cap your image resolution. Smaller images produce fewer image tokens, leaving more output budget. We cap our images at roughly 1.2 million pixels.

KV Cache Is Pre-Allocated at Startup

vLLM grabs the gpu-memory-utilization percentage of your GPU at startup. Memory doesn’t grow under load. This means you’ll never OOM mid-request, but you have to size correctly upfront. Set it to 0.92 for dedicated serving, lower for multi-tenant boxes.

The Cost Structure of VLM Serving

The mental model from text-only serving is “tokens per second.” For VLMs, this metric obscures more than it reveals.

Image tokens are consumed during prefill. Heavy per token, but a fixed cost per request.
Output tokens are generated sequentially during decode. Cheaper per token, but scales linearly with output length.

A request with 12K image tokens and 500-token output spends most of its time on prefill. The same image with a 3,000-token output spends most of its time on decode. Adding concurrent requests with short outputs increases throughput substantially. Adding concurrent requests with long outputs increases throughput less.

4. Input-Side Engineering Beats Another Training Epoch

For our first three training iterations, our default response to “model isn’t quite good enough” was to train longer, train more, or train with more data. Each iteration took 90 minutes to two hours and produced incremental improvements we struggled to even measure correctly.

Three interventions that weren’t training itself ended up moving the needle more than any individual training run.

Measure Field-Level Accuracy, Not Eval Loss

Eval loss is approximately useless for document AI tasks. Three problems compound:

Loss across schema changes isn’t comparable (covered above).
Loss is dominated by noisy parts of your output. Free-form text dominates the gradient signal.
Loss doesn’t tell you which fields are wrong. Two models with identical 0.05 loss can have completely different field-level failure patterns.

Field-level accuracy on real-world documents survives all three problems. For each document, mark each business-critical field as correct or incorrect against the source image. Aggregate across a held-out set.

Once we finalized this metric, conversations about quality became concrete: opening balance went from 92% to 96%, joint holder detection from 60% to 85%. Schema changes became easy to evaluate even when their eval losses lived in different scales.

Oversample the Rare Cases

The first time we noticed the model failing on joint accounts, our instinct was to look at the training run. The actual problem was simpler.

Distribution of rare cases before and after oversampling:

From: Author

The model dutifully learned that “single holder” was the default and ignored the second name even when clearly visible. The fix was duplicating those samples until they appeared roughly 10% of the dataset:

For any field or pattern that’s underrepresented in training data, the model will learn it as the exception, not the rule. This is true regardless of how good your model is or how long you train.

Two things to get right- find rare cases through field-level evaluation rather than guessing, and don’t oversample too aggressively. We targeted ~10% per rare case. Higher (say 20%) starts to overfit on the specific samples being duplicated.

Treat Prompts as a Training-Time Variable

Most teams treat prompts as a runtime knob. For fine-tuning, the prompt is part of your training data. The model learns to attend to whatever your prompt emphasizes.

Between version 2 and version 3, our training prompt had a line that read:

Transactions: keys MUST be exactly date, narration, debit, credit, balance.

The capitalization was deliberate emphasis on the part that was failing in V1. It worked: V2 had perfectly consistent transaction schemas. But other things broke. Account holder names disappeared on documents where they’d been correctly extracted in V1. Statement dates regressed.

The MUST language effectively told the model “transactions are the important part, focus your attention there.” It learned exactly that, at the cost of header extraction.

V3 used the same training data with a rebalanced prompt: equal weight to header and transaction rules, no CAPS emphasis, full schema spelled out for both sections. Header extraction came back. Transaction consistency stayed. Total work: 30 minutes of prompt rewriting.

We could have spent another training run trying to fix this with hyperparameters. The actual fix didn’t need GPU time at all.

What This Looks Like in Production

After multiple training iterations and significant work on the serving stack, we have a system that:

Processes a 50-page document in under 25 seconds end-to-end
Achieves over 95% field-level accuracy on the structurally critical fields
Runs at a fraction of frontier API costs at our volumes
Operates entirely on our own infrastructure on a single consumer GPU

What We’re Building Next

Bank statements are one document type. The same approach applies broadly across identity documents and financial documents. Each one has its own schema design problem, its own serving constraints and its own evaluation surface.

We’re hiring engineers who think about engineering as a systems problem. Not which framework is best, but how the pieces fit together, where the actual bottlenecks are and what the cheapest intervention that moves them looks like.

Check here for openings.

Fine-Tuning a 2B Vision-Language Model for Document AI: Four Lessons That Beat Hyperparameter… was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Migrating 600M Records to ClickHouse Cloud

Abhishek Banerjee — Wed, 19 Nov 2025 10:56:26 GMT

The Technical Reality of Self-Hosted ClickHouse at Scale

For twelve months, I managed a self-hosted ClickHouse cluster handling 2,259 tables and 600+ million records. The architecture seemed solid on paper: 2 shards with 3 replicas, each ClickHouse node provisioned with 1TB storage, 25GB RAM, and 6 vCPUs. The coordination layer ran on 3 ZooKeeper instances with 40GB storage and 12GB RAM each.

The Pulse pipeline- our GO-based ingestion system- was writing 30 INSERT queries per second, translating to roughly 2.6 million rows daily. This should have been manageable for a well-architected ClickHouse deployment.

But scale changes everything.

Infrastructure Debt Compounds

The problems started subtle and escalated into production incidents:

Memory Management Issues: Analytical queries would unpredictably consume massive buffers, causing system-wide slowdowns. A query performing in milliseconds on a subset could take tens of seconds on the full dataset. Without predictable resource isolation, this meant production queries could choke the entire cluster.

ZooKeeper Corruption: Early in our deployment, a ZooKeeper outage cascaded into data corruption across all replicas. Recovery required manual intervention- backfilling data, validating consistency across shards, and verifying nothing was permanently lost. This was a multi-day incident that exposed the fragility of managing distributed state at this scale.

Scaling Requires Downtime: Adding compute capacity meant planned maintenance windows. During a 10x event traffic spike from a sale event, we had no way to elastically scale. At 6 PM, we scheduled emergency maintenance for midnight scaling while stakeholders waited for business reports. The system that should have handled dynamic load gracefully was instead requiring manual intervention during peak demand.

Single Point of Failure: As the sole data engineer responsible for this infrastructure, every incident meant context-switching from feature development to firefighting. The operational burden wasn’t just technical- it was consuming all available bandwidth.

Designing a Zero-Downtime Migration

A big-bang migration- dump everything to Cloud and cutover- carries massive risk. Database locks, memory spikes during bulk inserts, index building blocking production traffic, and no rollback path if issues appear halfway through.

Instead, I designed a three-phase approach spanning 30 days, optimized for safety and validation.

Phase 1: Idempotent Backfill Architecture

After provisioning the ClickHouse Cloud cluster, the critical challenge was backfilling 600M+ records without disrupting production or risking data inconsistency.

The key insight: make the backfill process idempotent. If it runs once and moves data up to timestamp X, subsequent runs should only move data before X- no duplicates, no overwrites, deterministic results regardless of execution count.

Here’s the implementation:

INSERT INTO FUNCTION remoteSecure(
  '{host}',
  '{database}.{table_name}',
  '{user}',
  '{password}'
)
SELECT *
FROM {database}.{table_name}
WHERE timestamp < coalesce(
  (SELECT min(timestamp)
   FROM remoteSecure(
     '{host}',
     '{database}.{table_name}',
     '{user}',
     '{password}'
   )),
  toDateTime('2100-01-01 00:00:00')
)

The coalesce clause is the critical piece. It queries the destination cluster for the minimum timestamp already present. If data exists, it only backfills records older than that. If the destination is empty, the fallback date (2100-01-01) ensures all records are included.

This architecture meant:

Restartable backfills after interruptions
No manual tracking of progress state
Safe to run multiple times during validation
No risk of duplicate data

The complete backfill-600 million records across 2,259 tables -completed in 3 hours. This included time for ClickHouse to build indexes and merge data parts in the background.

Phase 2: Dual-Write Validation Period

The most nerve-wracking phase: running both clusters simultaneously in production.

I modified the Pulse pipeline to write every event to both self-hosted and Cloud clusters. The write logic implemented retry logic for Cloud failures, with self-hosted as the source of truth during this transition period.

Why three weeks? Because trust is built through sustained validation, not spot checks.

I implemented comprehensive validation:

Data Integrity Checks:

Row count comparisons across all 2,259 tables
Aggregation result verification (sums, counts, distinct values)
Sampling and checksum validation for large tables

Performance Benchmarking:

Query latency comparisons on identical queries
Resource utilisation monitoring (CPU, memory, disk I/O)
Dashboard load time measurements

Schema Compatibility Validation:

Distributed tables automatically converted to MergeTree engine on Cloud
External table connections (Postgres integrations) functioned without modification
Materialized views rebuilt and maintained correct aggregations

Query Behavior Analysis: Not all queries improved immediately. Some required optimization for Cloud’s query planner. I identified queries that regressed, analyzed their execution plans, and adjusted them before cutover.

The business KPI monitoring was crucial-payments flowing correctly, conversion funnels matching, revenue reports aligning between clusters. This gave stakeholders confidence that the migration wouldn’t introduce business logic errors.

Phase 3: Staged Cutover

Day 20: Switch Reads All production read queries pointed to ClickHouse Cloud. This was the highest-risk moment -if Cloud data was inconsistent or queries behaved differently, users would see it immediately. But three weeks of validation meant I was confident.

Monitoring showed:

Query latency improved drastically
Error rates stayed at baseline
Dashboard load times felt instant to users
Business metrics continued flowing without anomalies

Day 30: Decommission after 10 days of flawless operation on Cloud, I disabled writes to the self-hosted cluster and immediately decommissioned it. The three-week validation period had given me enough confidence to pull the trigger without an extended monitoring window.

The migration was complete with zero downtime and zero data loss.

Performance Improvements

The numbers told a clear story:

Traffic Attribution Query:

Self-hosted: 30 seconds
Cloud: 0.6 seconds
50x improvement

This query joins multiple large tables to attribute user conversions to traffic sources -a complex analytical workload that benefits from ClickHouse Cloud’s improved hardware and query optimisation.

Fulfilment Dashboard:

Self-hosted: 8.1 seconds
Cloud: 1.154 seconds
7x improvement

The most critical dashboard for business operations went from painfully slow to nearly instant, improving decision-making speed for stakeholders.

General Dashboard Experience: Users described self-hosted dashboards as “painful” with frequent complaints about load times. Post-migration, dashboards loaded “in the blink of an eye.”

Why the improvement? ClickHouse Cloud’s managed infrastructure includes:

Better hardware provisioning
Improved query planning and optimisation
Superior resource allocation and isolation
Automatic performance tuning based on workload patterns

Reliability Impact

Operational:

Before: Multiple incidents per quarter (ZooKeeper corruption, scaling-induced downtime, memory explosions, DDL query failures)
After: Zero incidents, 100% uptime, zero maintenance overhead, zero middle-of-night alerts

What Transferred Successfully

The migration preserved our entire infrastructure without schema changes:

All 2,259 tables in a single database
External table connections to Postgres (read-only analytics sources)
Distributed table coordination (automatically optimised on Cloud)
Analytical views and query patterns
Materialized views maintaining real-time aggregations
Schema caching optimisations (80% system load reduction)

The Pulse pipeline required only endpoint configuration change- no logic modifications.

Trade-offs and Considerations

Reduced Control: Self-hosted ClickHouse allowed granular tuning of replication factors, buffer sizes, memory limits, and merge tree parameters. ClickHouse Cloud abstracts these controls, managing optimization automatically.

Worth it? Absolutely. The trade is comparable to owning versus leasing infrastructure- you lose some control but gain expert management and reliability.

Initial concerns that didn’t materialize:

Data transfer costs remained minimal
Query performance regressions were rare and easily fixed
Schema compatibility was seamless
Learning curve for Cloud-specific features was gentle

Key Technical Lessons

Idempotent operations are non-negotiable for migrations. Being able to safely re-run backfill scripts without side effects eliminated stress and enabled confident validation.
Dual-write periods expose edge cases synthetic tests miss. Timezone handling differences, collation mismatches, and query planner quirks only appeared under real production load.
Phased migrations aren’t slower- they’re safer and ultimately faster. Three weeks of validation prevented months of debugging production incidents.
Performance testing must include business-critical queries. Don’t just benchmark- validate the queries that matter to your users and stakeholders.

Should You Migrate?

Consider this migration if:

You’re managing millions+ rows with limited operational support
You’ve experienced scaling challenges requiring downtime
You want to focus engineering resources on features rather than infrastructure maintenance
You need predictable performance during traffic spikes

The migration profile:

Timeline: 30 days
Downtime: Zero
Data loss: Zero
Performance improvement: 7–50x (query-dependent)

Closing Perspective

This migration wasn’t about admitting defeat- it was about recognizing that infrastructure management isn’t core business value. The best infrastructure decision is one that lets you forget about infrastructure and focus on delivering features.

I migrated from being an ops engineer maintaining databases to a data engineer building analytics products. That shift is how I spend my time is the real ROI.

Key Metrics

2,259 tables, 600M+ rows
50x faster queries (30s → 0.6s)
7x faster dashboards (8.1s → 1.154s)
Zero incidents post-migration
Zero downtime migration

The best infrastructure is the one you don’t have to think about.

Migrating 600M Records to ClickHouse Cloud was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Database Connection Pool — DIY Edition ️

Gaurav Sharma — Mon, 19 May 2025 05:54:49 GMT

Building a Database Connection Pool — DIY Edition 🛠️

You’ve probably come across this nasty error from your database:

too many clients already error

Well, now that I have your attention, let’s unpack what this actually means — and more importantly, how to fix it (and even build a fix yourself!).

Understanding the Root Cause

The error message

“Too many clients already”

typically indicates that your application is attempting to open more concurrent connections to the database than it is configured to handle.

Most relational databases, including PostgreSQL, set a default limit on the number of active connections — often around 100. You can check this limit using:

SHOW MAX_CONNECTIONS;

In high-throughput systems where each incoming request initiates a new database connection, you can quickly exceed this limit. One quick fix might be increasing the allowed number of connections:

ALTER SYSTEM SET max_connections = ;

However, this approach has limitations. Every active connection consumes memory and system resources on the database server. Simply increasing the connection limit without understanding your workload and hardware capacity can lead to degraded performance or even downtime.

To handle this in a scalable and reliable way, a connection pooling strategy is essential. Before we get into building one, let’s simulate the problem to see it in action.

import pg from "pg";
import dotenv from "dotenv";

dotenv.config();

/**
 * Execute multiple database connections in parallel
 * @param {number} numConnections - Number of connections to create
 * @returns {Promise} Results from all connections
 */
export async function runInParallel(numConnections) {
  const dbPromises = [];

  async function executeQuery(id) {
    const client = new pg.Client();
    try {
      await client.connect();
      const res = await client.query("SELECT pg_sleep(2)");
      await client.end();
      return { id, success: true, time: res.rows?.[0]?.now };
    } catch (error) {
      console.error(`Connection ${id}: Error:`, error.message);
      await client.end().catch(() => {});
      return { id, success: false, error: error.message };
    }
  }

  for (let i = 0; i < numConnections; i++) {
    dbPromises.push(executeQuery(i));
  }

  return Promise.all(dbPromises);
}

// Example usage
export const runParallelExample = async () => {
  console.log("Running parallel connections example...");
  
  try {
    const results = await runInParallel(100);
    console.log("All database connections completed");
    console.log(`Successful connections: ${results.filter(r => r.success).length}`);
    console.log(`Failed connections: ${results.filter(r => !r.success).length}`);
    return results;
  } catch (err) {
    console.error("Error in parallel connections:", err);
    throw err;
  }
};

Output -

Enter: Connection Pools

This is where connection pooling becomes critical.

Creating a new database connection for every request doesn’t scale — it’s slow, resource-intensive, and can overwhelm your database under load.

Connection pools solve this by maintaining a fixed number of open connections that are reused across requests. Each request borrows a connection, runs its query, and returns it to the pool — reducing overhead and avoiding connection exhaustion.

While most libraries offer pooling out of the box, building one yourself is a great way to understand how it really works. Let’s see how to implement a simple version in JavaScript.

You can use your own preferred language. Implementation will not differ much.

We’ll use the pg package to talk to a Postgres DB.

Step 1: Create a Single Connection

import pg from "pg";

export const createConnection = () => {
  const client = new pg.Client();
  return client.connect().then(() => client);
};

Step 2: Build the Pool

export const createConnectionPool = async (poolCount) => {
  const connections = [];
  for (let i = 0; i < poolCount; i++) {
    connections.push(createConnection());
  }
  return Promise.all(connections);
};

Step 3: Pool Manager Class

import pg from "pg";

/**
 * Creates a single database connection
 * @returns {Promise} A connected database client
 */
export const createConnection = () => {
  const client = new pg.Client();
  return client.connect().then(() => client);
};

/**
 * Creates a pool of database connections
 * @param {number} poolCount - Number of connections to create
 * @returns {Promise} Array of connected database clients
 */
export const createConnectionPool = async (poolCount) => {
  const connections = [];

  for (let i = 0; i < poolCount; i++) {
    connections.push(createConnection());
  }

  return Promise.all(connections);
};

/**
 * Connection pool manager class
 */
export class ConnectionPool {
  /**
   * Create a new connection pool
   * @param {number} poolSize - Number of connections to maintain
   */
  constructor(poolSize = 10) {
    this.poolSize = poolSize;
    this.connections = [];
    this.availableConnections = [];
    this.mutex = new Set(); // Track connections in use
    this.waitQueue = []; // Queue for waiting query requests
    this.isInitialized = false;
  }

  /**
   * Initialize the connection pool
   * @returns {Promise}
   */
  async initialize() {
    if (this.isInitialized) return;
    
    this.connections = await createConnectionPool(this.poolSize);
    this.availableConnections = [...this.connections];
    this.isInitialized = true;
    
    console.log(`Created a pool with ${this.connections.length} connections`);
  }

  /**
   * Execute a SQL query using a connection from the pool
   * @param {string} queryText - SQL query to execute
   * @param {Array} params - Query parameters
   * @param {any} queryId - Optional identifier for the query
   * @returns {Promise} Query result
   */
  async executeQuery(queryText, params = [], queryId = null) {
    if (!this.isInitialized) {
      await this.initialize();
    }

    // Wait for an available connection if none are free
    if (this.availableConnections.length === 0) {
      return new Promise((resolve) => {
        this.waitQueue.push(() => {
          this.executeQuery(queryText, params, queryId).then(resolve);
        });
      });
    }

    // Get a connection from the pool
    const connection = this.availableConnections.pop();
    this.mutex.add(connection); // Lock this connection

    try {
      // Execute query
      console.log(`Query ${queryId}: Executing with connection`);
      const result = await connection.query(queryText, params);
      return { 
        queryId, 
        success: true, 
        rows: result.rows,
        rowCount: result.rowCount
      };
    } catch (error) {
      console.error(`Query ${queryId}: Error:`, error.message);
      return { queryId, success: false, error: error.message };
    } finally {
      // Release the connection back to the pool
      this.mutex.delete(connection);
      this.availableConnections.push(connection);

      // If there are waiting queries, process the next one
      if (this.waitQueue.length > 0) {
        const nextQuery = this.waitQueue.shift();
        nextQuery();
      }
    }
  }

  /**
   * Close all connections in the pool
   * @returns {Promise}
   */
  async close() {
    // Wait for all connections to be available (not in use)
    while (this.mutex.size > 0) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
    
    // Close all connections
    await Promise.all(
      this.connections.map(client => client.end())
    );
    
    this.connections = [];
    this.availableConnections = [];
    this.isInitialized = false;
  }
}

This is a simple pool manager class responsible for managing the connection pool. Overall what it does is —

initializes and creates the connection pool with given number of connections
executes the given SQL query. If request for number of queries to execute exceeds the number of available connections, the request is put inside a waitQueue and whenever an execution of query is successful, mutex on the connection is released, connection is put back in the pool and any waiting request in the queue is picked up for execution.

Example Usage

import dotenv from "dotenv";
import { ConnectionPool } from "../lib/connectionPool.js";

dotenv.config();


export async function runUsingPool(poolCount, queryCount) {
  const pool = new ConnectionPool(poolCount);
  await pool.initialize();

  try {
    // Execute multiple queries
    const queryPromises = [];
    for (let i = 0; i < queryCount; i++) {
      queryPromises.push(pool.executeQuery("SELECT pg_sleep(2)", [], i));
    }

    const results = await Promise.all(queryPromises);
    return results;
  } finally {
    // Ensure pool is closed even if there are errors
    await pool.close();
  }
}

// Example usage
export const runPoolExample = async () => {
  console.log("Running connection pool example...");
  
  try {
    // Run 100 queries with a pool of 20 connections
    const results = await runUsingPool(20, 100);
    console.log("All queries completed:", results.length);
    console.log(`Successful queries: ${results.filter(r => r.success).length}`);
    console.log(`Failed queries: ${results.filter(r => !r.success).length}`);
    return results;
  } catch (err) {
    console.error("Error executing queries:", err);
    throw err;
  }
};

Tada 🎉 —

Cool, Right ? 😎

We now have a basic, working connection pool that:

Initializes a fixed number of DB connections
Reuses them across queries
Queues requests when all connections are busy
Gracefully handles overloads

Optimizations & Ideas 💡

Elastic Pools: Dynamically scale up or down based on demand.
Timeouts: Add timeouts for queries stuck in queue.
Query Prioritization: Serve critical requests first.
Observability: Track pool usage, saturation, and wait times.

Complete codebase here — https://github.com/grvsharma1810/connection-pool-diy

Wrapping Up

Hope this helped you understand the “why” behind connection pools — and how to build one yourself.

If you’re into backend engineering and love going deep into systems stuff, let’s connect:

📍 Twitter
🔗 LinkedIn

Building a Database Connection Pool — DIY Edition 🛠️ was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Slice-Based Zustand Store for Next.js 14 and TypeScript

Heinrich Winterbach — Thu, 30 Jan 2025 05:10:00 GMT

Ever feel like your React state is spiraling out of control? Or that you’re gluing together too many unlinked pieces in a big Next.js app? Zustand might just be the relaxing bubble bath your state logic needs.

In this article, we’ll explore how to implement a slice-based state management approach in a Next.js 14 application written in TypeScript, using Zustand and its powerful middlewares. You’ll see how to keep your store modular, add advanced features like devtools and persist, and incorporate easy immutability with Immer. The best part? Once you learn the basics of slice-based design, you can mix and match these pieces however you like.

1. Why Zustand in a Next.js 14 World?

Next.js 14 gives you a spiffy environment for writing both server and client components, but guess what? The client side still needs to orchestrate user-driven interactions, data, and logic. Redux or large frameworks might be overkill. Zustand brings a refreshing minimalism:

• No forced structure: You just write JavaScript (or TypeScript) objects for your state and methods.

• Slices: Rather than a monolithic store, you break things down by domain. Each slice is an isolated chunk that’s easy to understand and test.

• Optional middlewares: You can add devtools, persist, subscribeWithSelector, and immer based on your needs.

• TypeScript-friendly: If you love strong typing, Zustand’s got your back.

When building advanced features — like verifying passports, linking child travelers, or managing entire booking processes — this approach is a lifesaver. Rather than burying your head in a single monstrous store, you define small slices that do exactly what they need. Then you glue them together into a single store that the rest of your Next.js 14 app can consume.

2. Setting Up: Dependencies and Folder Structure

Let’s say you have a Next.js 14 project called my-app. Here’s roughly where our store code might live:

my-app/
├─ app/
│ ├─ (various client/server components)
├─ store/
│ ├─ slices/
│ │ ├─ application-travelers-slice.ts
│ │ ├─ application-slice.ts
│ │ ├─ insurance-slice.ts
│ │ ├─ activities-slice.ts
│ │ ├─ …
│ ├─ agency-store.ts
│ └─ …
├─ package.json
└─ …

We’ll define slices like application-travelers-slice.ts, application-slice.ts, insurance-slice.ts, etc. Each slice focuses on a distinct domain — passports, user identity, child linkages, or insurance coverage.

Install Zustand (the tools you need):

# Yarn
yarn add zustand immer
# or NPM
npm install zustand immer

This includes:

• Zustand: The state library itself

• zustand/middleware: Devtools, persist, subscribeWithSelector, etc.

• immer: Easy immutability

Once installed, you’re all set to code the slices.

3. What Exactly Is a “Slice” in Zustand?

A slice is simply a function that returns part of your state plus the actions for that part. For instance, if you have a “traveler” slice, it might look like:

// store/slices/application-travelers-slice.ts

export type ApplicationTravelerSliceState = {
  travelers: Record    passportNumber: string; 
    name: string; 
    // ...
  }>;
  travelerErrors: Record;
};

export type ApplicationTravelerSliceActions = {
  addTraveler: (travelerData: { name: string; passportNumber: string }) => void;
  removeTraveler: (id: number) => void;
};

export type ApplicationTravelerSlice =
  ApplicationTravelerSliceState & ApplicationTravelerSliceActions;

// The slice is a function returning that combined shape
export const createApplicationTravelerSlice: StateCreator<
  /* your store type */,
  /* any middlewares used */,
  [],
  ApplicationTravelerSlice
> = (set, get) => ({
  travelers: {},
  travelerErrors: {},

  addTraveler: (travelerData) => {
    const nextId = Object.keys(get().travelers).length + 1;
    set((state) => {
      state.travelers[nextId] = travelerData;
    });
  },

  removeTraveler: (id) => {
    set((state) => {
      delete state.travelers[id];
    });
  },
});

This is the entire gist of a slice: define the shape of data it controls, define how you mutate that data, then unify. Because it’s pure TypeScript, you can scale it up with advanced logic — like validations or watchers — however you like.

4. Distilling Multiple Slices

Chances are, you’ll have multiple slices — like an “Application slice” for high-level data, an “Insurance slice” for coverage, a “Bulk Upload slice,” and so on. Each slice:

1. Exports its state type, its actions type, and the combined slice type.

2. Includes an init function that configures the slice.

This might feel like a lot of ceremony at first, but it’s super clean once you get going. If you only have one slice, you don’t need to slice at all — Zustand works either way. But for a more complicated Next.js 14 application (think multiple user flows or specialized data?), you’ll want that separation.

5. Merging Slices in a Single Store

Enter agency-store.ts (you can name it whatever you like). This is where all the magic middlewares come into play, plus the final aggregator. Typically:

// store/agency-store.ts

import { create } from "zustand";
import { devtools, persist } from "zustand/middleware";
import { immer } from "zustand/middleware/immer";
import { subscribeWithSelector } from "zustand/middleware";

import { createApplicationTravelerSlice, ApplicationTravelerSlice } from "./slices/application-travelers-slice";
import { createApplicationSlice, ApplicationSlice } from "./slices/application-slice";
import { createInsuranceSlice, InsuranceSlice } from "./slices/insurance-slice";
// ...

// 1) Build an overall Store type
export type Store = ApplicationTravelerSlice & ApplicationSlice & InsuranceSlice /* ...more slices */;

// 2) Actually create the store
export const useAgencyStore = create()(
  devtools(
    persist(
      subscribeWithSelector(
        immer((...args) => ({
          ...createApplicationTravelerSlice(...args),
          ...createApplicationSlice(...args),
          ...createInsuranceSlice(...args),
          // ...
        }))
      ),
      {
        name: "agency-store",
        partialize: (state) => state, // or pick certain fields
      }
    ),
    { name: "AgencyDevtools" }
  )
);

Let’s break down the main players:

• create()(…): Yup, TypeScript nuance. The first parentheses let us pass a generic type for the store, the second parentheses actually call the function with our logic.

• devtools(…): Wraps your store so that the Redux DevTools extension can track state changes.

• persist(…): Saves state to localStorage (by default) under “agency-store”. Next time the user visits, it rehydrates their previous session.

• subscribeWithSelector(…): Allows advanced watchers or finer-grained subscriptions so you don’t re-render components unnecessarily.

• immer(…): Provides a “mutable” syntax while preserving immutability under the hood.

Inside immer(…), we unify all slices by spreading them: …createApplicationTravelerSlice(…args). We do the same for any slice we want to incorporate.

6. Using the Store in Your Next.js 14 Components

Now, let’s put it to work in a client component:

"use client";

import React from "react";
import { useAgencyStore } from "@/store/agency-store";

export default function TravelerList() {
  const { travelers, addTraveler } = useAgencyStore((state) => ({
    travelers: state.travelers,
    addTraveler: state.addTraveler,
  }));

  const handleAdd = () => {
    addTraveler({ name: "Alice", passportNumber: "AB123XYZ" });
  };

  return (
    
      
      
        {Object.entries(travelers).map(([id, data]) => (
          
            {id}: {data.name} - {data.passportNumber}
          

        ))}
      

    

  );
}

Key points:

• Mark your component with “use client” at the top — Zustand stores are strictly client-based (unless you do advanced SSR).

• useAgencyStore: A single hook that merges all slices. We can destruct the traveler slice’s methods or data easily.

That’s it. If you also had an insurance slice, you’d do useAgencyStore((s) => s.insuranceTravelers) or s.setInsuranceTripDetails as needed.

7. DevTools and Persist: Real-World Observations

Once your code runs, open your browser’s DevTools with the Redux DevTools extension installed. You’ll see an entry named “AgencyDevtools” (or whatever name you specified). Each time you call an action like addTraveler, it logs a state change. You can time-travel, inspect states, etc.

Meanwhile, if you check localStorage, you’ll spot a key named “agency-store”. That’s your serialized data. If you refresh the page, the store rehydrates, so your travelers persist. If you only want to store some fields, partialize is your friend:

persist(
  subscribeWithSelector(immer(...)),
  {
    name: "agency-store",
    partialize: (state) => {
      return { travelers: state.travelers }; 
    },
  }
);

Now you persist only travelers, ignoring everything else.

8. Handling Immense Data or Nested Changes

By default, Zustand merges states at the top level. That means if you set a nested object, you might need to do a bit of manual merging. This is precisely where Immer is so helpful:

// Without Immer, you'd do something like:
// set((state) => ({ 
//   travelers: {
//     ...state.travelers,
//     [id]: {...state.travelers[id], ...updates}
//   }
// }));

// With Immer, you can do:
set((state) => {
  Object.assign(state.travelers[id], updates);
});

Internally, Immer ensures these updates remain immutable. It’s magical for slice logic that’s deeper than one or two levels.

9. Avoiding Unnecessary Re-renders via useShallow

When you do:

const user = useAgencyStore((store) => store.user);

Your component re-renders if the object store.user changes reference, even if it’s the same data. Sometimes that’s a problem. Zustand offers useShallow:

import { useShallow } from "zustand/react/shallow";

const userInfo = useAgencyStore(
  useShallow((store) => ({ 
    name: store.user.name, 
    age: store.user.age 
  }))
);
// Re-renders only if name or age changes

You can keep your slices returning objects while skipping re-renders if the actual fields remain the same.

10. Testing Slices

Zustand slices are simple to test — each slice is mostly a function. You can test them individually or test the combined store:

import { useAgencyStore } from "@/store/agency-store";

describe("Traveler Slice Tests", () => {
  it("adds a traveler properly", () => {
    const before = useAgencyStore.getState().travelers;
    expect(Object.keys(before)).toHaveLength(0);

    useAgencyStore.getState().addTraveler({ name: "Alice", passportNumber: "XYZ123" });

    const after = useAgencyStore.getState().travelers;
    expect(Object.keys(after)).toHaveLength(1);
  });
});

If you want to reset your store after each test, you can call your slice’s reset methods or re-initialize the store. It’s just JavaScript objects — no exotic mocking required.

11. Common Pitfalls

1. Server vs. Client:

• Don’t import your client store inside a server component. That will result in confusion (and possibly runtime errors).

2. Middleware Order:

• Typically, you want devtools as the outermost wrapper, then persist, then subscribeWithSelector, then immer.

3. LocalStorage Limitations:

• If you’re persisting large volumes, you could run out of space. Keep that in mind or narrow your data with partialize.

4. Performance:

• If you keep thousands of items in a single slice, be sure you’re selecting small pieces in your UI. Or rely on useShallow to avoid rerenders if the slice is stable.

12. Wrapping Up

That’s the gist of building a slice-based Zustand store in Next.js 14 with TypeScript. Zustand remains incredibly straightforward, letting you define slices of logic with near-zero overhead. You can sprinkle in advanced validations (like with Yup, Zod, or your favorite library), nest as deeply as you want with Immer, and watch your changes in real time with devtools. Meanwhile, you only store exactly what you want with persist.

Key Takeaways:

1. Slices keep your logic cohesive. Each domain or feature in your Next.js app can be one slice.

2. Aggregator (useAgencyStore) merges them, hooking in devtools, persist, etc.

3. TypeScript ensures safe, typed interactions.

4. useShallow helps you skip unnecessary re-renders if you’re returning objects from the store.

5. Testing is easy: Just import the store or slices and call the methods.

Next time your Next.js 14 app calls for a “done-for-you” client state solution, give this approach a spin. It’s a breeze to test, debug, and expand. And if your boss asks for a new slice — like, say, a “Coupon slice” or “Payment Gateway slice” — no problem. Just drop it in, merge it in agency-store.ts, and away you go. That’s the beauty of small, well-defined pieces over a single monstrous state file.

Now, go forth: build that traveler-management or e-commerce or quiz application. Relax in the knowledge that you’re not stuck wrangling boilerplate — Zustand has you covered. Happy slicing and coding!

A Slice-Based Zustand Store for Next.js 14 and TypeScript was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Automating passport detection and quality analysis using deep learning

Shubham Tiwari — Wed, 27 Nov 2024 05:36:14 GMT

[Image source](https://www.shutterstock.com/image-vector/vector-blank-open-passport-template-international-1060912469)

In the visa application process, accurate and reliable identification of passport images is a critical step. Given that Atlys specializes in providing visa, the ability to detect and validate passport images with high precision is essential for a smooth application process. Here are the key challenges that define this problem:

Detecting the presence of a valid passport in a given image is fundamental. Ensuring high detection accuracy reduces errors in the application process, minimizes manual intervention, and improves user trust.
Low-quality images with issues like blur, glare, or obstructing objects (e.g., fingers) are common problems in user-submitted photos. Such imperfections can lead to image rejection by embassies, as they compromise the document’s readability. Thus, the model must be able to assess these quality aspects and flag problematic images.
High-quality images are vital for the next step of optical character recognition (OCR), where passport details are extracted. Poor image quality can lead to errors in field extraction, impacting the overall application process.
The faster and more seamless the image verification process, the better the user experience. By providing real-time feedback on image quality, we can guide users to capture a valid image on the first attempt. This reduces the need for resubmissions, enhances user satisfaction, and increases conversion rates.

Product description and model selection

Our product is designed to handle two primary workflows: Live Capture and Upload. These workflows cater to different user interactions while ensuring that only high-quality images meeting all requirements proceed to the next stage of processing. The following key objectives were our target for the model training -

Both workflows requires high latency for a seamless user experience.
Model should generalize well with passports from all countries, even for countries for which data is not available.
Since, we are using same model for the detection of passport and other quality check classes like glare, fingers. It should generalize well with all classes.

We benchmarked various models based on our objectives and finally selected YOLOv8 model given its strength in latency and architectural efficiency. This model designed with an anchor-free structure and optimized layers, which reduces computational overhead compared to older YOLO versions. This enables faster inference while maintaining high detection accuracy.

This offers multiple size variants (e.g., YOLOv8n, YOLOv8s, YOLOv8m) that can be chosen based on the deployment requirements. For our use-case, we used smaller variant (YOLOv8n and YOLOv8s) to minimize latency without compromising too much on accuracy.

[Image source](https://blog.roboflow.com/what-is-yolov8/)

Annotation and model training

To avoid data and annotation challenges, we followed the active learning approach. It’s a semi-automated approach to annotation where a model is used to predict labels on new data, and then those predictions are reviewed and corrected by humans before retraining the model. Here’s how it works and why it made your annotation job easier:

How the Process Works

Initial Model Training: We start with a small, labeled dataset and train a model.
Model-Assisted Annotation: As new, unlabeled data becomes available, we run it through the trained model to generate initial annotations or predictions.
Human Review: We manually review and correct the annotations generated by the model, ensuring high-quality labeled data.
Retraining: The corrected data is added back to the dataset, and the model is retrained on this expanded, more accurate dataset.
Iteration: This process is repeated as more data becomes available, progressively improving both the dataset quality and the model’s accuracy.

Image : by author

How the active learning fast paced model training

Reduced Annotation Effort: The model automatically generates initial labels, saving time compared to annotating everything manually from scratch. We only need to correct the errors, which is significantly faster.
Improved Data Quality: By iteratively refining annotations, we ensure that each iteration of the model trains on more accurate data, leading to better predictions over time.
Efficient Use of Resources: We make the most of limited initial labeled data and human effort, leveraging the model to handle repetitive tasks while focusing human effort where it’s most needed (on corrections).
Scalability: This approach scales well with growing datasets. As more data comes in, the model’s predictions become increasingly accurate, further reducing manual work.
Continuous Model Improvement: Each iteration improves the model’s understanding of the task, leading to better performance not only for generating annotations but also for the final deployment.

In addition to streamlining annotation, the active labeling approach made it much easier to add new classes to your model’s training process. Since the model could accurately label existing classes, we only needed to annotate the new class on relevant samples, significantly cutting down the manual workload.

Optimization and deployment

Deploying deep learning models for both live capture and upload use cases requires careful optimization and integration to meet performance, latency and scalability requirements. TFLite is a lightweight framework that optimizes models for deployment on web and mobile devices. It ensures low latency and efficient use of resources, which is crucial for live capture use cases. Ultralytics’s yolo framework makes it very easy to benchmark and export model to tflite.

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
model = YOLO("path/to/best.pt")
# Export the model
model.export(format="tflite", half=True)

To understand what all parameters can be tuned while exporting model to tflite, you can go through their documentation in detail -

Export

Post-processing for quality checks

In post-processing, we detect if the fingers or glare are on the passport region. Some countries have really hard check that they don’t allow fingers or glare to be anywhere on passport, Where as some countries are flexible if it’s not on texts and photos. Based on this, we defined two different ROI’s to check. Two different guide box gives us fine balance between our visa photo requirements and hassle-free user experience.

[Image source](https://www.shutterstock.com/image-vector/vector-blank-open-passport-template-international-1060912469)

This below flow chart details the process of post-processing -

Image — by author

Also, Check out the below blog on how we have integrated it on our production web:

How to build a React App to interact with a ML model locally

This enabled us to provide a hassle-free, efficient and accurate solution for passport image detection and validation. Just navigate to our website to experience our seamless passport scanning process.

If you’re excited by projects like this, consider joining our team! We’re hiring!

Reference links

Automating passport detection and quality analysis using deep learning was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Production-Grade Full-Text Search System with PostgreSQL: Lessons from Atlys

Hardik Gupta — Mon, 25 Nov 2024 19:27:13 GMT

At Atlys, we recently implemented a full-text search system for our activities platform using PostgreSQL’s text search capabilities. Before diving into the implementation, let’s understand some key PostgreSQL full-text search components.

Key PostgreSQL Full-Text Search Components

1. tsvector:
- A sorted list of normalized lexemes (words stripped to their base form)
- PostgreSQL Documentation: tsvector

SELECT to_tsvector('english', 'The quick brown foxes jumped');
 - Result: 'brown':3 'fox':4 'jump':5 'quick':2

2. tsquery:
- Query tree for text search that specifies what to search for
- PostgreSQL Documentation: tsquery

SELECT to_tsquery('english', 'quick & fox');
 - Searches for documents containing both "quick" and "fox"

3. unaccent:
- Text search dictionary that removes accents (diacritics) from lexemes
- [PostgreSQL Documentation: unaccent

CREATE EXTENSION IF NOT EXISTS unaccent;

Setting Up the Foundation

First, we needed to prepare our database for text search. We added a tsvector column and created necessary indexes:

ALTER TABLE product_metadata 
ADD COLUMN search_vector tsvector;
- Create GIN index for faster searches
CREATE INDEX product_metadata_search_idx 
ON product_metadata USING GIN (search_vector);

Our trigger for automatic vector updates:

CREATE OR REPLACE FUNCTION product_metadata_search_vector_update() RETURNS trigger AS $$
BEGIN
 - Combine and weight different fields
 NEW.search_vector :=
 setweight(to_tsvector('english', unaccent(coalesce(NEW.city_name, ''))), 'A') ||
 setweight(to_tsvector('english', unaccent(coalesce(NEW.product_name, ''))), 'B');
 RETURN NEW;
END;
$$ LANGUAGE plpgsql;

The Search Implementation

Our search implementation uses to_tsquery with proper query cleaning as to_tsquery has a high chance of failure in case your search argument contains some special characters like %, [ etc.

websearch_to_tsquery handles these edge cases but does not provide the full length text based search you might want and just helps with exact matches.

We decided to go ahead with to_tsquery along with implementing our own railguards around keeping the seach argument clean:

async def search_products(
 self, 
 query: str,
 limit: int = 10,
 offset: int = 0,
) -> Tuple[List[dict], List[dict]]:
 def _clean_query(q: str) -> str:
 import re
 # Handle special characters
 q = re.sub(r'[!&|():<>\'"\[\]{}+\-~*?\\%@#$^=;,]+', ' ', q)
 q = re.sub(r'\s+', ' ', q)
 q = q.strip()
 return ' & '.join(q.split())
cleaned_query = _clean_query(query)

websearch_to_tsquery vs to_tsquery: Our Journey

PostgreSQL offers multiple text search functions:
- to_tsquery: Basic text search parser
- plainto_tsquery: Converts plain text to tsquery
- websearch_to_tsquery: Implements web-style search syntax
- PostgreSQL Text Search Functions

Initially, we used websearch_to_tsquery, but switched to to_tsquery because:
1. Better control over search term combinations
2. More predictable partial matching behavior
3. Consistent handling of special characters

Our search query:

WITH SearchResults AS (
 SELECT 
 city_name,
 city_code,
 product_name,
 ts_rank_cd(search_vector, to_tsquery('english', :query || ':*')) as rank
 FROM product_metadata
 WHERE search_vector @@ to_tsquery('english', :query || ':*')
 ORDER BY rank DESC
)

Handling Special Characters

Special characters can break tsquery syntax. Our solution:

def _clean_query(q: str) -> str:
 # Remove special characters that could break tsquery
 q = re.sub(r'[!&|():<>\'"\[\]{}+\-~*?\\%@#$^=;,]+', ' ', q)
 return ' & '.join(q.split())
# Input: "[Limited Time: 15% Off] City Sightseeing"
# Output: "Limited & Time & Off & City & Sightseeing"

Performance Optimization

GIN Index:
- GIN (Generalized Inverted Index) is optimized for full-text search
- PostgreSQL GIN Index

CREATE INDEX product_metadata_search_idx ON product_metadata USING GIN (search_vector);

Ranking and Weighting

PostgreSQL provides several ranking functions:
- ts_rank: Basic text search ranking
- ts_rank_cd : Ranks based on cover density
- [PostgreSQL Ranking Functions]

We use weights to prioritize matches:

setweight(to_tsvector('english', city_name), 'A') || - Weight: 1.0
setweight(to_tsvector('english', product_name), 'B') - Weight: 0.4

Key Learnings

Always clean user input before creating a tsquery
2. Use GIN indexes for performance
3. Consider weighting different fields based on importance
4. Handle edge cases explicitly

Sample request response we got

Handling for special chars help you deal with cases when user enters these kind of inputs

Supporting multi word searches

Edge case of empty input

Default single word cases

PS ~ This is based on our interal data set of some 30 cities and activities available in them.

## Further Reading
- [PostgreSQL Full Text Search](https://www.postgresql.org/docs/current/textsearch.html)
- [Using Full Text Search in PostgreSQL](https://www.postgresql.org/docs/current/textsearch-intro.html)
- [PostgreSQL Text Search Configuration](https://www.postgresql.org/docs/current/textsearch-configuration.html)
- [GiST and GIN Index Types](https://www.postgresql.org/docs/current/textsearch-indexes.html)

This real-world implementation shows how PostgreSQL’s full-text search capabilities can be effectively used in production systems when properly configured and optimized.

[Note: This is an engineering blog post from Atlys’ engineering team. The code samples are from our actual implementation, showcasing real solutions to real problems we encountered.]

Building a Production-Grade Full-Text Search System with PostgreSQL: Lessons from Atlys was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Animate SVGs: An Introduction to SMIL

Anuj Kapoor — Mon, 25 Nov 2024 07:12:11 GMT

France jao, Germany jao, Italy jao… lekin ek baat yaad rakhna — apni maati(soil) ko mat bhoolna!

Have you noticed the new SVG animation in the Atlys visa application flow? It’s not just a pretty graphic — it’s designed to enhance the user experience by turning data into a visually engaging animated graph.

Curious about how it works? In this blog, I’ll take you behind the scenes and break down the magic behind the center pentagon’s growth animation. Let’s dive in and bring your ideas to life with SVG animations!

Step 1: Creating the SVG

The foundation of this animation lies in SVG (Scalable Vector Graphics). Here’s how we start:

Define the viewBox to make scaling and positioning easier.
Use SVG elements like polygon and line to create the pentagon and its background.
Group elements using the tag to share properties such as fill, strokeWidth, and strokeDasharray.

{/* Background pentagons and lines*/}

Here’s how the SVG will look after this step:

Step 2: Adding the Animated Polygon

Next, we layer the animated pentagon on top of the static background. This polygon will dynamically change its shape based on interactions.

Use a distinct fill and stroke style for visibility.
Place this polygon outside the group for cleaner code and easier animation control.

 
    {/* Background for our svg animation */}
    
      
      
      
      
      
      
      
      
    

    {/* The above polygon that we will animate */}

Here’s what it looks like with the polygon added:

Step 3: Logic for Updating Pentagon Coordinates

To make the animation dynamic, we calculate the new pentagon coordinates using custom logic.

Define three sets of polygon coordinates: largestPolygon, mediumPolygon, and smallerPolygon.
Use a function to generate random points between the smaller and medium polygons.

// ./utils
function generateRandomPoints(medium, smaller) {
  return medium.map(([x1, y1], index) => {
    const [x2, y2] = smaller[index];
    const t = Math.random();
    const x = Math.round(x1 + t * (x2 - x1));
    const y = Math.round(y1 + t * (y2 - y1));
    return [x, y];
  });
}

const coordinatesToPoint = (coordinates) =>
  coordinates.map(([x, y]) => `${x},${y}`).join(" ");

export { generateRandomPoints, coordinatesToPoint };

// ./contants.js
const largestPolygon = [
  [50, 0],
  [2, 35],
  [21, 90],
  [79, 90],
  [98, 35],
];
const mediumPolygon = [
  [50, 12],
  [14, 38],
  [28, 81],
  [72, 81],
  [86, 38],
];
const smallerPolygon = [
  [50, 24],
  [25, 42],
  [35, 71],
  [65, 71],
  [75, 42],
];

export { largestPolygon, mediumPolygon, smallerPolygon };

This logic ensures that the pentagon updates organically.

Step 4: Adding Interactivity

Now, let’s add a button to make the animation interactive! Clicking this button will trigger a change in the pentagon’s shape by updating its coordinates.

export default function App() {
  const [coordinates, setCoordinates] = useState(smallerPolygon);

  function handleUpdatePolygon() {
    const randomPolygonPoints = generateRandomPoints(
      mediumPolygon,
      smallerPolygon
    );
    setCoordinates(randomPolygonPoints);
  }

  return (
    
      
        
          {/* Background for our svg animation */}
          
            
            
            
            
            
            
            
            
          

          {/* The above polygon that we will animate */}
                      points={coordinatesToPoint(coordinates)}
            fill="yellow"
            fillOpacity="0.2"
            stroke="red"
            strokeWidth="0.4"
            strokeDasharray="2"
          />
        
      

      
    

  );
}

Here’s what it looks like when interacting with the button:

Step 5: Animating the Pentagon

Finally, we bring everything to life with SVG animations using the tag. This allows the pentagon to transition smoothly between its current and new states.

Here’s how the animation looks with smooth transitions:

Key animation attributes include:

attributeName: Specifies the attribute to animate (e.g., points).
from and to: Define the animation's starting and ending coordinates.
dur: Specifies the duration of the animation.
begin: Triggers the animation (e.g., on button click).

import { useRef, useState } from "react";
const { largestPolygon, mediumPolygon, smallerPolygon } from "./contants";
const { generateRandomPoints, coordinatesToPoint } from "./utils";
import "./styles.css";


export default function App() {
  const [coordinates, setCoordinates] = useState(smallerPolygon);
  const prevCoordinates = useRef(smallerPolygon);
  const animateRef = useRef(null);

  function handleUpdatePolygon() {
    const randomPolygonPoints = generateRandomPoints(
      mediumPolygon,
      smallerPolygon
    );
    // stroing previous coordinates in ref
    prevCoordinates.current = coordinates;
    // storing new coordinates
    setCoordinates(randomPolygonPoints);
    // for trigging animation
    animateRef.current.beginElement();
  }

  return (
    
      
        
          {/* Background for our svg animation */}
          
            
            
            
            
            
            
            
            
          

          {/* The above polygon that we will animate */}
                      points={coordinatesToPoint(coordinates)}
            fill="yellow"
            fillOpacity="0.2"
            stroke="red"
            strokeWidth="0.4"
            strokeDasharray="2"
          >
                          attributeName="points"
              from={coordinatesToPoint(prevCoordinates.current)}
              to={coordinatesToPoint(coordinates)}
              dur="0.3s"
              begin="indefinite"
              ref={animateRef}
            />
          
        
      

      
    

  );
}

What Else Can We Do to Make It Even Smoother?

You guessed it — we can tweak the animation for a buttery-smooth transition by adding Bézier curves!

By using attributes like calcMode, keyTimes, and keySplines, we can fine-tune the motion. Here’s an example:

   {/* previous code*/}
   calcMode="spline"
   keyTimes="0; 1"
   keySplines="0.25 0.1 0.25 1"
/>

Magical, right? With just a little more effort, you can customize the feel of your animation to match your product perfectly.

Step 6: Check Out the Full Working Code

Want to see the entire implementation in action? Check out the live, working code on CodeSandbox:

👉 View the CodeSandbox Here

Feel free to explore the live project, interact with it, and customize the animation to suit your needs. It’s a fun and creative way to bring your ideas to life!

The Result: A Dynamic Approval Meter

With these steps, you’ve created a dynamic and visually appealing pentagon animation. Each click updates the pentagon’s shape, enhancing user engagement with smooth, interactive visuals.

Pro Tip: Customize Your Animation

Want to take it a step further?

Experiment with gradient fills or multi-colored strokes.
Adjust animation durations and easing curves to match your product’s style.
Play with opacity and stroke effects for added flair.

SVG animations are incredibly versatile, so don’t hesitate to make them your own!

Yours friendly engineer,
Anuj Kapoor

Passionate about solving tricky customer problems👨‍💻? Join us: https://careers.atlys.com/

How to Animate SVGs: An Introduction to SMIL was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Making visa documents collection a 10x experience with our AI Chatbot: Nanite

Vaibhav singhal — Wed, 20 Nov 2024 08:00:37 GMT

At Atlys, we continually strive to enhance user experience. One of the most time-consuming tasks for users is to identify the right documents customers need for their visa applications based on their specific circumstances like profession, marital status, travel history, sponsorship, etc. Previously our team would manually reach out to customers, ask about these details and determine the necessary documents. The customers would then share these documents on the app or via mail.

However there were few challenges in this process

Process was vulnerable to operational failures

There were knowledge gaps on the side of the representative handling the case
The embassy and government requirements keep changing
Sometimes there is a lack of response from the customer on call / emails

2. Process was non-uniform — A user’s experience would largely be governed by the representative assigned to the case

3. Process was slow

There was constant back and forth communication from initial assessment to gathering docs one by one to submitting docs.
Often this process would take several days to complete

4. Security Concerns

Users felt uncomfortable sharing docs via email / whatsapp because of security concerns.
Docs like Bank statements and ITRs are highly sensitive docs and often customers would be concerned about the privacy of said docs

5. Ease of uploading

Nanite would allow users to upload documents quickly by uploading from gallery, via email link or taking a photo on the spot to upload an image of the document
This greatly reduces the number of steps involved for the user to forward the docs to Atlys

To convert this into a 10x experience, we developed an intelligent chatbot called Nanite designed to inquire, collect and process the customer information to determine the necessary visa documents. The chatbot is powered by a decision tree and is capable of handling additional queries through integration with large language models (LLMs). Here’s a breakdown of how it all works.

The Role of the Decision Tree

To ensure accuracy of the overall process, we used decision trees as the core of this chatbot. Higher accuracy helps in increasing the visa approval rate. Decision trees provide a logic-based system to determine what question to ask next based on the user’s responses and calculate the documents which the user needs to share. Additionally it also helps to model complicated conditions into the chatbot flow, like salaried people based out of delhi who don’t have a sponsor need to share their bank statement. To make the process more convenient for users, chatbot reuses data provided by customers in the past and stores the new data again for future use.

The logic varies for each country and the type of visa offered by it. For instance, if the user selects that they are applying for a tourist Singapore visa, the chatbot will ask questions specific to that visa. The decision tree dictates which questions are to be asked next based on previous responses, such as:

Sponsor related data: Is someone sponsoring the travelers? His information
Profession: Are you a student, employed, or self-employed?
Marital Status: Are you married, single, or divorced?
Travel History: Have you traveled internationally before?
and more

Based on the answers, the decision tree continues to branch out. The aim is to narrow down the exact set of documents needed, such as financial statements, invitation letters, or proof of employment

Handling Customer Queries Using LLMs

While the decision tree efficiently gathers the necessary data, users often have additional questions, especially when dealing with visa-related terminology or document retrieval. Questions like:

“How do I obtain a bank statement?”
“What is a sponsor letter?”
“I don’t have a GST certificate. Can I submit any other document for business proof?”

This is where large language models (LLMs) come into play. The chatbot is integrated with LLMs, which can understand natural language and respond intelligently to customer queries. These models are trained on vast amounts of data, making them capable of providing answers even for slightly ambiguous or complex questions.

For example, if a customer asks what a “No Objection Certificate” means, the chatbot accesses the LLM to provide a concise explanation, thus eliminating the need for customers to seek external help. This seamless interaction ensures that users not only receive instructions but also understand the visa process better, enhancing their experience with Atlys.

Handling Unknown Scenarios with Support Tickets

Despite the robustness of both the decision tree and LLM integration, there are always scenarios where the chatbot might not have the answer. For instance, a customer might ask about a highly specific or unusual situation that the LLM cannot handle effectively, or they might have a technical issue while uploading documents.

In such cases, the chatbot is designed to gracefully acknowledge its limitations and ask the customers if they want to talk to someone else from the team. If they do, it creates a support ticket which is routed to our operational team, who can then follow up directly with the customer to resolve the issue. This system ensures that no query is left unanswered and customers feel supported throughout their application process.

Stay tuned for part 2 of this article to know about more features we have implemented.

Passionate about solving tricky customer problems👨‍💻? Join us: https://careers.atlys.com/

Making visa documents collection a 10x experience with our AI Chatbot: Nanite was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Taming Database Connections: Our Journey to Better Connection Pool Management

Hardik Gupta — Tue, 19 Nov 2024 07:47:00 GMT

Taming Database Connections: Our Journey to Better Connection Pool Management

At Atlys, we recently tackled an interesting challenge in our insurance marketplace service where we were seeing an unusually high number of idle database connections. Here’s how we diagnosed the issue, fixed it, and what we learned along the way.

The Problem: Connection Pool Saturation

During a routine infrastructure review, we noticed our PostgreSQL connection pool was frequently reaching its limits, with many connections sitting idle. Upon investigation, we traced this back to our get_agency_insurance_details endpoint, which retrieves insurance policies for travel agencies.

The original implementation was opening a new database session for each policy belonging to an agent:

# Simplified version of the problematic code
async def get_agency_insurance_details(self, b2b_agent_uid: str):
 policies = []
 for policy_id in policy_ids:
 async with AsyncPostgresSessionManager().get_session() as session:
 policy = await self._get_policy_details(session, policy_id)
 policies.append(policy)
 return policies

This approach was creating multiple database connections for what could be accomplished with a single connection, leading to connection pool saturation and potential performance bottlenecks.

The Solution: Consolidated Database Access

We refactored the code to use a single database session for fetching all required data:

async def get_agency_insurance_details(self, b2b_agent_uid: str, limit: int = 10, offset: int = 0):
 async with AsyncPostgresSessionManager().get_session() as session:
 policies, total_count = await self._policy_repo.get_policies_by_agency(
 session, b2b_agent_uid, limit, offset
 )
 if not policies:
 return PaginatedPolicyResponse(
 policies=[],
 total=0,
 limit=limit,
 offset=offset
 )
processed_policies = await asyncio.gather(
 *[self._process_policy(policy) for policy in policies]
 )
 return PaginatedPolicyResponse(
 policies=processed_policies,
 total=total_count,
 limit=limit,
 offset=offset
 )

Tackling the N+1 Query Problem

While fixing the connection pool issue, we also addressed another common database performance pitfall: the N+1 query problem. This occurs when you fetch a record and then need to make additional queries to fetch related data for each record.

In our case, for each insurance policy, we needed:
- The insurance type details
- Associated travelers
- Payment information

Without proper optimization, this would result in multiple queries per policy. Here’s how we solved it using SQLAlchemy’s `selectinload`:

async def get_policies_by_agency(self, db_session: AsyncSession, b2b_agent_uid: str, limit: int = 10, offset: int = 0):
 query = (
 select(InsurancePolicy)
 .distinct()
 .join(InsuranceType)
 .join(InsuranceTraveler)
 .join(TravelerDetail)
 .filter(TravelerDetail.b2b_agent_uid == b2b_agent_uid)
 .options(
 selectinload(InsurancePolicy.insurance_type),
 selectinload(InsurancePolicy.insurance_travelers).selectinload(InsuranceTraveler.traveler_details),
 selectinload(InsurancePolicy.payments),
 )
 .order_by(InsurancePolicy.created_at.desc())
 .limit(limit)
 .offset(offset)
 )

The selectinload strategy eagerly loads related objects using a separate SELECT statement, which is then matched with the parent objects in Python. This approach:
- Reduces the number of database queries
- Maintains clean separation of concerns in the SQL statements
- Efficiently handles one-to-many relationships

Impact and Learnings

After deploying these changes, we saw significant improvements:
- ~95% reduction in idle database connections
- Improved response times for policy retrieval endpoints
- More stable connection pool utilization

Key Takeaways

1. Connection Management: Always analyze whether multiple database sessions can be consolidated into a single session.
2. Eager Loading: Use appropriate eager loading strategies (`selectinload`, `joinedload`, etc.) based on your data access patterns.
3. Monitor and Measure: Regular monitoring of database metrics helped us identify and fix these issues before they became critical problems.

Looking Forward

This optimization work has led us to establish new best practices for database access patterns in our codebase:
- Prefer single-session operations where possible
- Use appropriate eager loading strategies by default
- Regular review of database connection patterns in code reviews

Remember, while connection pooling is a powerful feature, it’s important to use it judiciously and always be mindful of how your application interacts with the database layer.

If you’re excited by projects like this, consider joining our team! We’re hiring!

Taming Database Connections: Our Journey to Better Connection Pool Management was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Built a Real-Time Feedback-Assisted Auto Face Capture in React

Gaurav Sharma — Mon, 11 Nov 2024 04:53:04 GMT

Face landmarks illustration

Capturing a valid photo that meets certain criteria can be tricky, especially when users need to ensure their faces are aligned correctly, lighting is appropriate, and no obstructions are present. Recently, I had the opportunity to work on an exciting auto face capture feature that assists users in capturing photos by guiding them in real-time.

This feature automatically captures a photo once all conditions are met, eliminating the need for manual intervention. It was built using a combination of MediaPipe Face Landmarker machine learning model and a secondary model which detects other facial attributes which the MediaPipe Face Landmarker cannot, integrated into a React-based UI. In this blog, we’ll be mainly diving into how MediaPipe Face Landmarker can be used to process frames from video stream and provide near real time results.

Final Output

https://medium.com/media/4235cbbd645097582d0859687d9f6bec/href

Overview

The auto face capture is designed to provide real-time feedback while the user is in front of the camera, ensuring their photo meets all criteria before it’s captured. Here’s a high-level overview of what this feature does:

Face Detection: Detects facial landmarks (eyes, nose, mouth, etc.) and facial attributes.
Validation: Checks for conditions such as proper lighting, face alignment, distance from the camera, and whether the face is covered.
Auto Capture: Once the face meets all the required conditions, the system automatically captures the frame after a short countdown.

Tools Used

MediaPipe Face Landmarker ML Model: This model identifies facial landmarks, providing the x, y, z coordinates of key points on the face. Ref — https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker
React: For rendering UI

Step-by-Step Implementation with React

Let’s break down how this is implemented

1. Creating Face Landmarker instance

First we need to install the following package from Google — @mediapipe/task-vision which will help in detecting the landmarks of faces. Once done, we can initialize the face landmarker instance which will also download the binary of model - face_landmarker.task

export const createFaceLandmarker = async () => {
  const filesetResolver = await FilesetResolver.forVisionTasks(
    'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm',
  );

  const faceLandmarker = await FaceLandmarker.createFromOptions(
    filesetResolver,
    {
      baseOptions: {
        modelAssetPath: `https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task`,
        delegate: 'CPU' // or 'GPU', check if GPU is available and set accordingly
      },
      outputFaceBlendshapes: true,
      runningMode: 'IMAGE',
      numFaces: 50,
    },
  );

  return faceLandmarker;
};

We will run this model in IMAGE mode.

2. Setting up the Video Stream

Next is accessing the device camera and streaming the video feed into an HTML video element. We achieve this by using the navigator.mediaDevices.getUserMedia API from browser and a React’s useRef to manage the video element.

const videoRef = useRef(null);

useEffect(() => {
  if (navigator.mediaDevices.getUserMedia) {
    navigator.mediaDevices.getUserMedia({ video: true })
      .then((stream) => {
        if (videoRef.current) {
          videoRef.current.srcObject = stream;
        }
      })
      .catch((error) => {
        console.error("Camera access error: ", error);
      });
  }
}, []);

videoRef: A reference to the video element where the video stream is displayed. This is essential for accessing and controlling the video feed within the React component.
useEffect: This hook ensures that the camera access is requested and the stream is applied to the video element as soon as the component mounts.

3. Processing Video Frames Using Canvas and ML Models

Once the video stream is active, we need to process each frame to run it through the machine learning models. We use an HTML

const canvasRef = useRef(null);
const isModelRunningRef = useRef(false);
const [captureStatus, setCaptureStatus] = useState('');

const validateFrame = (
  faceLandmarkerResult?: FaceLandmarkerResult,
  canvas?: HTMLCanvasElement,
) => {
  const { isTooBright, isTooDark } = isTooDarkOrTooBright(canvas);

  if (isTooDark) {
    return 'TOO_DARK';
  }

  if (isTooBright) {
    return 'TOO_BRIGHT';
  }

  if (isMultipleFaces(faceLandmarkerResult)) {
    return 'MULTIPLE_FACE'
  }

  //...all other checks can be added here.

  return 'GOOD_PHOTO';
}

const runModel = (canvas, faceLandmarker) => {
    // This is make sure to run models on new frames only if processing of previous frame is complete.
    // Please note this mean some frames are ignored and not processed
    if(isModelRunningRef.current === true) return;

    isModelRunningRef.current = true;

    // Process the frame using Face Landmarker
    const faceLandmarks = faceLandmarker.detect(canvas)

    // Process frame using internal ML Model
    const modelResult = runTFLiteModel(canvas);

    // Validate the frame
    const captureStatus = validateFrame(faceLandmarks, canvas, modelResult);

    setCaptureStatus(captureStatus); // use this state to show feedback on UI

    if(captureStatus !== 'GOOD_PHOTO'){
        stopCapture();
        setCaptureStatus(captureStatus);
        isModelRunningRef.current = false;
        return;
    }

    // captureStatus is POSITIVE, start the capture
    startCapture();
    setCaptureStatus('GOOD_PHOTO');
    isModelRunningRef.current = false;
}

const processFrame = (faceLandmarker) => {
  const canvas = canvasRef.current;
  const video = videoRef.current;

  if (canvas && video) {
    const context = canvas.getContext('2d')!;
    canvas.width = video.videoWidth;
    canvas.height = video.videoHeight;

    // Draw the current video frame onto the canvas
    context.drawImage(video, 0, 0, canvas.width, canvas.height);
    
    runModel(canvas, faceLandmarker);

    // Continue processing frames recursively
    window.requestAnimationFrame(processFrame);
  }
};

navigator.mediaDevices
    .getUserMedia(constraints)
    .then((stream) => {
        streamRef.current = stream;

        if (videoRef.current == null) return;

        videoRef.current.srcObject = stream;
        videoRef.current.play();
        
        // Start processing frames on `loadeddata` event on video element.
        videoRef.current.addEventListener('loadeddata', () =>
          processFrame(faceLandmarker),
        );
    })

canvasRef: a reference to the canvas element where each frame from the video is drawn.
processFrame: this function is recursively called using window.requestAnimationFrame which basically means processFrame is called after each repaint done by browser and has it’s own advantage. For instance, if the tab is not active, then processFrame would not be called.
startCapture() : just starts the countdown and handle whatever is needed when countdown is started.

4. Processing Video Frames Using Canvas and ML Models

To ensure the captured photo meets all the necessary criteria, we validate each frame by running various checks. Here are the utility functions used for validation:

Lighting Validation (isTooDark, isTooBright)

These functions check whether the lighting is either too dark or too bright, based on the RGB values of each pixel

const TOO_DARK_THRESHOLD = 60;
const TOO_BRIGHT_THRESHOLD = 200;

// This function will convert each color to gray scale and return average of all pixels, so final value will be between 0 (darkest) and 255 (brightest)
const getFrameBrightness = (canvas: HTMLCanvasElement) => {
  const ctx = canvas.getContext('2d');

  if (!ctx) return;

  let colorSum = 0;

  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
  const data = imageData.data;
  let r, g, b, avg;

  for (let x = 0, len = data.length; x < len; x += 4) {
    r = data[x];
    g = data[x + 1];
    b = data[x + 2];

    avg = Math.floor((r + g + b) / 3);
    colorSum += avg;
  }

  // value between 0 - 255
  const brightness = Math.floor(colorSum / (canvas.width * canvas.height));

  return brightness;
};

const isTooDarkOrTooBright = (canvas: HTMLCanvasElement) => {
  const brightness = getFrameBrightness(canvas);

  let isTooDark = false;
  let isTooBright = false;

  if (brightness == null) {
    return {
      isTooBright,
      isTooDark,
    };
  }

  if (brightness < TOO_DARK_THRESHOLD) {
    isTooDark = true;
  } else if (brightness > TOO_BRIGHT_THRESHOLD) {
    isTooBright = true;
  }

  return {
    isTooBright,
    isTooDark,
  };
};

2. Checking for Multiple Faces (isMultipleFaces)

Result returned by face landmarker can be passed to this utility and if there are face landmarks of multiple faces present, this returns true

export const isMultipleFaces = (
  faceLandmarkerResult,
) => {
  if (faceLandmarkerResult && faceLandmarkerResult.faceLandmarks.length > 1) {
    return true;
  }

  return false;
};

3. Face Cutoff Detection (isFaceCutoff)

This function checks whether any of the face landmarks are outside the boundaries of the image (canvas). Since x and y co-ordinates in face landmarker result are normalized, we convert to actual pixel co-ordinates and multiplying with frame width and height accordingly.

import { NormalizedLandmark } from '@mediapipe/tasks-vision';

function isFaceCutOffScreen(
  faceLandmarks: NormalizedLandmark[],
  imgW: number,
  imgH: number,
): boolean {
  for (const landmark of faceLandmarks) {
    const x = Math.round(landmark.x * imgW);
    const y = Math.round(landmark.y * imgH);

    if (x <= 0 || x >= imgW || y <= 0 || y >= imgH) {
      return true;
    }
  }
  return false;
}

4. Face Distance Detection (isFaceTooClose, isFaceTooFar)

This function determines if the face is too far from the camera by measuring the distance between the eyes.

import { NormalizedLandmark } from '@mediapipe/tasks-vision';

// Calculate Euclidean distance between two points
const getDistance = (point1: number[], point2: number[]): number => {
  const [x1, y1] = point1;
  const [x2, y2] = point2;
  return Math.sqrt(Math.pow(x2 - x1, 2) + Math.pow(y2 - y1, 2));
};

const FACE_TOO_CLOSE_THRESHOLD = 370;
const FACE_TOO_FAR_THRESHOLD = 300;

function isFaceTooFar(
  landmark: NormalizedLandmark[],
  imgW: number,
  imgH: number,
  threshold: number = FACE_TOO_FAR_THRESHOLD,
): boolean {
  const leftEye = [landmark[33].x * imgW, landmark[33].y * imgH];
  const rightEye = [landmark[263].x * imgW, landmark[263].y * imgH];

  // Calculate the distance between the eyes
  const eyeDistance = getDistance(leftEye, rightEye);
  return eyeDistance < threshold;
}

function isFaceTooClose(
  landmark: NormalizedLandmark[],
  imgW: number,
  imgH: number,
  threshold: number = FACE_TOO_CLOSE_THRESHOLD,
): boolean {
  const leftEye = [landmark[33].x * imgW, landmark[33].y * imgH];
  const rightEye = [landmark[263].x * imgW, landmark[263].y * imgH];

  // Calculate the distance between the eyes
  const eyeDistance = getDistance(leftEye, rightEye);
  return eyeDistance > threshold;
}

5. Is the face centered ?

These functions check whether the face is positioned too far to the left, too far right, too far up, too far down in frame. This is done by checking leftmost, rightmost, topmost and bottommost points from landmarks and adjusting the threshold accordingly.

const FACE_TOO_RIGHT_THRESHOLD = 500;
const FACE_TOO_LEFT_THRESHOLD = 600;
const FACE_TOO_FAR_UP_THRESHOLD = 150;
const FACE_TOO_FAR_DOWN_THRESHOLD = 450;

function isFaceTooFarLeft(
  landmark: NormalizedLandmark[],
  imgWidth: number,
  thresholdRatio: number = FACE_TOO_LEFT_THRESHOLD,
): boolean {
  const leftmostX = Math.min(
    landmark[1].x * imgWidth,
    landmark[263].x * imgWidth,
  );
  return leftmostX > thresholdRatio;
}

function isFaceTooFarRight(
  landmark: NormalizedLandmark[],
  imgWidth: number,
  thresholdRatio: number = FACE_TOO_RIGHT_THRESHOLD,
): boolean {
  const rightmostX = Math.max(
    landmark[1].x * imgWidth,
    landmark[263].x * imgWidth,
  );
  return rightmostX < thresholdRatio;
}

function isFaceTooFarUp(
  landmark: NormalizedLandmark[],
  imgHeight: number,
  thresholdRatio: number = FACE_TOO_FAR_UP_THRESHOLD,
): boolean {
  const topmostY = landmark[10].y * imgHeight;
  return topmostY < thresholdRatio;
}

function isFaceTooFarDown(
  landmark: NormalizedLandmark[],
  imgHeight: number,
  thresholdRatio: number = FACE_TOO_FAR_DOWN_THRESHOLD,
): boolean {
  const bottommostY = landmark[10].y * imgHeight;
  return bottommostY > thresholdRatio;
}

6. Are Eyes Closed?

Fortunately Face Landmarker returns something called as face blendshapes which has different face attributes like are eyes closed, looking left right etc. We can leverage 2 of these attributes to check if eyes are closed or not.

For more such attributes, please refer to this codepen — https://codepen.io/mediapipe-preview/pen/OJBVQJm

import { FaceLandmarkerResult } from '@mediapipe/tasks-vision';

const isEyesClosed = (faceLandmarkResult: FaceLandmarkerResult) => {
  const result = faceLandmarkResult?.faceBlendshapes?.[0]?.categories
    ?.filter(
      (category: any) =>
        category.categoryName === 'eyeBlinkLeft' ||
        category.categoryName === 'eyeBlinkRight',
    )
    ?.map((category: any) => category.score);

  if (!result) return false;

  return result[0] > 0.5 || result[1] > 0.5;
};

7. Detecting Head Orientation

To check if user is looking up, down, left right, we can calculate something called as yaw and pitch angles. There are ways to calculate these angles using OpenCV library which includes doing some complex calculations on landmark points to get these angles. You can check it out here —

Head Pose Estimation with MediaPipe and OpenCV in Javascript

Though I did not wanted to add OpenCV package as dependency to the project just for this usecase, so I found an alternative to the above method which does a decent job. You can read more about it here —

A Simple and efficient Face direction detection in React

Here’s how I implemented the same -

const getAngleBetweenLines = (
  midpoint: NormalizedLandmark,
  point1: NormalizedLandmark,
  point2: NormalizedLandmark,
) => {
  const vector1 = { x: point1.x - midpoint.x, y: point1.y - midpoint.y };
  const vector2 = { x: point2.x - midpoint.x, y: point2.y - midpoint.y };

  // Calculate the dot product of the two vectors
  const dotProduct = vector1.x * vector2.x + vector1.y * vector2.y;

  // Calculate the magnitudes of the vectors
  const magnitude1 = Math.sqrt(vector1.x * vector1.x + vector1.y * vector1.y);
  const magnitude2 = Math.sqrt(vector2.x * vector2.x + vector2.y * vector2.y);

  // Calculate the cosine of the angle between the two vectors
  const cosineTheta = dotProduct / (magnitude1 * magnitude2);

  // Use the arccosine function to get the angle in radians
  const angleInRadians = Math.acos(cosineTheta);

  // Convert the angle to degrees
  const angleInDegrees = (angleInRadians * 180) / Math.PI;

  return angleInDegrees;
};

const calculateDirection = (
  faceLandmarkerResult: FaceLandmarkerResult,
) => {
  const landmarks = faceLandmarkerResult.faceLandmarks[0];

  // leftmost, center, rightmost points of nose.
  if (!landmarks?.[1] || !landmarks?.[279] || !landmarks?.[49])
    return {
      isLookingDown: false,
      isLookingLeft: false,
      isLookingRight: false,
      isLookingUp: false,
    };

  const noseTip = { ...landmarks[1] };
  const leftNose = { ...landmarks[279] };
  const rightNose = { ...landmarks[49] };

  // MIDESCTION OF NOSE IS BACK OF NOSE PERPENDICULAR
  const midpoint: NormalizedLandmark = {
    x: (leftNose.x + rightNose.x) / 2,
    y: (leftNose.y + rightNose.y) / 2,
    z: (leftNose.z + rightNose.z) / 2,
    visibility: 0,
  };

  const perpendicularUp: NormalizedLandmark = {
    x: midpoint.x,
    y: midpoint.y - 50,
    z: midpoint.z,
    visibility: 0,
  };

  // CALC ANGLES
  const pitch = getAngleBetweenLines(midpoint, noseTip, perpendicularUp);
  const yaw = getAngleBetweenLines(midpoint, rightNose, noseTip);

  const isLookingUp = pitch < PITCH_UP_THRESHOLD;
  const isLookingDown = pitch > PITCH_DOWN_THRESHOLD;
  const isLookingLeft = yaw > YAW_LEFT_THRESHOLD;
  const isLookingRight = yaw < YAW_RIGHT_THRESHOLD;

  return { isLookingDown, isLookingLeft, isLookingRight, isLookingUp };
};

5. Face Capture and Final Confirmation

Once all validations pass and the frame is deemed valid, a countdown starts, and the frame is captured automatically. useCountdown hook can be implemented from scratch or can be consumed from any external package. I used usehooks-ts package as I did not want to reinvent the wheel and this package handles the nitty gritty details of hook’s implementation.

import { useCountdown } from 'usehooks-ts';

const isCapturingRef = useRef(false);
const [photo, setPhoto] = useState(null);
const [count, { startCountdown, stopCountdown, resetCountdown }] =
    useCountdown({
      countStart: 3,
      countStop: 1,
      intervalMs: 1000,
    });

const startCapture = () => {
    startCountdown();
};

const stopCapture = () => {
    stopCountdown();
    resetCountdown();
};

const onImageCapture = () => {
  if (canvasRef && canvasRef.current) {
    const context = canvasRef.current.getContext('2d');
    if (context) {
      // Convert the canvas to a blob and store photo in state
      canvasRef.current.toBlob((b) => setPhoto(b), 'image/jpeg', 0.9);
    }
  }
};

useEffect(() => {
  if (count === 1) {
    onImageCapture();
  }
}, [count]);

Finally we have our captured photo stored in a React’s state photo, which can be consumed as needed. This can be shown to user for confirmation and then sent to upstream services.

Useful trick

To get the image url from a blob, you can simply use URL.createObjectURL(photo) this will return a string which can be passed to src attribute of img tag.

Fine-Tuning Thresholds

While the conditions mentioned above work well out of the box, it’s highly customizable. You can adjust thresholds for detecting brightness, face distance etc.

Performance Optimization

Since the models are running continuously processing one frame after another, it can overwhelm the main thread potentially deeming the UI to be frozen degrading user experience, unusable. To solve this, we can run the models asynchronously. Especially for time-consuming operations like face detection, asynchronous execution is preferred to maintain a responsive user interface and provide a better user experience.

So I wrote a wrapper which converts a sync function to an async function and used this wrapper to run Face Landmarker.

function asyncWrapper(syncFunction: () => void) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      try {
        const result = syncFunction();
        resolve(result);
      } catch (error) {
        reject(error);
      }
    }, 0);
  });
}

const runModel = async () => {
    //...
    await asyncWrapper(() => faceLandmarker.detect(canvas);
    //....
}

I hope this is helpful to you! If you’re excited by projects like this, consider joining our team! We’re hiring!

How We Built a Real-Time Feedback-Assisted Auto Face Capture in React was originally published in Atlys Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.