Stories by BasicAI on Medium

3D Polygon and Polyline Annotation for LiDAR Perception Model Training

BasicAI — Fri, 22 May 2026 02:55:43 GMT

As autonomous driving and robotic systems mature, 3D LiDAR point clouds and sensor-fusion data have become key data types.

High-quality training data built from these unstructured inputs is the base for better 3D perception models.

Among many annotation types, 3D polygon and polyline annotation often receive less attention than 3D cuboids (3D bounding box). Yet they are vital for scene-level perception, providing precise spatial ground truth for drivable area detection, path planning, and road-structure parsing.

In this post, we‘ll explain how 3D polygon and polyline annotation works in point cloud and fusion data. It covers common tasks, real-world use cases, and a practical workflow using the BasicAI Data Annotation Platform.

What is 3D polygon annotation in computer vision?

In 3D perception projects, polygons and polylines describe two kinds of geometry.

A 3D polygon represents an irregular spatial boundary or surface. It’s built from a series of ordered vertices in 3D space. The vertices form a closed shape.

In autonomous driving, a 3D cuboid is used for an object-level box, such as a vehicle, pedestrian, or traffic sign. A 3D polygon is better for continuous areas, such as a lane region, road surface, or parking area.

Most polygon vertices sit near ground height. Together, they form a near-planar region and provide area-level structure.

This structure helps the model learn where a vehicle can safely drive. It also teaches the geometric limits behind behaviors like avoid, yield, and overtake.

What is 3D polyline annotation in computer vision?

A 3D polyline captures linear geometry and topology rather than enclosed area. It’s an ordered sequence of vertices, open at both ends, so it represents a path rather than a region.

With 3D polylines, teams can annotate lane markings, curbs, pipelines, boundary contours, forest-road centerlines, and similar objects in LiDAR point clouds. In autonomous driving, lane markings and lane centerlines are typical 3D polyline targets.

Pure vision-based lane detection struggles at night, in rain or snow, and under glare. LiDAR has lower spatial resolution than cameras, but many mechanical and solid-state LiDAR sensors include an intensity channel. Road marking paint, thermoplastic materials, and glass beads often have distinct reflectance. In intensity views, they can appear as visible line structures.

LiDAR-based 3D polyline annotation, combined with camera data, can reduce the weakness of a single sensor. It gives lane detection and lane geometry models a more stable supervision signal.

What tasks and applications benefit most from 3D polygon and polyline data?

Tasks supported by these annotations fall into three groups: region understanding, line and topology understanding, and cross-modal projection or auxiliary tasks.

The downstream models vary just as widely. It may be a classic point cloud semantic segmentation network, a bird’s-eye view (BEV) perception and planning model, an end-to-end multi-task network, or a graph-based topology learning model.

For drivable area detection, a common approach is to project 3D point clouds into BEV maps or voxel grids. A 2D or 3D convolutional model then predicts occupancy and traversability for each grid cell. The 3D polygon can be rasterized into a binary grid label to supervise the model.

For lane detection and lane topology learning, a 3D polyline can be parameterized as a sequence of control points or a spline. The model regresses the geometric parameters and connectivity, with the polyline annotation acting as ground truth.

Autonomous driving and robotics

In passenger cars and logistics robots, 3D polygons mark drivable area boundaries while 3D polylines trace curbs, lane markings, and route guides. Both are core inputs to path planning and lateral control. OpenLane-V1, for example, is a widely used dataset that annotates real-world 3D lane lines at scale.

This data is also useful for mobile robots, such as indoor AGVs, delivery robots, and campus inspection robots. In indoor environments, polygons can mark traversable areas such as corridors, rooms, and open spaces. Polylines can represent virtual lanes or navigation paths. This helps robots navigate repeated routes with higher precision.

Utilities and infrastructure management

Airborne and vehicle-mounted LiDAR can capture large city-scale point clouds. These scans cover roads, buildings, green belts, and public assets.

3D polygons are the natural fit for ground feature boundaries. Building footprints, rooftop outlines, parking lots, vegetation patches, water bodies, and land-use boundaries all qualify.

3D polylines describe linear infrastructure. Underground utilities, water and gas pipes, fiber, power lines, and street lighting cables are all linear assets best represented as polylines.

Precision agriculture and forestry

In precision agriculture, 3D polygons can divide farmland into detailed plots and support slope surveys. In forestry, ground robots use 3D polylines to extract trail edges and tree-row axes, producing reliable trajectories through irregular natural boundaries. This matters for automated forest maintenance, orchard logistics, and other off-road robotic workflows.

How to annotate 3D polygon and 3D polyline (using the BasicAI platform)

This section walks through drivable area polygon labeling and lane line polyline labeling using BasicAI Data Annotation Platform, an enterprise-grade multimodal data annotation tool. We assume your team has already finished data collection and basic preparation.

Create the dataset and ontology

Open BasicAI platform. In the left sidebar of the home page, open the “Dataset” tab.

Click “Create” in the upper-right corner. In the pop-up window, choose “LiDAR Fusion” as the data type. Enter a dataset name, such as urban_drivable_and_lane_mapping, then confirm.

Open the new dataset and switch to the “Ontology” tab to configure ontology assets.

Create a class named drivable_area. Set “Tool Type” to “Polygon”. Add attributes if needed, for example: surface_type > Asphalt/Concrete/Dirt.

Create another class named lane_line and set its Tool Type to Polyline. Bind attributes as your spec requires, such as line_style: Solid/Dashed/Double.

Annotate drivable area with 3D polygons

Switch back to the “Data” tab, select the LiDAR fusion data you want to annotate, and click “Annotate” to load the point cloud annotation editor.

The UI includes a 3D point cloud canvas and orthographic views (overhead, side, and rear). The left side contains tools. The right side shows ontology labels.

Pick the “3D Polygon” tool from the left toolbar (or press 3).

To cleanly isolate the ground in cluttered urban scans, two settings help.

Turn on smart ground segmentation. Open the control menu in the lower-left corner of the canvas. Under “Ground”, choose “Model”. The built-in ground detection model highlights terrain points in the point cloud view. Since drivable-area polygons should sit on the ground plane, this makes the road boundary easier to judge.
Filter by height range. In the upper-left corner of the point cloud canvas, find “Height Range”. Enter min and max height values. This hides irrelevant high objects, such as vehicles, building tops, streetlights, and tree crowns. The result is a cleaner ground-level view.

After the view is ready, click the road boundary to place the first vertex. During drawing, press “I” to enable point snapping. This lets vertices snap to the nearest LiDAR return and reduces manual alignment work.

Click clockwise or counterclockwise along the drivable boundary. When the shape is complete, press “Enter” or “Space”. The platform closes the polygon by linking the last vertex to the first and opens the attribute panel.

In the attribute panel, select the drivable_area ontology class and any required attributes. Confirm the annotation.

To insert a new control point, click directly on a polygon edge.

To refine the shape, drag polygon vertices in orthographic views. Use synchronized 2D camera images as an extra reference.

Annotate lane lines and trajectories with 3D polylines

Road markings reflect strongly in LiDAR returns and become faintly visible in the point cloud. But sparsity, occlusion, and rain-soaked diffuse reflection break those lines into fragments. To label them well, you need to cross-check against the projected 2D camera frames.

Pick the “3D Polyline” tool (shortcut 4). We’re going to mark the left-turn lane line in the point cloud.

Left-click on the start of the marking to place the first vertex. Follow the curve, dropping control points as it bends. Click the endpoint and press Enter or Space to finish.

The generated 3D polyline carries a direction by default, from start point to end point. This direction provides useful flow topology for downstream path planning models.

A few useful shortcuts:

Split a polyline. Select any non-endpoint vertex and press Shift + F. The polyline splits into two independent polylines that inherit the original class, attributes, and group information.
Merge polylines. Hold Shift, select two polylines, and click “Merge” in the top-left. You can choose to join the end of one to the start of the other.
Create a centerline. Hold Shift, select two polylines, and press “.” to generate a centerline between them.

Finish the work and export

Once every polygon and polyline is in place, click “Save” in the top-right of the canvas to commit your work, then “Close” to exit the editor.

Back on the “Data” tab, select the annotated point cloud frames and click the cyan “Export” button. In the export dialog, pick your target 3D coordinate format and submit. When the job finishes, you’ll have standardized PCD files and a JSON topology bundle ready to drop into a training pipeline.

Practical tips for 3D polygon and polyline annotation

Control vertex density. Too many vertices slow down annotation and reduce consistency. Too few vertices fail to match curved curbs, tight bends, and complex corners.

Use ground segmentation before drawing polygons. For drivable areas, enable the 3D ground segmentation model first. Then use height range filtering to remove floating noise above the road surface. This speeds up boundary drawing and helps keep polygons on the ground plane.

Check across modalities. LiDAR becomes sparse at longer range, especially beyond 20 meters. Use clear lane details in 2D camera images, then map them back to the 3D point cloud. This improves long-range lane-line accuracy.

Choose tools built for the workload. LiDAR fusion scenes can be huge. A single complex frame may contain millions or tens of millions of points. Use enterprise-level annotation platform designed for large point clouds and multi-camera fusion. This helps avoid browser lag, freezing, memory issues, crashes, and lost coordinates.

Partner with a specialized BPO. 3D polygon and polyline annotation have strict quality requirements. Production-grade autonomous driving projects may need millions of frames of ground truth. A trained 3D annotation service team, backed by strong QA and AI-assisted tools such as ground segmentation and tracking interpolation, can help deliver large volumes of accurate spatial data.

2D Polygon Annotation for Computer Vision Model Training

BasicAI — Sat, 09 May 2026 10:24:46 GMT

After convolutional neural networks became dominant in the 2010s, deep learning models grew more complex. The limits of the rectangular bounding box became hard to ignore.

Axis-aligned boxes struggle to represent objects with irregular shapes, complex contours, or transparent regions. Researchers looked for annotation methods that could capture finer spatial information. Polygon annotation emerged as a practical answer, striking a workable balance between labeling cost and model accuracy.

Today, polygon annotation is standard across many industries. It’s supported by mature annotation platforms and common data formats. Demand for polygon datasets keeps growing as more teams train segmentation models for real-world scenes.

In this post, we walk through the concept, applications, rules, formats, and practical workflow of 2D polygon annotation, so researchers and practitioners can fold it into their pipelines quickly.

What is 2D polygon annotation in computer vision?

2D polygon annotation is a vector-based data labeling method. A human annotator or an automated system places a set of vertices along the outer contour of an object. These vertices connect to form a closed shape. The polygon separates the pixels that belong to the object from the background and nearby objects.

By enclosing only the relevant visual area, polygon annotation gives the training set a higher signal-to-noise ratio. Models can then learn specific shape features rather than generalized spatial regions.

When a computer vision system processes these vector polygons, they are typically rasterized into binary pixel masks. Pixels inside the polygon take a value of 1 (or a class ID), and pixels outside take 0. This provides the pixel-level supervision needed to train segmentation models.

For image segmentation, polygons are the standard representation for instance segmentation. Compared with pixel-level masks, polygons are faster and more flexible to produce. And compared with bounding boxes, they are far more precise and faithful to object geometry.

What CV tasks and applications benefit most from 2D polygon data?

Models built for instance segmentation benefit most from polygon annotation.

Mask R-CNN is a classic example. Polygon annotations are converted into raster masks, which provide pixel-level ground truth for the mask prediction branch.

Modern YOLO variants, including YOLOv8-seg, have added segmentation heads and likewise depend on polygon-to-mask conversion. U-Net and its derivatives, widely used for medical image segmentation, also rely on pixel-accurate masks.

These models are used in fields where boundary quality affects safety, efficiency, compliance, or cost

Autonomous driving and robotics

Self-driving and robotic perception systems need a detailed vector representation of their surroundings to navigate safely.

Segmentation datasets in this space call for precise polygons to outline drivable areas, intricate lane markings, pedestrians, and the highly variable shapes of surrounding vehicles.

In industrial robotics, rectangular approximations rarely hold up for precise manipulation. Polygons give the arm accurate physical boundaries for gripping irregularly shaped objects.

Medical imaging analysis

Regulators such as the U.S. FDA and Europe‘s CE require strict, mathematically verifiable validation of diagnostic AI tools. A bounding box is rarely enough for diagnostic tasks.

In this field, polygon annotation is commonly used to outline anatomical structures, organs, tissues, and pathologies such as tumors. Accurate lesion area, boundary, volume estimate, and growth rate can affect treatment planning.

Precision agriculture

Modern agriculture uses drone imagery and robotic equipment to improve crop yield and resource use.

Vision models trained on polygon data can detect individual fruits, vegetables, grains, and even weeds. By labeling the exact leaf boundaries of invasive species, smart spraying systems can target herbicide only onto the weed’s outline, cutting chemical runoff and operational cost.

Environmental monitoring and geospatial analysis

Environmental features such as coastlines, rivers, wetlands, and forest edges have irregular geometry.

Satellite, aerial, and drone imagery often rely on large-scale 2D polygon segmentation. These polygons support area estimation, land-use classification, and change detection over time.

For geospatial AI, boundary quality often determines measurement quality.

Sports analytics and broadcasting

Sports broadcasting now uses computer vision in many places. One newer use case is real-time virtual advertising and automated offside-line placement.

To place localized digital ads on stadium walls without covering players who run in front of them, broadcast systems need foreground segmentation masks. These masks define the scene’s z-order, so graphics can appear behind moving athletes in the live video.

High-quality polygon datasets help train the segmentation models that make this possible.

What are the requirements and formats for 2D polygon annotation?

2D polygon annotation must follow clear geometry rules. If those rules are ignored, datasets can fail during parsing, rasterization, mask generation, or model training.

Closure

A valid 2D polygon must be a closed geometric ring. The last vertex in the coordinate array must connect back to the starting point (Xn, Yn → X1, Y1) to seal the shape.

Minimum points

A polygon needs at least three non-collinear points. Its internal area must be greater than zero.

No self-intersection

Polygon edges should not cross each other. Self-intersection creates ambiguity during rasterization because the system may not know which pixels are inside or outside the object.

Vertex density

Curved or detailed regions need more vertices. Human contours, leaves, fabric, and pastries often need dense points around high-curvature areas. Straight edges should use fewer points. Extra vertices slow annotation and can add noise without improving the mask.

Negative space

Objects with holes need parent-child topology. The outer ring defines the main object. One or more inner rings define empty regions. When rendered, the system applies rules such as the even-odd rule to decide which areas are object and which are background.

Different formats

The COCO instance segmentation format is widely adopted in CV research. It represents each instance as a collection of one or more polygons per object, with each polygon stored as a flat coordinate array in the form [X1,Y1,X2,Y2,…,Xn,Yn] to describe consecutive vertices.

Pascal VOC segmentation uses raster PNG masks rather than vector polygons. Pixel values correspond to class IDs or instance IDs. A palette or metadata file maps those values to class names.

How to perform 2D polygon annotation? (using the BasicAI platform as an example)

In this section, we use BasicAI Data Annotation Platform, a modern enterprise-grade tool, to walk through a complete 2D polygon workflow. For this demonstration, we assume that raw data collection, filtering, and preprocessing are already done.

Example scenario:

A smart retail company is building an AI self-checkout terminal for bakeries. A top-down camera captures trays of randomly placed bread. The computer vision model must run fast instance segmentation to identify, count, and bill each item. The images contain irregular bread shapes, overlapping objects, and hollow geometry such as bagels.

Create the dataset and ontology

Open the BasicAI platform. Go to the dataset management view from the left navigation bar. Create a new dataset and select ”Image“ as the data type. Give it a descriptive name (for example, Bakery_Checkout_V1_Production) and confirm.

Navigate to the new dataset workspace and switch to the Ontology tab. Create the object classes you plan to label. In this case, classes might include “Toast_slice” and similar items. For each class, set the annotation tool type to “Polygon.” Add any classification attributes you need for richer metadata, such as “Visibility” (full, partially occluded).

Manual annotation with the 2D polygon tool

Once the ontology is ready, open the annotation interface.

The image appears in the center canvas. The right sidebar shows the category list. The left toolbar shows the available annotation tools.

Select the polygon tool, or use the keyboard shortcut “2”.

Let‘s start with the toast slice in the bottom right. Click one point on the bread edge, such as the upper-left contour, to place the first vertex. Continue clicking along the boundary in either clockwise or counterclockwise order.

Once the full perimeter is traced, press Space or Enter to auto-close the polygon.

A configuration panel appears so the annotator can assign a class label from the ontology. Select “Toast_slice” along with any configured attributes.

Tips:

Once a polygon is created, click on it and drag any vertex to fine-tune the shape. Hold Shift to drag the whole polygon to a new position.
Press A to constrain the next point to the horizontal axis, or Ctrl/Cmd + A to lock movement to the vertical axis.

Handling overlapping objects

Bakery tray scenes often include overlapping loaves. Take the overlapping baguette slices in the top-left of this example. Labeling overlapping objects usually introduces small gaps or accidental pixel overlaps. BasicAI solves this through local topology sharing.

First, draw a standard complete polygon around the top, fully visible baguette slice.

Next, annotate the partly occluded slice underneath it. Let the new polygon roughly overlap the existing slice along the shared boundary.

When placing the second polygon, activate shared-edge mode (Ctrl/Cmd + K). In this mode, you can trace the new polygon freely without worrying about overlap with the existing one. When the second polygon is closed, the platform adjusts the boundary. The two polygons share a clean common edge, with no gap and no overlap.

Handling hollow objects

Bagels and some doughnut-shaped bread items have inner holes. They need hollow polygons.

Trace one complete, continuous polygon along the outermost edge of the bagel. This creates polygon 1, the main body.

Trace a smaller secondary polygon along the perimeter of the inner hole. This becomes Polygon 2.

Hold Shift and select both polygons. Run the Hollow command, available in the top-left toolbar or via the H shortcut.

The system instantly subtracts the smaller polygon‘s geometry from the larger one, producing a hollow polygon.

Special case: clipping

Occlusion can create disconnected visible regions. For example, object A may be partly hidden by object B, while visible on both sides of B. In this case, annotate A and B first as complete overlapping polygons, as if both were fully visible.

Then hold Shift and select both polygons. Click the Crop or Clip button, or use the assigned keyboard shortcut. The platform shows two options: Crop 1 and Crop 2.

Crop 1 removes the overlapping area from B and keeps A fully visible.
Crop 2 removes the overlapping area from A and keeps B fully visible.

After clipping, the occluded region is removed. The visible parts remain as separate, non-overlapping polygons.

Finish annotation and export

When you are satisfied with the annotation quality, click “Save” to commit the labels to the dataset. Click “Close” to exit annotation mode.

To prepare the labeled data for model training, select the fully annotated images from the dataset and launch an export task. Specify the export format (COCO JSON, Pascal VOC XML, or a platform-specific format) along with any additional parameters.

Practical advice for 2D polygon annotation

Building a high-quality polygon dataset takes more than tool proficiency. Several operational factors matter as much. To close this post, here are a few recommendations based on our experience:

Manage vertex density. For most natural objects, 20–40 vertices usually strike a good balance between precision and speed, though this depends heavily on image resolution and object size.
Write clear guidelines. Edge cases must be defined before production starts. Should a shadow on the ground be included in the object? Should a glass polygon include the liquid inside it, or should that region be cut out? These rules must be explicit.
Choose an advanced annotation platform. Platforms like BasicAI offer a rich set of segmentation tools and integrate models such as SAM or other interactive segmentation algorithms. These generate an initial segmentation result that the annotator only needs to refine, which lifts throughput significantly.
Consider outsourcing your annotation work. Polygon labeling demands more geometric expertise, stronger tool skills, and higher cognitive load than basic bounding box annotation. Large internal labeling projects can consume expensive engineering time. Partnering with a specialized BPO data annotation service brings a clear strategic advantage. These teams come with rigorous QA frameworks, domain-specific expertise, and a scalable workforce, all of which help the final model reach its full potential.

3D Cuboid Annotation for Computer Vision Model Training

BasicAI — Fri, 24 Apr 2026 15:03:31 GMT

Early perception systems relied on images and 2D bounding boxes. A 2D box is a clean way to outline an object within a pixel grid. But the world is three-dimensional. If you care about size, depth, orientation, and spatial relations, you need labels that describe geometry in real space.

Modern 3D sensors introduced new data modalities. LiDAR measures distance by emitting laser pulses and timing their return, producing 3D point clouds. This data demands new data annotation methods.

3D cuboids (3D bounding boxes) emerged as a practical solution. They carry the key signals autonomous systems need for safety-critical decisions like collision avoidance, path planning, and control, while staying simple enough to label at scale.

Major benchmarks such as KITTI, nuScenes, and Waymo Open Dataset standardized on 3D cuboid annotations. Their conventions shaped much of today’s industry norms.

In this post, we cover the concept, applications, how it’s represented, and workflow of 3D cuboid annotation.

What is 3D cuboid annotation in computer vision?

A 3D cuboid (3D bounding box) is a rectangular box in 3D space that tightly encloses a target object.

Unlike a 2D bounding box, a cuboid captures:

Position in 3D Euclidean space,
Physical dimensions along three axes, and
Orientation relative to a defined coordinate frame.

This volumetric representation changes how ML models perceive and reason about the physical world.

3D cuboid annotation is especially important for LiDAR point cloud data. LiDAR is the primary sensing modality in current autonomous driving systems. It generates sparse but highly accurate 3D point clouds, millions of 3D coordinates representing laser pulse returns from surfaces in the environment. In point clouds, a 3D cuboid defines which points belong to an object and which do not.

Autonomous vehicles and many robots combine multiple sensing modalities (LiDAR, cameras, radar, and others) through sensor fusion. Each provides different signals, and each has its own coordinate frame. A 3D cuboid provides a common 3D reference that 2D camera observations can be projected onto and aligned with.

What CV tasks and applications benefit most from 3D cuboid annotations?

Core computer vision tasks

3D object detection is the primary and most fundamental task built on 3D cuboid data. Instead of predicting a 2D pixel region, the model predicts an object’s real-world 3D position, metric size, and rotation.

Trajectory prediction and 3D object tracking. Tracking requires consistent object identity across frames and an estimate of motion over time. Cuboids provide ground-truth geometry in real space. Datasets such as Argoverse are common benchmarks here. Trained systems can predict trajectories, match detections across frames using spatial proximity and motion consistency, and handle partial occlusion or new objects entering the scene.

Autonomous driving and ADAS

In autonomous driving, 3D object detection is the foundation for nearly all downstream 3D perception tasks, including motion prediction, trajectory planning, and collision avoidance.

Cuboids provide the physical footprint of nearby vehicles and other agents and their exact distance to the ego vehicle. That supports safe following distance and correct time-to-collision (TTC) calculations for emergency braking. By analyzing the heading angle (yaw) of vehicle or pedestrian cuboids, the system can also infer motion intent.

Robot navigation and obstacle avoidance

Warehouse picking robots, AGVs, and humanoid platforms all need 3D perception for planning and interaction.

In warehouses, factories, and service spaces, it’s not enough to know where an object appears in an image. The robot needs its 3D position and orientation to plan a collision-free path and execute actions.

For manipulation, a robot arm needs accurate 3D localization and pose to grasp from shelves. For mobile robots, accurate 3D size and location for obstacles, furniture, and people is required for safe and natural navigation.

Augmented reality (AR)

AR and VR depend on consistent alignment between virtual content and the physical world.

3D cuboid annotations map the precise physical dimensions and spatial positions of real-world items in a user’s environment. That supports correct occlusion (virtual content behind real furniture), stable placement on surfaces, and consistent depth behavior as the user moves.

How are 3D bounding boxes represented?

To make a 3D cuboid machine-readable, teams follow a standardized mathematical format. The industry-standard 3D bounding box is defined by 9 parameters. These fully specify position, size, and rotation.

The 9 parameters

Center coordinates: the x, y, z coordinates of the cuboid’s center in a relevant coordinate frame. In autonomous driving, this is often an ego-vehicle frame with origin on the ground under the vehicle center, with x pointing forward, y pointing left, and z pointing up.
Dimensions: the physical size of the cuboid: length, width, and height (l, w, h). These values describe the box’s extent along its three principal axes, in the same units as the coordinate frame (usually meters in autonomous driving).
Orientation / rotation: the cuboid’s rotation around the x, y, and z axes, commonly called roll (α), pitch (φ), and yaw (θ). These three angles fully define any 3D rotation.

Yaw (rotation around the vertical z-axis) is critical in autonomous driving. It directly determines a vehicle’s heading, whether a pedestrian faces toward or away from the ego vehicle, and which side of an object faces which direction.
Pitch (rotation around the lateral y-axis) describes forward or backward tilt, relevant for vehicles on ramps or slopes.
Roll (rotation around the longitudinal x-axis) describes lateral tilt, relevant for leaning motorcycles.

In many road-driving datasets (including the standard KITTI dataset), objects on relatively flat ground are often assumed to have near-zero pitch and roll. That reduces the parameter set from 9 to 7 for efficiency.

Different coordinate frames

Cuboid parameters depend on the coordinate system:

The sensor frame is defined relative to a single sensor. A LiDAR sensor defines its own 3D frame with the origin at the sensor center and axes aligned with its mounting orientation.
Camera frame places its origin at the camera’s focal point with axes aligned to the optical axis.
Ego-vehicle frame places its origin at the vehicle center with axes aligned to the vehicle heading.
Global/world frame references fixed earth coordinates, typically using a latitude-longitude-altitude system or a local Cartesian projection.

In sensor fusion pipelines, LiDAR points live in the LiDAR frame, while images live in the camera frame. To align them, you project between frames using calibration:

Apply extrinsics to transform world (or LiDAR) coordinates into the camera frame;
Apply intrinsics to project camera coordinates onto the 2D image plane.

How to perform 3D cuboid annotation on BasicAI Data Annotation Platform

With the necessary concepts covered, let’s turn to practice. Production-scale 3D cuboid labeling needs tools that can render million-point clouds and synchronized high-res images with low latency.

Below is a typical workflow using the BasicAI Data Annotation Platform as an example for LiDAR-camera fusion. We assume that raw data collection, multi-sensor calibration, data preprocessing, and dataset organization have already been completed.

Create the dataset and ontology

Before drawing any cuboids, the project manager must define the annotation schema and dataset parameters.

Dataset creation. From the BasicAI homepage, go to Datasets and create a new dataset. Choose LiDAR Fusion as the dataset type to indicate synchronized LiDAR point clouds and camera images. Use a clear dataset name (for example, “Urban Driving Scene LiDAR Fusion Dataset”), then confirm.

Ontology definition. Open the dataset’s Ontology tab and define classes, attributes, and tools. The ontology specifies which object types exist, which attributes each type has, and which annotation tool is used per type.

For this autonomous driving scene, create a main class such as “car” (passenger cars, SUVs, and similar). Set its tool type to Cuboid. Common attributes for vehicles include:

Occlusion: fully visible / partially occluded / heavily occluded.
Truncation: partly outside sensor range or camera view.
Motion state: parked / moving / stopped.

Manual 3D cuboid annotation

Open the annotation UI. In the Data tab, select a LiDAR fusion frame or a sequence, then click Annotate. The UI is typically split into a left tool panel, a central 3D+2D view, and a right ontology/attribute panel.

Select the cuboid tool. Choose the 3D cuboid tool (hotkey “1” or “F”).

Two-click cuboid creation. BasicAI uses an assisted cuboid creation flow. Click twice in the point cloud to set an initial extent: first click one outer corner of the target cluster, second click the diagonal corner. The system proposes an initial 3D cuboid. The direction from the first click to the second influences the initial yaw.

Multi-sensor projection and linking. Because the dataset is LiDAR fusion, the tool uses preloaded extrinsics and intrinsics to project the 3D cuboid onto the aligned camera image in real time. It also creates a 2D bounding box and a 2D cuboid in the image view. These instances share the same trackID and trackName, so downstream models can treat them as one physical object across modalities.

Refine the cuboid. Adjust position, size, and rotation in the 3D view. Resize by dragging faces, rotate with the rotation controls, and translate the whole box. Here we need to make it include all points that belong to the object, excluding nearby objects, infrastructure, and ground returns.

Set attributes. On the floating panel, pick the “car” class and fill in occlusion and truncation based on visual evidence.

Optional: Auto-annotation (pre-labeling). For efficiency, annotators can run a built-in pretrained model. On BasicAI platform, this can be triggered via the “brain” button for one-click inference. The model proposes cuboids for vehicles and other classes across the scene. The human role shifts from drawing to verification and correction.

Save, exit, and export. After labeling all target objects in the frame (or across a 4D-BEV sequence), click Save, then close the UI. A manager can select completed items in the Data tab and run Export. Outputs are commonly produced in formats such as JSON or KITTI-style files.

Practical tips for 3D cuboid annotation

3D cuboid labeling is labor-heavy and easy to get wrong. Based on real project experience, here are some strategies to maintain high efficiency and quality.

Ensure tight fit. Avoid large empty margins, and do not cut off valid points. Rotate the 3D view often and inspect from multiple angles before finalizing.
Align the bottom face to the ground. Most road and warehouse objects rest on the ground plane. Many tools like BasicAI can run ground segmentation to separate ground and non-ground points. Another simple tactic is to restrict the visible height range. For example, only showing points from 0.5–5m above ground to reduce clutter.
Calibrate sensor alignment when using LiDAR fusion data. Visual mismatches between 3D point clouds and 2D camera images are common. BasicAI provides online calibration tools (the camera + gear button on the dataset page). Place a reference point on a clearly identifiable physical feature in the point cloud, then adjust its 2D projection in the corresponding camera image until the coordinate frames align.
Human-model coupling. Advanced platforms like BasicAI include pre-trained ML models that detect common classes and propose cuboids. For sequences, auto tracking features can propagate cuboids across frames and keep stable object IDs.
Consider partnering with professional annotation providers. 3D cuboid annotation demands significant cognitive load, substantial hardware resources, and specialized expertise. Leading AI teams frequently work with dedicated annotation companies that bring robust multi-tier QA frameworks and purpose-built infrastructure to deliver high-quality, consistent 3D annotations at industrial scale. This directly supports safer, more reliable AI models in production.

What is Ontology in Machine Learning Data Annotation?

BasicAI — Tue, 07 Apr 2026 04:41:24 GMT

Ontology started in ancient philosophy as the study of the nature of existence.

Computer scientists later adopted the term. In machine learning, an ontology is a structured framework that defines the entities, concepts, relationships, and constraints in a domain or problem space.

Applied to data annotation, ontology embeds rich semantic and hierarchical information into annotations. This helps keep labeling consistent across large teams and long-running projects. Downstream AI systems can also use the added structure to improve reasoning and decision-making.

BasicAI’s Xtreme1 was the first open-source multimodal training data platform to bring ontology into the core annotation workflow. The same system is also integrated into the BasicAI Data Annotation Platform, helping AI teams maintain high label consistency, reuse schemas across projects, and capture domain knowledge as an asset.

In this post, we’ll explain what an annotation ontology contains, why it matters, and how to build ontology on the BasicAI platform.

What elements make up an ontology in data annotation?

An effective ontology for machine learning data annotation consists of several connected elements that together form a domain knowledge model:

Class, Sub-Class, and Instance

In ontology-based data annotation, a class is a basic category of “things” in the domain. It defines a general concept that can apply to many entities. In an autonomous driving dataset, for example, “vehicle” can be a top-level class covering motor vehicles.

Classes usually sit in a hierarchy. A sub-class is a specialized version of a parent class. It inherits broad properties from the parent, and adds features that distinguish it from sibling classes. “Sedan” can be a sub-class of “vehicle”, and may be further split into “luxury sedan” or “mid-size sedan”.

An instance (often called an entity or object) is a specific, located occurrence of a class or sub-class in actual data.

This hierarchy matters for deep learning. It allows models learn visual features at different levels of abstraction. If low light or heavy occlusion prevents a model from deciding whether an object is a sedan or an SUV, the perception stack can still recognize it as a vehicle from more general shape cues and trigger braking.

Classification

Classes and sub-classes define and locate objects (entities) in the data. Classification, in the context of an annotation ontology, describes global properties of the broader context, the environment, or the data itself.

Classification ontologies often split into data-level and scene-level ones. Scene-level classification assigns semantic labels to describe the full environment or context in a frame, rather than any single local object within it. These labels provide key metadata so AI models can interpret objects under the conditions in which they appear.

In autonomous driving, classifications might include weather (sunny/rainy/snowy/foggy) and time context (day/night/dawn/dusk). Other scene labels can include road type, traffic density, or the presence of pedestrians and cyclists.

Attributes

A class says what something is. An attribute describes how it looks, what state it’s in, or what properties it has. Attributes capture nuance that a simple class label cannot. That nuance often affects real-world performance.

Two especially important attributes in computer vision and spatial annotation are:

Occlusion: how much an object is hidden by other objects or the environment.
Truncation: whether part of the object lies outside the sensor’s field of view. This tells the model the missing pixels are due to framing, not physical obstruction.

Other common attributes include color (e.g., red, blue, silver), motion state (e.g., stationary, moving), or physical condition (e.g., intact, damaged, severely deformed).

Relation

A relation defines semantic, logical, spatial, or causal connections between two or more instances. Relation annotation is widely used in NLP and knowledge extraction, and the same idea applies to multimodal data.

A standard format in text annotation is the entity–relation–entity triple. Entities are recognized concepts such as people, organizations, locations, or events. The relation encodes the meaningful link between them.

Constraints

Attributes describe properties. Constraints set validity bounds to ensure annotations conform to domain knowledge and physical reality.

For example, when annotating sedans with a 3D cuboid tool in LiDAR point clouds, predefined constraints might specify that any sedan instance cannot exceed 6.0m long x 2.5m wide x 2.0m tall.

Raw LiDAR point clouds can be sparse. Annotators may accidentally extend a 3D cuboid (3D bounding box) into background noise or a nearby car. Ontology constraints act as guardrails, so the model learns correct spatial dimensions.

Why ontology matters for machine learning data annotation?

In computer science, ontology was first used to connect early web hypertext documents by giving precise definitions to underlying concepts.

Modern deep learning, particularly in computer vision and NLP, requires training datasets that can reach petabyte scale. The semantic framework that was built to organize the web maps well to organize training data, and in practice becomes necessary.

Ontology has moved from a pure knowledge-representation idea to a practical backbone for data-centric AI. Integrating formal ontology into ML data annotation has far-reaching implications.

The traditional approach based on class labels alone (“flat taxonomy”) breaks down at scale:

Different annotators interpret the same guideline differently, which introduces bias and inconsistency.
As applications grow more complex and datasets reach millions of images, teams spend heavily on relabeling, reconciling conflicting tags, and dealing with a combinatorial explosion in flat label sets.

Higher annotation quality

Ontology-based annotation standardizes work through a rigorous framework. It ensures an annotator in one office labels a heavily occluded vehicle the same way as an annotator on another continent. That reduces error rates, narrows data variance, and cuts expensive rework.

Scalability and reuse

Many datasets are labeled as one-off efforts for a single, narrow model task. The result is data silos that are hard to reuse. Ontology abstracts the problem definition away from one model’s immediate needs. A well-built ontology becomes a central, evolving knowledge base that you can reuse, extend, and apply to future datasets.

How to start annotating with ontology? (BasicAI platform example)

Ontology-based annotation requires robust software infrastructure. Here we use the BasicAI data annotation platform as an example to show how to build and manage ontologies. The platform provides an intuitive interface designed for computer vision and multimodal data, so project owners can create, deploy, and enforce complex ontologies.

The Ontology Center

Ontology Center is the central repository for ontology assets on the BasicAI platform. You can access it from the left-side Ontology tab on the homepage. It separates ontology creation from any single dataset (though you can also create ontologies inside a dataset).

Ontology Center allows teams build an ontology once and apply it across multiple projects and datasets. This reduces setup cost when starting new labeling work.

Creating Classifications

Let’s start with classification ontology.

Open a dataset and switch to the Ontology tab.
Under the tab, go to the Classification section.
Click Create to define a new classification.
Fill in key fields:
Name: The classification identifier. For this example, we create a “time_of_day” classification.
Target On: Choose whether the classification applies at the scene level or data level. For scene-level classification, select Scene.
Options: Add valid values. For time of day, options might be Day, Night, Dawn, Dusk. Each scene should map to exactly one option, with no ambiguity.
Repeat the same flow to create more classifications, such as Weather with options like Sunny, Rainy, Snowy, Foggy, Overcast.

During labeling, you can switch the right-side panel to Classification annotation and see the ontologies you created.

Creating Classes

A class defines the object types that must be labeled in a scene (and learned by the model). Whether you create classes globally in Ontology Center or locally inside a dataset, the steps are the same.

Click Create to open the class configuration panel. It includes:

Name: The official machine-readable name (for example, vehicle_sedan). This is the exact string downstream training receives.
Alias: A human-friendly name shown in the UI when the machine-readable name is long or unclear. Annotators can toggle between name and alias with the “O” shortcut to stay oriented during fast labeling.
Number: An integer ID (for example, 1 for Pedestrian, 2 for Vehicle). If downstream pipelines use numeric IDs, this avoids extra parsing or mapping scripts.
Color: Display color for this class in the labeling view. Pick colors that are easy to tell apart so annotators can visually verify categories.
Tool Type: The annotation tool for this class, such as bounding box, polygon, mask, cuboid, or skeleton. If you choose skeleton, you must also define keypoint order and connections before saving. For other modalities, the platform provides tool types such as Clip for temporal segmentation in video or audio.
Tags: Domain tags or special handling markers. Useful for search and organization in large ontologies.
Size Limit: Size constraints, especially for 3D LiDAR annotation. You can set strict min/max bounds for width, height, area, and more.

Setting Attributes

Attributes add the state and nuance a class needs. In the class configuration window, click Manage Attributes to open the attribute panel.

You can add multiple attributes to a class. Each attribute has an input type that controls how annotators enter values:

Radio: Mutually exclusive choice. Annotators must pick exactly one option.
Example: occlusion > None / Partial (<40%) / Heavy (40–70%) / Extreme (>70%)
Multi-selection: Multiple overlapping states can be selected.
Example: damage types such as “windshield cracked” and “bumper dented” at the same time.
Dropdown: Like radio logic, but more compact. Useful for attributes with long lists.
Example: “vehicle brand” with dozens of manufacturers.
Text: Free-form input for transcription, unique strings, or formulas.
Example: license plate transcription.
Rank: A numeric severity or ordering.
Example: medical imaging grades for lesion severity, readability, or artifact level.

Creating Sub-Classes

After you define the class configuration, save it to create the ontology class. To add hierarchy, hover over an existing class entry in the Ontology tab and use the Sub-Class entry point to create a more specific class under it.

Ontology management

Once an ontology is created and refined through real labeling work, you need a clear process for versioning, sharing, and evolution.

In Ontology Center, use the three-dot menu in the top-right to import/export ontology JSON for version control, sharing, or archival.
In a dataset’s Ontology tab, the three-dot menu also provides Copy from Ontology Center, which enables project-level management while staying connected to the central library.

If an organization maintains a detailed city-driving ontology, any newly uploaded LiDAR dataset can be configured for annotation in seconds by importing that master ontology.

Conclusion

Modern ML systems have moved from isolated perception tasks to multimodal reasoning and interaction in dynamic environments. The methods used to annotate and manage training data must evolve with that shift.

Formal ontology in machine learning annotation is a practical change in how you enforce consistency and capture knowledge. Ontology-based labeling reduces disagreement between annotators and produces cleaner training signals, which helps models learn more reliably.

Ontology assets also compound in value across projects. Reuse speeds up new project setup and ensures new work benefits from what the organization has already learned.

Modern platforms such as Xtreme1 (open source) and the BasicAI enterprise platform make these ontology-driven workflows accessible in day-to-day annotation work.

Keypoint and Skeleton Annotation for Computer Vision Model Training

BasicAI — Fri, 06 Mar 2026 10:23:07 GMT

Early computer vision handled rigid objects like cars and cups well. Humans and animals are different. They bend, twist, and occlude their own limbs.

A bounding box tells an object detection system that a person exists somewhere in the frame. It cannot tell the system whether that person is running, falling, reaching, or throwing a punch.

Algorithm engineers need a mathematical way to represent joint motion. They need a structural constraint graph that knows an elbow connects a shoulder to a wrist, and it only moves within a limited range.

This need gave rise to keypoint and skeleton methods, which push past what classic object detection can express.

The concept actually dates back to 1973. Fischler and Elschlager proposed Pictorial Structures, representing objects as collections of parts connected by flexible springs. That work set the conceptual foundation for today’s keypoint-and-skeleton representations.

Now, teams across industries rely on structured keypoint data to train models that capture geometric relationships in visual scenes.

In this article, we’ll explain what keypoint and skeleton annotation mean, where they are used, common dataset conventions, and a practical workflow you can follow.

What are keypoint and skeleton annotation in computer vision?

Keypoint annotation means marking specific feature points on objects in images or video frames. Each point corresponds to a meaningful location such as a body joint, a facial landmark, or a corner of an object.

These points are represented as 2D (or 3D) coordinates, typically stored as pixel locations (x,y). Unlike bounding boxes that only provide regions, keypoints capture precise positional information at specific anatomical or structural positions.

Skeleton annotation adds connectivity between keypoints. It defines which points are linked, forming a structured representation, such as shoulder–elbow–wrist.

This connected representation enables models to learn not just where individual points are, but also the spatial relationships and geometric constraints between them.

Keypoints and skeletons solve slightly different problems.

Keypoint annotation works best when you need to track specific features or joints without full shape information. Skeleton data becomes necessary when understanding relationships between points matters more than tracking precise boundaries. Motion analysis and gesture recognition are typical tasks.

What AI models and applications benefit most from keypoint and skeleton data?

Over the past decade, human pose estimation has been the primary application domain for keypoint and skeleton annotation. It’s for identifying the spatial positions of body joints from images or video sequences.

Models trained on keypoint data detect and track body parts including shoulders, elbows, wrists, hips, knees, and ankles. This enables systems to infer a person’s movement or posture.

In fitness and rehabilitation, pose estimation models powered by keypoint data analyze exercise form in real time, providing corrective feedback to users during workouts.

Facial expression analysis is another major area. Facial landmarks help models detect emotion, recognize individuals, and support applications ranging from user authentication to accessibility features in human-computer interfaces.

Autonomous driving perception also benefits. When tracking pedestrians and cyclists, keypoints can help infer whether someone is walking, standing still, or preparing to cross. That added context supports behavior prediction and can improve safety decisions.

Keypoints are not limited to humans. Defining a fixed set of structural points on vehicles creates a shape prior that can improve generalization and robustness. This supports use cases such as intelligent traffic monitoring, trajectory prediction, and accident reconstruction.

More broadly, keypoints extend detection and tracking into manufacturing, robotics, and agriculture.

In quality control, keypoints localize specific part features for defect detection and dimensional verification. In robotics, keypoints help a robot understand object geometry and candidate grasp points. In livestock monitoring, pose estimation can track behaviors like standing, lying, or grazing, which can indicate health or feeding patterns.

Are there any notable keypoint and skeleton datasets and industry standards?

Annotation standards often follow established conventions so models and tooling can interoperate, and so results can be compared fairly.

The COCO dataset is a foundational benchmark for keypoint detection. COCO Keypoints 2017 provides 17 keypoints per person across more than 56,000 training images.

These 17 keypoints define a standard human skeleton:

nose; left/right eye; left/right ear; left/right shoulder; left/right elbow; left/right wrist; left/right hip; left/right knee; left/right ankle.

This convention is widely adopted in practice. Many computer vision frameworks, including Ultralytics YOLO26, use the 17-point COCO scheme as a default for human pose estimation.

Hand keypoint datasets have evolved alongside the growing importance of hand pose estimation in gesture recognition and VR applications.

The Hand Keypoints Dataset contains 26,768 images with 21 hand keypoints annotated per hand. These 21 keypoints include the wrist position plus four joints for each of the five fingers, providing enough detail to recognize complex hand poses and gestures.

Facial landmark datasets typically use 68-point or 104-point annotation schemes, depending on required granularity. The 68-point standard covers key positions around the eyes, nose, mouth, and jawline. The expanded 104-point scheme adds landmarks on areas such as the ears, eyebrows, and facial contours for a finer description of facial geometry.

When designing annotation schemes, you need to decide point ordering, attribute definitions, and occlusion rules.

Ordering is critical for skeleton-style labels because it determines the logical structure and how connections are built. Attributes can be attached to keypoints to store extra information, such as visibility state (visible, occluded, or outside the frame) or body-part type.

Visibility attributes are particularly important. They allow annotators to mark occluded points while still preserving anatomically consistent estimated locations when your task requires them.

How to perform keypoint and skeleton annotation?

The workflow starts after data collection and preparation, once you have a set of images or extracted video frames ready to label.

Here we use BasicAI Data Annotation Platform* as an example. We have prepared a video guide to help you understand the process more intuitively.

In this guide, we demonstrate 5-point facial keypoint annotation and a simplified human skeleton annotation.

https://medium.com/media/534cd199f9cbea55c52168a0fbaf1e5a/href

The first step is creating a new dataset in the platform and uploading your data. Next, you must define the ontology, which specifies the annotation structure and the classes you will create.

For keypoint classes, ontology definition involves class name, numbering, attached attributes, and display style.

For skeleton annotation, you must build a skeleton template that defines point structure and connectivity. BasicAI Data Annotation Platform* allows you to upload a reference image and draw the skeleton template directly on it.

Then, select all data and enter the annotation interface. For keypoint-only annotation, select the keypoint tool and click each relevant location in the image, placing points at precise coordinates.

For skeleton annotation, the process is more structured. Annotators must place keypoints following the predefined template order so the system can maintain the intended structure.

If connections were defined in the template, the platform draws edges automatically as points are placed, creating a visible skeleton that guides the annotator toward anatomically reasonable configurations.

When points are occluded or invisible, follow your annotation guidelines. If you added visibility attributes during ontology setup, you can estimate possible keypoint positions based on visible neighboring points and anatomical constraints, then add the appropriate attributes.

After labeling, save your work through the interface and exit. The final step is creating an export task to prepare the annotated data for model training. The export process selects which annotations to include, specifies output format, and generates the dataset in your chosen standard (such as COCO JSON).

*Contact us here to customize your privately deployed annotation platform.

How to build high-quality keypoint and skeleton datasets efficiently?

Keypoint labeling is less forgiving than many other annotation tasks. Precision and consistency matter because small point errors propagate into downstream training and can degrade performance across tasks.

If you are preparing training data for pose estimation or similar models, here are several recommendations based on our experience.

These should be addressed in your data annotation guidelines:

Clearly define what each keypoint represents and how to position it in ambiguous cases. For a shoulder keypoint, guidelines should specify whether the point marks the shoulder joint center, the top of the shoulder, or the outermost point of the scapula. Small differences like this are a common source of annotator disagreement.
The human skeleton annotation task needs a strict left/right convention. The guide should state how “left” and “right” map to image coordinates across viewpoints, including frontal views and rotated poses.
Skeleton annotation guidelines should emphasize point ordering. Annotators should know whether they are expected to label the upper body before the lower body, or complete one side of the body before the other, and why that order exists.
Fully visible limbs are rare in real-world footage, so occlusion handling must be standardized. Define when annotators should estimate a point using anatomical constraints and when they should mark it as not visible. The best choice depends on the application. Pose optimization algorithms may need explicit visibility tags, while other systems work better with estimated positions.

Accuracy drives model quality. Efficiency gains mean nothing if they come at the cost of accuracy. For projects building in-house annotation teams, choose a professional annotation platform designed for this type of work.

BasicAI Data Annotation Platform is the tool of choice for many teams like yours. It offers:

Standalone keypoint annotation and structured skeleton annotation tools;
Multi-level attribute labels for building complex ontologies;
Role-based access control for team collaboration;
Batch automated quality checks that flag error annotations;
Private deployment option for maximized data security; and
Export in standard formats like COCO JSON to ensure compatibility with common training frameworks.

For large projects, external vendors can improve throughput. In practice, providers with dedicated annotation teams often deliver higher accuracy than open crowdsourcing, and they tend to handle domain-specific cases better.

When selecting a vendor, verify their accuracy on similar projects, their experience with keypoint labeling in particular, and their ability to keep consistency across large datasets.

How Computer Vision AI Enables Scanless Checkout: Models, Data, and Annotations

BasicAI — Wed, 04 Feb 2026 16:45:05 GMT

A bakery can carry hundreds of items: croissants, bagels, cakes, and seasonal specials…

Cashiers are expected to remember names, prices, and codes for a catalog that changes all the time. Error rates stay high.

Barcodes seem like a solution. But many baked goods need to be bagged and labeled before checkout. That adds labor and time.

This is the problem that led to BakeryScan, a computer vision system from Japan. It identifies items placed on a checkout tray and totals the bill in about one second. As it processes more transactions and encounters new product variants, its accuracy keeps improving.

Vision-based scanless checkout is spreading fast in the retail industry. AI is removing friction from shopping.

In this blog post, we’ll discuss how smart checkout works, and what data and annotations are needed to train these systems.

What is vision-based scanless checkout, and what problems does it solve?

In a standard checkout flow, customers place items on a belt or counter. A cashier (or self-checkout) scans barcodes, then the customer pays. Since barcode scanning arrived in the 1970s, the flow has barely changed.

Wait time affects customer satisfaction and purchase intent. Longer waits mean lower return rates and more abandoned carts.

Frictionless checkout eliminates manual item scanning. That shortens checkout time and reduces labor. In practice, it is usually built with RFID or computer vision.

RFID uses small electronic tags attached to products for wireless identification and tracking. But you can’t stick these tags on baked goods, fresh produce, or other bulk items.

Computer vision does not require product modification. Packaged or unpackaged items can be identified by appearance alone and integrated seamlessly with payment and inventory management systems. It reduces transaction time from 3–5 minutes to as little as 5–30 seconds

Vision-based scanless checkout takes several forms:

Vision checkout terminal: recognizes items placed in a basket or tray and calculates the total.
AI smart scale: weighs items and combines weight with visual recognition to price goods. This is useful for fresh produce.
Smart cart: tracks what is placed into the cart and shows a running total in real time.
“Just Walk Out” store: deploys extensive camera networks throughout the store, tracking every product interaction and charging customers automatically when they leave. No checkout counter is needed.

How do vision-based smart checkout systems recognize items?

A vision checkout system turns images into a price through a multi-stage neural pipeline. Early systems used hand-crafted color and edge features. Modern systems run primarily on convolutional neural networks and vision transformers.

During image processing, systems use LED arrays to control lighting strictly and eliminate shadow interference. The captured image then goes through steps such as white balance, resizing, and contrast normalization.

In the object localization stage, the system first separates products (foreground) from the tray/counter (background), then identifies each item. When items touch or overlap, instance segmentation predicts pixel-level masks for each object to enable accurate counting.

Feature extraction generates a high-dimensional vector encoding each object’s visual attributes. This process must balance capturing subtle differences while ignoring distractions like lighting variations.

Classification matches extracted vectors against a known product database through similarity search. Downstream logic then computes the total and updates the transaction.

In production, teams often run in one of two modes:

Authoritative prediction: The strategy is adopted by “Just walk out” store. The system decides the item identity and price without human verification. This demands very high accuracy and broad training coverage.
Assisted prediction: The system proposes a result and staff confirm it. Accuracy and coverage requirements are looser, and it is a practical way to handle long-tail items. For example, in a bakery with BakeryScan, bakery staff can override or correct the system’s suggestions if needed.

What object detection models work well for retail product recognition?

Retail baskets and shelves are crowded. Many products look alike. Several mainstream architectures suit different situations.

The YOLO family is a common baseline for real-time deployment. One forward pass predicts boxes and classes. Inference can be under 20ms in optimized setups.

Among the family, YOLOv5 can weigh only 27MB (depending on variant/config) and fit edge hardware. Retail-tuned versions may add attention modules to focus on discriminative regions.

Two-stage detectors like Faster R-CNN and Mask R-CNN prioritize accuracy. They first generate region proposals, then classify each one. This naturally reduces false positives and improves localization precision. They often perform well on small objects and large scale variation.

Mask R-CNN adds pixel-level masks. This is valuable when products touch. A cluster of pastries that looks stuck together can still be separated for correct counting.

Deep residual networks like ResNet typically serve as backbone feature extractors embedded in these architectures. Skip connections help optimization in deep nets and support multi-level features from edges and textures to higher-level shapes.

Vision transformers and vision-language models like CLIP represent the latest frontier. These models support strong transfer and, in some setups, zero-shot recognition from text descriptions. They can reduce cold-start pain when new products arrive with little or no labeled data. Compute cost is still a constraint in many checkout deployments.

In real systems, hybrid strategies are common. YOLO for fast primary detection, with Faster R-CNN used to verify uncertain cases. Or they train specialized models for different categories to balance speed and accuracy.

What type of training data is needed for smart checkout systems?

Model performance is constrained by dataset quality and coverage. Data needs differ by product form factor.

Visual checkout terminals need to capture how items actually look on trays under varying lighting and angles. Studio photos alone are not enough.

Smart carts require in-store capture. Different aisles have different lighting. Items are often occluded by the cart frame or other products.

Just Walk Out stores require extensive overhead images from ceiling cameras. Products may be blocked by customer bodies or captured from extreme angles.

Produce and bakery add another layer. Packaged items can be visually consistent. Apples vary by size, color, and ripeness. Bread varies in crust, shape, and surface texture. Training data must reflect this natural variation.

Dataset scale is the hard part. A small grocery store may have 10,000 SKUs. A large supermarket can have 30,000–50,000. Ideally, each product needs training samples.

Amazon reportedly photographed hundreds of thousands of products for Just Walk Out work, building datasets with millions of images. BakeryScan focuses on baked goods, which allows high accuracy with a smaller, narrower dataset.

A dataset tuned to a target environment often beats a universal dataset that tries to cover everything.

The long tail is a persistent pain point. Roughly 80% of retail revenue often comes from 20% of products. The remaining items include local brands and seasonal specials. Each appears rarely, but together they account for meaningful volume.

A common approach is a tiered plan. High-volume products get hundreds to thousands of images. Medium-volume products get tens to hundreds. Long-tail products get only baseline data and run in assisted mode with staff verification.

Transfer learning is also widely used, so features learned from high-volume products help low-volume ones with limited extra data.

What annotation types are required for training checkout vision systems?

Bounding box annotation remain the default. Annotators draw a rectangle around each object. This trains detectors like YOLO and Faster R-CNN and supports counting in crowded scenes.

High-quality boxes should be tight to the object’s outer pixels. Loose boxes shift IoU metrics and can teach the model the wrong spatial cues.

Elongated or rotated objects are a special case. A pen or thin bottle placed diagonally forces an axis-aligned box to include a lot of background. That can confuse training. In these cases, rotated bounding boxes can be a better annotation type.

Image segmentation is more precise. Each object gets a pixel mask instead of a rectangle. It helps with overlap and contact. Annotation effort is typically 3–5× greater than bounding boxes.

In practice, instance segmentation is used for high-volume products or categories that frequently overlap. Low-volume items or those that typically appear alone use only bounding boxes to control costs.

Semantic segmentation assigns each pixel a class (product, background, hand, tray, cart frame, etc.) but does not separate multiple instances of the same class. It is mainly used for scene context. It can improve robustness by helping the system ignore hands, equipment, and structural parts of the checkout setup.

Detected items must be linked to specific SKUs in the retail inventory database.

Packaged goods can be matched automatically through text recognition. But fresh produce or items without clear SKU markings require manual annotation.

This calls for a hierarchical label scheme (parent/child class or attribute-based labels). This work requires annotators familiar with retail inventory systems.

What public datasets exist for retail product recognition?

Researchers and companies have created several public datasets to support retail product recognition system development. But none fully covers real retail inventory breadth.

RPC: built for automatic checkout. Over 83,000 images across 200 product categories, with both studio-like and cluttered real-world scenes.
SKU-100K: designed for dense shelf detection. 11,762 images, with 150 objects per image on average, and some images reaching 700.
Grocery Store Dataset: 5,125 natural images captured in real supermarkets using smartphones, covering 81 categories (fruits, vegetables, and packaged items like juice, milk, yogurt).
MVTec D2S: pixel-level labels for instance-aware semantic segmentation tasks. 21,000 images across 60 categories.

Compared to comprehensive general retail datasets, publicly available specialized category datasets remain limited. For models handling baked goods, produce, or specialty merchandise, companies must consider building their own datasets and maintaining them as proprietary assets.

How can training datasets for scanless checkout computer vision systems be built effectively?

BakeryScan recognizes baked goods at 98% accuracy. Amazon’s Just Walk Out processes millions of transactions. Mashgin identifies over 100,000 items in seconds. The foundation for all of them is data. The few seconds at the register are backed by millions of labeled images.

To build effective training datasets, we would like to provide several practical recommendations based on experience:

Decide what to cover first. Rather than trying to cover an entire supermarket inventory immediately, focus first on high-volume items. These account for about 80% of checkout volume. Validate the pipeline, then expand coverage.
Define capture specs from the deployment scene. Collect images that match the camera angle, lighting, and background the system will see. Combine in-store photography with controlled studio shooting, so you get both realism and coverage.
Run strict QA and close the loop with production errors. Track failure modes in the field. Add targeted data where the model breaks, and keep updating.

Dense retail annotation is cognitively expensive. Drawing 150 tight boxes on a single shelf image can take an hour. Small mistakes can teach the AI model to ignore valid objects, driving false negatives.

Many AI teams work with specialized data annotation teams for this reason. BasicAI is one example that can handle the complex retail-style data. They emphasize trained annotators (rather than anonymous crowdsourcing) and structured QA workflows. They support multiple label types such as bounding boxes, image segmentation, keypoints, and rotated boxes.

For teams with in-house labeling, BasicAI also offers on-prem smart data annotation platforms with semi-automatic labeling. A model proposes initial labels, then humans verify and correct them. This can increase throughput while keeping label accuracy under control.

Annotation cost depends on label type and precision requirements. Bounding boxes are usually cheaper than semantic or instance segmentation. Large programs may get volume pricing.

When planning smart checkout system development, companies should budget annotation as a significant expense and plan carefully.

AI Vision in Smart Fridges: Models, Data, and Annotation

BasicAI — Thu, 29 Jan 2026 11:12:55 GMT

Why is the next widely adopted AI home appliance likely to be the refrigerator?

Home automation demand has grown over the last few years. Robot vacuums take a large share of the smart home market. Smart lighting, sensor-driven curtains, and integrated security systems are now common in many homes.

Food management is different. Most households still lack reliable information about what they have. People open the fridge several times a day, yet often cannot answer basic questions: what’s inside, what will expire soon, and what meals they can cook from what’s already there.

Samsung spotted this gap and built the Family Hub refrigerator line, which received a 2026 CES Innovation Awards nomination.

With built-in cameras, it can identify and log food in the fridge. Users can check inventory remotely, get expiration reminders, and receive recipe suggestions based on available ingredients.

A fridge can become a coordination point for the kitchen by planning meals, tracking inventory, and supporting grocery decisions. That, in turn, pulls other parts of the smart home into a tighter loop.

In this blog post, we’ll break down the computer vision foundations behind these features, with a focus on training data building, data annotation methods, and the practical limits of deploying vision models in refrigerated environments.

What is a Smart Fridge?

A traditional refrigerator keeps internal temperature in a safe range, so food stays fresh and safe. A smart fridge adds connectivity, sometimes an interactive display, and embedded AI features.

The key difference is a built-in camera system with computer vision that can automatically identify and track food across compartments.

When the door closes, or at scheduled intervals, the on-device or connected vision system captures and analyzes images. It identifies individual items, classifies them by category and type, estimates quantities, and records their relative positions.

This capability depends on deep learning models trained on large, diverse fridge image datasets with precise labels. Without high-quality training data that actually represents real fridge conditions, models will not reach the accuracy needed for reliable household use.

How do smart fridges identify and track food?

Food recognition starts with image capture. Built-in cameras trigger when the door closes. Some systems also capture on a schedule to track inventory changes over time. Images are then sent to an on-device processor or a cloud service.

Captured images go through preprocessing, such as normalization, color correction, and resizing. The goal is to reduce lighting variation and increase contrast in shadowed regions.

Next comes object detection. The model finds food items in the image and outputs bounding boxes, along with location, size, and confidence scores. Then classification assigns each detected item to a specific class.

Depending on the system, labels may include food type, brand, and whether packaging is opened. More advanced systems may estimate freshness or infer expiration-related signals.

Finally, results update an inventory database. Users can view fridge contents in an app, receive expiration reminders, or get recipe suggestions.

Edge processing is fast and privacy-friendly but limited by compute and power. Cloud processing supports heavier models and richer features, but adds latency and privacy concerns. Most products use hybrid compute architecture.

A common split is lightweight detection and classification on-device for real-time responsiveness, with uploads to the cloud for deeper analysis, model retraining, and advanced recommendation logic.

What computer vision models are used for food recognition?

Object detection is often the backbone, locating all food items in the image.

The YOLO family, especially YOLOv8, is widely used for food detection because it balances speed and accuracy well. Its single-stage design supports real-time inference.

Faster R-CNN (two-stage) can be more accurate in clutter and heavy occlusion, but its compute cost makes real-time deployment harder. EfficientDet is strong on multi-scale feature extraction. RetinaNet uses focal loss to address class imbalance. Mask R-CNN adds pixel-level instance masks, which helps when foods overlap.

Classification models run after detection. They label the cropped regions with specific food categories and status labels.

MobileNetV3 is a common choice for edge deployment. Depthwise separable convolutions keep the model small (often under 6 MB) while still reaching strong accuracy.

ResNet-18 and EfficientNetV2 are larger but can offer a better accuracy–efficiency trade-off in many setups. Because power and cost are tight, manufacturers often compress models. Knowledge distillation is common, where a larger teacher model guides a smaller student model.

Among emerging approaches, Vision Transformers show stronger generalization to novel food categories through self-attention mechanisms. Few-shot learning enables recognition of rare foods from just a handful of examples.

What data do fridge food detection models need?

Fridge lighting varies a lot. LED strips differ in color temperature, placement, and brightness, which changes how food looks. Training images should cover warm white and cool white lighting, fully lit scenes, and partial shadow cases, so the model transfers across hardware designs.

The dataset also needs the reality of what people store, like packaged foods and condiments, drinks in different containers, leftovers in plates or storage boxes.

Food appearance changes over time. Produce loses saturation. Leftovers can discolor or grow mold. If a system offers spoilage detection, the dataset must include multiple freshness stages. You can capture this by collecting data across time or by using image augmentation and synthesis to reflect appearance drift.

Real-world complexity challenges model training. Items occlude each other. Some are only partially visible. Transparent containers create visual interference. Storage habits also vary widely between households. Some people organize neatly, while others stuff items randomly. Training data should cover this range.

Many important items are small in pixel area. Detection models need enough resolution to localize these. Training images should be at least the deployment resolution, often 1080p or higher, and include multiple viewing angles to reflect different camera mounting positions.

What data annotation types are needed to train food recognition models?

A production dataset for smart fridge vision usually needs multiple label types. Each serves a specific purpose in the machine learning pipeline.

Bounding box annotation

Bounding boxes are fundamental for object detection. Annotators draw a rectangle around each food item and assign a class label. Box consistency is critical. Loose, inconsistent, or misaligned boxes add noise that degrades detector training.

Image segmentation

Segmentation assigns a class to each pixel (or separates object instances), producing precise boundaries. This matters in fridges where objects overlap.

Semantic segmentation labels pixels by class, which improves boundary accuracy in clutter. The instance segmentation also separates multiple instances of the same class. This is important for inventory counting.

Segmentation is more expensive to label, but modern tools like BasicAI data annotation platform can speed it up with semi-automatic workflows.

Classification and attribute labels

Classification assigns food type and status to detected items. Some systems use Ontology systems (hierarchical taxonomies) that contain coarse categories first, then finer subcategories.

Attribute labels capture extra descriptors, such as:

opened / unopened packaging,
partially consumed state,
estimated remaining quantity,
visible quality status.

Metadata can also be valuable: camera position, fridge model, lighting condition, and shelf location. These help the system learn common spatial organization patterns.

What datasets are available for training food detection models?

Several public datasets are useful starting points:

The Food-101 dataset: one of the most widely used food recognition benchmarks. It includes 101,000 images across 101 classes. Note that images were not shot in household fridge environments. Each image shows only a single item with no occlusion.
Fridge Food Images: 2,377 labeled photos of common fridge items (apples, bananas, milk, eggs, vegetables, etc.) on fridge shelves. It captures real household lighting and occlusion patterns.
Refrigerator Contents: 1,162 files across 7 classes: banana, bread, egg, milk, potato, spinach, tomato.
RP2K (Retail Product 2K): 500,000+ product photos across 2,000 retail items on supermarket shelves. It can be useful for fridge vision because packaged items (yogurt cups, milk cartons, juice boxes) look similar in fridges and stores.

These datasets can help with early experiments, but they do not transfer directly to production smart-fridge systems. Real fridge datasets are rare.

Fridge environments have distinctive LED lighting, fixed viewpoints, and dense item packing with many non-food products. Public datasets also lack the depth of labeling needed for advanced features like freshness scoring or packaging-type classification. They also do not reflect the long-tail distribution of real household food.

As a result, companies building smart-fridge systems typically need proprietary datasets captured from real fridge images, with fridge-specific attributes labeled. This becomes a defensible advantage.

Building that dataset takes real engineering effort, but it produces performance and coverage that public data cannot.

How to build training datasets for smart fridge vision right?

Building production training datasets involves two main phases: acquiring raw data through image collection, and annotation through careful human labeling. Smart fridge systems usually need higher annotation quality than many other vision applications.

Data can come from prototype or production fridges in test facilities, or from beta user devices with explicit consent. Partnerships with food retailers or commercial kitchens offer another source.

For data annotation, outsourcing to professional annotation service providers can bring specialized expertise, scalable throughput, and established quality control.

Companies such as BasicAI offer managed workflows and a stated 99% accuracy guarantee, with pricing that varies by label type (bounding boxes, segmentation, classification).

Crowdsourcing platforms often fail to meet the quality bar. Managed workforce costs more, but investment matters because label errors propagate into the model and can be hard to diagnose later.

In-house teams can respond faster to requirement changes and can align labels tightly to product needs, but they require investment in tooling and processing.

Annotation tools can reduce workload significantly. Semi-automatic segmentation tools can refine boundaries using edge cues. Auto-label suggestions can propose boxes or classes for human review.

In practice, these tools can reduce labeling time by 20% to 50%. The BasicAI Data Annotation Platform with these features offers private deployment for teams that need tight control.

A production training dataset commonly takes two to six months to build. A small pilot (about 1,000 to 5,000 images) can be completed in one to three months. A system aiming for broad coverage often needs 50,000 to 500,000 images, with labeling continuing for months.

Ongoing dataset management and versioning are also critical. High-performing systems keep a living dataset and expand it as new edge cases appear in the field.

Computer Vision for Smart Vending Machines: How It Works and What Data You Need

BasicAI — Fri, 23 Jan 2026 10:15:20 GMT

Traditional vending machines run on simple mechanics. A customer presses a physical button tied to a fixed slot. A motor pushes the product forward into the pickup bay.

These machines offer limited selection. The shopping experience is constrained by the physical layout of the mechanical system.

Computer vision enables a grab-and-go flow. Customers browse shelves freely and pick items just like in a store. The system calculates charges in real time and completes the transaction automatically when the customer closes the door.

In this post, we’ll cover the computer vision methods behind this process, along with the data and annotation required to train these systems.

What is a smart vending machine?

A smart vending machine is an automated retail unit that operates without staff. Customers complete purchases on their own. Compared with classic vending machines, it combines IoT, edge AI vision, and mobile payment.

The smart cabinet is a typical example. Customers verify their identity first. The door unlocks. They open it, take products from the shelves (or pick something up then put it back), and close the door. The system then identifies what was taken and charges.

In practice, there are three main technical routes:

Shelf weight sensors detect product removal by measuring weight changes.
RFID tags attached to each item are scanned when the door closes.
Vision-based recognition uses cameras inside the cabinet combined with AI algorithms to infer what the consumer took.

Here, let’s focus on the vision-based approach. It avoids mechanical failures associated with weight sensors and the cost of damaged electronic tags.

How does vending machine vision detect which products were taken?

The industry typically uses two approaches: static vision and dynamic vision.

Static vision compares shelf images before and after the customer opens the door to identify which products are no longer visible. This often requires one camera per shelf for reliable coverage.

In real deployments, subtle lighting changes or product displacement can trigger false positives. Static methods also struggle to capture intent changes in-session, such as picking an item up and placing it back.

Dynamic vision does not compare static shelf states. It continuously detects and tracks customer interactions, especially hand motion, while the door is open.

Dynamic systems handle occlusion better. Even if a product region is blocked, hand tracking can continue and preserve the interaction timeline.

Most commercial smart vending machines combine static and dynamic vision to maximize robustness and accuracy. When dynamic vision detects a hand removing a product from the shelf, the system also analyzes images from before and after the door closes to confirm the product has actually left the shelf.

What computer vision models are used in smart vending machines?

Real-time inference and edge deployment are essential. The system must complete recognition and billing within seconds after the customer closes the door. Models run on edge devices inside the cabinet. Several model types are typically involved.

Object detection models form the foundation of the vision system. They locate products on shelves and generate bounding boxes with SKU labels. Common choices include Faster R-CNN (often with a ResNet50 backbone) for higher accuracy and the YOLO series (such as YOLOv8 and YOLOv9) for faster inference. YOLO models are better suited for edge deployment.

Hand detection + action recognition models track hand position and pose in real time. They identify interactions such as reaching, grasping, and returning items. More advanced systems use hand skeleton detection (often 21 key points). This turns object detection into active behavior understanding.

Instance segmentation models, like Mask R-CNN, provide pixel-level object masks rather than simple bounding boxes. They help when items overlap or have irregular shapes. Due to higher computational cost, segmentation is typically reserved for the verification stage.

Image classification models serve as confirmation after object detection. They answer what exactly one product is. This is critical for distinguishing visually similar items, such as different flavors of the same beverage.

What data is needed to train a vending machine vision system?

A production-grade system needs a large, carefully collected image dataset. Operators usually treat these growing datasets as proprietary.

Industry best practice typically calls for around one hundred different images per SKU. A mid-sized vending machine dataset may require 10k to 30k training images, and real-world variation often pushes that higher.

Smart vending machines operate in diverse environments with varying lighting. Data should cover multiple camera angles and illumination conditions to reduce domain shift.

Reflective materials (such as metal cans and transparent bottles) and different packaging types need targeted capture, so the model learns stable features rather than one specific highlight pattern.

Training data must also include dense shelf scenes covering realistic situations like product occlusion, overlap, and tilting. Because product catalogs update, operators may need to add thousands of images monthly and retrain models regularly.

What types of annotation are required for retail product recognition?

Bounding Box Annotation

Annotators draw bounding boxes around each product and assign SKU labels. In many pipelines, boxes are expected to align closely with the visible edges.

Intersection over Union (IoU) is the primary quality metric. Production workflows typically enforce minimum IoU thresholds. Detailed guidelines are needed to handle edge cases such as product shadows and partially visible items.

Polygon Annotation and Segmentation

Polygon annotation uses connected points to outline irregular product shapes (such as bottles or unusually shaped snacks). Instance segmentation requires pixel-level masks that precisely identify each product.

Segmentation takes three to five times longer than bounding box annotation. Most commercial systems rely primarily on bounding boxes and use segmentation only for specific scenarios, such as distinguishing similar adjacent products.

Hand Keypoint Annotation

Hand detection systems require annotation of 21 anatomical keypoints (wrist, knuckles, fingertips, and others). Annotators need anatomical knowledge. Even minor errors can corrupt the entire pose representation. Large-scale datasets may use automated tools to generate initial annotations, but production datasets still require human review and correction.

Classification

Semantic labels can be at the individual SKU level or grouped into broader categories. Maintaining naming consistency across hundreds or thousands of SKUs is challenging. Inconsistencies lead to blurred model decision boundaries. Professional platforms use hierarchical classification systems (category → subcategory → specific product) and track annotation consistency.

What datasets are available for training retail product recognition models?

Several public datasets target retail product recognition tasks. They provide labeled benchmark data for researchers and developers to build and evaluate computer vision models.

SKU-110k contains 110,000 categories with dense shelf images. It is suitable for scenarios where products are tightly packed, touching, or partially overlapping.
GroZi-120 is a classic retail recognition dataset. Its main feature is paired data: standard white-background product images alongside real shelf photos.
HoloSelecta contains 295 real product images from vending machines in the Zurich area, covering 109 categories with labeled bounding boxes.
Take Goods from Shelves (TGFS) is a hierarchical large-scale object detection dataset with 38,000 images, divided into 24 fine-grained categories and 3 coarse-grained categories.
Toward New Retail collected over 30,000 images from unmanned retail containers. It includes 155,153 manually annotated instances, focusing on beverage recognition.

Public datasets help, but most commercial operators cannot reach required accuracy using only public training data.

Operator catalogs contain many SKUs that are missing or underrepresented in public sets. Deployment environments also introduce specific lighting, shelf materials, and background characteristics. When models are deployed in environments different from training conditions, accuracy drops.

In regulated settings such as healthcare vending, companies may need auditable, company-specific datasets, and business constraints may rule out reliance on generic public data.

How to build a training dataset for vending machine AI vision quickly?

Data Collection

Prioritize the real deployment environment. Capture products across multiple viewpoints, so the model generalizes to any shelf position and camera angle in the cabinet.

A Stanford study emphasizes that even for hand-centric tasks, collecting data from multiple camera positions and viewpoints significantly improves model generalization.

Data Annotation

A single operator may need thousands of newly annotated images monthly to keep up with changing product assortments. Many teams partner with external labeling services.

Crowdsourcing can be cost-effective for simple, objective tasks. But workers unfamiliar with a specific retail catalog may confuse similar products and mislabel SKUs. Anonymity can also make systematic quality issues harder to isolate.

Managed data labeling teams provide higher efficiency and comprehensive quality assurance. BasicAI is one company offering fully managed annotation services, including bounding box annotation, segmentation, and SKU-level labeling, with accuracy guarantees above 99%. For mid-sized datasets, expect professional annotation to take two to four weeks.

For organizations with sufficient in-house annotation staff, internal annotation using smart annotation tools and platforms offers greater control.

The BasicAI data annotation platform provides model-assisted annotation. Pretrained models generate initial annotations, which human annotators then review and correct rather than annotating from scratch.

The platform offers private deployment options for organizations with confidentiality requirements, so proprietary catalogs and images can stay on private infrastructure.

Continuous Improvement

Initial dataset and training mark only the beginning of production deployment for a smart vending machine. As new products arrive, environments shift, and monitoring surfaces failure cases, the system needs ongoing iteration.

Vision algorithm engineers may need to add more diverse lighting conditions to training data, increase images of problem products, or retrain models to adapt to detected environmental changes.

FAQs

How do vending machine vision systems handle visually similar products?

A common approach is to first run object detection to locate products, then apply fine-grained classification to analyze subtle features such as label colors and text patterns. Distinguishing similar products requires far more training images per variant than basic detection. For cases with confidence scores near decision boundaries, the system triggers secondary verification or human review.

What else can AI vision systems in vending machines do?

Beyond product recognition, AI vision systems can check brand purity by detecting competitor products on brand-exclusive shelves or verifying compliance with display agreements. They can also generate restocking recommendations based on inventory levels. Some advanced systems analyze customer behavior patterns, such as which products are picked up and returned, providing data to support assortment optimization.

How do smart vending machines handle products that customers pick up and then return?

Dynamic vision continuously track hand motion and can distinguish between taking and returning behaviors. The system monitors the complete path of a product from the moment it is picked up until it leaves the shelf area. If the item is returned to the original location (or placed elsewhere), the model detects the reverse action and updates the cart and inventory state. This is critical for preventing false charges.

CES 2026 Innovation Awards: 15 Computer Vision Use Cases That Signal What’s Next

BasicAI — Fri, 09 Jan 2026 09:39:41 GMT

CES 2026 closed on January 9, 2026. Over three days in Las Vegas, more than 4,100 exhibitors showed what they’ve been building, from large public companies like Google, Amazon, Samsung, and Sony to promising early-stage startups.

At BasicAI, we follow this event closely. We were glad to see several of our customers showcasing their latest work, and we congratulate all the teams recognized in this year’s CES Innovation Awards.

In this post, we want to share 15 computer vision AI use cases from the 2026 award winners and honorees. These cases reflect the current state of commercial AI and hint at what’s coming next.

CES 2026 and the CES Innovation Awards

CES (Consumer Electronics Show) is one of the largest and most influential annual technology events in the world. Organized by the Consumer Technology Association (CTA), it takes place every January in Las Vegas.

Since its launch in 1967, CES has evolved from a showcase for televisions and radios into a global benchmark for emerging technologies spanning artificial intelligence, automotive, digital health, and beyond.

The CES Innovation Awards are CTA’s annual program recognizing outstanding design and engineering in consumer technology. The award carries significant weight in the industry and is widely regarded as a stamp of approval for the year’s best products.

2026 is widely seen as a breakout year for Physical AI, and CES reinforced that view. Many award-winning products run on local AI chips for real-time inference. Computer vision applications were especially prominent.

RAPA: Autonomous driving perception with multiple 4D imaging radars

2026 Best of Innovation in Artificial Intelligence
By Deep Fusion AI

4D imaging radar is gaining traction as a promising sensor option in autonomous driving perception stacks. It can approach LiDAR-like spatial resolution at a much lower cost, while naturally providing velocity and working in all weather. But radar point clouds are sparse, and multipath interference is a persistent limiter.

RAPA takes a software-defined approach to fusing multiple 4D radars. It learns radar-signal physics for adaptive filtering, then uses an attention-based deep learning model trained on its own dataset to deliver real-time, high-precision detection and tracking. It is designed to run efficiently on edge embedded platforms.

This work supports the engineering feasibility of radar-only perception. If 4D radar could carry primary perception on its own, it would significantly reduce the BOM cost of autonomous driving systems and open the door to scaled deployment in cost-sensitive and harsh-environment verticals such as unmanned surface vessels, and robotics.

VIXallcam: All-weather vision enhancement for commercial vehicles

2026 Honoree in Smart Communities
By IntelliVIX Co.,Ltd

In commercial vehicle operations, degraded visibility in severe weather is a major contributor to crashes.

VIXallcam is an AI vision camera built for long-haul trucks, mountain routes, and special-purpose vehicles. It keeps delivering a clear view in dense fog, heavy rain, blizzards, tunnels, and even complete darkness where standard cameras fail.

The system detects pedestrians, vehicles, and road obstacles up to 200 meters ahead, buying reaction time. It adapts automatically to changing weather without manual tuning.

Logistics accidents are expensive. A few seconds of earlier risk exposure can translate into measurable reductions in incident rates. VIXallcam fills an ADAS gap in edge conditions, helping fleets keep night and bad-weather schedules without trading away safety.

Argus-D: Multimodal disaster warning on smart cameras

2026 Honoree in Artificial Intelligence, Products in Support of Human Security for All
By IIST Co., Ltd

As extreme climate events become more frequent, traditional security cameras that focus on post-incident evidence are no longer enough. Sending every video stream to the cloud for inference is also costly and fails under outages and congestion.

Argus-D embeds Physical AI and multimodal sensing into a standard surveillance-camera form factor. It detects wildfires, building collapses, and earthquakes in real time, with seamless integration into smart IoT infrastructure.

For fire detection, it reports accuracy above 99%. In earthquake scenarios, it uses P- and S-wave information to estimate direction and distance to the source, then triggers faster response loops through IoT coordination.

Argus-D is a concrete example of edge intelligence landing in public safety. By pushing multimodal perception and inference into the camera, it can fire alerts on millisecond timescales, creating a critical window for evacuation and emergency response.

Real-time drunk driving detection from driver behavior

2026 Honoree in Vehicle Tech & Advanced Mobility
By Smart Eye AB

Drunk driving is not always visible to external sensors, and traditional tests (breath, blood) don’t fit into everyday driving workflows.

Smart Eye added alcohol impairment detection to its commercial driver monitoring system. It is positioned as the first mass-production solution in the industry that infers alcohol impairment from real-time behavior analysis rather than physiological sampling.

The system analyzes subtle patterns in eye movement and eyelid dynamics to estimate whether the driver is under influence. The timing also matches a regulatory shift, with programs such as Euro NCAP bringing impairment detection into evaluation criteria.

Drunk driving causes more than 12,000 deaths per year in the US. Embedding detection into passive driver monitoring changes what’s practical for commercial fleets and may also shape future insurance pricing models built around non-intrusive risk signals.

AA-2: An indoor delivery robot that can ride elevators

2026 Honoree in Robotics
By GoLe-Robotics

The growth of late-night delivery has created new risks for high-end apartments: driver fatigue, security exposure, elevator congestion, and privacy concerns for residents.

AA-2 is an autonomous delivery robot designed for gated residential communities. Through integration with the EV-1 elevator interface, it can call an elevator and ride it autonomously.

The robot uses flexible materials to absorb impact if it contacts residents or objects. After delivery, it can deflate for compact storage, and when it returns to the charging station it recharges both the battery and the airbag system.

Most delivery robots have focused on the outdoor “last mile.” AA-2 reflects product thinking around the “last 100 meters.” Indoor vertical and horizontal mobility, integration with building systems, and safe operation in private spaces force a different set of engineering trade-offs. Vision and perception here bias toward close-range obstacle avoidance, semantic scene understanding, and safe human-robot coexistence.

AEON: A collaborative humanoid robot for industrial sites

2026 Honoree in Robotics
By Hexagon

Aging workforces and structural labor shortages are pushing industry to revisit the practical value of humanoid robots.

AEON is designed to work alongside human workers, not replace them. Wheeled mobility improves efficiency and runtime, combined with multi-sensor fusion and spatial intelligence for navigation and manipulation. Its dexterous hands and skills cover machine tending, inspection, asset capture, and digital modeling, with support for teleoperation and assisted decision-making.

Notably, the robot’s appearance and behavior have been refined with psychological research to improve human acceptance.

Humanoids have drawn heavy capital in recent years, but many products remain in demo stage. AEON signals a shift from lab proof points toward industrial deployment. The “collaborator” positioning also avoids the social backlash implied by replacing workers, and that pragmatic product framing is worth studying.

Bedivere: Autonomous navigation robot for the visually impaired

2026 Honoree in Artificial Intelligence, Products in Support of Human Security for All
By AidALL Inc.

Millions of people with visual impairment face ongoing challenges in independent mobility. For this population, the visual function needed is not positioning but continuous, safe walking.

Bedivere is a portable autonomous navigation robot that runs environmental perception and path planning on-device. Local AI interprets obstacles and free space in real time, produces an actionable safe route, and works fully offline, avoiding GPS drift, signal loss, and privacy constraints.

Guide dog training takes up to two years and costs are high. Global supply falls far short of demand. Bedivere aims to approach guide-dog utility without training or long-term care. It is lightweight, quick to learn, and designed for complex indoor and outdoor environments.

Instead of chasing a general-purpose humanoid narrative, Bedivere focuses on a sharply defined user need. This makes productization and scale more achievable and makes it a meaningful embodied AI attempt in accessibility.

SafeZone: Vision AI for bus door safety

2026 Honoree in Vehicle Tech & Advanced Mobility
By oToBrite Electronics, Inc.

Bus door pinch injuries look rare at the individual level, but globally they drive real harm and litigation. Traditional pneumatic anti-pinch systems have blind spots and struggle with soft objects like clothing, backpacks, or limbs.

SafeZone uses a single-camera module plus an ECU. A deep learning model monitors the door area in real time and prevents closure while passengers are boarding or exiting. It covers a 30×200 cm detection region. The tight integration also makes it applicable to cranes, forklifts, garbage trucks, and other industrial equipment with pinch hazards.

SafeZone pushes computer vision into an overlooked safety niche. A cost-controlled, retrofit-friendly vision module can reduce incident rates and provide a general safety layer for vehicle intelligence upgrades, aligned with broader smart transportation infrastructure trends.

Multi-tron: AI waste sorting at collection points

2026 Honoree in Smart Communities
By Aetech

Conventional waste sorting facilities are hard to place inside cities due to noise, footprint, and infrastructure demands. Waste often travels long distances to centralized plants, adding carbon emissions and increasing pollution risk.

Multi-tron changes the layout by integrating pre-processing and AI sorting into a compact, modular unit that can be deployed in mixed-use buildings, public facilities, or even outdoor event sites.

The equipment line is 30% shorter than conventional solutions. Low noise and clean industrial design reduce the visual and psychological friction of installing such systems in urban environments, enabling earlier separation at the source for higher-purity recyclables.

By moving sorting to the point of waste generation, Multi-tron changes the economics of the recycling value chain. The distributed-infrastructure model is replicable, especially for communities and developing regions without strong centralized facilities.

Family Hub: An AI vision refrigerator

2026 Honoree in Smart Home
By Samsung Electronics America

Food recognition in a fridge fails on the long tail. Fresh ingredients, packaging variance, occlusions, changing lighting, and user placement habits break closed-category models quickly.

Samsung’s four-door refrigerator is the first of its kind to integrate a multimodal large language model. Its built-in AI Vision system can recognize an unlimited range of ingredients and keep a live food list, covering fresh items, packaged goods, and prepared foods. It then turns recognition into recipes, shopping lists, and energy management through recommendations and dialogue, using on-device AI and smart home integration to complete the household interaction loop.

This product marks the point where LLMs and vision AI move into major appliances. “Unlimited items” recognition depends on vision-language generalization. From a consumer electronics view, Samsung is redefining the refrigerator from passive storage into an active household management surface, with downstream implications for food retail and health data services.

Selto: A general automation agent built on visual UI understanding

2026 Honoree in Artificial Intelligence
By INFOFLA

Many government websites, legacy systems, and enterprise intranets lack stable APIs. Traditional RPA relies on scripts and fixed coordinates, and it breaks when the UI changes.

Selto treats the UI as an environment. A vision-language model perceives interface elements and performs clicks, typing, and decisions like a human, enabling end-to-end task execution. It also self-learns from task logs to reduce configuration cost, and it supports both cloud and on-prem deployments for security and compliance.

It shifts computer vision targets from the physical world to pixel-based digital interfaces, and it makes interaction the output. This is a high-value CV domain with low sensor noise but high distribution shift. It will push work on UI understanding, temporal planning, controllability, and failure recovery.

TlatFarm: Drone-driven autonomous smart agriculture

2026 Honoree in Construction & Industrial Tech
By Turbine Crew Inc.

Precision agriculture has proven potential, but in remote farms with weak infrastructure it is often constrained by power availability, connectivity, and operational complexity.

TlatFarm combines autonomous drone flight, wireless charging, multispectral crop monitoring, and edge AI analysis into a turnkey system. Drones run multiple missions per day, capturing RGB, NDVI, infrared, and multispectral imagery, while also performing spraying tasks. AI models process imagery and in-field sensor data in real time to predict pests, nutrient issues, and optimal irrigation and harvest timing, with accuracy up to 92%.

Agriculture is a domain where spatiotemporal data density sets the ceiling. Low-cost, high-frequency, consistent multimodal observation determines whether CV models can move from recognition to prediction. Closing the loop across capture, power, and analysis also helps AI move beyond demo plots into regions with limited infrastructure.

SHOSABI: 3D motion sensing with learning feedback

2026 Honoree in Sports & Fitness
By SHOSABI inc

Most fitness and rehab devices focus on strength or surface physiological metrics, while ignoring brain-body coordination, which sits closer to the foundation of human performance.

SHOSABI is a training tool built on a patented 3D motion sensing technology that captures over one million 3D data points per second. It objectively evaluates coordination, stability, and left-right balance, then generates personalized training plans. Its adaptive feedback engine switches between voice, visual guidance, or immersive 3D rotational instruction based on user state, optimizing the cognitive-motor learning process.

Behind SHOSABI is more than a decade of research from the University of Tokyo and Mitsubishi Chemical Group, plus over ten licensed patents. It effectively defines a new motion science subcategory.

Competitive sports and aging health are both recognizing the value of coordination training, but objective measurement has been limited. SHOSABI packages high-precision motion sensing and AI feedback in a consumer product form. It can serve professional athletes optimizing performance and ordinary people preventing musculoskeletal decline. Its data assets may also catalyze a new generation of AI applications based on movement intelligence.

TORAH VISION AI: 16-bit high-resolution chest X-ray decision support

2026 Honoree in Artificial Intelligence
By Torah Co., ltd

Chest X-ray remains one of the most common imaging exams, but early lesions can be subtle, low-contrast, and highly reader-dependent.

TORAH VISION AI uses high-definition Torah data and a ResNet50-based deep learning model to analyze chest X-rays automatically. It can identify 14 common thoracic pathologies as defined by the ChestX-ray14 dataset, including cardiomegaly.

The system outputs AI findings with corresponding clinical recommendations, integrates with the Medidata platform for a precision diagnostic workflow, and uses Biovia solutions to improve training data quality.

The core proposition here is treating resolution, dynamic range, annotation quality, and deployment platform as equally important system components. This preserves more grayscale detail, helping models capture subtle pathological signs. This is a technical direction worth watching in medical imaging AI.

Strutt ev¹: A personal mobility device with environmental perception

2026 Best of Innovation in Vehicle Tech & Advanced Mobility, 2026 Honoree in Accessibility & Longevity
By Strutt Pte. Ltd.

In mixed indoor-outdoor use, risk for personal mobility devices comes from more than speed. Tight spaces, obstacles, and social interaction with crowds matter just as much.

Strutt ev¹ brings LiDAR and AI algorithms, derived from autonomous driving, into a personal mobility device. Its Co-Pilot system continuously senses environmental complexity and adjusts trajectory in real time to smooth bumps and avoid collisions with walls, furniture, or pedestrians. Natural-language voice interaction reduces the need for menu navigation.

Strutt ev¹ pushes perception and control into a category that has long remained low intelligence, offering safer mobility options for those with limited mobility and the elderly. At an industry level, it is also a clear example of autonomy tech spillover. As LiDAR and perception costs keep falling, they can enter price-sensitive consumer products and expand the small mobile robot market.

What these 15 cases suggest

AI’s center of gravity is shifting from general-purpose foundation models flexing their capabilities toward vertical, specialized applications.

Across these award-winning cases, edge intelligence and offline availability appear to be becoming standard for many scenarios, especially in industrial and safety systems. The definition of perception is also expanding from vision alone to various forms of multi-modal fusion for more complete world understanding.

AI is moving into fragmented scenarios that traditional automation struggled to cover because the environment is non-standard and the logic is messy, such as elevator-integrated delivery robots, source point waste sorting, and system UI operation.

With these trends, training data is evolving from a pursuit of volume toward high-value long-tail data and synthetic or physics-based data. The most immediate impact is data labeling for extreme and rare conditions. In cases like VIXallcam, general datasets offer limited value. The competitive edge lies in those who possess scarce data like “trucks in a blizzard.”

Annotations are also moving from semantic labels to behavioral labels. A bounding box around a car is not enough for Smart Eye’s impairment detection or SHOSABI’s coordination assessment. The label needs to capture fine-grained temporal signals such as micro-patterns in eye dynamics or biomechanical features in 3D space.

That kind of data annotation often requires domain experts in medicine or physics, and it does not scale through low-cost crowdsourcing. Expert knowledge density becomes part of the dataset’s core value.

In the next phase of AI competition, the advantage is less about parameter count but more about who can collect, validate, and maintain high-quality data that encodes physical logic, covers extreme edge cases, and is confirmed by experts, while keeping cost under control.

If your team is preparing data for a specialized AI vision application, let’s talk about training data solutions. We provide expert-in-the-loop data annotation services and smart data labeling tools for many leading AI teams.

Data Annotation Strategies for Lightweight Computer Vision in Edge AI

BasicAI — Mon, 01 Dec 2025 05:18:35 GMT

Computer vision is moving from centralized, abundant cloud computing to noisy, constrained edge environments. This shift is not just about where models run. It restructures the relationship between model architecture, hardware accelerators, and the data that powers them.

The past decade of deep learning has been defined by massive foundation models trained on internet-scale datasets. The new frontier is different, featuring efficient, task‑specific lightweight models running on embedded devices, smart cameras, and autonomous robots.

In the cloud, model capacity can often swallow label noise and still generalize across huge taxonomies. At the edge, available memory may be measured even in megabytes. There’s virtually no capacity budget to spare. Every parameter has to earn its place in the decision process, and every labeled data point must be selected with ruthless attention for its utility.

Algorithm engineers and AI team leaders may need to abandon classic large-scale data annotation approaches. Missed detections in safety‑critical systems or wasted compute on irrelevant pixels leave very little room for error.

Considering this trend, we want to share some data annotation strategies tailored to lightweight computer vision models, to help CV teams prepare training data that matches the realities of edge deployment.

What is Edge AI?

Before diving into annotation strategies, it helps to be precise about how Edge AI operates.

Edge AI runs AI inference directly on devices located near where data is generated, rather than relying on centralized cloud infrastructure. Edge computing and AI are fused so that machines can process data locally and make realtime decisions without constant back‑and‑forth to a remote server.

This architecture changes how data is prepared, how models are optimized, and how predictions are validated in production. Decoupling from cloud infrastructure has a direct impact on data labeling. Datasets must be comprehensive enough to handle edge cases and deployment variations without frequent retraining or runtime access to additional cloud data sources.

The most visible difference from cloud systems is compute. Edge devices operate under tight limits in processing power, memory, and storage. Heavyweight deep models are hard to run efficiently. At the same time, many edge applications sit in safety‑critical loops where latency is not just annoying but causes failures.

Why Edge Scenarios Need a Different Data Labeling Mindset

Given edge scenarios’ demands for low latency and real-time inference, lightweight architectures have become the default. MobileNetV3, SqueezeNet, EfficientNetV2, ResNet‑18, ShuffleNetV2 and similar families that trade capacity for speed and efficiency.

These models come with a cost. They are more sensitive to training data quality. With little spare capacity, they cannot easily absorb noisy or inconsistent labels. Data annotation quality and strategic data selection become central to overall system performance.

Hardware constraints deepen this effect. Power budgets limit what operations can run continuously. Data labeling that assumes pixel‑perfect segmentation when the deployed model can only run bounding box detection wastes both annotation and compute budgets.

Deployment environments also look very different from cloud scenarios. Edge models are often installed at fixed locations, like production lines, cashier stations, specific fields or facilities. Training data has to mirror the actual deployment scene closely.

Internet‑scale datasets, however large and diverse, rarely capture the exact lighting, viewpoints, seasonal patterns, and object appearance of a given site. This location specificity pushes edge teams away from collecting broad and diverse data for full coverage. Instead, they collect focused datasets from the real environment and annotate them deeply.

Common Tasks and Trade‑offs for Lightweight Models

Putting AI on edge devices changes how data should be labeled, structured, and prepared for training. Typical edge computer vision tasks include:

object detection (people, vehicles, defects, goods, equipment),
classification (pass/fail, on/off, state recognition),
lightweight segmentation (lane markings, ground vs non‑ground, drivable area),
keypoints/pose (human skeletons, machine buttons), and
OCR/readings (dashboards, digital codes, barcodes/QR codes).

The choice of annotation task is the foundational decision. It sets both the computational complexity of the inference engine and the data volume required. The high‑level goal of learning visual patterns doesn’t change, but the path to get there does when model capacity and deployment constraints are tight.

Efficiency First: Prefer Classification Over Detection

Efficiency serves as a guiding principle for on-device AI. In our experience, if a problem can be solved with a classification head, avoid using a detection head.

Image classification costs less than object detection in both annotation and computational terms. Detection requires regressing spatial coordinates (bounding boxes) and running post‑processing like NMS, which can consume resources and create latency bottlenecks on edge hardware.

Classification works best when there is a single, fixed‑position primary object (for example, an industrial sensor always imaging the same part), or when a scene‑level decision is enough (such as “contains shopping person” or “defective product present”).

With classification, smaller models can reach practical accuracy, inference is extremely fast, and data annotation and QA overhead are minimal. That efficiency translates directly to practical advantages.

Detection becomes necessary when multiple objects appear simultaneously, when objects occupy small regions of the field of view, or when where something is determines the decision logic, such as distinguishing people in a “safe zone” from people in a “danger zone.”

The Granularity Trade-off: Prefer Detection Over Segmentation

Semantic and instance segmentation provide the richest spatial detail by assigning a class to every pixel instead of approximating objects with boxes. But architectures like U‑Net or Mask R‑CNN require large decoders to reconstruct high‑resolution masks from feature embeddings, burning memory bandwidth and compute.

For lightweight models, bounding box detection should be the default. If an application needs to understand area estimates (for example, defect size as a proxy for severity), consider coarse polygons or rotated bounding boxes instead of full pixel‑level masks.

Industrial defect inspection is a good example. While segmentation may be theoretically more precise, lightweight detectors such as YOLOv5/v8 can localize defects with enough accuracy and at a fraction of the inference time. The marginal benefit of tracking the exact jagged outline of a scratch rarely justifies a 10× compute increase.

If segmentation is truly unavoidable, use coarse masks and train at downsampled resolutions, such as 28×28 mask grids instead of full‑image outputs. This keeps label granularity aligned with what a small feature extractor can actually resolve.

Keypoint Annotation

Pose estimation and keypoint labeling mark specific anatomical landmarks or points of interest, such as joints, facial landmarks, or industrial connection points. These points are often linked as skeletons.

Many tasks that initially look like pose estimation can be handled adequately with simpler detection or classification approaches, avoiding the regression overhead of precise keypoints.

When keypoints are required, embrace minimalism. Rather than annotating standard 68 facial keypoints (overkill for driver fatigue monitoring), define a custom scheme with the minimum actionable set, perhaps 5 points: eyes, nose, mouth corners. This reduces regression head dimensionality, saves parameters, and cuts data annotation time.

Label Class Design: Contraction Over Expansion

One of the most common planning mistakes in edge AI projects is over‑designing class taxonomy. Lightweight models have a limited feature budget. Spreading that across too many fine‑grained classes blurs decision boundaries for all of them.

Every new class increases model size via the final layer and adds pressure to learn distinct representations under tight resource constraints. And every class must be adequately represented in training data to avoid skewed performance from class imbalance.

Forcing a small model to separate visually similar subclasses wastes capacity. Asking it to distinguish “sedan” from “coupe” may degrade the core “vehicle” vs “non‑vehicle” performance.

Merging classes is more valuable than splitting them. The number of output categories for an edge model should be tightly controlled, ideally in the tens at most. Many real deployments work well with just 2–10 classes.

When finer distinctions are truly needed, hierarchical label systems (Ontologies) are a practical compromise. They keep the deployed model simple while preserving room for future expansion.

During labeling, data annotators choose the most specific node they can reliably distinguish. During model training, the system uses only merged high-level categories, but detailed annotation information remains available for analysis, auditing, and future use.

This dual‑level setup adds only modest overhead during labeling but brings significant long‑term flexibility and traceability.

Avoid Overly Precise Annotation

Edge AI projects often over‑invest in annotation precision with diminishing returns. Annotators may spend substantial time chasing pixel‑perfect boxes, even when coarser labels might deliver equal or better edge performance.

To maintain frame rates, edge models typically operate on lower input resolutions, such as 320×320, 512×512, or 640×640. A distant object 10 pixels wide in a 4K frame might shrink to fewer than 2 pixels at 640×640. At that point, it is essentially indistinguishable from sensor noise or aliasing.

In such scenarios, annotation guidelines should define a minimum detectable object size. Annotators should be guided not to over‑optimize pixel alignment but to include slightly more background rather than risk cropping away parts of an object. The aim is tight, not surgical, containment of visible content.

Polygon annotation for segmentation has similar granularity issues. Effective precision comes from vertex density. For edge deployment, coarse, smooth polygons without jagged edges are preferable to high‑vertex boundary tracing. Architectures benefit more from clean, generalizable boundaries than from modeling every tiny irregularity.

Temporal granularity is another critical dimension for video. Manually labeling every frame in a sequence is prohibitively expensive. Frame interpolation is a pragmatic alternative. Modern tools, such as the BasicAI Data Annotation Platform, can use tracking and motion estimation to propagate keyframe labels through intermediate frames, often cutting manual work by 80–95% while preserving temporal consistency.

Temporal sampling should match how often the deployment environment actually changes. Where scenes change fast or unpredictably, prioritize moments of transition, such as entrances and exits, state changes, or activity onset. Centering sampling on change events teaches the model to handle critical transitions, rather than overfitting to steady, unchanging states.

Training Set Design for Edge Deployment

Deployment‑First Data Collection

Training dataset design starts with data collection. Domain gap (the statistical mismatch between training and deployment data) is a leading cause of edge AI failure. Models trained on clean, high‑contrast internet imagery (COCO, ImageNet, etc.) often fall apart on noisy, low‑contrast industrial sensors.

Even if internet data is convenient and cheap, it should be used cautiously. Data should come, as much as possible, from real or faithfully simulated deployment environments.

For example, a manufacturing QC system should capture images directly from its production line, with real products, real lighting, real camera angles, and real backgrounds. Tailored hard case (glare, motion blur, partial occlusions, dirty lenses) acquisition should augment the typical cases.

Temporal characteristics matter too. Tracking models trained on 30 FPS footage but deployed at 5 FPS will see much larger apparent motion between frames and may fail. Match training video frame rates to deployment as closely as possible.

Preparing the Training Set

Raw data and labels are just materials. Training set preparation is where they are adapted to the constraints of edge models. Three practices matter in particular.

First, target class balance. Aim for reasonably balanced representation across all classes, and avoid extreme skew wherever possible.

Second, explicitly address small and low‑contrast objects. Lightweight models struggle with small objects because they occupy only a few input pixels and are easily lost in downsampled feature maps. Give small and low‑contrast examples special treatment in sampling and augmentation strategies.

Third, separate extreme and abnormal conditions. Encode scene conditions using classification labels like lighting (day/night, backlit), weather, time of day, occlusions, reflections, etc. Instead of treating all images as equal, make this context explicit. It allows you to stratify training, monitor performance across conditions, and design curriculum or hard‑example mining strategies that target real‑world failure modes.

Data Annotation Tips for Typical Edge Applications

Industrial Vision and Defect Detection

Industrial QC is one of the most mature edge AI domains. Typical characteristics include high‑speed conveyors, controlled lighting, and special class distribution (around 99.9% good parts, 0.1% defects).

Defect detection systems automatically flag quality issues from line images, supplementing or replacing manual visual inspection. The key data annotation question is what downstream systems actually need: exact defect location, or simply defect presence?

When defect information guides human operators in visual product inspection, a rough bounding box indicating approximate defect location suffices. If human operators can precisely locate defects through direct visual inspection, the model’s task is merely flagging products requiring inspection. In this case, classification frames or rough bounding boxes outperform precise segmentation masks.

Smart Cameras for People and Vehicle Monitoring

Smart cameras in retail, parking, and surveillance are a broad edge AI category. These systems detect people and vehicles to enable use cases like customer counting, occupancy monitoring, or intrusion alerts.

Where downstream logic allows, labels should merge classes aggressively. A system that just needs person counts, without demographic attributes, should use a single “person” class, not separate classes by age or gender.

Consistent handling of crowded, overlapping scenes is essential. Retail spaces and transport hubs often have heavy occlusion and overlapping boxes. A common strategy is to label a “head” key point or small head box, which stays visible and provides a stable signal for counting even in dense crowds.

Modern security setups increasingly use PTZ cameras that auto‑track targets, changing field of view on the fly. Data labeling for such systems must reflect this dynamic framing and include examples of zoom, pan, and re‑acquisition patterns that will occur in deployment.

Choosing Data Labeling Tools and Workflows

Data annotation tools and workflow shape both efficiency and quality. For edge AI annotation, certain capabilities are especially important.

Ontology management is central when you apply class shrinking strategies through hierarchical labels. Tools should support multi-level class definitions, guide data annotators to select along the hierarchy, and record all levels, not just leaf nodes. This enables training on merged classes while keeping fine‑grained labels for later.

Video interpolation and tracking support is critical for any workload involving video. Tools that provide keyframe‑based interpolation, object tracking across frames, and consistent ID assignment can dramatically reduce effort versus frame‑by‑frame labeling. For edge projects, full manual per‑frame annotation is rarely sustainable.

Model‑assisted pre‑labeling allows a model to propose candidate labels for human review. Combined with active learning that prioritizes low‑confidence or novel samples, this lets the model handle easy, high‑confidence cases while human annotators focus on ambiguous ones, shrinking overall labeling volume.

Scalable collaboration features become important as edge projects grow. Role‑based access control, structured review workflows, and audit trails for all edits are necessary to maintain quality and security at scale, especially across internal and external labeling teams.

On‑premise deployment is often non‑negotiable. Many on-device AI projects involve proprietary manufacturing data, medical imagery, or private surveillance footage that cannot be pushed to public clouds. Labeling tools should offer self‑hosted or tightly controlled deployment options so data never leaves the organization’s security boundary.

Recommended: BasicAI Data Annotation Platform

Given these requirements, the BasicAI Data Annotation Platform is well aligned with edge AI workflows.

It supports model‑assisted pre‑labeling and interpolation‑based tracking, making sparse labeling strategies viable while still handling video continuity. It also supports complex, multi‑level Ontologies for compact class design, centrally managed and reusable across projects.

BasicAI’s scalable collaborative annotation system stands as a primary reason for its popularity. Teams can manage internal and external annotators, batch-assign tasks, view task progress and personnel performance in dashboards, and save significant time with customized automatic quality checks.

BasicAI also offers strong capabilities for sensor fusion (LiDAR, RGB images, and video), which fits advanced edge robotics and autonomous vehicle use cases. The platform is available as a private deployment, aligning with strict project and data security requirements.

Summary and Emerging Directions

Lightweight models deployed at the edge demand a different annotation mindset than large cloud‑scale models that dominate academic benchmarks.

The differences stem from constrained compute, hard realtime requirements, the importance of local processing for privacy, and the central role of deployment‑specific training data.

Edge AI annotation should not chase maximum label granularity or rely on massive, generic datasets. It should prioritize consistency, keep class complexity low, and focus labeling effort on representative data from the actual deployment environment.

Synthetic data is becoming a powerful part of this toolkit. Generative models and digital twins can help cover rare but critical scenarios, like factory fires, traffic accidents, dangerous human behaviors, where collecting enough real samples is impractical or unsafe.

Combining edge AI with IoT and sensor networks also opens doors for distributed annotation and federated learning. Models can improve collaboratively across many edge devices while keeping data local. This reduces the need to centralize all training data and labels, and lets annotation happen closer to the deployment context, improving freshness and relevance.

Ultimately, effective edge annotation strategies reflect a deeper understanding of what truly drives model performance. It is not the raw amount of data or the finest possible labels, but how well annotation practice is aligned with model capacity, deployment characteristics, and the actual information needs of the application.

Done well, this principle‑driven approach turns constraints into an advantage, enabling teams to ship smaller, sharper models that perform reliably on real‑world edge devices.