SNAP-Cam | Devpost

Rendering of apartment kitchen from multiple camera angles.

Inspiration

After seeing the advent of Neural Radiance Fields (NeRF) and many improvements in the domain of 3D reconstruction, we hoped to apply these technologies to the field of forensic science. We were inspired by films like Blade Runner, where advanced technology assists in analyzing forensic evidence beyond simply presenting video feed to a user.

What it does

SNAP-Cam can consolidate large amounts of surveillance footage to generate a 3D reconstruction of an event, allowing an investigator to not only see events transpire, but also step into the situation and view it from all angles.

How we built it

Our implementation revolves around a recent paper named Dust3r, which leverages the capabilities of pretrained vision transformers (ViT) to perform multi-view stereo reconstruction. Through the pipeline presented in this paper, we were able to input temporally-aligned frames from multiple surveillance camera views, and retrieve a rough 3D model that represents the scenario. By generating multiple videos from several sequential batches of video frames, we were able to create a continuous 3D representation of a physical space over time. This data was then displayed in our frontend with a surveillance video screen and a 3D rendered space where a user can move around and explore a 3D reconstruction.

Challenges we ran into

Determining an adequate model architecture for this task was a difficult process in our development. Initially, we purposed utilizing NeRF for this project, in addition to other tangentially related projects, such as Nvidia's instant-ngp or the full COLMAP pipeline for reconstruction. These methods failed to produce quality results due to the sparsity of our surveillance views. After trying out Dust3r we noted that the intrinsic understanding of the vision encoders due to extensive pretraining assisted the model in comprehending sparser input data such as ours, making it the perfect implementation for our project. We also faced issues with limited compute, as we were utilizing a free service and only had a limited number of attempts to generate meshes.

Accomplishments that we're proud of

Throughout this process we worked with complex code to generate NeRFs and handle large transformer models, which were difficult to utilize in a custom application. Our web application also served as a simple transitions from simple surveillance data, to a dynamically rendered 3D model which a user could thoroughly explore.

What we learned

We gained a deeper understanding on the nuanced capabilities of various 3D reconstruction models, identifying the advantages and disadvantages of each in various situations.

What's next for SNAP-Cam

We are hoping to improve the performance of our model when it comes to rendering moving objects. This might involve interpolating certain frames to fit in context with others to generate a cleaner mesh. In addition to generating more precise user data to generate cleaner responses. We also hope to expand on our user interface, allowing users to interact with the model with various labels or simulations to further comprehend the generated mesh.