SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Introduction

We evaluate the capabilities of current neural retrievers in understanding complex NL queries and semi-structured data. The queries involve diverse types of filtering conditions for structured objects, including exact and semantic matching, numerical and logical reasoning, or comprehensive understanding of multiple fields. The document structure can be dynamic, with potential missing fields and flexible structures (nested lists or dictionaries), making it challenging to query using fixed-schema database indexing. Current powerful LLM-based neural retrievers show promise in providing a unified solution to address the challenges present in this scenario.

We present the Semi-Structured Retrieval Benchmark (SSRB), encompassing 6 domains with 99 different data schemas, totaling 14M data objects, along with 8,485 NL queries of varying difficulty levels. Given the scarcity of public data, we build SSRB by LLMs in a three-stage data synthesis workflow (figure bellow): (1) schema generation, creating multiple schema definitions for six manually defined domains; (2) data triples generation, synthesizing $<query, positive, negative>$ triples for each schema using different query characteristic configurations to ensure diversity and quality; and (3) testset annotation, employing powerful LLMs to judge the relevance recalled candidates to queries as test labels.

Based on SSRB, we evaluate two main types of dense retrievers: 1) small-scale encoder-based models like InstructOR and BGE, and 2) LLM-based ones such as E5-mistral. We also include the BM25 lexical retriever for comparison. Our experiments reveal several key findings:

BM25 struggle with this task,
encoder-based models, benefiting from BERT-style backbones, provide better performance than BM25,
LLM-based retrievers achieve notably better performance, highlighting the importance of LLM's powerful semantic understanding and reasoning capabilities in handling complex queries.

However, their absolute performance remains relatively low, indicating the necessity for developing more task-specific retrievers.

How to run

0 Clone this repo

git clone https://github.com/vec-ai/struct-ir.git
cd ./struct-ir

1 Download data and models

Download data (less than 10GB):

bash ./scripts/hfd.sh --dataset vec-ai/struct-ir
mv struct-ir data

Download models (modifity the script to select models):

mkdir models
cd models
bash ../scripts/dl_models.sh

2 Run selected models

General environments:

pip install pandas torch transformers sentence_transformers pytrec_eval pyyaml

Warning

Some models have extra required packages need to install, e.g., gritlm,

Supported model list:

bge
instructor
jina3
nomic2
drama
e5mistral
qwen (gte-qwen2-7b)
gritlm (requires gritlm)
nvembedv2 (requires transformers==4.42.4)

Run one model:

MODEL_DIR="./models" python evaluate.py --model_name drama --batch_size 32

All embeddings, results, and scores will be save to results/models/MODEL_NAME by default.

Get results table:

python ./scripts/print_tables.py

Then we will have several rows of scores, and numbers are seperated by comma ,. You could paste into google sheets to generate the table.

Acknowledgments

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
scripts		scripts
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
models.py		models.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Introduction

How to run

0 Clone this repo

1 Download data and models

2 Run selected models

General environments:

Supported model list:

Run one model:

Get results table:

Acknowledgments

About

Uh oh!

Languages

License

vec-ai/struct-ir

Folders and files

Latest commit

History

Repository files navigation

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Introduction

How to run

0 Clone this repo

1 Download data and models

2 Run selected models

General environments:

Supported model list:

Run one model:

Get results table:

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages