I-cite | Devpost

Inspiration

Our team consists of enthusiastic AI/ML hackers and researchers. Apart from an insatiable thirst to explore our curiosity and build products that profoundly shape the future, we are united by a common frustration of coming up with new ideas. But it isn't even coming up with new ideas that's frustrating - it's figuring out if your idea is new. Our team decided to specifically focus on the domain of ML research. This is because as more hype grows around AI/ML, there is an influx of low quality information, and articles by authors who do not really understand machine learning (even though they may cite several ML papers)

What it does

To solve this issue, we built an advanced (re)search engine that is trained exclusively on academic articles. We use AI-powered prioritisation algorithms, search techniques, and graph algorithms to empower users to get answers to their questions with links to research articles.

tl;dr We provide high quality answers and good sources

How we built it

We scrape academic articles on arxkiv.org and semanticscholar.org in addition to existing datasets of scholarly papers to obtain a collection of 20,000+ research papers. After compiling this data, we use jina embeddings to create a vector database. We also scrape semanticscholar.org for citations to create backlinks and maintain a graph database. When a user asks a research-related query, we use retrieval augmented generation and graph algorithms to find the most relevant research articles and generate a response using the Anthropic Claude API. Our UI is built using Streamlit.

Challenges we ran into

Like most machine learning projects, a large amount of time was spent in obtaining and curating our data. We ran into several issues while scraping data and preprocessing it to be in a useful format before creating embeddings. We also had to spend a lot of time experimenting with different parts of the pipeline. For instance, we tested a variety of embedding architectures like BERT, RoBERTa, SciBERT before settling on jina embeddings which proved to provide the best similarity scores for our use case of research-related queries.

Accomplishments that we're proud of

We're proud of carefully architecting and refining the data acquisition and ML pipeline. The majority of our effort was spent in experimenting and improving our engine until we produced a product that we are actually excited to use in our daily lives.

What we learned

We learnt a lot about database design, advantages and disadvantages of different AI models and APIs, and RAG.

What's next for I-Cite

Research in AI is growing at an unprecedented pace and we hope to continually update our database of research papers and expand to newer sources. We also hope to expand the engine to domains other than ML research so that researchers from various domains can use our product.