Stories by IVAN ILIN on Medium

Through knowledge sharing to singularity, accelerated by LLMs

IVAN ILIN — Tue, 27 Feb 2024 13:05:33 GMT

Through Knowledge Sharing to Singularity, Accelerated By LLMs

LLMs are one of the pinnacles of human knowledge that has transformative potential comparable to the Internet. How did we come to that? Why do knowledge sharing and knowledge flow play a crucial role in our world’s acceleration, and why are LLMs the lapis philosophorum that could bring us to the singularity?

I’ll briefly outline some knowledge-sharing history milestones, the effects open-source had on knowledge accumulation, and the way it brought us to LLMs.
Then we’ll stop at the current point to reflect on the effects LLMs will have in tech, science, and society, touching on recent techno-optimism e/acc philosophy that ushers humanity into the bright singularity future.

We believe intelligence is in an upward spiral [1] — Marc Andreessen

Some history

The ability to share knowledge might be the key distinction that allowed humankind to evolve into such a complex civilization and to become the dominant species on our planet.

Since the beginning of the times, the collective efforts to explore the unknown and find meaning in the unexplainable have driven our evolution, propelling humanity forward on a relentless quest for knowledge and understanding. Spurred by this inherent curiosity, our species’ development has been distinguished from others by the accumulation of wisdom passed down through generations.

Early humans, driven by the need to collaborate and share experiences, developed language to communicate, describe their surroundings, and transmit knowledge. This cognitive leap laid the foundation for the complex societies that would emerge over time and the foundation of science later on as the collective understanding of the world became a shared endeavor.

From 15,000 BC cave drawings to Wikipedia, all iterations of knowledge sharing tools were an invitation for collaboration. While documentation techniques have continued to get more and more sophisticated, the goal has stayed the same — to find meanings and share the findings. It is scientifically proven that teamwork has been a key factor in the progress, evolution, and survival of humanity.

The obvious milestones in knowledge sharing are the invention of writing around 3400 B.C. in the Schumer area and the Gutenberg press almost five thousand years later. The first one allowed to capture knowledge while the second provided its distribution to the masses grounding the Scientific Revolution resulting in the Industrial Revolution.

The quest for understanding the meaning, truth, and unraveling the mysteries of existence has compelled us to cultivate an environment where the exchange of ideas and the advancement of knowledge are paramount. During the 17th century, the majority of European countries established their Academies of Sciences, thus accelerating the exchange and validation of knowledge.

The time axis here has a logarithmic scale due to our world’s exponential acceleration

Internet era & open source

Long story short — a hundred years after the invention of the telephone and the radio we finally came up with the Internet. The first message was sent on October 29, 1969, from UCLA to the Stanford Research Institute — the universities were the first organizations involved in information sharing technology, though under the military supervision of (D)ARPA.

The Internet brought us to the globalization era with the vast availability of knowledge, and then people started sharing their solutions to popular problems so the open source was sparked. Engineers shared the source code of their projects even before the Internet, but for sure it sped things up dramatically.

Apart from sharing code people started sharing opinions, ideas, and facts on the Internet — speaking of knowledge sharing it is impossible to pass by Wikipedia's launch in 2001. Looking ahead, it has served the ML community a lot while building different Natural Language Understanding tools and models as a high-quality curated corpus of information.

The open-source movement gained hold with the rise of the Internet, and it has since grown into a vibrant scene with many contributors and projects. Fast forward to 2008, and we see the Github launch, providing developers with a platform to collaborate on their projects online. Since that time the open-source approach has become a solid ground for the tech scene’s exponential growth and allowed society to not only avoid a good chunk of duplicative efforts but to build a shared foundation for our current and future innovations. More than 99% of Fortune 500 companies use open-source code [2].

The whole machine learning industry since the early days was growing on open source solutions like scikit learn (2007) and then deep learning frameworks — TensorFlow (2015) and PyTorch (2016). Later on in the 2020s people started sharing pre-trained models weights on huggingface and referencing the arXiv papers to their implementations on paperswithcode.

These were all the building blocks of the knowledge-sharing culture, community, and tech, particularly in machine learning, that paved the way for ChatGPT and other LLMs to appear. Without that, we would never have developed AI to its state in just a decade.

This sharing culture dramatically increased our progress pace by multiplying contributions, almost instantly sharing the current state-of-the-art tech with the world and allowing every engineer to start his journey right from the current pinnacle.

In fact, the whole open source culture is a collaborative learning environment and we are all working in the zone of proximal development according to Lev Vygotsky.

LLMs and knowledge — a symbiotic relationship

Transformers invention and training

Additionally, this knowledge sharing process left plenty of publicly available data — first in the form of texts, then in code, and lately as public models and datasets. This data abundance is crucial to the LLMs training process — LLMs thrive on large datasets, which expand and thrive thanks to the collective knowledge accumulated and shared on the Internet through open-source platforms — human-generated texts from scriptures to Instagram posts, Reddit (recent $60M/y training deal with Google), massive Github and StackOverflow codebase, etc.

Let’s dig a bit into the LLMs training process. First, the GPT family models are decentralized parts of the industry-revolutionizing transformer architecture introduced in the seminal paper Attention Is All You Need. Many novel ideas were implemented at once, resulting in a complete replacement of the previous state-of-the-art variations of Recurrent Neural Networks by this new architecture. One of the pivotal ideas besides self-attention was the application of self-supervised learning to model pretraining. Transformer models had to solve two core tasks while pre-training — masked language modeling — prediction of the randomly masked word given the context and the next sentence prediction task — given two sentences model has to classify if they are consecutive or not.

This approach eliminated the need to create a specific dataset to train models on by manually labeling texts, allowing researchers to increase the training data size dramatically without any human work involved and to use any human texts of reasonable quality to train Transformer models — first high quality curated corpora like Wikipedia and later the whole Internet.

The invention of Transformers led to a new cycle in Natural Language Understanding bringing us to the era of the first capable chat-bots with retrieval and generative parts, high-quality semantic search, and multimodal generative models.

LLMs are born

A few years later OpenAI applied reinforcement learning from human feedback (RLHF) to train InstructGPT and make it safer, more helpful, and more aligned — labelers provided demonstrations of the desired model behavior, ranked several outputs from the models, and then used this data to fine-tune GPT-3, eventually giving birth to ChatGPT in 2022.

That’s how LLM reasoning abilities emerged — we may speculate that InstructGPT, having demonstrated a much better ability to follow human instructions and more coherent responses, learned this reasoning and logic from humans. And let’s step back a bit to remind ourselves that models have been learning in natural language, so our knowledge sharing mechanism worked for a machine learning model too.

Here lies a thin line between the reproduction of coherent sequences and the actual “understanding” of information encoded by those sequences. Understanding means the model’s capability not only to capture relations but also to demonstrate general logic and common knowledge[3].
GPT-4 reasoning capabilities are extensively tested in the OpenAI tech report, but if you’d like to briefly get the idea, here is my 8-month-old (yes, THAT old) post reflecting on LLM’s reasoning potential.

So we got LLMs as the reasoning engine, being capable of executing logical operations, which is already a very promising thing. But while LLM weights contain human-level reasoning capabilities they may lack relevant information — that’s where RAG, standing for retrieval augmented generation, comes to the stage. RAG is the most popular architecture of LLM applications providing information injection into the LLM prompt, giving LLM the knowledge to reason upon.

LLMs are the new interface for information

This RAG approach brings immense possibilities for creating knowledge assistants, capable of answering complex logical questions and fetching any data needed from multiple sources, creating reports, preparing analytical notes, comparative analysis, writing long and short texts, or even working as your thinking assistant. We are building such an intelligent knowledge interface in iki.ai — an assistant and a second brain for professionals & teams, but there are many more solutions in the market, focusing on specific use cases — perplexity.ai is fighting with Google for the mainstream information discovery tool market, Arc browser is going a step further and wants not to just merge various information pieces from different sources with an LLM, but to build a whole new interface for web aimed at information aggregation and structuring (as far as I got it from their recent video).

This ability of LLMs to generate a coherent and logically correct text provided some information is what changes the whole knowledge sharing paradigm — it is now possible to merge and transform information according to a specific request — the operation that required a human expert before.

Those who are building truly disruptive LLM-powered products are reshaping the interfaces to information.

That is the paradigm shift in the knowledge creation process as we know it — before the dots were connected only in human brains. Scientists studied, adopted thinking patterns, and research methods, and consumed vast amounts of information to come up with an innovation — adding a new layer of ideas to the existing knowledge.

Now an LLM can do that for you. And obviously, it does not have cognitive load threshold or memory limit issues. A research agent, augmenting a scientist’s intelligence, now can have some tools like access to scientific databases and the Internet, a goal, and some human-in-the-loop guidance. That’s not a futuristic idea, that’s how research would be done this year — there already are products implementing early prototypes of research assistants.

This would boost knowledge creation aka tech and scientific progress, speeding up ideation loops and research cycles dramatically plus granting almost instant access to information, thus ushering in the age of singularity — arguably we are entering it now.

Most of the knowledge workers may delegate some part of their daily responsibilities to LLMs right now — I am speaking of analysts, lawyers, researchers, and experts in general. One of the obvious problems we are taсkling in iki.ai is professional information overload — you may ask your knowledge assistant to distill key ideas or connect the dots from multiple contexts for you. The cognitive load, especially in the field of Machine Learning, has now reached unprecedented levels and our brains are not made to withstand this constant influx of information so having some kind of software to store your knowledge and query it in various ways becomes a necessity.

The next big cognitive frontier once solely attributed to humans is creativity — It could be interpreted as the ability to connect previously unrelated dots. Now these dots, or ideas, may be extracted from a context by an LLM and then connected in various ways until it clicks with the user — that gives birth to the first true second brains emerging.

I’ve already mentioned agents — LLMs capable of using external tools to complete a task — that’s a whole next paradigm. Proactive assistants, software, automating whole pipelines, automated interactions with GUI to integrate agentic systems with the current generation of software, and who knows what comes next.

The coming few years will change many processes and tools we got used to in the previous decade and be sure that the best minds in the industry are now working on that huge business opportunity.

Societal effects

Such drastic technological changes cannot happen without affecting society, especially in the informational world we live in after the fourth industrial revolution.

One of the philosophies, outlining a fairly positive view on this major tech shift, is Effective Accelerationism, or e/acc, coined in 2022 and substantially developed by Marc Andreessen in his Techno Optimist Manifesto several months ago. The text is remarkable and I recommend reading the original, but let’s cite some core ideas:

We believe intelligence is the ultimate engine of progress. Intelligence makes everything better. Smart people and smart societies outperform less smart ones on virtually every metric we can measure. Intelligence is the birthright of humanity; we should expand it as fully and broadly as we possibly can.

Ray Kurzweil defines his Law of Accelerating Returns: Technological advances tend to feed on themselves, increasing the rate of further advance.

We believe in accelerationism — the conscious and deliberate propulsion of technological development — to ensure the fulfillment of the Law of Accelerating Returns. To ensure the techno-capital upward spiral continues forever.

Although a bit one-sided e/acc movement is popular in the California tech community and on Twitter — the new AI tech creates a lot of possibilities in the market for tech entrepreneurs, engineers, and investors, and then for a lot of data-related businesses. Capitalism is an economic system based on growth, so for me, e/acc looks more like a technocratic approach with some sparkling singularity touch.

AI and LLMs are the new tech revolution, and those with tech capabilities, capital, and some audience are positioned much better to handle this new opportunity — successful VCs & entrepreneurs will become richer, and the measure of merits from the new tech will be proportional to the tech ownership. The least qualified employees will be the first ones to be replaced, so I do not see a better world for everyone immediately, more likely a continuous series of layoffs and turbulence due to the disrupting tech adoption. The upside is that eventually, more tedious tasks will be gone along with the low-paid, less-qualified jobs, while more companies are created, more cases are solved, and more successful VC stories happen.

At a higher level of abstraction, knowledge can fast-track happiness. Adopting a motivation-driven learning approach enriches personal and professional journeys, aiding in discovering your ikigai — the essence of your existence. Your “second brain” can help you accelerate the iteration loop of ideation, creation, feedback collection, and refinement, allowing for a more steep and successful learning curve.

Despite all the positive things about knowledge sharing and tech advancements that humanity fostered over the last 300–400 years, there also are a few problems caused by this unprecedented acceleration we are experiencing: psychological overload, insane rivalry in the markets, and tech knowledge getting obsolete within a month now.

The current pace of advancements in AI technology resembles an explosion. Humanity operates as a cohesive entity, exchanging information, resources, and responsibilities. The rapidity of changes mirrors that of a shock wave and it remains uncertain how human society and individual psyches will adjust to this acceleration, or if adaptation is even feasible for the majority.

Another thing is that while the unprecedented acceleration equals the opportunities a bit it creates an extreme rush and rivalry. By no means you can stop learning in such a world. While some may accept this, others simply are not prepared for it.

As our psyche is not designed for such a pace of changes, they may cause a general feeling of insecurity and uncertainty. The speed of tech growth levered by VC money leaves little chance for the common people to grasp that new world, adopt it, and adapt. The psychological effects of living in this unstable and less predictable world are far from beneficial, but no one is going to slow down; remember the letter we had last year?

Conclusion

Knowledge sharing has been the key factor in humanity's progress.
LLMs are the current pinnacle of this progress, accelerating information sharing, knowledge creation, and the economic growth spiral. That’s because LLMs are the new generation of interfaces to information, providing its transformation according to user queries and tasks, thus untapping instant knowledge sharing. Add agents, capable of completing complex tasks and pipelines, and you’ll get an augmented intelligence reality that we are facing.

The bright future of humanity in the e/acc paradigm is defined by this free knowledge flow and AI-enhanced intellectual work, accelerating tech progress and ushering us to the fabulous singularity.
LLM-powered knowledge assistants are a crucial part of this acceleration.

The world has entered a new phase, and we have to adapt faster than ever before. Better off with an assistant to help :)

Just curious — is that singularity yet?

Find me on LinkedIn or Twitter to challenge the opinions & ideas shared above!

The main references are collected in my knowledge base, there is a co-pilot to chat with this set of documents: https://app.iki.ai/playlist/393.

References

[1] https://a16z.com/the-techno-optimist-manifesto/

[2] https://a16z.com/open-source-from-community-to-commercialization/

[3] https://hackernoon.com/scratching-the-singularity-surface-the-past-present-and-mysterious-future-of-llms

Through knowledge sharing to singularity, accelerated by LLMs was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Advanced RAG Techniques: an Illustrated Overview

IVAN ILIN — Sun, 17 Dec 2023 12:02:51 GMT

Groningen, Martinitoren, where the article was composed in the peace of the Noorderplatsoen

A comprehensive study of the advanced retrieval augmented generation techniques and algorithms, systemising various approaches. The article comes with a collection of links in my knowledge base referencing various implementations and studies mentioned.

Since the goal of the post is to make an overview & explanation of avaliable RAG algorithms and techniques, I won’t dive into implementations details in code, just referencing them and leaving it to the vast documentation & tutorials available.

Intro

If you are familiar with the RAG concept, please skip to the Advanced RAG part.

Retrieval Augmented Generation, aka RAG, provides LLMs with the information retrieved from some data source to ground its generated answer on. Basically RAG is Search + LLM prompting, where you ask the model to answer the query provided the information found with the search algorithm as a context. Both the query and the retrieved context are injected into the prompt that is sent to the LLM.

RAG is the most popular architecture of the LLM based systems in 2023. There are many products build almost solely on RAG — from Question Answering services combining web search engines with LLMs to hundreds of chat-with-your-data apps.

Even the vector search area got pumped by that hype although embedding based search engines were made with faiss back in 2019. Vector database startups like chroma, weavaite.io and pinecone have been built upon existing open source search indices — mainly faiss and nmslib — and added an extra storage for the input texts plus some other tooling lately.

There are two most prominent open source libraries for LLM-based pipelines & applications — LangChain and LlamaIndex, founded with a month difference in October and November 2022, respectfully, inspired by the ChatGPT launch and having gained massive adoption in 2023.

The purpose of this article is to systemise the key advanced RAG techniques with references to their implementations — mostly in the LlamaIndex — in order to facilitate other developers’ dive into the technology.

The problem is that most of the tutorials cherry-pick one or several techniques and explain in details how to implement them rather than decribing the full variety of the avaliable tools .

Another thing is that both LlamaIndex and LangChain are amazing open source projects, developing at such a pace that their documentation is already thicker than a machine learning textbook in 2016.

Naive RAG

The starting point of the RAG pipeline in this article would be a corpus of text documents — we skip everything before that point, leaving it to the amazing open source data loaders connecting to any imaginable source from Youtube to Notion.

A scheme by author, as well all the schemes further in the text

Vanilla RAG case in brief looks the following way: you split your texts into chunks, then you embed these chunks into vectors with some Transformer Encoder model, you put all those vectors into an index and finally you create a prompt for an LLM that tells the model to answers user’s query given the context we found on the search step.
In the runtime we vectorise user’s query with the same Encoder model and then execute search of this query vector against the index, find the top-k results, retrieve the corresponding text chunks from our database and feed them into the LLM prompt as context.

The prompt can look like that:

https://medium.com/media/30b7701044f1c52a209404dde05c8f9b/href

Prompt engineering is the cheapest thing you can try to improve your RAG pipeline. Make sure you’ve checked a quite comprehensive OpenAI prompt engineering guide.

Obviously despite OpenAI being the market leader as an LLM provider there is a number of alternatives such as Claude from Anthropic, recent trendy smaller but very capable models like Mixtral form Mistral, Phi-2 from Microsoft and many open source options like Llama2, OpenLLaMA, Falcon, so you have a choice of the brain for your RAG pipeline.

Advanced RAG

Now we’ll dive into the overview of the advanced RAG techniques.
Here is a scheme depicting core steps and algorithms involved.
Some logic loops and complex multistep agentic behaviours are omitted to keep the scheme readable.

Some key components of an advanced RAG architecture. It’s more a choice of available instruments than a blueprint.

The green elements on the scheme are the core RAG techniques discussed further, the blue ones are texts. Not all the advanced RAG ideas are easily visualised on a single scheme, for example, various context enlarging approaches are omitted — we’ll dive into that on the way.

1. Chunking & vectorisation

First of all we want to create an index of vectors, representing our document contents and then in the runtime to search for the least cosine distance between all these vectors and the query vector which corresponds to the closest semantic meaning.

1.1 Chunking
Transformer models have fixed input sequence length and even if the input context window is large, the vector of a sentence or a few better represents their semantic meaning than a vector averaged over a few pages of text (depends on the model too, but true in general), so chunk your data — split the initial documents in chunks of some size without loosing their meaning (splitting your text in sentences or in paragraphs, not cutting a single sentence in two parts). There are various text splitter implementations capable of this task.

The size of the chunk is a parameter to think of — it depends on the embedding model you use and its capacity in tokens, standard transformer Encoder models like BERT-based Sentence Transformers take 512 tokens at most, OpenAI ada-002 is capable of handling longer sequences like 8191 tokens, but the compromise here is enough context for the LLM to reason upon vs specific enough text embedding in order to efficiently execute search upon. Here you can find a research illustrating chunk size selection concerns. In LlamaIndex this is covered by the NodeParser class with some advanced options as defining your own text splitter, metadata, nodes / chunks relations, etc.

1.2 Vectorisation
The next step is to choose a model to embed our chunks — there are quite some options, I go with the search optimised models like bge-large or E5 embeddings family — just check the MTEB leaderboard for the latest updates.

For an end2end implementation of the chunking & vectorisation step check an example of a full data ingestion pipeline in LlamaIndex.

2. Search index
2.1 Vector store index

In this scheme and everywhere further in the text I omit the Encoder block and send our query straight to the index for the scheme simplicity. The query always gets vectorised first of course. Same with the top k cunks — index retrieves top k vectors, not chunks, but I replace them with chunks as fetching them is a trivial step.

The crucial part of the RAG pipeline is the search index, storing your vectorised content we got in the previous step. The most naive implementation uses a flat index — a brute force distance calculation between the query vector and all the chunks’ vectors.

A proper search index, optimised for efficient retrieval on 10000+ elements scales is a vector index like faiss, nmslib or annoy, using some Approximate Nearest Neighbours implementation like clustring, trees or HNSW algorithm.

There are also managed solutions like OpenSearch or ElasticSearch and vector databases, taking care of the data ingestion pipeline described in step 1 under the hood, like Pinecone, Weaviate or Chroma.

Depending on your index choice, data and search needs you can also store metadata along with vectors and then use metadata filters to search for information within some dates or sources for example.

LlamaIndex supports lots of vector store indices but there are also other simpler index implementations supported like list index, tree index, and keyword table index — we’ll talk about the latter in the Fusion retrieval part.

2. 2 Hierarchical indices

In case you have many documents to retrieve from, you need to be able to efficiently search inside them, find relevant information and synthesise it in a single answer with references to the sources. An efficient way to do that in case of a large database is to create two indices — one composed of summaries and the other one composed of document chunks, and to search in two steps, first filtering out the relevant docs by summaries and then searching just inside this relevant group.

2.3 Hypothetical Questions and HyDE

Another approach is to ask an LLM to generate a question for each chunk and embed these questions in vectors, at runtime performing query search against this index of question vectors (replacing chunks vectors with questions vectors in our index) and then after retrieval route to original text chunks and send them as the context for the LLM to get an answer.
This approach improves search quality due to a higher semantic similarity between query and hypothetical question compared to what we’d have for an actual chunk.

There is also the reversed logic apporach called HyDE — you ask an LLM to generate a hypothetical response given the query and then use its vector along with the query vector to enhance search quality.

2.4 Context enrichment

The concept here is to retrieve smaller chunks for better search quality, but add up surrounding context for LLM to reason upon.
There are two options — to expand context by sentences around the smaller retrieved chunk or to split documents recursively into a number of larger parent chunks, containing smaller child chunks.

2.4.1 Sentence Window Retrieval
In this scheme each sentence in a document is embedded separately which provides great accuracy of the query to context cosine distance search.
In order to better reason upon the found context after fetching the most relevant single sentence we extend the context window by k sentences before and after the retrieved sentence and then send this extended context to LLM.

The green part is the sentence embedding found while search in index, and the whole black + green paragraph is fed to the LLM to enlarge its context while reasoning upon the provided query

2.4.2 Auto-merging Retriever (aka Parent Document Retriever)

The idea here is pretty much similar to Sentence Window Retriever — to search for more granular pieces of information and then to extend the context window before feeding said context to an LLM for reasoning. Documents are split into smaller child chunks referring to larger parent chunks.

Documents are splitted into an hierarchy of chunks and then the smallest leaf chunks are sent to index. At the retrieval time we retrieve k leaf chunks, and if there is n chunks referring to the same parent chunk, we replace them with this parent chunk and send it to LLM for answer generation.

Fetch smaller chunks during retrieval first, then if more than n chunks in top k retrieved chunks are linked to the same parent node (larger chunk), we replace the context fed to the LLM by this parent node — works like auto merging a few retrieved chunks into a larger parent chunk, hence the method name. Just to note — search is performed just within the child nodes index. Check the LlamaIndex tutorial on Recursive Retriever + Node References for a deeper dive.

2.5 Fusion retrieval or hybrid search

A relatively old idea that you could take the best from both worlds — keyword-based old school search — sparse retrieval algorithms like tf-idf or search industry standard BM25 — and modern semantic or vector search and combine it in one retrieval result.
The only trick here is to properly combine the retrieved results with different similarity scores — this problem is usually solved with the help of the Reciprocal Rank Fusion algorithm, reranking the retrieved results for the final output.

In LangChain this is implemented in the Ensemble Retriever class, combining a list of retrievers you define, for example a faiss vector index and a BM25 based retriever and using RRF for reranking.

In LlamaIndex this is done in a pretty similar fashion.

Hybrid or fusion search usually provides better retrieval results as two complementary search algorithms are combined, taking into account both semantic similarity and keyword matching between the query and the stored documents.

3. Reranking & filtering

So we got our retrieval results with any of the algorithms described above, now it is time to refine them through filtering, re-ranking or some transformation. In LlamaIndex there is a variety of available Postprocessors, filtering out results based on similarity score, keywords, metadata or reranking them with other models like an LLM,
sentence-transformer cross-encoder, Cohere reranking endpoint
or based on metadata like date recency — basically, all you could imagine.

This is the final step before feeding our retrieved context to LLM in order to get the resulting answer.

Now it is time to get to the more sophisticated RAG techniques like Query transformation and Routing, both involving LLMs and thus representing agentic behaviour — some complex logic involving LLM reasoning within our RAG pipeline.

4. Query transformations

Query transformations are a family of techniques using an LLM as a reasoning engine to modify user input in order to improve retrieval quality. There are different options to do that.

Query transformation principles illustrated

If the query is complex, LLM can decompose it into several sub queries. For examle, if you ask:
— “What framework has more stars on Github, Langchain or LlamaIndex?”,
and it is unlikely that we’ll find a direct comparison in some text in our corpus so it makes sense to decompose this question in two sub-queries presupposing simpler and more concrete information retrieval:
— “How many stars does Langchain have on Github?”
— “How many stars does Llamaindex have on Github?”
They would be executed in parallel and then the retrieved context would be combined in a single prompt for LLM to synthesize a final answer to the initial query. Both libraries have this functional implemented — as a Multi Query Retriever in Langchain and as a Sub Question Query Engine in Llamaindex.

Step-back prompting uses LLM to generate a more general query, retrieving for which we obtain a more general or high-level context useful to ground the answer to our original query on.
Retrieval for the original query is also performed and both contexts are fed to the LLM on the final answer generation step.
Here is a LangChain implementation.
Query re-writing uses LLM to reformulate initial query in order to improve retrieval. Both LangChain and LlamaIndex have implementations, tough a bit different, I find LlamaIndex solution being more powerful here.

Reference citations

This one goes without a number as this is more an instrument than a retrieval improvement technique, although a very important one.
If we’ve used multiple sources to generate an answer either due to the initial query complexity (we had to execute multiple subqueries and then to combine retrieved context in one answer), or because we found relevant context for a single query in various documents, the question rises if we could accurately back reference our sources.

There are a couple of ways to do that:

Insert this referencing task into our prompt and ask LLM to mention ids of the used sources.
Match the parts of generated response to the original text chunks in our index — llamaindex offers an efficient fuzzy matching based solution for this case. In case you have not heard of fuzzy matching, this is an incredibly powerful string matching technique.

5. Chat Engine

The next big thing about building a nice RAG system that can work more than once for a single query is the chat logic, taking into account the dialogue context, same as in the classic chat bots in the pre-LLM era.
This is needed to support follow up questions, anaphora, or arbitrary user commands relating to the previous dialogue context. It is solved by query compression technique, taking chat context into account along with the user query.

As always, there are several approaches to said context compression —
a popular and relatively simple ContextChatEngine, first retrieving context relevant to user’s query and then sending it to LLM along with chat history from the memory buffer for LLM to be aware of the previous context while generating the next answer.

A bit more sophisticated case is CondensePlusContextMode — there in each interaction the chat history and last message are condensed into a new query, then this query goes to the index and the retrieved context is passed to the LLM along with the original user message to generate an answer.

It’s important to note that there is also support for OpenAI agents based Chat Engine in LlamaIndex providing a more flexible chat mode and Langchain also supports OpenAI functional API.

An illustration of different Chat Engine types and principles

There are other Chat engine types like ReAct Agent, but let’s skip to Agents themselves in section 7.

6. Query Routing

Query routing is the step of LLM-powered decision making upon what to do next given the user query — the options usually are to summarise, to perform search against some data index or to try a number of different routes and then to synthesise their output in a single answer.

Query routers are also used to select an index, or, broader, data store, where to send user query — either you have multiple sources of data, for example, a classic vector store and a graph database or a relational DB, or you have an hierarchy of indices — for a multi-document storage a pretty classic case would be an index of summaries and another index of document chunks vectors for example.

Defining the query router includes setting up the choices it can make.
The selection of a routing option is performed with an LLM call, returning its result in a predefined format, used to route the query to the given index, or, if we are taking of the agnatic behaviour, to sub-chains or even other agents as shown in the Multi documents agent scheme below.

Both LlamaIndex and LangChain have support for query routers.

7. Agents in RAG

Agents (supported both by Langchain and LlamaIndex) have been around almost since the first LLM API has been released — the idea was to provide an LLM, capable of reasoning, with a set of tools and a task to be completed. The tools might include some deterministic functions like any code function or an external API or even other agents — this LLM chaining idea is where LangChain got its name from.

Agents are a huge thing itself and it’s impossible to make a deep enough dive into the topic inside a RAG overview, so I’ll just continue with the agent-based multi document retrieval case, making a short stop at the OpenAI Assistants station as it’s a relatively new thing, presented at the recent OpenAI dev conference as GPTs, and working under the hood of the RAG system described below.

OpenAI Assistants basically have implemented a lot of tools needed around an LLM that we previously had in open source — a chat history, a knowledge storage, a document uploading interface and, maybe most important, function calling API. This latter provides capabilities to convert natural language into API calls to external tools or database queries.

In LlamaIndex there is an OpenAIAgent class marrying this advanced logic with the ChatEngine and QueryEngine classes, providing knowledge-based and context aware chatting along with the ability of multiple OpenAI functions calls in one conversation turn, which really brings the smart agentic behaviour.

Let’s take a look at the Multi-Document Agents scheme — a pretty sophisticated setting, involving initialisation of an agent (OpenAIAgent) upon each document, capable of doc summarisation and the classic QA mechanics, and a top agent, responsible for queries routing to doc agents and for the final answer synthesis.

Each document agent has two tools — a vector store index and a summary index, and based on the routed query it decides which one to use.
And for the top agent, all document agents are tools respectfully.

This scheme illustrates an advanced RAG architecture with a lot of routing decisions made by each involved agent. The benefit of such approach is the ability to compare different solutions or entities, described in different documents and their summaries along with the classic single doc summarisation and QA mechanics — this basically covers the most frequent chat-with-collection-of-docs usecases.

A scheme illustrating multi document agents, involving both query routing and agentic behavior patterns.

The drawback of such a complex scheme can be guessed from the picture — it’s a bit slow due to multiple back and forth iterations with the LLMs inside our agents. Just in case, an LLM call is always the longest operation in a RAG pipeline — search is optimised for speed by design. So for a large multi document storage I’d recommed to think of some simplifications to this scheme making it scalable.

8. Response synthesiser

This is the final step of any RAG pipeline — generate an answer based on all the context we carefully retrieved and on the initial user query.
The simplest approach would be just to concatenate and feed all the fetched context (above some relevance threshold) along with the query to an LLM at once.
But, as always, there are other more sophisticated options involving multiple LLM calls to refine retrieved context and generate a better answer.

The main approaches to response synthesis are:
1. iteratively refine the answer by sending retrieved context to LLM chunk by chunk
2. summarise the retrieved context to fit into the prompt
3. generate multiple answers based on different context chunks and then to concatenate or summarise them.
For more details please check the Response synthesizer module docs.

Encoder and LLM fine-tuning

This approach involves fine-tuning of some of the two DL models involved in our RAG pipeline — either the Transformer Encoder, resposible for embeddings quality and thus context retrieval quality or an LLM, responsible for the best usage of the provided context to answer user query — luckily, the latter is a good few shot learner.

One big advantage nowadays is the availability of high-end LLMs like GPT-4 to generate high quality synthetic datasets.
But you should always be aware that taking an open-source model trained by professional research teams on carefully collected, cleaned and validated large datasets and making a quick tuning using small synthetic dataset might narrow down the model’s capabilities in general.

Encoder fine-tuning

I’ve also been a bit skeptical about the Encoder funetuning approach as the latest Transformer Encoders optimised for search are pretty efficient.
So I have tested the performance increase provided by finetuning of bge-large-en-v1.5 (top 4 of the MTEB leaderboard at the time of writing) in the LlamaIndex notebook setting, and it demonstrated a 2% retrieval quality increase. Nothing dramatic but it is nice to be aware of that option, especially if you have a narrow domain dataset you’re building RAG for.

Ranker fine-tuning

The other good old option is to have a cross-encoder for reranking your retrieved results if you dont trust your base Encoder completely.
It works the following way — you pass the query and each of the top k retrieved text chunks to the cross-encoder, separated by a SEP token, and fine-tune it to output 1 for relevant chunks and 0 for non-relevant.
A good example of such tuning process could be found here, the results say the pairwise score was improved by 4% by cross-encoder finetuning.

LLM fine-tuning

Recently OpenAI started providing LLM finetuning API and LlamaIndex has a tutorial on finetuning GPT-3.5-turbo in RAG setting to “distill” some of the GPT-4 knowledge. The idea here is to take a document, generate a number of questions with GPT-3.5-turbo, then use GPT-4 to generate answers to these questions based on the document contents (build a GPT4-powered RAG pipeline) and then to fine-tune GPT-3.5-turbo on that dataset of question-answer pairs. The ragas framework used for the RAG pipeline evaluation shows a 5% increase in the faithfulness metrics, meaning the fine-tuned GPT 3.5-turbo model made a better use of the provided context to generate its answer, than the original one.

A bit more sophisticated approach is demonstrated in the recent paper RA-DIT: Retrieval Augmented Dual Instruction Tuning by Meta AI Research, suggesting a technique to tune both the LLM and the Retriever
(a Dual Encoder in the original paper) on triplets of query, context and answer. For the implementations details please refer to this guide.
This technique was used both to fine-tune OpenAI LLMs through the fine-tuning API and Llama2 open-source model (in the original paper), resulting in ~5% increase in knowledge-intense tasks metrics (compared to Llama2 65B with RAG) and a couple percent increase in common sense reasoning tasks.

In case you know better approaches to LLM finetuning for RAG, please share your expertise in the comments section, especially if they are applied to the smaller open source LLMs.

Evaluation

There are several frameworks for RAG systems performance evaluation sharing the idea of having a few separate metrics like overall answer relevance, answer groundedness, faithfulness and retrieved context relevance.

Ragas, mentioned in the previous section, uses faithfulness and answer relevance as the generated answer quality metrics and classic context precision and recall for the retrieval part of the RAG scheme.

In a recently released great short course Building and Evaluating Advanced RAG by Andrew NG, LlamaIndex and the evaluation framework Truelens, they suggest the RAG triad — retrieved context relevance to the query, groundedness (how much the LLM answer is supported by the provided context) and answer relevance to the query.

The key and the most controllable metric is the retrieved context relevance — basically parts 1–7 of the advanced RAG pipeline described above plus the Encoder and Ranker fine-tuning sections are meant to improve this metric, while part 8 and LLM fine-tuning are focusing on answer relevance and groundedness.

A good example of a pretty simple retriever evaluation pipeline could be found here and it was applied in the Encoder fine-tuning section.
A bit more advanced approach taking into account not only the hit rate, but the Mean Reciprocal Rank, a common search engine metric, and also generated answer metrics such as faithfulness abd relevance, is demonstrated in the OpenAI cookbook.

LangChain has a pretty advanced evaluation framework LangSmith where custom evaluators may be implemented plus it monitors the traces running inside your RAG pipeline in order to make your system more transparent.

In case you are building with LlamaIndex, there is a rag_evaluator llama pack, providing a quick tool to evaluate your pipeline with a public dataset.

Conclusion

I tried to outline the core algorithmic approaches to RAG and to illustrate some of them in hopes this might spark some novel ideas to try in your RAG pipeline, or bring some system to the vast variety of tecniques that have been invented this year — for me 2023 was the most exciting year in ML so far.

There are many more other things to consider like web search based RAG (RAGs by LlamaIndex, webLangChain, etc), taking a deeper dive into agentic architectures (and the recent OpenAI stake in this game) and some ideas on LLMs Long-term memory.

The main production challenge for RAG systems besides answer relevance and faithfulness is speed, especially if you are into the more flexible agent-based schemes, but that’s a thing for another post. This streaming feature ChatGPT and most other assistants use is not a random cyberpunk style, but merely a way to shorten the perceived answer generation time.
That is why I see a very bright future for the smaller LLMs and recent releases of Mixtral and Phi-2 are leading us in this direction.

Thank you very much for reading this long post!

The main references are collected in my knowledge base, there is a co-pilot to chat with this set of documents: https://app.iki.ai/playlist/236.

Find me on LinkedIn or Twitter

Advanced RAG Techniques: an Illustrated Overview was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a news aggregator from scratch: news filtering, classification, grouping in threads and…

IVAN ILIN — Mon, 27 Jan 2020 13:19:32 GMT

Building a news aggregator from scratch: news filtering, classification, grouping in threads and ranking

Fake news headlines from https://www.designboom.com/design/the-fake-newsstand-tbwa-chiat-day-columbia-journalism-review-11-05-2018/

The idea behind this post is to show a reasonably simple approach one can implement in a couple of weeks to solve a real-world problem of creating a news aggregator like Google news or Yandex news showing the top news threads out of millions of news scraped all over the web.

Problem statement and restrictions

So this is another post about NLP where I shall describe a few algorithms for texts filtering, classification, grouping and ranking developed during the Telegram Data Clustering contest. The motivation behind this post is to demonstrate that you can build a decent texts processing system and run it on your laptop without even a GPU.

The contest included five tasks — detecting news languages, filtering news from other texts (such as encyclopedic articles, some random entertainment posts, blogposts, etc), news classification in one of seven categories (Society, Economy, Sports, Science, Technologies, Entertainment and Other), grouping news in threads and ranking these threads by importance. The full contest rules are avaliable here.

Selection of the particular algorithms and instruments was largely dependant on the the contest rules putting some constraints on the implementation — each task had to be executed within 60 sec per 1000 articles on a Debian machine with 8 CPUs and 16 Gb of RAM, there should be no external services or APIs used, the algorithm even sould not assume there is an internet connection (to download some pretrained models, for example). Hence no SOTA Transformer models like BERT, ALBERT or GPT-2 should have been involved.

The high-level solution architecture described in this article looks the following way:

Text preprocessing and vectorization
Texts classification with a custom Deep Neural Network (DNN) with a LSTM and an Attention layers
Grouping texts in threads with the nearest neighbors search algorithm controlled by the Levenshtein distance within each group.
Ranking news threads by importance

Note: all the code and ideas described in this post have been developed during the contest period (2 weeks), though some fine-tuning of the grouping algorithm as well as code styling have been performed afterwards during the New Year holidays.

Raw data parsing

Contest participants have been provided with hundreds of thousands of publications saved as html files containing news title, text, sometimes an image, a publication date, an author and media source. I used the Beatifulsoup library as a handy html parser to extract all the needed data into a pandas dataframe.

Language detection

This part was pretty straightforward — I used a fast langdetect implementation as language detector and feeded it with titles by default to speed up detection — language detection on texts with 10x larger average length is slower.

https://medium.com/media/b3d2c71943f7ae2ea8fd99f1de791944/href

This step took 12 sec per 1000 files, the detection accuracy is over 99%. The resulting data looked the following way:

https://medium.com/media/e1464b62c14b3d9bede74a20e87fb911/href

Text preprocessing logic

Text vectorisation logic is one of the core algorithm decisions contest participants had to come up with. We had a large and variative enough corpus of around 1M articles to use pretrained word embeddings instead of the basic TF-IDF approach. But first we had to perform common tokenization and stemming procedures. I used stopwords list and Porter Stemmer from the nltk library.

https://medium.com/media/c82add3ec392d3cdbe863cf52129aeae/href

After this step each text was represented by a list of word tokens.

The next step was the replacement of each token in a list with a vector from one of the pretrained language models — Glove or fasttext.

The result of this operation was that each text was now represented by a list of semantically rich word vectors. I have put a restriction on the maximum length of the list — 50 words, the headline was concatenated with the beginning of the article’s body.

In order to equalise the lenght of all vectors the padding operation has been performed. We’ve now got the feature tensor of our text corpus, each row represents a sequence of pretrained word vectors of the chosen dimension.

https://medium.com/media/1d46c8a7124e2ca305e4eb018abdf63a/href

Deep neural network architecture

Since we had quite enough data for a neural network training I decided to use a deep neural network (DNN) classifier in order to tell news from not news and to define news topics.

Taking into account it was a algorithm contest an obvious choice maximising the solution’s accuracy would be a SOTA NLP model architecture, namely some kind of large Transformer like BERT, but as I have mentioned before, these kinds of models would be too large and too slow to pass the restrictions imposed on hardware and on text processing speed. The other drawback is that such model’s training would take a few days leaving me no time for model’s fine-tuning.

I had to come up with a simpler architecture so I implemented a lightweight neural net with an RNN (LSTM) layer taking the sequence of words embeddings representing texts and an Attention layer on top of it. The output of the network should be the class probabilities (binary classification for news filtering step and multiclass for news categories detection), so the upper part of the network was comprised by a set of fully connected layers.

News topic detection — multiclass classification

I shall be explaining the selection of particular DNN architecture using the multiclass classifier (detecting news categories) as an example because its objective is more challenging and its performance is independent of the threshold value used in the binary classifier to draw the margin between positive and negative classes.

The upper part of our neural net was made up of three Dense layers, the output layer had 7 units corresponding to the number of classes with the softmax activation function and categorical crossentropy as a loss function.

To train it I used the News_Category_Dataset and applied a mapping logic in order to fit the initial 31 news categories into one of 7 categories: Society, Economy, Sports, Technology, Entertainment, Science and Other to comply with the contest objectives.

https://medium.com/media/dbddc68d0a5d550b96d50e42ca88307a/href

I would like to share the logic behind the NN model’s architecture selection and hyperparameters fine-tuning.

The Dropout layers increase the generalisation ability of our model and prevent it from overfitting (Dropout layer randomly zeroes out the given percentage of weights at each update of the training phase). Model’s overfitting without the Dropout layers is clearly demonstrated by the learning curves — the accuracy on the training set grows up to 95% while the accuracy on the validation dataset barely grows during training.
A BatchNormalisation layer applied on top of the LSTM-Attention construction normalized the activations after the dropout at each batch keeping the activation mean close to 0 and the activation standard deviation close to 1. It tends to increase test accuracy on the larger batch sizes and to decrease it on the smaller ones.
Regularization applied to Dense layers penalizes the extreme values of layer parameters.
Batch size selection can also affect a model’s performance. The larger the size is the faster your model trains and the more precisely the gradient vector is calculated on each step. This results in noise reduction which makes the model more prone to converging to a local minimum so the batch size selection is usually a tradeoff between speed, memory consumption and model’s performance. Values between 32 and 256 are the common choice, in our case the model showed top accuracy with batch size 64. Increasing batch size up to 512 or 1024 significantly decreases model’s accuracy (by 2% and 4% accordingly)

Model’s performance depending on architecture and hyperparameters

Attention layer explained

It is worth saying a few words about the attention mechanism used in our model. The theory behind the applied approach is described in the arXiv paper by Raffel et al. This is a simplified model of attention for feed-forward neural networks addressing the RNN information flow problem for long sequences. Particularly, the attention layer provides an optimal transition to the fully connected layer, creating a context vector (an embedding for the sequence of input word vectors) as the weighted average of the hidden states of the input sequence with the weights representing the importance of the elements of the sequence. The explicit notation looks the following way:

where T is the length of sequence and a is a learnable function, namely a single hidden layer feed-forward network with the tanh activation function, jointly trained with the global model.

https://medium.com/media/d52c0fdd34305efe5950f2e949434ba9/href

The drawback of this model is that it does not take the order within an input sequence into account but for our task of news headlines vectorization this is not as significant as for some sequence-to-sequence problems such as phrase translation.

There is a number of publications describing the more complicated attention mechanisms designed for seqence-to-sequence problems, a good start would be the Neural Machine Translation with Attention tutorial by Google Research, I would also recommend to check the Attention? Attention! post by Lilian Weng with an overview of attention mechanisms and their evolution and make sure you have read the original Bahdanau et al, 2015 paper. Among the more up-to-date LSTM + Attention papers of the post-Transformer era I would recommend reading recent Single-headed attention RNN by Stephen Merity, proving that the huge Transformers are not the only possible approach.

P.S. Do not forget to add a saving config in the Attention class as it will enable seamless loading of the saved model with the custom layer.

An alternative simpler transition to the Dense layers would be the Flatten layer used for tensor reshaping.

Other ideas & results

In order to increase the classifier quality one could take advantage of the other information provided like the publication source and the author and add them to the network as one-hot encoded features, but the contest rules explicitly said that the algorithms would be evaluated on different datasets with other specific source list.

To speed up the training phase I ran the code in a colab notebook with TeslaK80 GPU runtime.

Typical model’s learning curves

https://medium.com/media/ac8799a214b8c27d8e09d89b177ceb28/href

The classification step takes 13 sec per 1000 texts.

News / not news — binary classification

To filter only news from the given dataset of publications we had to implement a binary classifier. Actually this step has been performed before news categories classification. The binary calssifier had pretty much similar architecture with the obvious difference in the output layer — the last layer had 1 unit with the sigmoid activation function and binary crossentropy as a loss function.

The data provided by contest organizers had over 90% of news publications in it so I decided to filter 100% news by source (The Guardian, Bloomberg, CNN, etc) and then to use an english Wikipedia atricles (the good ones, not promotional) to represent the NOT NEWS class.

Typical binary model’s learning curves

The trained binary classifier model outputs the probability of the object being a positive (1) class, in order to interpret these probabilities we needed to impose a treshold wich would tell news from not news. This selection has been performed manually after a long careful look on the marginal cases has been taken.

The classification step takes 14 sec per 1000 texts. The classification results can be observed in the previous table.

Grouping news in threads

The news grouping task was solved by building a Ball Tree on the text embedding vectors and then searching this tree with an adaptive search radius controlled by the normalized Levenshtein distance within each group of neighbors.

Text vectorization

In order to group news first we should introduce some kind on metrics on the dataset. Since we used DNNs for texts processing and classification the most obvious way to get a text’s vector would be taking the embedding obtained by passing the text through the pretrained multiclass DNN without the last layer. I used the 128 unit dense layer as the output to create texts embedding vectors.

As I found later this approach was not the optimal one — a significantly better performance in news grouping was demonstrated when I switched to a simpler TF-IDF vectorization. This is quite explainable — while in news category classification we needed to generalize the whole news semantic meaning regardless of particular politician’s names, gadget, celebrities, and even particular circumstances, the sequence of pretrained word vectors fed to the DNN was a suitable choice for the task. In news grouping we are dealing with a different setting — each thread should be describing a very particular event — something has happened to somebody, the characters names, even verbs and adverbial modifiers should be the same across the whole news thread, so the classic TF-IDF (calculating the vector of n-gram frequencies in each text normalized by the n-gram frequencies in the whole corpus) is the approach loosing less valuable information. To fight the TF-IDF output matrix sparsity I have applied an SVD decomposition compressing each text’s vector to the selected dimension (1000 in our case).

https://medium.com/media/dde83ae5e777a232d1e946de3ba9424c/href

A fast nearest neighbor search

The next step is actually texts grouping. The basic unsupervised approach for this task would be clustering, but since we are unlikely to get a perfect clustering from the first shot our algorithm requires an iterative approach and the clustering of the whole dataset is expensive. Besides that we do not have a proper metrics of grouping quality to run an automated hyperparameters tuning procedure, neither we should tune them by hands since it is highly possible that we’ll need a different set of hyperparameters for different news datasets.

I decided that a more precise and flexible approach would be the construction of a fast search index on texts embedding vectors and then querying this index. The efficient approach for fast high-dimensional nearest-neighbor search is to build a binary space partitioning tree, namely a Ball Tree or a KD Tree (a k-dimensional binary search tree is a space-partitioning data structure for organizing points in a k-dimensional space) on the text embeddings without explicitly calculating all the distances within a dataset. These algorithms scale with O(N log(N)) complexity for tree construction. My particular choice was a scikit-learn Ball Tree implementation as it has lower query time O(D log(N)) than a KD Tree in high dimensions and we had to optimize query times as far as we intend to perform iterative searches with various search radius for each element in our dataset. For more details on the differences between KDtree and BallTree data structures and the performance benchmarking please refer to this great post by Jake VanderPlasof, scikit-learn contributor.

https://medium.com/media/55eadb9ac6a89d789866d08e9318961d/href

The grouping algorithm

Ok, now we’ve finally have got an index of all papers and can group them in threads by distance between different samples.

I did not have enough time to reflect on some more sophisticated approaches and just created a cycle over all papers in each news category (we can take advantage of the pervious algorithm step to partition our dataset by news categories) checking for their neighbors in radius r_start, r_start was selected empirically in such a way that it took a little more papers than the actual thread contained. Then I calculated an empirical functional variative_criterium_norm — the normalized Levenstein distance between the texts within the group — and decreased the search radius r_curr iteratively until this functional became less than the empirically found constraint or the search radius size hit r_min constraint. If there were no news in the given area, I increased the search radius until we found some neighbors and then switched to the radius decreasing branch. The idea behind this iterative search was that similar news have partly similar headlines (some entities like subject, object and sometimes verbs are invariant over the news thread). The other details of the developed algoritms are easier to see from the code snippet.

https://medium.com/media/46b81adbff9668e0109cefd8cf94dcbb/href

There are 7 hyperparameters controlling the algorithm: r_min, r_max — these control the size of the query area, r_start, r_step — controlling the query area dynamics and vc_min, vc_max, delta_max, controlling the value of normalized Levenshtein distance within a group—this defines the variance of news headlines in a group. These hyperparameters should be tuned after you have chosen the vectorization parameters like the final text embedding size (n_components in SVD) and the range of n_grams used (I used 1-3).

It is not really obvious how we can estimate a “proper” news grouping, so I did not spend too much time playing with hyperparameters after I got a reasonable grouping result — my intention was to suggest a working approach to solve the problem with the given constraints. Clearly there is some space for improvement like checking if we can merge some of the groups and filtering out some occasional noise. Actually in order to get a reasonable number and density of groups the grouping hyperparameters should be fine tuned to each dataset, so there could be an outer cycle implementing a kind of Randomized search on them. One of the ways to get an idea of the particular grouping we got and to estimate its quality is to check the groups size distribution histogram.

Groups size distribution histogram obtained on a test dataset

In fact, the described approach could be regarded as a relative of the DBSCAN clustering.

Execution time varies depending on the hyperparameters chosen for the dataset and the structure of data, the typical values are from 8.5 sec / 1000 papers to 25 sec / 1000 papers including the vectorization time defined by the expensive SVD operation.

Sorry for the long listing, here are the full results for the SOCIETY category ranked by groups size

https://medium.com/media/a3162d70881279caa6044aa41e52777c/href

Threads ranking

Actually there are lots of features to use when it comes to threads ranking, the problem is that this ranking may be quite subjective depending on the region and the interests of a particular user. We may agree that there are global and local news, some speculations/opinions on ploitical issues and mere facts, but some new information about Trump’s impeachment hearings may outweigh the global but distant tragedies like forest fires in Australia — the importance of news is a perceived, not an objective characteristic.

In the view of the foregoing I shall just describe the approach to handle this task: the most obvious features for threads ranking would be the number of publications in a thread, the ranking of the sources (in the Society category for example Bloomberg and Financial Times are the most respectable sources) and the semantic meaning of the thread like ‘international politics’, ‘global economy’, ‘war’, ‘global accident’, ‘local accident/crime’ etc. This approach presupposes manual introduction and ranking of these semantic categories and manual ranking of the news sources. If we do not want to create features manually, another possible approach would be to manually rank some selection of threads, to calculate their mean semantic vectors (average any vectorization we have used) and then to use these vectors as the predictors to train out ranking model (a kind of regression model) based on the scores we have assigned to threads importance.

In fact, a quite reasonable ranking can be obtained just by sorting the threads by size. Given this fact and the fact that I have not implemented the ranking part during the contest, I shall leave the enthusiastic ones free to try out any of the approaches discussed above.

One important issue is that the contest presupposed a static ranking — we are given all news for the same date, but in reality news are time-dependent and loose their novelty as time passes, this relevance decay could be fairly enough described with the exp(-t) function.

Thank you for reading this long post, I hope you have found some ideas to use in your NLP projects.

Building a news aggregator from scratch: news filtering, classification, grouping in threads and… was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Check out iki.ai, a new professional growth ecosystem

IVAN ILIN — Wed, 21 Feb 2018 12:50:05 GMT

Value of information decreases as its volume grows, and time is more precious than ever so we have the ambition to provide the most efficient ways of spending your time for professional growth.

The future is here

Technology progress comes at an exponential rate. The humankind now stands at the threshold of the fourth industrial revolution which will bring together achievements of biotechnology, physics, robotics and IT. Artificial intelligence, the blockchain, quantum computing, biohacking, next-generation human-computer interfaces will cause disruptive changes in world economy. These global processes change labor market. Today cycles of technology trends come faster than a classical academy course. When you were entering university there was little demand for AI scientists, AR designers or blockchain engineers. In order to keep being a competitive professional, you have to live in the lifelong education paradigm. People will be changing several industries during their lifespan and globalisation is enhancing competition, so you better get ready for this marathon.

Education has always defined person’s life path and his social class. Education has always been a very powerful social elevator. Today this elevator has opened its’ doors for everyone. World’s top level academy courses and the foremost technologies are accessible for everyone having access to the Internet by means of such educational platforms as Coursera, EdX, Udacity, etc, or technology resources such as Github or StackOverflow. This counts up to 3 bln people around the globe. Even language barriers are obliterated with recent advances in speech recognition and Natural Language Understanding.

There are only two main barriers left — the excessive amount of information available and lack of personal motivation and self-confidence to step on the right track.

Ikigai philosophy

But why the education track is the right one? Maybe there is too much competition — the higher you get, the higher the stakes are, with management consultants and investment bankers on top working 16 hours a day, sleeping in their office or Silicon Valley engineers spending their lives looking at the blue screens 70% of their awake time?

The answer is quite simple — because you have to find your ikigai — the reason for your being. This is a concept from Japanese philosophy (to be more exact, from Okinawa), meaning the reason to wake up every morning.

Ikigai is the intersection of the things that you love, you are good at, of things that world needs and of the ones that are paid for. In order to find it, you need to try various things in life — traveling, going in for sports, reading and starting relations. But the main point is trying to create something, be that a painting, a computer program or a new business process. I would call this concept “modern” or “western” ikigai.

iki.ai service was invented to help people enroll in this process of trials and errors, to push barriers of their knowledge and skills in a more systematic and engaging way than people usually do. iki.ai is your personal career advisor powered by machine learning and artificial intelligence related technologies.

Professional growth algorithm

iki.ai service has two main use cases. First is helping a young professional or a student to navigate through various career options, set a career goal and providing him with all the tools needed to achieve it. This case also includes professionals willing to make a major pivot in their career such as switching industry — it will be much easier to make the decision having clear understanding of needed skills and career options in the new field. The second scenario is for experienced professionals willing to excel — they get carefully filtered content and news feed from the selected areas of professional expertise.

Let’s go a little deeper and see how exactly this works and what value is delivered.

All data analysis is based on your professional experience provided by Linkedin login or CV upload along with a single screens’ tags selection mechanic resembling Apple Music experience.

Career graph for IT industry with path from intern to product VP (left). Personal development plan leading to portfolio manager position in Financial industry (right)

In the first scenario, user explores career graph — typical industry positions with the routing function using real-world career statistics. Once one of the positions is set as career goal the route is broken down into a list of career steps with key skills and responsibilities at each position. iki.ai shall calculate your personal route from a junior developer to the head of a research group or to the managing partner of a hedge fund. Your personal development plan will include recommendations of the most relevant online courses and trainings, platforms and local communities. Personal news feed covering topics of interest and needed skills powered by a custom recommending system will be with you on the chosen track, fetching carefully filtered fresh blogposts, news and publications in selected professional areas on daily basis.

Motivation cannot be underestimated, that is why we unite people on the same career track in a community. Within this group members will see each other’s progress, feedback on content and will share achievements. We shall be tracking your progress and showing your peer’s successes. Your peers will be your team — share your insights and help each other to improve.

Social line showing your peers and their most recent achievements with Y coordinate representing each user’s progress (left). Achievement card (position change) shown in user’s feed (right).

The second scenario is designed for seasoned professionals looking more for insights than for motivation — these guys are provided with content feed directly without any career goal setting.

Our feed is smart — you can chat with it and specify your particular interest, goal, expertise level or type of the content fetched for you from the web. Our intention here is to save user’s time by cutting informational noize and secondary publications with careful information aggregation, filtering, and ranking both with ML algorithms and user ratings.

Featured technologies & innovation

AI now is the hot trend, machine learning and data science becoming a casual approach. All major features of iki.ai service are empowered with machine learning algorithms starting from user’s experience processing (vectorizing text data) and classification in one of the points on the career graph (which is built by clustering of a large amount of career data within each industry to form position clusters with similar job description) continuing with machine learning ranked courses recommendations and the smart content feed. Daily feed is generated via automated web data collection and processing with an advanced ih-house developed deep learning recommending system on top. A chat-bot mechanic will be integrated for a more interactive feeling and precise recommendations. Technology stack and UX design ideas will be described in more details in further publications.

Conclusion

Our motivation to build this product was quite simple — both founders missed an egocentric interactive service focused on user’s professional growth, bringing together education, career goals, and personal development. Ivan was a little puzzled trying to find his way in the variety of open possibilities after defending his Ph.D. in Applied Mathematics. Max was really busy working as CEO and Creative Director of ONY agency which he founded 17 years ago, having little time left for self-improvement. We both felt that knowledge delivery job could be optimized and automated, that now there is little time left for scrolling Facebook newsfeed as well as for checking dozens of good quality blogs every couple of days. We also presumed that automatically formed new social groups of professionals with similar aspirations pushing each others expertise would make that painful self-improvement process more engaging and meaningful for each member.

Due to the past industrial revolutions, people got more free time and the unlimited access to information. We have the ambition to help you use this information in order to find your ikigai and to fulfill your potential. To help people move at modern world’s pace, running with technology’s exponential acceleration.

A public beta will be released soon, please keep in touch to be among the first adopters. Check out our web page iki.ai and subscribe to updates on Facebook: fb.me/ikiservice

Stories by IVAN ILIN on Medium

Through knowledge sharing to singularity, accelerated by LLMs

Through Knowledge Sharing to Singularity, Accelerated By LLMs

Some history

Internet era & open source

LLMs and knowledge — a symbiotic relationship

Transformers invention and training

LLMs are born

LLMs are the new interface for information

Societal effects

Conclusion

Advanced RAG Techniques: an Illustrated Overview

A comprehensive study of the advanced retrieval augmented generation techniques and algorithms, systemising various approaches. The article comes with a collection of links in my knowledge base referencing various implementations and studies mentioned.

Intro

Naive RAG

Advanced RAG

1. Chunking & vectorisation

2. Search index2.1 Vector store index

2. 2 Hierarchical indices

2.3 Hypothetical Questions and HyDE

2.4 Context enrichment

2.5 Fusion retrieval or hybrid search

3. Reranking & filtering

4. Query transformations

Reference citations

5. Chat Engine

6. Query Routing

7. Agents in RAG

8. Response synthesiser

Encoder and LLM fine-tuning

Encoder fine-tuning

Ranker fine-tuning

LLM fine-tuning

Evaluation

Conclusion

Building a news aggregator from scratch: news filtering, classification, grouping in threads and…

Building a news aggregator from scratch: news filtering, classification, grouping in threads and ranking

Problem statement and restrictions

Raw data parsing

Language detection

Text preprocessing logic

Deep neural network architecture

News topic detection — multiclass classification

Attention layer explained

News / not news — binary classification

Grouping news in threads

Threads ranking

Check out iki.ai, a new professional growth ecosystem

2. Search index
2.1 Vector store index