Stories by Vinayak Sengupta on Medium

The Essential Guide to Effectively Summarizing Massive Documents, Part 1

Vinayak Sengupta — Sat, 14 Sep 2024 00:10:35 GMT

Document summarization is important for GenAI use cases, but what if the documents are too BIG!? Read on to find out how I have solved it.

“Summarizing a lot of text”— Image generated with GPT-4o

Document summarization today has become one of the most (if not the most) common problem statements to solve using modern Generative AI (GenAI) technology. Retrieval Augmented Generation (RAG) is a common yet effective solution architecture used to solve it (If you want a deeper dive into what RAG is, check out this blog!). But what if the document itself is so large that it cannot be sent as a whole in a single API request? Or what if it produces too many chunks to cause the infamous ‘Lost in the Middle’ context problem? In this article, I will discuss the challenges we face with such a problem statement and go through a step-by-step solution that I applied using the guidance offered by Greg Kamradt in his GitHub repository.

Some “context”

RAG is a well-discussed and widely implemented solution for addressing document summarizing optimization using GenAI technologies. However, like any new technology or solution, it is prone to edge-case challenges, especially in today’s enterprise environment. Two main concerns are contextual length coupled with per-prompt cost and the previously mentioned ‘Lost in the Middle’ context problem. Let’s dive a bit deeper to understand these challenges.

Note: I will perform the exercises in Python, using the LangChain, Scikit-Learn, Numpy, and Matplotlib libraries for quick iterations.

Context window and Cost constraints

Today, with automated workflows enabled by GenAI, analyzing big documents has become an industry expectation/requirement. People want to quickly find relevant information from medical reports or financial audits by prompting the LLM. But there is a caveat: enterprise documents are not like documents or datasets we deal with in academics; the sizes are considerably bigger, and the pertinent information can be present anywhere in the documents. Hence, methods like data cleaning/filtering are often not a viable option since domain knowledge regarding these documents is not always given.

In addition to this, even the latest Large Language Models (LLMs) like GPT-4o by OpenAI with context windows of 128K tokens cannot just consume these documents in one shot, or even if they did, the quality of response will not meet standards, especially for the cost it will incur. To showcase this, let’s take a real-world example of trying to summarize the Employee Handbook of GitLab, which can downloaded here. This document is available free of charge under the MIT license available on their GitHub repository.

1 We start by loading the document and also initializing our LLM, to keep this exercise relevant, I will make use of GPT-4o.

Note: We have removed the first 30 pages of the document as they are change logs and will not contribute to insightful information and only consume memory.

from langchain_community.document_loaders import PyPDFLoader

# Load PDFs
pdf_paths = ["/content/gitlab_handbook.pdf"]
documents = []

for path in pdf_paths:
    loader = PyPDFLoader(path)
    documents.extend(loader.load())

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")

2 Then, we can divide the document into smaller chunks (this is for embedding; I will explain why in the later steps).

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split documents into chunks
splits = text_splitter.split_documents(documents)

3 Now, let’s calculate how many tokens make up this document, for this, we will iterate through each document chunk and calculate the total tokens that make up the document.

total_tokens = 0

for chunk in splits:
    text = chunk.page_content  # Assuming `page_content` is where the text is stored
    num_tokens = llm.get_num_tokens(text)  # Get the token count for each chunk
    total_tokens += num_tokens

print(f"Total number of tokens in the book: {total_tokens}")

# Total number of tokens in the book: 254006

As we can see, the number of tokens is 236,592, while the context window limit for GPT-4o is 128,000. This document cannot be sent in one go through the LLM’s API. In addition to this, considering this model's pricing is $0.00500 / 1K input tokens, a single request sent to OpenAI for this document would cost $1.18! This does not sound horrible until you present this in an enterprise paradigm with multiple users and daily interactions across many such large documents, especially in a startup scenario where many GenAI solutions are being born.

Lost in the Middle

Another challenge faced by LLMs is the Lost in the Middle Context problem, as discussed in detail in this paper. Research and my experiences with RAG systems handling multiple documents describe that LLMs are not very robust when it comes to extrapolating information from long context inputs. Model performance degrades considerably when relevant information is somewhere in the middle of the context. However, the performance improves when the required information is either at the beginning or the end of the provided context. Document Re-ranking is a solution that has become a subject of progressively heavy discussion and research to tackle this specific issue. I will be exploring a few of these methods in another post. For now, let us get back to the solution we are exploring, which utilizes K-Means Clustering.

What is K-Means Clustering?!

Okay, I admit I sneaked in a technical concept in the last section. Allow me to explain it (for those who may not be aware of the method, I got you).

First the basics

To understand K-means clustering, we should first know what clustering is. Consider this: we have a messy desk with pens, pencils, and notes all scattered together. To clean up, one would group items, like all pens in one group, pencils in another, and notes in another, creating essentially three separate groups (not promoting segregation). Clustering is the same process where among a collection of data (in our case, the different chunks of document text), similar data or information are grouped, creating a clear separation of concerns for the model, making it easier for our RAG system to pick and choose information effectively and efficiently instead of having to go through it all like a greedy method.

K, Means?

K-means is a specific method to perform clustering (there are other methods, but let’s not information dump). Let me explain how it works in 5 simple steps:

Picking the number of groups (K): How many groups we want the data to be divided into
Selecting group centers: Initially, a center value for each of the K-groups is randomly selected
Group assignment: Each data point is then assigned to each group based on how close it is to the previously chosen centers. Example: items closest to center 1 are assigned to group 1, items closest to center 2 will be assigned to group 2…and so on till Kth group.
Adjusting the centers: After all the data points have been pigeonholed, we calculate the average of the positions of the items in each group, and these averages become the new centers to improve accuracy (because we had initially selected them at random).
Rinse and repeat: With the new centers, the data point assignments are again updated for the K-groups. This is done till the difference (mathematically the Euclidean distance) is minimal for items within a group and the maximal from other data points of other groups, ergo optimal segregation.

While this may be quite a simplified explanation, a more detailed and technical explanation (for my fellow nerds) of this algorithm can be found here.

Enough theory, let’s code.

Now that we have discussed K-means clustering, which is the main protagonist in our journey to optimization, let us see how this robust algorithm can be used in practice to summarize our Handbook.

4 Now that we have our chunks of document text, we will be embedding them into vectors.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Embed the chunks
chunk_texts = [chunk.page_content for chunk in splits]  # Extract the text from each chunk
chunk_embeddings = embeddings.embed_documents(chunk_texts)

Maybe a little theory

Alright, alright, so maybe there’s more to learn here — what’s embedding? Vectors?! and why?

Embedding & Vectors

Think of how a computer does things — it sees everything as binary, ergo, the best language to teach/instruct it is in numbers. Hence, an optimal way to have complex ML systems understand our data is to see all that text as numbers, and that very method by which we do this conversion is called Embedding. The number list describing the text or word is known as Vectors.

Embeddings can differ depending on how we want to describe our data and the heuristics we choose. Let’s say we wanted to describe an apple; we need to consider its color (Red), its shape (Roundness), and its size. Each of these could be encoded as numbers, like the ‘redness’ could be an 8 on a scale of 1–10. The roundness could be 9, and the size could be 3 (inches in width). Hence, our vector for representing the apple would be [8,9,3]. This very concept is applied in more complexity when describing different qualities of documents where we want each number to map the topic, the semantic relationships, etc. This would result in vectors having hundreds or more numbers long.

But, Why?!

Now, what improvements does this method provide? Firstly, as I mentioned before, it makes data interpretation for the LLMs easier, which provides better accuracy in inference from the models. Secondly, it also helps massively in memory optimization (space complexity in technical terms) by reducing the amount of memory consumption by converting the data into vectors. The paradigm of vectors is known as vector space, for example, A document with 1000 words can be reduced to a 768-dimensional vector representation, hence resulting in a 768 number representation instead of 1000 words.

A little deeper math (for my dear nerds again), “1234” in word (or strings in computer language) form would consume 54 bytes of memory, while the 1234 in numeral (integers in computer language) form would consume only 8 bytes! So, if you were to consider documents consuming Megabytes, we are reducing memory management costs as well (yay, budget!).

And we are back!

5 Using the Scikit-Learn Python library for easy implementation, we first select the number of clusters we want, in our case 15. We then run the algorithm to fit our embedded documents into 15 clusters. The parameter ‘random_state = 42’ means that we are shuffling the dataset to prevent pattern bias in our model.

It is also important to note that we are converting our list of embeddings into a Numpy array (a mathematical representation of vectors for advanced operation in the Numpy library). This is because Scikit-learn requires Numpy arrays for K-means operation.

from sklearn.cluster import KMeans
import numpy as np

num_clusters = 15
# Convert the list of embeddings to a NumPy array
chunk_embeddings_array = np.array(chunk_embeddings)

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(chunk_embeddings)

Class dismissed…for now.

I think this is a good place for a pit stop! We have covered much, both in code and theory. But no worries, I will be posting a second part covering how we make use of these clusters in generating rich summaries for large documents. There are going to be more interesting techniques to showcase, and of course, I will be explaining through all the theory and understanding as best as I can!

So stay tuned! Also, I would love your feedback and any comments you may have regarding this article, as it helps me improve my content, as always, Thank you so much for trading, and I hope it was worth the read!

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

The Essential Guide to Effectively Summarizing Massive Documents, Part 1 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Advancing the Power of Retrievers in RAG Frameworks

Vinayak Sengupta — Sat, 13 Jan 2024 14:43:36 GMT

In this article, I try to give credit where it’s due, spotlighting the unsung heroes that determine the effectiveness in the many document question-answering toolkits: the retrievers themselves.

Photo by Egor Myznik on Unsplash

A small but needed Introduction

Retrieval-Augmented Generation (RAG) like many Generative AI (GenAI) topics is being exhaustively discussed. Information retrieval techniques learnt in undergraduate studies are now being combined with Large Language Models (LLMs). Yet, amidst this broader discussions, I feel the ‘R’ in RAG — isn't as deeply discussed. As the crucial component that navigates through vast corpora of knowledge to retrieve relevant information and context for our foundation models, retrievers form the backbone of this modern and dare I say, sophisticated architecture.

Hence, through a rather detailed experimenting with various modern retrieval methodologies I have employed, I have tried to document as to What these methods are, Why would we use them, and When are they most fruitful in their retrievals.

Primary Schooling

Before getting down and dirty with the different methods of retrieval, lets familiarize ourselves with a few introductory concepts (for some of you this maybe a refresher)

Language Models (LLMs) :- Large Language Models, example — OpenAI’s GPT-4 and Google’s Gemini Pro, are trained on humongous corpus’ of text data and can generate human-like responses towards, answering queries, and performing other language-related tasks like translations and even summarizing large amounts text. They have become a game-changer in natural language processing applications.
Information Retrieval :- It is the process of obtaining relevant information from a large quantity of data. In our architecture (which I will show shortly), it involves finding the right parts of the documents that are most relevant to a user’s question.
Retrieval-Augmented Generation (RAG) :- RAG is an architecture that is a Venn diagram between Information Retrieval (IR) and LLMs, to generate contextually relevant and insight rich responses. The retriever first fetches relevant parts of the document pertaining to user query, and then the LLM uses this information to generate a response.
Vector Database (VectorDB) :- A unique breed of databases, vector databases store data in a high-dimensional space, allowing for unique searching methods towards similarity and semantic searching (explained just below).
Semantic Search :- Semantic searching goes beyond the now generic keyword-matching methodology in understanding the contextual meaning of a user prompt. It uses Natural Language Understanding (NLU) techniques to retrieve information that is conceptually relevant to the prompt, even if the exact words are not present in the documents.
Contextual Information :- Context is imperative in NLU. In the architecture, context is added to the user query to provide additional information, which can help the LLM generate more accurate and relevant responses.
Metadata :- Metadata refers to any additional information regarding the primary data that can be used to guide the retrieval and generation process. In our architecture, metadata includes details like the user query itself, context (this can include any addition prompt engineering or data), and parameters (I will explain in a bit) that might positively influence the retrieval and response generation process.

The RAG Architecture

RAG function flowchart

The above flowchart is a representation of a function architecture I developed using classes from the Langchain package in Python. Lets go step-by-step in understanding what the architecture is performing and then I will explain each concept in more theoretical detail to highlight their differences.

1 — The function extracts parameters such as `retrieval_k`, `mmr_lambda_mult`, `retriever_search_type`, and `ensemble_weights` from the user input. These parameters will be the configurations for the retrievers themselves.

2 — The value of `retriever_search_type` is recorded, to decide which retriever methodology to configure and apply. There are several conditions based on the value:

The function initializes a `BM25Retriever` using the provided `documents` and `metadatas`. This retriever will be used either on its own or as part of an ensemble.
If retriever_search_type is mmr, the function creates a MMR-based retriever from the `vectordb`, then creates a `MultiQueryRetriever` using the `llm`, and finally returns an `EnsembleRetriever` combining the `bm25_retriever` and the `multi_query_retriever`.
If retriever_search_type is no_mmr, the function initializes MMR-based retriever from the `vectordb`, then creates a `MultiQueryRetriever` using the `llm`, and returns an `EnsembleRetriever` as before.
If retriever_search_type is mmr_without_multi, an MMR-based retriever but without the multi-query functionality is initialized and returns an `EnsembleRetriever` with just the `bm25_retriever` and the MMR retriever.
If retriever_search_type is no_mmr_without_multi, the function creates a retriever without MMR and without multi-query functionality, and returns an `EnsembleRetriever` as before.
If retriever_search_type is retriever_mmr, the function returns the MMR-based retriever from the `vectordb` directly.

3 — If none of the specific `retriever_search_type` conditions are met, the function defaults to returning the `vectordb` as a retriever without MMR.

Secondary Schooling (The What)

Now, I know I threw around a number of statistical and technical jargon throughout the explanations and (while some of you may know what they mean) allow me explain them in detail in this section.

BM25Retriever

What is it?

BM25 is an IR methodology applying the Bag-Of-Words (BOW) model, that ranks a set of documents based on the exact presence of user prompt’s words appearing in each document, regardless of the inter-relationship between the words themselves within a document like their proximity to each other or the order in which they appear. It’s an extension of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm.

What variables (parameters) make it work?

`documents`: The list of documents to be indexed and searched through.
`metadatas`: Additional information about the documents or the question itself, that may can be used for filtering during retrieval.

How does it work?

In the function, `BM25Retriever` is created from the provided `documents` and `metadatas`. It’s used either on its own or as part of an ensemble retriever.

VectorStore (vectordb) as a retriever

What is it?

‘VectorStore’ essentially represents the Vector Database that we had discussed earlier. In our architecture the `VectorStore` is used to create a retriever to perform semantic searches.

What variables (parameters) make it work?

`search_type`: This parameter defines the type of search algorithm for the retriever. For example, “mmr” indicates that the search should use Maximal Marginal Relevance (MMR) to balance relevance to the question with diversity of different information chunks from the document. There are a number of different algorithms that can be chosen from and I strongly urge you to explore them all!
`search_kwargs`: These are keyword arguments that specify main configurations for the retrievers and their search algorithms. A common argument includes — `k`: The number of top document chunks to retrieve.

How does it work?

In our architecture, a retriever is created using the `VectorStore` to perform semantic searches. Depending on the `retriever_search_type`, the `VectorStore` may be configured to use MMR or not as its underlying search algorithm.

When the `VectorStore` is used as a retriever with MMR (which I will be explaining below), it will rank the retrieved search results to provide a set of document chunks that are not only contextually relevant to the user prompt but also diverse in its content. This is particularly useful when you want to avoid redundancy in the retrieval results and generate a response based on a broader range of information.

If MMR is not used, the `VectorStore` retriever will apply similarity search by first converting the user prompt into its respective vector representation and then focusing on matching the most relevant document chunks based on the similarity of their vector representations to the prompt vectors. This approach is simpler and prioritizes relevance over diversity.

Maximal Marginal Relevance (MMR)

What is it?

MMR is a rank-order technique to re-rank the retrieval results to balance relevance and diversity. It helps in reducing information redundancy by considering both the similarity of each document to the query and the dissimilarity of the documents to each other.

What variables (parameters) make it work?

`k`: The number of top documents to retrieve.
`lambda_mult`: A parameter that measures the magnitude of trade-off between relevance and diversity. A higher value favors relevance, while a lower value favors diversity.

How does it work?

In the function, MMR is used as a `search_type` for the `vectordb` retriever. It’s also used to determine whether to include MMR in the ensemble retriever architecture.

MultiQueryRetriever

What is it?

The `MultiQueryRetriever` is a class from the LangChain library that enhances the retrieval process by generating multiple versions of the given user prompt, this is done using an LLM as the underlying foundation model. The approach aims at addressing the limitations of traditional distance-based vector database retrieval methods (like direct similarity search), which can be sensitive to subtle changes in prompt wording and may not always capture the semantics of the data correctly.

Distance-based vector database retrieval works by converting the prompt and documents into a high-dimensional vectors using an Embedding Model, and then finding document vectors that are “close” to the prompt vectors in terms of some distance metric (e.g., cosine similarity).

Again there are many other metrics to measure semantic distances and I urge you play around with them!.

However, this method can sometimes fail to retrieve relevant document chunks if the embeddings do not align well with the semantics of the user query or the documents, hence, using the right embedding model is crucial as different models produce vectors of different dimensionality as well.

The `MultiQueryRetriever` overcomes this by using an LLM to generate multiple variations of the original query. These variations are intended to capture different interpretations of the query, thus augmenting the context of the user question itself, which can lead to a more comprehensive set of retrieved documents.

What variables (parameters) make it work?

`llm`: The language model used for generating multiple queries. This model can be configured with parameters like `temperature` to control the diversity of the generated queries.
`retriever`: The underlying retriever that is used to perform the actual document retrieval for each generated query. This is typically a vector store-based retriever to uphold sparsity.
`query`: The original user input prompt that is to be expanded into multiple queries.

How does it work?

Query Generation: The `MultiQueryRetriever` takes the original user query and uses the LLM to generate multiple versions of same question.
Retrieval: For each generated version, the retriever performs a search using the underlying vector database to retrieve a comprehensive set of relevant documents.
Aggregation: The retriever then takes the mathematical union of all the unique documents retrieved from the different queries to form the comprehensive set of relevant document chunks.

‘MultiQueryRetriever` is created using a retriever (with or without MMR) and a language model. It is also used in the ensemble architecture when multi-query functionality is desired.

EnsembleRetriever

What is it?

The `EnsembleRetriever` is a retrieval architecture that combines the retrieved information from multiple retrievers to provide a more comprehensive and potentially more accurate set of search results.

An ensemble method in the context of IR involves using multiple retrieval algorithms to obtain a set of candidate document chunks and then merging these results into a single ranked list. The `EnsembleRetriever` specifically uses the Reciprocal Rank Fusion (RRF) algorithm to re-rank the combined results.

Ok I just dropped another big algorithm name, but what is it exactly?

RRF is a ranking followed by aggregation algorithm that essentially combines the documents chunks that have been retrieved and ranked according to relevancy from different retrieval systems. The key idea is to assign a score to each document based on its rank in each individual retrieval system’s results. The inverse of the rank (1/rank) is used, so that higher-ranked documents receive higher scores. The scores from all retrieval systems are then summed for each document, and the documents are then re-ranked based on these combined scores to produce the final ranking.

What variables (parameters) make it work?

`retrievers`: list of retriever that are to be combined. Each retriever has its own method for obtaining relevant documents, such as BM25 for keyword-based retrieval or a vector-based method for semantic retrieval.
`weights`: The weights assigned to each retriever’s results when combining them. Essentially setting the ratio between which retriever to give more importance to.

How does it work?

To use the `EnsembleRetriever`, you provide a list of retrievers that you want to combine. The `EnsembleRetriever` will then call the `get_relevant_documents()` method for each retriever (this is a built-in function by LangChain for these retriever classes) for a given user prompt, combine the results using RRF, and return the re-ranked list of document chunks.

Going into the Why and When

Now I hope that I have been able to give you atleast a beginners understanding of the different retrievers possible and their use-cases, but you would ask — why would I necessarily choose one over the other? lets dive into some of their key differences and when one would be preferred over the other.

BM25Retriever

Advantages

Keyword matching: BM25 is hard to beat when it comes to finding regions in the documents that contain specific keywords, making it suitable for searches where users know exactly what terms they are looking for.
Scalability: BM25 is computationally efficient and can handle large collections of documents within relatively low resource environments.
Simplicity: It is straightforward to implement and understand as the underlying search algorithm is not very complicated, with well-established tuning parameters.

Disadvantages

Lack of semantic understanding: BM25 does not capture the meaning behind of words in the context or user prompt, so it can tend to miss out on relevant documents that do not contain the exact prompt terms being referenced.
Sensitivity to term frequency: BM25 can be biased towards longer documents or those that repeat certain terms, which may not always align with relevance.

Preferred Use Cases

BM25 is preferred when the search queries are expected to contain precise keywords that match the document content, and when computational efficiency is a higher priority.

VectorStore Retriever

Advantages

Semantic search: Vector-based retrieval is able capture the meaning of (semantic relationship between words) the text, enabling it to find relevant document chunks based on conceptual and contextual similarity rather than exact keyword matches.
Model capabilities: The embeddings used in vector-based retrieval benefit from the advancements in language models, which can understand nuanced language and context. The embeddings being essentially mathematical representations of data can be be more accurately processed and evaluated.

Disadvantages

Computational cost: Generating and storing the embeddings of the documents can be resource-intensive, especially for large document collections.
Sensitivity to embedding quality: The effectiveness of the retrieval depends on the quality of the embeddings, which can vary based on the language model itself.

Preferred Use Cases

Vector-based retrieval is preferred when the search requires a need to find relevant information without only relying on exact keyword matches.

MultiQueryRetriever

Advantages

Query expansion: By generating multiple versions of the user’s prompt, the MultiQueryRetriever can capture a wider range of relevant documents that may have been missed by a single query.
Robustness in query creation: It is less sensitive to the specific wording of the query, as it explores different ways of expressing the same intent. This is done by the underlying LLM.

Disadvantages

Computational overhead: Generating multiple queries and retrieving documents for each can be computationally expensive.
Potential for introducing noise: While a broader set of documents are retrieved, some of them could be less relevant, increasing non-relevant or inaccurate content in the results.

Preferred Use Cases

MultiQueryRetriever is preferred when there is uncertainty in how users may want to phrase their queries, and when it’s important to retrieve a diverse set of potentially relevant documents.

EnsembleRetriever

Advantages

Combining strengths: By bringing the best of both sparse and dense retrievers, the EnsembleRetriever can provide a more comprehensive set of results that benefits from keyword matching and semantic understanding.
Improved performance: It often achieves better performance than any single retriever by working on and mitigating their individual weaknesses.

Disadvantages

Complexity: Managing and tuning multiple retrievers is more complex than using a single method.
Resource intensity: The need to run multiple retrievers and then to combine their results leads to higher computational and memory consumption.

Preferred Use Cases

The EnsembleRetriever is preferred when the goal is to maximize retrieval performance and when there is a need to balance keyword precision against semantic recall. It’s particularly useful when dealing with very heterogeneous data or when highly variability in user questioning is expected in terms of specificity or phrasing.

Encapsulating our long discussion

In summary, I think I can confidently say that retrievers — are the linchpin that enables these modern LLMs to sift through like a dispensing machine that understands what you might to want to drink and outputs the drink that you actually wanted. From the more traditional BM25Retriever, to the VectorStore’s nuanced grasp of semantics, each retriever plays a distinct role in the performance of an effective RAG system. The MultiQueryRetriever broadens the search horizon by multi-faceting the user prompts, while the EnsembleRetriever efficiently balances different strategies to retrieve search results that single retriever could fail to do on their own.

Through comprehensive understanding, it’s evidently visible that building an optimal retrieval system is not just a technical skill but requires strategic understanding of information retrival intentions. The carefully calculated parameters like ‘retriever_search_type’ and ‘mmr_lambda_mult’ are not merely switches that turn on a process, but are tools to influence the precision as well as the recall of the very architecture.

Wrapping up this conversation

As GenAI progresses, we are already seeing innovative methods and optimizations in these kind of architectures. Understanding the interplay of and between retrievers not only enhances current performance but also the chances of more sophisticated and intelligent information retrieval.

If you did find this article informative or helpful, let me know! and if you found areas of improvement, please let me know as well, because even I am an ever learning developer in the generative AI landscape.

Advancing the Power of Retrievers in RAG Frameworks was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Customer Segmentation, Identifying the Profit Among the Loose Ends.

Vinayak Sengupta — Wed, 15 Jul 2020 04:50:20 GMT

A short descriptive narrative about a recent real-world, data science project I just completed for an online certification course on Udacity.

What is this about?

The blog post basically is a discussion about the insights, that I learnt from working on the data provided by Bertelsmann Arvato Analytics. This project is based on real-world data science problem of customer segmentation.

The main aim of this project is to predict those individuals who are likely to become customers of the company based on many of their attributes, from the general population.

How was this done?

2 main approaches were taken in the identification of potential clients:

An Unsupervised Learning approach, where we perform segmentation and form Clusters on the company’s current customer base as well the general population.
A Supervised Learning approach, in which we use a machine-learning algorithm to predict whether or not each individual will respond to the campaign (target customers).

For the evaluation of our model’s accuracy, we have utilised AUC/ROC.

Engineering the Data

For the project, we were provided with 4 files containing the demographic data of both customers and the general population of Germany. The dataset contains a notorious amount of missing values, and so we must clean the dataset to be able to work with it.

The Dataset files

CUSTOMERS: Demographics data for customers of a mail-order company. 191,652 (rows) x 369 (columns)
AZDIAS: Demographic data for the general population of Germany. 891,211 (rows) x 366 (columns)
MAILOUT_TRAIN: Demographic data for individuals who were targets of a marketing campaign. 42,982 (rows) x 367 (columns)
MAILOUT_TEST: Demographic data for individuals who were targets of a marketing campaign. 42,833 persons (rows) x 366 (columns).

Our main focus would be on the customer and azdias datasets.

Exploring the Datasets

To make the data ready as an input to the machine learning algorithm we have to fill the NaN (missing) values. To understand how all the missing values can be filled, let's take a closer look at the columns with the most number of missing values.

Top 10 columns containing missing values (in %)

From the graph above, it can be clearly seen that there are a significant amount of missing data in many columns, with 4 columns having more than 90% of the data, missing.

The graph below showcases the proportion of missing data in the columns:

The percentage of missing data

To make the dataset workable for our analysis, we will have to heavily clean it and modify it, so as to not lose too many important information but also get rid of sources of inaccuracies.

Hence, we start by dropping all the rows with more than 16 missing data present in them. This will reduce the row count for our dataset from 891,211 to 733,227.

Percentage of data kept

The following flow of steps have been taken, to get rid of unnecessary columns and missing data:

We have removed all the columns that contain more than 65% missing data.
Dropped additional 3 columns that are present in the customer dataset but not in general population data.
Dropped columns with too many different items.
Columns that are highly correlated.
Filled missing values in certain columns with ‘-1’ based on our analysis.
Filled the remaining missing data with mode.

We then went ahead and finally removed the outliers as well as normalised the data to a range of standardized values for better analytical calculations.

Reduction of Dimensionality

Having cleaned the data using the above-mentioned steps, the final shape we get of our general population data is 415405 rows and 283 columns and of the customer data is 100341 rows and 303 columns. Even after dropping multiple columns we still have a high dimensional data. To reduce the dimensionality of our data further we can use Principal Component Analysis (PCA) to retain the important information whilst reducing the dimensions. To choose the number of components to keep we can take help of the following graph,

Principal Component Analysis Graph

Our aim is to choose the value for the component for which our PCA will have high variance on the data. We can see from the graph above that the line seems to be flattened at around 220 components.

So we reduce the number of feature to 220 in both of the datasets, and we see that we get a variance ratio of 0.92699 for azdias and 0.99735 on customer dataset.

Clustering

To decide on the number of clusters, we utilise the Elbow Method:

Elbow Graph for Clusters

Referring to the chart above we can see that at about 16 clusters, the average distance between the clusters more or less flattens. Hence, we will be using 16 clusters for the segmentation.

By analyzing the clustering data closely, we also found out that cluster ‘5’ particularly has over-representation of customers.

Supervised Learning

In the final part of our project, we train our Machine-Learning(ML) model on the MAILOUT_TRAIN dataset and then to predict on MAILOUT_TEST, whether an individual is a potential customer to the company or not.

To train our model we first cleaned our file using the above methods and then split the data into training and validation(testing) sets.

We have utilised the LGBM Regressor to train our model and have used the Grid Search to get the best parameters for our model for which our ROC score is highest. Then predictions on the test data using the above-trained model were made.

Conclusion

Over the course of the project, I have learned a lot more than I expected to since it was tackling a real-world problem using real industry data. The most challenging part of this project was the data cleaning part. Understanding the data to remove the missing values and potential outliers are necessary but it must not be at the cost of otherwise important information. For this, various steps and methods must be researched and kept in mind and exemplified above.

Like every implementation of a concept, even this project can be improved upon. A different model and various other fine-tuning could get a better score overall. Other approaches towards data engineering can be used to handle the missing and misleading data. These changes might improve the performance of our model.

Lastly, I would like to express my utmost gratitude to Arvato Analytics and Udacity for providing me with this opportunity to work on such a challenging problem which has helped me sharpen my data science skills.

The code of all the above methods implemented can be found on my Github.

Customer Segmentation, Identifying the Profit Among the Loose Ends. was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Last 40 Years of Gaming Industry, Unlocked.

Vinayak Sengupta — Sat, 11 Jul 2020 08:08:03 GMT

The gaming industry has become enormously popular and lucrative.
This article delves into the elements that are making this form of
entertainment industry tick.

The US Gaming Industry has made $120 Billion (yes with a B!) worth of revenue in 2018 alone. The gaming industry is a playground for pioneering technology innovation. It has introduced ideas like real-life motion replication in the games (Nintendo’s Wii) and life-like graphics-enabled entertainment consoles (Sony’s PlayStation series). The innovations in the gaming industry have lent themselves to aid high-end research in the Technology industry — ex: NVIDIA GPUs commonly used for high-end PC gaming have become paramount for performance enhancement for Machine Learning Models’ research and training.

Such an industry that has driven many of today’s tech innovations deserves a deeper understanding. What makes it go round and analyse some attributes that affect the rise and fall of the industry.

The Questions

The following are the questions that I have answered regarding the analysis:

Which Genre of gaming has been the most popular?
Which Platform has been the most popular to play games on?
How the Sales Trend for games has been for the past 40 years?
Publishers with most Global Sales along with which Regions offer the maximum of these sales?

The Data and its Preparation

I started by searching for an informative dataset and decided upon this dataset from Kaggle. The attributes used to perform the analysis are:

Name : The games’ name
Platform : The Platform of the game's release (PC, PS4, etc.)
Year_of_Release : Year of the game’s release
Genre : The Genre of the game
Publisher : Publisher of the game
NA_Sales : Sales in North America (in Millions)
EU_Sales : Sales in Europe (in Millions)
JP_Sales : Sales in Japan (in Millions)
Other_Sales : Sales in the Other Regions like India, S.E Asia, etc. (in Millions)
Global_Sales : Total worldwide sales (in Millions)

Along with these attributes, the data set also consisted of certain attributes that would have proven helpful, but unfortunately due to these columns missing up to 45–50% of their values I had to drop them to avoid inaccuracies.

I continued to make the dataset more workable by checking for missing values in columns. The Publisher column had 54 missing values which I searched and populated in the dataset. Also, there were some rows which had missing values across multiple columns which I had to drop as well. These few steps led to a much cleaner data set with no missing values, ready for analysis.

The Cleaned Dataset

The Answers

Which Genre of gaming has been the most popular?

Top 5 globally selling gaming genres

The above graph clearly shows which 5 genres of gaming have acquired most popularity over the past 40 years with “Action” reigning supreme with more than 3000 million games sold. Followed by “Sports”, “Miscellaneous” (which basically means a single game having multiple forms of games ex: Wii Play, having laser hockey, shooting, fishing etc. Basically “Party Games”), “Role-Playing” and finally “Shooter”. The above analysis was performed using each genre’s Global Sales.

2. Which Platform has been the most popular to play games on?

Top 5 globally selling gaming platforms

Analysing the last dataset further revealed that from the year 1984–2020, “Sony’s PlayStation 2 (PS2)” has proven to be the world’s most selling gaming console. Released in 2000, it continues to beat all that came after it, including giants like “Nintendo DS”, its own successor Sony’s PlayStation 3 (PS3), “Nintendo Wii” and finally “Microsoft’s Xbox 360”.

3. How the Sales Trend for games has been for the past 40 years?

The Global Sales Trend through the past 40 years

Studying the Sales Trends of an industry is important to the successful analysis of any industry. The above Time-Series Analysis showcases the Global Sales Revenue Trend for the past 40 years. Till 1995 the video game sales were up and down probably due to the fact that the industry was just starting out and spreading out across nations and building a base. After 1995 there was a steady rise in sales with fewer variations in the trend. From 2000 the volumes jumped sharply and the graph almost grew vertically till 2007–2008. Thereafter, there was a sharp and steady fall in sales and consequent revenues until 2015, owing to the World Financial Crisis, lasting from 2012 to 2016. Post-2016, the trend has been growing although slowly. 2020 was supposed to be a game-changing (no pun intended) year for the industry with massive titles and new consoles announced. Unfortunately, the COVID-19 Pandemic continuing to this day has greatly hampered sales and releases continuing to affect the industry.

4. Publishers with most Global Sales along with which Regions offer the maximum of these sales?

Top 5 Publishers with most Global Sales

The above bar graph showcases the top 5 companies with the most global sales when it comes to their video game titles and respective gaming consoles. With America leading the way with “Electronic Arts(EA)” with over 1345 million copies sold. EA has been responsible for some of the biggest titles like “FIFA Series” and “Need for Speed Series”, which currently holds the title for the most selling game of all time. The second place to “Activision” having made the sensational “Call of Duty Series”. Then a close third “Namco Bandai Games” producing the world-famous “PAC-MAN”. The ranking is followed by “Ubisoft” holding the fan favourite “Assasin’s Creed Series”. Finally the Japanese publisher “Konami Digital Entertainment” having made the retro sensation “Contra”.

Regional Sales Division for the top 5 companies

Looking at these companies more closely, I decided to see which regions out of the ones in the dataset (North America, Europe, Japan, Other Regions) have been responsible for the majority of their sales, globally. Starting with “Activision”, the graph clearly shows that majority of its video-game sales are in the “North America” region. “Electronic Arts” seems to be getting its most profits thanks to “North America” as well. As expected “Konami Digital Entertainment” started by Hideo Kojima gains its maximum sale volume from “Japan” with “North America” coming a close second. “Namco Bandai Games” as well receives a majority of its sales from “Japan”, by a big margin as well. And lastly “Ubisoft” as well owes the majority of its sales to “North America”.

Conclusion

In conclusion, it can definitely be said that like any other large tech giant, the gaming industry is also not just consumer-driven, but also bases its various releases to cater to a large heterogeneous audience. After the analysis, I can also firmly state that the future of video-games will only grow and branch out, giving birth to more genres, and as well as a fusion of genres to build more dynamic gaming experiences. There is also a gradual rise in the spread of video-games culture that is not anymore restricted to the USA or the west, eastern countries like Japan, India are also jumping on the Kart. A big example is how colossal E-sports tournaments have become with humongous money pool prizes of millions of dollars all over the world, with games like “Counter-Strike Global Offensive” and “PUBG”.

What's the next step?

There are many ways one can conduct further analytics on the gaming industry. Comparing different Sales Regions or even Experimenting to find various different attributes affecting the sales of the games. In fact, additional work can be done on the dataset itself regarding the missing values in the customer and critic scores for each game for more in-depth analysis.

Food for thought: What elements do you think determines the success of a gaming company?

Please do share your thoughts and suggestions. A link to my Github Repository can be found here, containing the code.

The Last 40 Years of Gaming Industry, Unlocked. was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Vinayak Sengupta on Medium

The Essential Guide to Effectively Summarizing Massive Documents, Part 1

Document summarization is important for GenAI use cases, but what if the documents are too BIG!? Read on to find out how I have solved it.

Some “context”

Context window and Cost constraints

Lost in the Middle

What is K-Means Clustering?!

First the basics

K, Means?

Enough theory, let’s code.

Maybe a little theory

Embedding & Vectors

But, Why?!

And we are back!

Class dismissed…for now.

Advancing the Power of Retrievers in RAG Frameworks

In this article, I try to give credit where it’s due, spotlighting the unsung heroes that determine the effectiveness in the many document question-answering toolkits: the retrievers themselves.

A small but needed Introduction

Primary Schooling

The RAG Architecture

Secondary Schooling (The What)

BM25Retriever

What is it?

What variables (parameters) make it work?

How does it work?

VectorStore (vectordb) as a retriever

What is it?

What variables (parameters) make it work?

How does it work?

Maximal Marginal Relevance (MMR)

What is it?

What variables (parameters) make it work?

How does it work?

MultiQueryRetriever

What is it?

What variables (parameters) make it work?

How does it work?

EnsembleRetriever

What is it?

What variables (parameters) make it work?

How does it work?

Going into the Why and When

BM25Retriever

Advantages

Disadvantages

Preferred Use Cases

VectorStore Retriever

Advantages

Disadvantages

Preferred Use Cases

MultiQueryRetriever

Advantages

Disadvantages

Preferred Use Cases

EnsembleRetriever

Advantages

Disadvantages

Preferred Use Cases

Encapsulating our long discussion

Wrapping up this conversation

Customer Segmentation, Identifying the Profit Among the Loose Ends.

What is this about?

How was this done?

Engineering the Data

Reduction of Dimensionality

Clustering

Supervised Learning

Conclusion

The Last 40 Years of Gaming Industry, Unlocked.

The gaming industry has become enormously popular and lucrative.This article delves into the elements that are making this form ofentertainment industry tick.

The Questions

The Data and its Preparation

The Answers

Conclusion

What's the next step?

The gaming industry has become enormously popular and lucrative.
This article delves into the elements that are making this form of
entertainment industry tick.