Amit Chaudhary

The Anatomy of Tool Calling

Sat, 15 Feb 2025 00:00:00 GMT

Giving an LLM the capability to call some external function based on the user’s input and receive the results back is a very powerful pattern and a key element behind the rapid rise of agentic workflows.

This pattern powers many of the features we see on ChatGPT today, such as web search, code execution, image generation, or personalized memory based on conversation history.

LLM providers expose this as tool use or function calling. We provide all the function signatures and parameters as JSON Schema and can later call the implementation in any programming language.

For example, we can write a JSON schema to provide a simple add function to OpenAI as shown below.

!pip install openai -qqq

def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b

At first, we convert the function into a JSON Schema showing the name of the function, the description of what it does, and the name and type of all the parameters that it can take.

tools = [
    {
        "type": "function",
        "function": {
            "name": "add",
            "description": "Adds two integers together",
            "strict": True,
            "parameters": {
                "type": "object",
                "required": ["a", "b"],
                "properties": {
                    "a": {"type": "integer", "description": "The first integer to add"},
                    "b": {
                        "type": "integer",
                        "description": "The second integer to add",
                    },
                },
                "additionalProperties": False,
            },
        },
    }
]

Then, we can provide our schema as a list of tools and send a user query.

from openai import OpenAI

client = OpenAI()

messages = [{"role": "user", "content": "Add 2 and 3"}]
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
)

The model then decides that it wants to call the add function with the parameters a=2 and b=3

tool_call = completion.choices[0].message.tool_calls[0]
tool_call.function

Function(arguments='{"a":2,"b":3}', name='add')

tool_call.function.name

'add'

We can fetch the arguments to be passed to the function as shown below

import json

args = json.loads(tool_call.function.arguments)
args

{'a': 2, 'b': 3}

Then we call our function with those arguments and get a result

result = add(**args)
print(result)

The result is sent back to the LLM context as a separate message and it will generate a natural language response as a reply for the next turn.

messages.append(completion.choices[0].message)
messages.append(
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": str(result),
    }
)

completion_after_tool_call = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
)
completion_after_tool_call.choices[0].message.content

'The sum of 2 and 3 is 5.'

Now, the question becomes: how can we automatically convert Python functions into JSON Schemas?

In this post, I will go over various runtime introspection features that Python provides to extract pretty much everything about a function definition. Then we will use that knowledge to build automatic function to json schema converters.

Object Introspection

Let’s understand the various introspection features step by step.

Extracting the parameters and type-annotations

To get the parameters of the function, we can use the signature function of the inspect module.

def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b

This will return the entire signature for both the input parameters and the return type.

import inspect

signature = inspect.signature(add)
signature

<Signature (a: int, b: int) -> int>

We can get a dictionary of the parameters of the function from the signature

signature.parameters

mappingproxy({'a': <Parameter "a: int">, 'b': "b: int">})

We can access each parameter from the dictionary. It will return an object that has many useful properties

a = signature.parameters["a"]
a

<Parameter "a: int">

We can now easily access the name of the parameter, its default value as well as the type annotation.

print("Name of parameter: ", a.name)
print("Default value: ", a.default)
print("Type annotation: ", a.annotation)

Name of parameter:  a
Default value:  
Type annotation:

This means that if a parameter has a default value of inspect._empty, it’s a required parameter`

a.default == inspect._empty

True

The type annotation is of particular interest to us. It will return the type directly

a.annotation

<class 'int'>

a.annotation == int

True

We can also get the type annotation for the return statement i.e. output of the function using the signature itself

signature.return_annotation

<class 'int'>

Extracting the docstring

To get the docstring, we can use the __doc__ attribute in the function

def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b

add.__doc__

'Adds two integers together'

An alternate approach is to use inspect module itself.

import inspect

inspect.getdoc(add)

'Adds two integers together'

Extracting the function name

This is relatively simple as python already provides a __name__ attribute on each function.

def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b

add.__name__

'add'

Extracting the parameter descriptions from the docstring

We can make use of a third-party library called docstring_parser as the format of docstrings can vary a lot.

!pip install docstring_parser -qqq

def add(a: int, b: int) -> int:
    """
    Adds two integers together.

    Args:
        a (int): The first integer.
        b (int): The second integer.

    Returns:
        int: The sum of a and b.
    """
    return a + b

from docstring_parser import parse

doc = parse(add.__doc__)
{param.arg_name: param.description for param in doc.params}

{'a': 'The first integer.', 'b': 'The second integer.'}

Functions to JSON Schema

With the above background knowledge, we have everything needed to convert the function definition to JSON Schema.

Let’s see how this is applied in various popular agent libraries.

Approach 1: Pure Python

This is the approach implemented in the OpenAI Swarm library. In this, we can use all introspection feature discussed above to write the conversion function from scratch.

!pip install git+https://github.com/openai/swarm.git -qqq

def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b

Swarm has a utility function called function_to_json that converts a python function into a JSON schema.

from swarm.util import function_to_json

function_to_json(add)

{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Adds two integers together',
        'parameters': {
            'type': 'object',
            'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
            'required': ['a', 'b']
        }
    }
}

As seen above, we first need some mapping to convert the parameter types from Python to the equivalent JSON schema data type.

python	json_schema
str	string
int	integer
float	number
bool	boolean
list	array
dict	object
None	null

Based on this, the implementation is quite simple and reuses all the concept we discussed before.

We take the function signature and extract the parameter types for each paramter as well as get the function name and docstring. Using this, we construct the JSON Schema at the end.

# Source: https://github.com/openai/swarm/blob/9db581cecaacea0d46a933d6453c312b034dbf47/swarm/util.py#L31
import inspect


def function_to_json(func) -> dict:
    # A mapping of types from python to JSON
    type_map = {
        str: "string",
        int: "integer",
        float: "number",
        bool: "boolean",
        list: "array",
        dict: "object",
        type(None): "null",
    }

    try:
        signature = inspect.signature(func)
    except ValueError as e:
        raise ValueError(
            f"Failed to get signature for function {func.__name__}: {str(e)}"
        )

    parameters = {}
    for param in signature.parameters.values():
        try:
            param_type = type_map.get(param.annotation, "string")
        except KeyError as e:
            raise KeyError(
                f"Unknown type annotation {param.annotation} for parameter {param.name}: {str(e)}"
            )
        parameters[param.name] = {"type": param_type}

    required = [
        param.name
        for param in signature.parameters.values()
        if param.default == inspect._empty
    ]

    return {
        "type": "function",
        "function": {
            "name": func.__name__,
            "description": func.__doc__ or "",
            "parameters": {
                "type": "object",
                "properties": parameters,
                "required": required,
            },
        },
    }

1: Get the function signature
2: For each parameter, convert the type annotation to valid JSON type. Default to string if user didn’t specify a type
3: Find out which parameters are required
4: Extract the function name
5: Extract the docstring

function_to_json(add)

{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Adds two integers together',
        'parameters': {
            'type': 'object',
            'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
            'required': ['a', 'b']
        }
    }
}

Approach 2: Pydantic

2a. Dynamic Models

I first came across this approach in Jeremy Howards’s talk and this pattern is also implemented in popular libraries like LlamaIndex and LangChain under the hood.

Pydantic is a popular python library already used for data validation and serialization of structured data. As such, it can convert a Python class into a JSON schema directly.

For example, if we were to define a Pydantic model for our add function manually, it would look something like below.

from pydantic import BaseModel


class Add(BaseModel):
    a: int
    b: int


Add.model_json_schema()

{
    'properties': {'a': {'title': 'A', 'type': 'integer'}, 'b': {'title': 'B', 'type': 'integer'}},
    'required': ['a', 'b'],
    'title': 'Add',
    'type': 'object'
}

But, we actually want to create the Pydantic data model dynamically. This is possible via the create_model function provided by Pydantic. It takes the name for the model as the first argument, and then the named paramters for the different fields in the model.

Here a=(int, ...) means that the field a is of type int and is required.

from pydantic import create_model

a = create_model("Add", a=(int, ...), b=(int, ...))
a.model_json_schema()

{
    'properties': {'a': {'title': 'A', 'type': 'integer'}, 'b': {'title': 'B', 'type': 'integer'}},
    'required': ['a', 'b'],
    'title': 'Add',
    'type': 'object'
}

Thus, if we can somehow create a dictionary of our function parameters, then we can pass that using the **kwargs trick and then get the JSON schema directly.

from pydantic import create_model

a = create_model("Add", **{"a": (int, ...), "b": (int, ...)})
a.model_json_schema()

{
    'properties': {'a': {'title': 'A', 'type': 'integer'}, 'b': {'title': 'B', 'type': 'integer'}},
    'required': ['a', 'b'],
    'title': 'Add',
    'type': 'object'
}

Below, we implement a function that uses this concept to convert the add function into JSON Schema directly.

We use inspect.signature as before to get all the function parameters and then prepare a Pydantic model directly from it.

import inspect

from pydantic import create_model


def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b


def schema(f):
    kws = {
        name: (
            # Get the type annotation
            parameter.annotation,
            # Check if parameter is required or optional
            ... if parameter.default == inspect._empty else parameter.default,
        )
        for name, parameter in inspect.signature(f).parameters.items()
    }
    # Pass the function name and parameters to get a pydantic model
    p = create_model(f"`{f.__name__}`", **kws)

    # Convert to JSON Schema
    schema = p.model_json_schema()
    return {
        "type": "function",
        "function": {
            "name": f.__name__,
            "description": f.__doc__,
            "parameters": schema,
        },
    }


schema(add)

{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Adds two integers together',
        'parameters': {
            'properties': {
                'a': {'title': 'A', 'type': 'integer'},
                'b': {'title': 'B', 'type': 'integer'}
            },
            'required': ['a', 'b'],
            'title': '`add`',
            'type': 'object'
        }
    }
}

2b. Type Adapter

Pydantic introduced a new feature called Type Adapter in version 2.0. It allows you to convert any arbitrary Python object into a Pydantic model.

We can use it to get JSON schema for the function parameters directly without requiring use of inspect.signature.

from pydantic import TypeAdapter


def add(a: int, b: int) -> int:
    """Adds two integers together"""
    return a + b


def schema(f):
    schema = TypeAdapter(f).json_schema()
    return {
        "type": "function",
        "function": {
            "name": f.__name__,
            "description": f.__doc__,
            "parameters": schema,
        },
    }


schema(add)

{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Adds two integers together',
        'parameters': {
            'additionalProperties': False,
            'properties': {
                'a': {'title': 'A', 'type': 'integer'},
                'b': {'title': 'B', 'type': 'integer'}
            },
            'required': ['a', 'b'],
            'type': 'object'
        }
    }
}

Approach 3: Decorators

Most agent libraries wrap conversion approaches like above as decorators (e.g. smolagents) to make them easier to use.

For example, we can make a decorator called tool, which, when applied to a function, will add a json_schema method to that function.

def tool(func):
    func.json_schema = lambda: function_to_json(func)
    return func

We can mark out functions with the decorator.

@tool
def add(a: int, b: int) -> int:
    """Adds two numbers"""
    return a + b

And can use the json_schema method to get the schema directly and use it downstream in LLM API.

add.json_schema()

{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Adds two numbers',
        'parameters': {
            'type': 'object',
            'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
            'required': ['a', 'b']
        }
    }
}

Conclusion

Thus, we understood how Python’s runtime introspection enables automatic conversion of function definitions into JSON Schema.

Evals for Diversity in Synthetic Data

Sun, 09 Feb 2025 00:00:00 GMT

Synthetic data is a popular approach for bootstrapping an initial dataset when building LLM-based applications.

We can find practical examples of synthetic data usage in the wild such as:

Generating synthetic user queries from existing documents to evaluate RAG systems ¹
Producing fake meeting transcripts for video call summarization ²
Bootstrapping lots of texts (emails, inquiries, multi-turn chats etc.) for good old classification tasks (customer service routing, intent classification, sentiment analysis, etc.).

As a common starting point, people write a prompt defining the data they need, provide a few seed examples either within the prompt or as few-shot exemplars, and sample multiple times from the LLM to bootstrap a dataset.

However, LLMs generate repetitive outputs out of the box, and we need special techniques to increase diversity:

Sampling Parameters: higher temperature, nucleus-sampling, top-k sampling, random seeds
Attribute Generation: Generating various attributes (topics, writing style, length, personas, emotion, sentiment, location, etc.) beforehand and inserting randomly sampled attributes in the prompt. (Yu et al. (2023), Ge et al. (2024))
Post-decoding Clustering: Overgenerating a large number of texts and deduplicating via cluster centroids (Ippolito et al., 2019) and semantic hashing (Dongen and Tulkens, 2025)

But this raises the question:

How do we systematically test the impact of various techniques above on diversity without relying on just vibe checks?

I was curious and read the existing academic literature on evaluating diversity. It turns out that there is a large body of prior work on evaluating diversity from the days of classic sequence-to-sequence models and dialogue generation (Shaib et al. (2024a), Guo et al. (2024)).

In this post, I will discuss the various diversity metrics from the literature and explain how they work. These automatic metrics are fast to compute and can be a useful tool to have as a proxy for evaluating linguistic diversity in applied use cases.

Lexical Diversity Metrics

Lexical diversity metrics capture the surface-level repetition of words, phrases, topics, and n-grams in the generations.

Distinct n-grams (Distinct-k)

Li et al. (2016) proposed distinct-k to evaluate their technique for increasing diversity in sequence-to-sequence models. It builds on the type-token ratio concept from linguistics.

They calculate diversity as the ratio of the number of unique n-grams to the total n-grams occurring in the entire generated dataset. As shown below, the two texts contain only 5 unique words out of a total of 9 words and thus, the diversity score is only 55% (0.55).

However, if all the synthetic texts were unique, we would get a diversity score of 100% (1.0).

We can extend this same idea from unigrams to bigrams, trigrams, and any higher-order n-grams. There are two approaches.

In the first approach, we report the diversity score separately for different n-grams. Li et al. (2016) do this for unigrams and bigrams as distinct-1 and distinct-2. While Padmakumar and He (2023) report diversity scores up to 4-grams separately in their paper that shows instruction-tuned models have lower diversity compared to base models.

Alternatively, we can report a single diversity score by combining the scores for different n-grams. Li et al. (2022) take the product of the diversity score for unigrams, bigrams, trigrams, and four-grams as a single final score, while Meister et al. (2023) take the sum of the diversities.

The library diversity by Shaib et al. (2024a) provides an easy way to compute the distinct-k metric:

shell

pip install diversity

from diversity import ngram_diversity_score

texts = ['As an AI language model', 'As an AI model']

ngram_diversity_score(texts, 1)

0.556

N-gram Entropy (Ent-n)

Zhang et al. (2018) introduced this metric, and Jagfeld et al. (2018) also used it to evaluate the diversity of template to natural language generation.

The intuition behind it is that in an ideal case, LLM generates texts that are all unique and no n-grams is repeated more than once.

We can measure this by collecting all the unique bigrams in the text and calculating their count and the relative frequency. This yields a probability distribution over the bigrams.

For the highest diversity, all the texts would be unique and thus the probability distribution over the bigrams would be uniform, resulting in the highest entropy. Therefore, the entropy of the n-gram distribution serves as a diversity metric, as shown below.

Given the distribution of bigrams, we can calculate the entropy easily as shown below.

import math

probs = [0.25, 0.25, 0.25, 0.25]
-sum(p * math.log(p) for p in probs)

1.3862

However, let’s take another case where there is lots of repetition e.g. “Play the music” being generated 100 times. In such a case, the bigrams “Play the” and “the music” dominate the frequency distribution. As such, the entropy reduces, and thus, the diversity score drops.

We can also extend this idea to higher-order n-grams similar to the distinct n-grams metric. Tevet and Berant (2020) calculate and report entropy separately for unigram, bigram, and trigrams.

While Oraby et al. (2018) combine all unique unigrams, bigrams, and trigrams and then use the entropy of the resulting distribution as the diversity.

This metric can be implemented in code as shown below.

import math
from collections import Counter


def generate_ngrams(words, n: int):
    return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]


def ngram_entropy(texts: list[str], n: int = 2) -> float:
    ngrams = []
    for text in texts:
        words = text.split()
        ngrams.extend(generate_ngrams(words, n))

    ngram_counts = Counter(ngrams)
    total_ngrams = sum(ngram_counts.values())

    ngram_frequencies = [count / total_ngrams for ngram, count in ngram_counts.items()]

    entropy = -sum(freq * math.log(freq) for freq in ngram_frequencies)
        
    return entropy

1: Step 1: Generate n-grams from input texts
2: Step 2: Count the frequency of each n-gram
3: Step 3: Calculate the frequency of each n-gram
4: Step 4: Calculate entropy

texts = ["Call an Uber", "Play the music"]

print("Unigram entropy:", ngram_entropy(texts, n=1))
print("Bigram entropy:", ngram_entropy(texts, n=2))
print("Trigram entropy:", ngram_entropy(texts, n=3))

Unigram entropy: 1.7917594692280547
Bigram entropy: 1.3862943611198906
Trigram entropy: 0.6931471805599453

Normalized N-gram Entropy

The original n-gram entropy metric doesn’t have a fixed range for the score.

To get a score between a range of 0 to 1, I thought of a normalized version inspired by the NDCG metric from Information Retrieval.

For any generated set of texts, the maximum diversity possible happens when all the n-grams occur with the same frequency. Thus, the entropy of a uniform distribution of those ngrams would provide us with the upper bound of diversity.

We can calculate the n-gram entropy as before and then divide it by the entropy of the ideal uniform distribution over the n-grams to get a normalized diversity score between 0 and 1.

import math
from collections import Counter


def generate_ngrams(words: list[str], n: int) -> list[str]:
    return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]


def normalized_ngram_entropy(texts: list[str], n: int = 2) -> float:
    ngrams = []
    for text in texts:
        words = text.split()
        ngrams.extend(generate_ngrams(words, n))

    ngram_counts = Counter(ngrams)
    total_ngrams = sum(ngram_counts.values())

    ngram_frequencies = [count / total_ngrams for ngram, count in ngram_counts.items()]
    entropy = -sum(freq * math.log(freq) for freq in ngram_frequencies)

    uniform_frequencies = [1/len(ngrams) for _ in range(len(ngrams))]
    ideal_entropy = -sum(freq * math.log(freq) for freq in uniform_frequencies)

    diversity = entropy / ideal_entropy
    return diversity

We can use it similar to before.

texts = ["Call an Uber", "Play the music"]

print("Unigram diversity:", normalized_ngram_entropy(texts, n=1))
print("Bigram diversity:", normalized_ngram_entropy(texts, n=2))
print("Trigram diversity:", normalized_ngram_entropy(texts, n=3))

Unigram diversity: 1.0
Bigram diversity: 1.0
Trigram diversity: 1.0

Compression Ratio

Shaib et al. (2024a) proposed this metric by adapting the concept of the compression ratio, originally used to evaluate compression algorithms, as a diversity metric.

Compression ratio calculates the ratio of the size of the compressed file to its original size. A high compression ratio indicates the file was highly compressible and thus had higher redundancy, indicating lower diversity in the file contents.

To apply this concept to texts, we can compress them using an algorithm like Gzip and then calculate the compression ratio. A higher ratio indicates lower diversity in the text. Thus, the greater the compression ratio, the less diverse the generated texts.

Thus, diversity can be calculated as the reciprocal of the compression ratio to get a score between 0 and 1.

If all the texts are unique, then the compressed file size would be the same as the original file size and thus the compression ratio and the diversity both would be 1.

We can implement this in code using the diversity library.

shell

pip install diversity

from diversity import compression_ratio

texts = ['Call an Uber'] + ['Play the music'] * 100

compression_ratio(texts)

16.258

Semantic Diversity Metrics

These metrics capture the diversity in terms of meaning and rely on embeddings. They handle cases where the texts share similar meaning but have zero n-gram overlap.

For example, “Play the music” and “Start a song” have zero word overlap and thus would be incorrectly assigned 100% diversity by lexical metrics. However, they are repetitive in meaning and thus should have been assigned a lower diversity score. Semantic diversity metrics can tackle this.

Embedding Diversity

Tevet and Berant (2020) proposed this metric, which considers diversity as the dissimilarity of text embeddings.

The metric calculates sentence embeddings for all generated texts using an encoder (e.g. sentence-transformers).

Then, we calculate the cosine similarity between all the unique pairs and take the average to get a similarity score.

To convert the similarity into diversity, we can either take the negation of the average cosine similarity (Tevet and Berant, 2020) or take the cosine distance i.e. (Young et al. (2024); Hayati et al. (2024))

Approach	Mean Cosine Similarity	Diversity	Range
Young et al. (2024) / Hayati et al. (2024)	0.39	1 - 0.39 = 0.61	0 to 1
Tevet and Berant (2020)	0.39	-0.39	-1 to 0

DCScore

This metric was proposed in a paper currently under review for ICLR 2025 (Anonymous, 2024).

The metric, similar to embedding diversity, also starts by calculating the pairwise similarity between all the text embeddings but has a unique take on formulating the diversity.

To understand the intuition, let’s look at the first row of the pairwise similarity matrix. Here, the numbers 1.0, 0.75, and 0.2 mean that the text is 100% similar to itself, 75% similar to some other text, and 20% similar to the final text. Hypothetically, we would have wanted the text to only be 100% similar to itself and 0% similar to everything else for maximum diversity.

Thus, we want some relative measure of similarity of the text to itself in comparison to others. The authors use softmax for this. Softmax converts the cosine similarities into relative probabilities of the text belonging to itself and others. When we apply softmax, we see that the first text is only belonging 45% to itself. Thus, the softmax probability of the text belonging to itself can be a measure of diversity.

To calculate the diversity of the dataset overall, we simply take the mean of the diagonal of the pairwise matrix after applying softmax. Thus, we get a diversity score of 0.47 in the example above.

The implementation is simple and fits in a few lines of code. We can swap the embedding model as needed.

shell

pip install sentence-transformers scipy numpy

import numpy as np
from scipy.special import softmax
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def dcscore(texts: list[str]) -> float:
    text_embeddings = model.encode(texts, normalize_embeddings=True)
    pairwise_matrix = text_embeddings @ text_embeddings.T
    softmax_matrix = softmax(pairwise_matrix, axis=1)
    score = np.mean(np.diag(softmax_matrix))
    return score

score = dcscore(['Play the music', 'Start the music', 'Call an Uber'])
print(score)

1: Load the MiniLM Sentence-BERT model
2: Generate embeddings for the sentences
3: Calculate pairwise cosine similarity
4: Apply softmax on the row level for each text
5: Take the mean of the scores in the diagonal

0.47264108

Cluster Inertia

Du and Black (2019) proposed this metric, reusing the inertia metric used to compute the quality of clustering as the diversity.

The metric clusters embeddings of the LLM-generated texts into 10 clusters and measures the inertia. Inertia is the sum of the squared distances between all points in a cluster and its centroid.

We can treat the inertia as a proxy for diversity because if the texts are diverse, they would be far apart from the centroid and thus the squared distance from the cluster centroid will be larger.

In code, this can be accomplished as shown below:

import numpy as np
from sklearn.cluster import KMeans

# Text embeddings for 1024 synthetic texts
text_embeddings = np.random.rand(1024, 768)

# Run clustering
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(text_embeddings)

# Get the inertia
k.inertia_

64556.00644871439

Syntactic Diversity Metrics

These metrics capture diversity in terms of the underlying grammatical structure.

Compression Ratio - Part of Speech (CR-POS)

Shaib et al. (2024b) proposed this metric to detect the repetition of syntactic templates in LLM-generated texts.

It reuses the idea of Compression Ratio but applies it to syntactic representation instead of the raw text. This works by applying a part-of-speech tagger to the text to get the POS tag for each token.

We apply a POS tagger to all the synthetically generated texts and get their syntactic representation as strings.

Then, the process is the same as the regular compression ratio. We concatenate the POS-tagged strings of all the texts together, compress the text using gzip, and then compare the ratio of the original file size with the compressed file size.

If the compression ratio is high, it indicates a large repetition of syntactic templates in the generated texts. Thus, the diversity will be low.

We can compute diversity directly by taking the reciprocal of the compression ratio and get a score between 0 and 1.

Conclusion

Thus, in this post, we learned about three different linguistic diversity metrics - lexical, semantic, and syntactic.

We have skipped a category of diversity metrics called homogenization scores above as those can be computationally expensive for practical use cases. These work by applying evaluation metrics from machine translation/summarization such as BLEU, ROUGE, etc. on each text treating all other texts as the reference text (Zhu et al. (2018)).

For further deep-dive into diversity metrics, you can read Shaib et al. (2024a) for a comparative analysis of these metrics on various datasets and Guo et al. (2024) for application of the metrics to evaluate popular LLMs.

References

Anonymous. 2024. Evaluating diversity of LLM-generated datasets: A classification perspective. In Submitted to the thirteenth international conference on learning representations. under review.

Thomas van Dongen and Stephan Tulkens. 2025. SemHash: Fast semantic text deduplication.

Wenchao Du and Alan W Black. 2019. Boosting dialog response generation. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 38–43, Florence, Italy. Association for Computational Linguistics.

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scaling synthetic data creation with 1,000,000,000 personas.

Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2024. Benchmarking linguistic diversity of large language models.

Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, and Dongyeop Kang. 2024. How far can we extract diverse perspectives from large language models? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 5336–5366, Miami, Florida, USA. Association for Computational Linguistics.

Daphne Ippolito, Reno Kriz, João Sedoc, Maria Kustikova, and Chris Callison-Burch. 2019. Comparison of diverse decoding methods from conditional language models. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 3752–3762, Florence, Italy. Association for Computational Linguistics.

Glorianna Jagfeld, Sabrina Jenne, and Ngoc Thang Vu. 2018. Sequence-to-sequence models for data-to-text natural language generation: Word- vs. Character-based processing and output diversity. In Emiel Krahmer, Albert Gatt, and Martijn Goudbeek, editors, Proceedings of the 11th international conference on natural language generation, pages 221–232, Tilburg University, The Netherlands. Association for Computational Linguistics.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models.

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and M. Lewis. 2022. Contrastive decoding: Open-ended text generation as optimization. Annual Meeting of the Association for Computational Linguistics.

Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023. Locally typical sampling.

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.

Shereen Oraby, Lena Reed, Shubhangi Tandon, S. SharathT., S. Lukin, and M. Walker. 2018. Controlling personality-based stylistic variation with neural natural language generators. SIGDIAL Conference.

Vishakh Padmakumar and He He. 2023. Does writing with language models reduce content diversity? International Conference on Learning Representations.

Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, and Ani Nenkova. 2024a. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.

Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and Byron C. Wallace. 2024b. Detection and measurement of syntactic templates in generated text. Conference on Empirical Methods in Natural Language Processing.

Guy Tevet and Jonathan Berant. 2020. Evaluating the evaluation of diversity in natural language generation. Conference of the European Chapter of the Association for Computational Linguistics.

Halley Young, Yimeng Zeng, Jacob Gardner, and Osbert Bastani. 2024. Improving structural diversity of blackbox LLMs via chain-of-specification prompting.

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J. Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. 2023. Large language model as attributed training data generator: A tale of diversity and bias. Neural Information Processing Systems.

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and W. Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. Neural Information Processing Systems.

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Footnotes

Jason Liu has a great conceptual example of using synthetic data for RAG Evaluation. Nogueira and Lin (2019) is another classic paper.↩︎
OpenAI has an example walkthrough on generating synthetic transcripts for a daily standup meeting summarization use-case in their build hour on evals↩︎

Citation

BibTeX citation:

@online{chaudhary2025,
  author = {Chaudhary, Amit},
  title = {Evals for {Diversity} in {Synthetic} {Data}},
  date = {2025-02-09},
  url = {https://amitness.com/posts/diversity-evals/},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2025. Evals for Diversity in Synthetic Data.

Zero-Cost Custom Feeds on Bluesky

Sun, 01 Dec 2024 00:00:00 GMT

Background

I recently built a custom feed on Bluesky to capture the latest discussions on pre-prints from arxiv.org and research papers from conferences like ACL. It was inspired by this bluesky post from a researcher requesting for such a feed.

While there are drag-and-drop custom feed generators like Skyfeed, you are limited to using only regular expressions for the filtering part. If you use a regex pattern to capture all ‘arxiv.org’ links on Skyfeed, it will yield a bunch of false positives with papers from non-ML fields like Quantum Physics, Economics, and so on.

Though it’s possible for us to instead build and host the custom feed from scratch ourself as Bluesky’s protocol is open and provide programmatic access, it will be costly to run a server 24/7, especially if a large number of people subscribe to our custom feed.

As such, I thought of a nice alternate solution to circumvent this need to run a server by leveraging how the Bluesky protocol works with custom feeds. The bluesky app only makes GET requests to the server to fetch a JSON of a list of post IDs. So, we could in theory make use of a static site to host the endpoints with the data that matches what they expect and not run a backend server via Flask / FastAPI.

I implemented this idea and it works perfectly. We can offload to Skyfeed for initial filtering, use GitHub Actions for periodic feed generation, filtering, and ranking, and then host the JSONs on a static site using Cloudflare Pages. This removes the need to run a backend server at all and you can launch a custom feed 100% free.

The feed can be easily added to your homepage here: https://bsky.app/profile/amitness.com/feed/arxiv-feed

High-level Overview

We first use Skyfeed to filter the entire network of posts on Bluesky using a regular expression for posts with links for arxiv.org papers.

Then, the resulting feed is filtered using Bluesky’s atproto library through Python. Here, we iterate through each paper and check if the paper belongs to the arxiv categories for Machine Learning, NLP, and Computer Vision via the pyarxiv library. From the filtered list of papers, we generate the JSON data format required by Bluesky for reading feeds and push that to Cloudflare pages as a static site.

When the feed is loaded on the Bluesky app, the app will make a request to our static page on Cloudflare and get a list of the post IDs as a JSON response. The app will parse each post ID, render it in the app, and display the feed. This runs super quick.

Implementation

1. Clone the code locally

The code for the concept described above has been implemented at https://github.com/amitness/bluesky-arxiv.

First, make a fork of my repo from https://github.com/amitness/bluesky-arxiv and then clone your repo locally.

# Replace with the link to your repo
git clone git@github.com:amitness/bluesky-arxiv.git

Install the required libraries via the requirements.txt file in your virtual environment.

pip install -r requirements.txt

2. Setup Cloudflare pages

We will need a Cloudflare page to host the data in the format needed by Bluesky.

You can create an account on Cloudflare pages. Once the account is created, go to Workers and Pages > Overview from the left sidebar on the dashboard.

You should see two tabs: Workers and Pages. Click the Pages tab.

Then, click the “Upload Assets” button.

Then enter a name for the project. Cloudflare will provide you a unique domain based on it. Click Create Project.

You will be shown a page below that allows you to upload a zip file or a folder. At this stage, just upload a random folder from your device at least one file in it. Once you’re done, click Deploy site.

Once the site is deployed, you should see a message below with the URL of your domain.

In the repo that you cloned locally, change the SERVICE_DOMAIN variable in config.py file to the domain you got above from Cloudflare.

config.py

# Domain provided by Cloudflare pages
SERVICE_DOMAIN = "bluesky-1tj.pages.dev"

3. Initialize a custom feed on Bluesky

Now, we will initialize a custom feed programmatically on Bluesky.

In the repo, you will find a config.py file. You have to change a few configurations inside it.

First, change the HANDLE to your bluesky handle.

config.py

# YOUR bluesky handle
# Ex: user.bsky.social
HANDLE: str = "amitness.com"

Then you need to generate an app password for Bluesky. It’s available at https://bsky.app/settings/app-passwords and will allow us to get programmatic access to Bluesky in Python.

You can set a name to denote what the password is going to be used for. Here I set it to custom-feed.

Then you will receive your app password. Take note of it in a safe place as you won’t be able to access it again.

Now you can set the BLUESKY_APP_PASSWORD environment variable to your password.

export BLUESKY_APP_PASSWORD=...

This will be read by the setup_feed.py script.

config.py

# YOUR bluesky password, or preferably an App Password (found in your client settings)
# Ex: abcd-1234-efgh-5678
PASSWORD = os.environ["BLUESKY_APP_PASSWORD"]

Next, you can modify the name of your custom feed, a description and the slug. Here is what I have set.

config.py

# A short name for the record that will show in urls
# Lowercase with no spaces.
# Ex: whats-hot
RECORD_NAME: str = "arxiv-feed"

# A display name for your feed
# Ex: What's Hot
DISPLAY_NAME: str = "Papers"

# (Optional) A description of your feed
# Ex: Top trending content from the whole network
DESCRIPTION: str = dedent(
    """
 Latest ML research papers and preprints from arxiv.org discussed on Bluesky.
    
 Logic:
 - Fetch arxiv preprints & filters out non-ML via arxiv API
 - Ranks the items using hackernews algorithm
 """
).strip()

Here is how it will render up on Bluesky app later on.

Once everything above is setup, now you can run the script.

python setup_feed.py

This will initialize our custom feed on Bluesky. If everything was set up correctly, you will get an output for the value of FEED_URI.

Update the FEED_URI in config.py file with this value.

config.py

# Feed URI generated by running `python setup_feed.py`
FEED_URI = "at://did:plc:bpuq5cgmyvssgi3iwsyvd4gn/app.bsky.feed.generator/arxiv-feed"

Your feed has been created and now it needs to be populated before you can start using it in the app.

4. Setup Skyfeed

In this step, we will build an initial feed using the interface of the Skyfeed app.

You can signup on skyfeed.app using your Bluesky handle and the app password you created in previous step.

After logging in, go to the top-right and click Create Feed to create a new feed

You will see bunch of options. Since our goal is to filter out all the posts on Bluesky in past 24 hours that mention arxiv.org or aclanthology.org, we can set up the options as such.

First, the Input field specifies how many posts to capture. We will specify the Entire Network and set the time to 24 hours because we want to run a regex over all posts on Bluesky indexed in the past 24 hours. Depending on your usecase, you can modify this part.

As seen below, it yields 6 million posts in the past 24 hours.

Now, we will filter those 6 million posts to only get items that mention either the arxiv.org or the aclanthology.org links. This can be achieved with the below regex and can be pasted in the RegEx field. Make sure the Post Text and Link items are green as we want to search only in the post text and links.

(arxiv.org/.+)|(aclanthology.org/.+)

Here is how it should look after everything is set up correctly.

With this setup, we can now publish the feed as shown below by clicking Update Feed button and clicking Publish in the popup. This will create a feed that can be accessed via Bluesky now.

You should see the link to your published skyfeed as shown below.

Copy the portion as shown above to the SKYFEED_DID variable in config.py. We will be further filtering this feed now using Python in the next steps.

config.py

# Skyfeed path
SKYFEED_DID = "did:plc:bpuq5cgmyvssgi3iwsyvd4gn/feed/aaagg56kp5qzi"

5. Feed Generation in Python

With the above steps done, we can build out the feed generation logic. The main crux of the logic is present in generate_feed.py file. Let’s understand how it works:

1. Cloudflare Page Generation

The entire thing is defined in the main function.

generate_feed.py

def main():
 did_data = {
        "@context": ["https://www.w3.org/ns/did/v1"],
        "id": f"did:web:{config.SERVICE_DOMAIN}",
        "service": [
 {
                "id": "#bsky_fg",
                "type": "BskyFeedGenerator",
                "serviceEndpoint": f"https://{config.SERVICE_DOMAIN}",
 }
 ],
 }
    write_json(did_data, "./_site/.well-known/did.json")

 feed_generator_data = {
        "encoding": "application/json",
        "body": {"did": config.SERVICE_DID, "feeds": [{"uri": config.FEED_URI}]},
 }

    write_json(feed_generator_data, "./_site/xrpc/app.bsky.feed.describeFeedGenerator")

This part of the code will generate some metadata JSON that will be called by Bluesky to our Cloudflare pages at following paths.

2. Filtering Posts

The main logic lies in the code below, which generates the data for the endpoint that contains all the post IDs that should be rendered in the feed.

generate_feed.py

# Fetch latest posts and prepare data in the format expected by Bluesky protocol
post_uris = fetch_latest_posts()

feed_skeletion = {"feed": [{"post": uri} for uri in post_uris]}
write_json(feed_skeletion, "./_site/xrpc/app.bsky.feed.getFeedSkeleton")

It generates the endpoint that will return the post IDs that should be rendered in our custom feed. (https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.getFeedSkeleton)

The main logic for the feed filtering is defined in the fetch_latest_posts() function in the generate_feed.py file.

generate_feed.py

def fetch_latest_posts():
 client = Client()
 client.login(config.HANDLE, config.PASSWORD)

 data = client.app.bsky.feed.get_feed(
 {
1            "feed": config.SKYFEED_PATH,
            "limit": 100,
 },
        timeout=100,
 )

 feed = data.feed
    for _ in range(2):
 data = client.app.bsky.feed.get_feed(
2 {"feed": config.SKYFEED_PATH, "limit": 100, "cursor": data.cursor},
            timeout=200,
 )
 feed.extend(data.feed)

3 bool_filter = thread_map(filter_item, feed)
 filtered_feed = compress(feed, bool_filter) 
4 sorted_feed = rank_posts(filtered_feed)
 post_uris = [item.post.uri for item in sorted_feed]
    return post_uris

1: We fetch the feed from the Skyfeed custom feed we generated in the earlier step
2: A cursor is used to paginate and select additional 200 items from that feed
3: Then the items are filtered using the filter_item function that checks whether the links present in the item are indeed CS Arxiv papers. We make use of thread_map to parallelize the process.
4: We re-rank the filtered items in the feed to use the Hackernews algorithm

3. Re-ranking with hackernews score

The re-ranking of the posts is defined in the rank_posts function. I made use of hackernews algorithm which is quite simple. We compute the points for a post as the sum of its number of likes, quotes, replies and reposts. Then that score is decayed by how many hours it has been since the post was created so slowly downvote items that are getting older. This balances the popular vs recent research papers.

generate_feed.py

def hackernews_score(item, gravity: float = 2.5):
 hours_passed = (
 datetime.now(timezone.utc) - parse_date(item.post.indexed_at)
 ).total_seconds() / 3600
    if hours_passed >= 12:
        return 0
    else:
 points = (
 item.post.like_count
            + item.post.quote_count
            + item.post.reply_count
            + item.post.repost_count
 )
 score = points / ((hours_passed + 2) ** (gravity))
        return score


def rank_posts(feed):
    return sorted(feed, key=hackernews_score, reverse=True)

6. Running periodically via GitHub Actions

To run our script periodically for free, we can leverage Github Actions. This will fetch the feed from Skyfeed, perform the filtering and re-ranking, and push the resulting data to Cloudflare pages every 30 minutes.

The schedule for the cron job is defined in the build_and_deploy.yml file and can be modified there as needed.

.github/workflows/build_and_deploy.yml

name: Build and deploy site to cloudflare

on:
  push:
    branches:
 - main
  schedule:
 - cron: '*/30 * * * *'

Crontab.guru is a great website to visualize what the cron syntax does.

To enable the actions in your forked GitHub repo, goto “Settings > Secrets and Variables” and click “New Repository Secret” and set these three variables one by one

BLUESKY_APP_PASSWORD
CLOUDFLARE_ACCOUNT_ID
CLOUDFLARE_API_TOKEN

You can get your “CLOUDFLARE_ACCOUNT_ID” by logging in to Cloudflare Pages and then getting the value from the right sidebar as shown below.

To get the CLOUDFLARE_API_TOKEN, create a new token from https://dash.cloudflare.com/profile/api-tokens as shown below.

Once all three secret variables have been set up, you can enable GitHub actions in your forked repo as shown below.

The action should automatically run every 30 minutes now. As such, it will fetch the latest posts from skyfeed, perform the filtering and generate the final set of posts to be displayed on Bluesky and deploy that to Cloudflare.

7. Access your feed

Your feed will be listed on your profile now at bsky.app/feeds and can be pinned to the homepage as well.

You can find the link from the address bar when the feed is open and share it with others.

Conclusion

Thus, we saw an approach on how to make a custom feed on Bluesky with a combination of Skyfeed, Github Actions and Cloudflare pages.

While we built it to get a feed of Arxiv papers, you can extend the same approach to do a bunch of useful stuff. You could integrate lightweight classifiers to classify/re-rank posts for relevance to your interests or even filter out toxic posts from your feed.

You can also skip Skyfeed as the initial source and instead read from the firehose or one of your existing feeds directly using atproto and handle the indexing via a small SQLite database or JSON committed directly to GitHub via the actions.

References

Parallel Processing with tqdm

Sun, 20 Oct 2024 00:00:00 GMT

tqdm is a popular library that’s widely used in a bunch of open-source python ML libraries for displaying progress bars. As such, it’s already pre-installed as a dependency when working on machine learning projects.

shell

pip show tqdm

Required-by: datasets, dvc, evaluate, huggingface-hub, openai, sentence-transformers, spacy, transformers

For example, consider a task where we loop over a list of websites and need to fetch the status code for each.

Naive loop

import requests

def ping(url):
    return requests.head(url).status_code

urls = ['https://amitness.com']*10
statuses = [ping(url) for url in urls]

To get a progress bar, it’s as easy as wrapping the urls list with the tqdm class.

shell

pip install tqdm

import requests
from tqdm.auto import tqdm

def ping(url):
    return requests.head(url).status_code

urls = ['https://amitness.com'] * 10
statuses = [ping(url) for url in tqdm(urls)]

1: Import the tqdm object. Importing from tqdm.auto is preferred as it automatically select the best progress bar (jupyter-compatible or console-based)
2: Simply wrap the list of items and you get a progress bar

While this use case of tqdm as a progress bar library is well known, there are three relatively undocumented features in tqdm to get progress bars while doing concurrent, parallel or asynchronous processing.

Running Concurrent Threads

You can execute a function on the list concurrently with multiple threads using the thread_map function. It takes the function to run as the first argument and a list of items as the second argument and returns the results.

import requests
from tqdm.contrib.concurrent import thread_map

def ping(url):
    return requests.head(url).status_code

urls = ['https://amitness.com']*500
statuses = thread_map(ping, urls, max_workers=4)

1: The number of threaded-workers to use can be specified using max_workers parameter.

This is useful to speed up IO-bound tasks such as fetching data by scraping a website, calling a remote third party API or querying a remote database.

Internally, thread_map leverages the ThreadPoolExecutor from concurrent.futures standard library.¹

Running parallel processes

For compute-bound tasks, tqdm provides a process_map function with a similar API to process the list in parallel using multiple child processes.

import requests
from tqdm.contrib.concurrent import process_map

def ping(url):
    return requests.head(url).status_code

urls = ['https://amitness.com'] * 500
statuses = process_map(ping, urls, max_workers=4)

1: The number of processes to use can be specified using max_workers parameter.

This is particularly useful when the task involves heavy computation such as generating sentence embeddings for a large dataset or running batch model inference on CPU.

Internally, process_map uses ProcessPoolExecutor from concurrent.futures standard library.²

Running Asynchronous Tasks

For asynchronous tasks, tqdm provides an asyncio-compatible progress bar using tqdm_asyncio. This allows you to run asynchronous functions with a progress bar.

We use the same example as before, but this time we will use httpx to make asynchronous HTTP requests instead of the synchronous requests library.

shell

pip install httpx

In the code, we only need to use tqdm_asyncio.gather instead of asyncio.gather to get a progress bar. Everything else is regular asyncio code.

import asyncio
import httpx
from tqdm.asyncio import tqdm_asyncio

async def ping(client, url):
    try:
        response = await client.head(url, timeout=10)
        return response.status_code
    except Exception as e:
        return f"Error: {e}"

async def main():
    urls = ['https://amitness.com'] * 500
    async with httpx.AsyncClient() as client:
        tasks = [ping(client, url) for url in urls]
        # tqdm_asyncio.gather instead of asyncio.gather
        statuses = await tqdm_asyncio.gather(*tasks)
    return statuses


if __name__ == "__main__":
    asyncio.run(main())

Conclusion

Thus, thread_map, process_map and tqdm_asyncio are useful tools to add to your toolbox when dealing with parallel processing. As tqdm is already pre-installed via other libraries you might use in ML, it’s a quick and easy way to add parallel processing to your program logic.

Footnotes

Source code for thread_map↩︎
Source code for process_map↩︎

A Visual Guide to Regular Expression

Wed, 21 Oct 2020 00:00:00 GMT

It’s a common task in NLP to either check a text against a pattern or extract parts from the text that matches a certain pattern. A regular expression or “regex” is a powerful tool to achieve this.

While powerful, regex can feel daunting as it comes with a lot of features and sub-parts that you need to remember.

In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.

Mental Model

Let’s start with a simple example where we are trying to find the word ‘cool’ in the text.

With regex, we could simply type out the word ‘cool’ as the pattern and it will match the word.

'cool'

While regex matched our desired word ‘cool’, the way it operates is not at the word level but the character level. This is the key idea.

Key Idea: Regex works at the character-level, not word-level.

The implication of this is that the regex r'cool' would match the following sentences as well.

Basic Building Blocks

Now that we understand the key idea, let’s understand how we can match simple characters using regex.

a. Specific character

We can simply specify the character in the regular expression and it will match all instances in the text.

For example, a regular expression given below will match all instances of ‘a’ in the text. You can use any of the small and capital alphabets.

'a'

You can also use any digits from 0 to 9 and it will work as well.

'3'

Note that regex is case-sensitive by default and thus the following regex won’t match anything.

'A'

b. White space character

We can detect special characters such as whitespace and newlines using special escape sequences.

Besides the common ones above, we have:

\r for carriage return
\f for form feed
\e for escape

c. Special sequences

Regex provides a bunch of built-in special symbols that can match a group of characters at once. These begin with backslash \.

Pattern: `\d`

It matches any single-digit number between 0 to 9.

Notice that matches are single digit. So we have 4 different matches below instead of a single number 18.04.

Pattern: `\s`

It matches any whitespace character (space, tab or newline).

Pattern: `\w`

It matches any of the small alphabets(a to z), capital alphabets(A to Z), digits (0 to 9), and underscore.

Pattern: `.`

It matches any character except the new line ().

import re

>>> re.findall(r'.', 'line 1\nline2')
['l', 'i', 'n', 'e', ' ', '1', 'l', 'i', 'n', 'e', '2']

Pattern: Negations

If you use the capitalized versions of the patterns above, they act as negation.

For example, if “ matched any digits from 0 to 9, then”” will match anything except “0 to 9”.

d. Character sets

These are patterns starting with [ and ending with ] and specify the characters that should be matched enclosed by brackets.

For example, the following pattern matches any of the characters ‘a’, ‘e’, ‘i’, ‘o’, and ‘u’.

You can also replicate the functionality of \d using the below pattern. It will match any digits between 0 to 9.

Instead of specifying all the digits, we can use - to specify only start and end digits. So, instead of [0123456789], we can do:

For example, [2-4] can be used to match any digits between 2 to 4 i.e. (2 or 3 or 4).

You can even use the special characters we learned previously inside the brackets. For example, you can match any digit from 0 to 9 or whitespace as:

Below, I have listed some useful common patterns and what they mean.

e. Anchors

Regex also has special handlers to make the pattern only match if it’s at the start or end of the string.

We can use the ^ anchor to match patterns only at the start of a line. For example:

Similarly, we can use the $ anchor after the character to match patterns only if it’s the end of the line. For example:

f. Escaping metacharacters

Consider a case where we want to exactly match the word “Mr. Stark”.

If we write a regex like Mr. Stark, then it will have an unintended effect. Since we know dot has a special meaning in a regex.

So, we should always escape the special metacharacters like ., $ etc. if our goal is to match the exact character itself.

Here is the list of metacharacters that you should remember to escape if you’re using them directly.

^ $ . * + ? { } [ ] \ | ( )

Repetition of basic blocks

Now that we can pattern match any characters, we could repeat things and start building more complicated patterns.

a. Naive repetition

Using only what we have learned so far, a naive way would be to just repeat the pattern. For example, we can match two-digit numbers by just repeating the character-level pattern.

\d\d

b. Quantifiers

Regex provides special quantifiers to specify different types of repetition for the character preceding it.

i. Fixed repetition

We can use the {...} quantifier to specify the number of times a pattern should repeat.

For example, the previous pattern for matching 2-digit number can be recreated as:

You can also specify a range of repetitions using the same quantifier. For example, to match from 2-digit to 4-digit numbers, we could use the pattern:

When applied to a sentence, it will match both 4-digit and 2-digit numbers.

Note:

There should not be any space between minimum and maximum count For example, \d{2, 4} doesn’t work.

ii. Flexible quantifiers

Regex also provides quantifiers “*“,”+” and “?” using which you can specify flexible repetition of a character.

0 or 1 times: ?
The ? quantifier matches the previous character if it repeats 0 or 1 times. This can be useful to make certain parts optional. It is equivalent to {0,1}.

For example, let’s say we want to match both the word “sound” and “sounds” where “s” is optional. Then, we can use the ? quantifier that matches if a character repeats 0 or 1 times.
one or more times: +
The + quantifier matches the previous character if it repeats 1 or more times. It is equivalent to {1,}.

For example, we could find numbers of any arbitrary length using the regex \d+.
zero or more times: *
The * quantifier matches the previous character if it repeats zero or more times. It is equivalent to {0,}.

Usage in Python

Python provides a module called “re” in the standard library to work with regular expression.

Need for raw strings

To specify a regular expression in Python, we precede it with r to create raw strings.

pattern = r'\d'

To understand why we precede with r, let’s try printing the expression \t without r.

pattern = '\t'
print(pattern)

You can see how when we don’t use raw string, the string \t is treated as the escape character for tab by Python.

Now let’s convert it into raw string. We get back whatever we specified.

pattern = r'\t'
print(pattern)
\t

Using re module

To use re module, we can start by importing the re module as:

import re

1. re.findall

This function allows us to get all the matches as a list of strings.

import re
re.findall(r'\d', '123456')

['1', '2', '3', '4', '5', '6']

2. re.match

This function searches for a pattern at the beginning of the string and returns the first occurrence as a match object. If the pattern is not found, it returns None.

import re

match = re.match(r'batman', 'batman is cool')
print(match)

<re.Match object; span=(0, 6), match='batman'>

With the match object, we can get the matched text as

print(match.group())

batman

In a case where our pattern is not at the start of the sentence, we will not get any match.

import re

match = re.match(r'batman', 'The batman is cool')
print(match)

None

3. re.search

This function also finds the first occurrence of a pattern but the pattern can occur anywhere in the text. If the pattern is not found, it returns None.

import re

match = re.search(r'batman', 'the batman is cool')
print(match.group())

batman

References

A.M. Kuchling, “Regular Expression HOWTO - Python 3.9.0 documentation”

Knowledge Transfer in Self Supervised Learning

Sun, 04 Oct 2020 00:00:00 GMT

Self Supervised Learning is an interesting research area where the goal is to learn rich representations from unlabeled data without any human annotation.

This can be achieved by creatively formulating a problem such that you use parts of the data itself as labels and try to predict that. Such formulations are called pretext tasks.

For example, you can setup a pretext task to predict the color version of the image given the grayscale version. Similarly, you could remove a part of the image and train a model to predict the part from the surrounding. There are many such pretext tasks.

By pre-training on the pretext task, the hope is that the model will learn useful representations. Then, we can finetune the model to downstream tasks such as image classification, object detection, and semantic segmentation with only a small set of labeled training data.

Challenge of evaluating representations

So pretext tasks can help us learn representations. But, this poses a question:

How to determine how good a learned representation is?

Currently, the standard way to gauge the representations is to evaluate it on a set of standard tasks and benchmark datasets.

Linear classification: ImageNet classification using frozen features
Low Data Regime: ImageNet Classification using only 1% to 10% of data
Transfer Learning: Object Classification, Object Detection and Semantic Segmentation on PASCAL VOC

We can see that the above evaluation methods require us to use the same model architecture for both the pretext task and the target task.

This poses some interesting challenges:

For the pretext task, our goal is to learn on a large-scale unlabeled dataset and thus deeper models(e.g. ResNet) would help us learn better representations. But, for downstream tasks, we would prefer shallow models(e.g. AlexNet) for actual applications. Thus, we currently have to consider this limitation when designing the pretext task.
It’s harder to fairly compare which pre-text task is better if some methods used simpler architecture while other methods used deeper architecture.
We can’t compare the representations learned from pretext tasks to handcrafted features such as HOG.
We may want to exploit several data domains such as sound, text, and videos in the pretext task but the target task may limit our design choices.
Model trained on pretext task may learn extra knowledge that is not useful for generic visual recognition. Currently, the final task-specific layers are ignored and weights or features only up to certain convolutional layers are taken.

Knowledge Transfer

Noroozi et al. (2018) proposed a simple idea to tackle these issues in their 2018 paper “Boosting Self-Supervised Learning via Knowledge Transfer”.

Intuition

The authors observed that in a good representation space, semantically similar data points should be close together.

In regular supervised classification, the information that images are semantically similar is encoded through labels annotated by humans. A model trained on such labels would have a representation space that groups semantically similar images.

Thus, with pre-text tasks in self-supervised learning, the objective is implicitly learning a metric that makes the same category images similar and different category images dissimilar. Hence we can provide a robust estimate of the learned representation if we could encode semantically related images to the same labels in some way.

General Framework

The authors propose a novel framework to transfer knowledge from a deep self-supervised model to a separate shallow downstream model. You can use different model architectures for the pretext task and downstream task.

Key Idea:

Cluster features from pretext task and assign cluster centers as pseudo-labels for unlabeled images. Then, re-train a small network with target task architecture on pseudo-labels to predict pseudo-labels and learn a novel representation.

The end-to-end process is described below:

1. Pretext task

Here we choose some deep network architecture and train it on some pretext task of our choice on some dataset. We can take features from some intermediate layer after the model is trained.

Figure: Training on Pre-text Task (Noroozi et al., 2018)

2. K-means Clustering

For all the unlabeled images in the dataset, we compute the feature vectors from the pretext task model. Then, we run K-means clustering to group semantically similar images. The idea is that the cluster centers will be aligned with categories in ImageNet.

Figure: Clustering Features (Noroozi et al., 2018)

In the paper, the authors ran K-means on a single Titan X GPU for 4 hours to cluster 1.3M images into 2000 categories.

3. Pseudo-labeling

The cluster centers are treated as the pseudo-label. We can use either the same dataset as the above step or use a different dataset itself. Then, we compute the feature vectors for those images and find the closest cluster center for each image. This cluster center is used as the pseudo-label.

Figure: Generating Pseudo-labels (Noroozi et al., 2018)

4. Training on Pseudo-labels

We take the model architecture that will be used for downstream tasks and train it to classify the unlabeled images into the pseudo-labels. Thus, the target architecture will learn a new representation such that it will map images that were originally close in the pre-trained feature space to close points.

Figure: Re-training on pseudo-labels (Noroozi et al., 2018)

Advantage of Knowledge Transfer

We saw how by clustering the features and then using pseudo-labels, we can bring the knowledge from any pretext task representations into a common reference model like AlexNet.

As such, we can now easily compare different pretext tasks even if they are trained using different architectures and on different data domains. This also allows us to improve self-supervised methods by using deep models and challenging pretext tasks.

How well does this framework work?

To evaluate the idea quantitatively, the authors set up an experiment as described below:

a. Increase complexity of pretext task (Jigsaw++)

To evaluate their method, the authors took an old puzzle-like pretext task called “Jigsaw” where we need to predict the permutation that was used to randomly shuffle a 3*3 square grid of image.

Image adapted from Noroozi et al. (2018)

They extended the task by randomly replacing 0 to 2 number of tiles with tile from another random image at some random locations. This increases the difficulty as now we need to solve the problem using only the remaining patches. The new pretext task is called “Jigsaw++”.

Image adapted from Noroozi et al. (2018)

In the paper, they use 701 total permutations which had a minimum hamming distance of 3. They apply mean and standard deviation normalization at each image tile independently. They also make images gray-scale 70% of the time to prevent the network from cheating with low-level statistics.

b. Use a deeper network to solve pretext task

The authors used VGG-16 to solve the pretext task and learn representations. As VGG-16 has increased capacity, it can better handle the increased complexity of the “Jigsaw++” task and thus extract better representation.

c. Transfer Knowledge back to AlexNet

The representations from VGG-16 are clustered and cluster centers are converted to pseudo-labels. Then, AlexNet is trained to classify the pseudo-labels.

d. Finetune AlexNet on Evaluation datasets

For downstream tasks, the convolutional layers for the AlexNet model are initialized with weights from pseudo-label classification and the fully connected layers were randomly initialized. The pre-trained AlexNet is then finetuned on various benchmark datasets.

e. Results

Using a deeper network like VGG-16 leads to better representation and pseudo-labels and also better results in benchmark tasks. It got state of the art results on several benchmarks in 2018 and reduced the gap between supervised and self-supervised methods further.

1. Transfer Learning on PASCAL VOC

The authors tested their method on object classification and detection on PASCAL VOC 2007 dataset and semantic segmentation on PASCAL VOC 2012 dataset.

Insights

Training Jigsaw++ with VGG16 and using AlexNet to predict cluster gives the best performance.
Switching to a challenging pretext task “Jigsaw++” improves performance than “Jigsaw”.
Knowledge transfer doesn’t have a significant impact when using the same architecture AlexNet in both Jigsaw++ and downstream tasks.

Task	Cluster	Pretext	Downstream	Classification	Detection(SS)	Detec.(MS)	Segmentation
Jigsaw	no	AlexNet	AlexNet	67.7	53.2	-	-
Jigsaw++	no	AlexNet	AlexNet	69.8	55.5	55.7	38.1
Jigsaw++	yes	AlexNet	AlexNet	69.9	55.0	55.8	40.0
Jigsaw++	yes	VGG-16	AlexNet	72.5	56.5	57.2	42.6

2. Linear Classification on ImageNet

In this, a linear classifier is trained on features extracted from AlexNet at different convolutional layers. For ImageNet, using VGG-16 and transferring knowledge to AlexNet using clustering gives a substantial boost of 2%.

3. Non-linear classification on ImageNet

For a non-linear classifier, using VGG-16 and transferring knowledge to AlexNet using clustering gives the best performance on ImageNet.

Additional Insights from Paper

1. How does the number of clusters affect the performance?

The network is not significantly affected by the number of clusters. The authors tested AlexNet trained on pseudo-labels from a different number of clusters on the task of object detection.

2. How is this different from Knowledge Distillation?

Knowledge transfer is fundamentally different from knowledge distillation. Here, the goal is to only preserve the cluster association of images from the representation and transfer that to the target model. Unlike distillation, we don’t do any regression to the exact output of the teacher.

3. Can you use different datasets in clustering vs predicting pseudo-labels?

Yes, the method is flexible and you can pre-train on one dataset, cluster on another, and get pseudo-labels for the third one.

The authors did an experiment where they trained clustering on representations for ImageNet and then calculated cluster centers on the “Places” dataset to get pseudo-labels. There was only a small reduction (-1.5%) in performance for object classification.

Conclusion

Thus, Knowledge Transfer is a simple and efficient way to map representations from deep to shallow models.

References

Mehdi Noroozi and Paolo Favaro. 2017. Unsupervised learning of visual representations by solving jigsaw puzzles.

M. Noroozi, Ananth Vinjimoor, P. Favaro, and H. Pirsiavash. 2018. Boosting self-supervised learning via knowledge transfer. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Daisuke Okanohara and Jun’ichi Tsujii. 2007. A discriminative language model with pseudo-negative samples. In Annie Zaenen and Antal van den Bosch, editors, Proceedings of the 45th annual meeting of the association of computational linguistics, pages 73–80, Prague, Czech Republic. Association for Computational Linguistics.

Interactive Analysis of Sentence Embeddings

Thu, 24 Sep 2020 00:00:00 GMT

Embedding Projector is a free web application for visualizing high-dimensional data. It has built-in demos for visualizing word embeddings in NLP and image embeddings for MNIST in Computer Vision.

I recently experimented with a way to load sentence embeddings along with the class labels into this tool and explore them interactively. In this blog post, I will explain the end-to-end process with an example dataset.

Toy Example: Outlier Detection

1. Preparing Dataset

To understand this use case, let’s take a subset of 100 movie reviews from the SST-2 dataset which are labeled as positive and negative.

import pandas as pd

df = pd.read_csv('http://bit.ly/dataset-sst2', 
                 nrows=100, sep='\t', names=['text', 'label'])

df['label'] = df['label'].replace({0: 'negative', 1: 'positive'})

The dataset has a column containing the text and a label indicating whether it’s positive or negative opinion.

We will introduce noise into our dataset by corrupting five of the responses with random text. It will act as an outlier for our example.

df.loc[[10, 27, 54, 72, 91], 'text'] = 'askgkn askngk kagkasng'

2. Generating Embeddings

Now, we will compute sentence embeddings for the headlines using the sentence-transformers package. First, let’s install it using pip.

shell

!pip install sentence-transformers

Next, we will create a helper function to return a NumPy array of sentence embeddings given a list of sentences.

from sentence_transformers import SentenceTransformer

sentence_bert_model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

def get_embeddings(sentences):
    return sentence_bert_model.encode(sentences,
                                    batch_size=32, 
                                    show_progress_bar=True)

Using the above function, we can generate sentence embeddings for our data as shown below.

e = get_embeddings(df['text'])
# shape: (100, 768)

3. Exporting to Embedding Projector Format

Embedding Projector requires two TSV files to load our custom embeddings. - output.tsv: This file should contain the embeddings without any headers. - metadata.tsv: This file should contain the original text and labels for the embeddings

Let’s first generate the output.tsv file for our sentence embeddings from the previous step.

# Convert NumPy array of embedding into data frame
embedding_df = pd.DataFrame(e)

# Save dataframe as as TSV file without any index and header
embedding_df.to_csv('output.tsv', sep='\t', index=None, header=None)

To generate metadata.csv, we simply save our original dataframe.

# Save dataframe without any index
df.to_csv('metadata.tsv', index=False, sep='\t')

4. Importing into Embedding Projector

We first go to https://projector.tensorflow.org/.

On the left-hand sidebar, click the Load button.

Then, for the first Choose file button, upload the output.tsv file and for the second Choose file button, upload the metadata.tsv file.

After uploading both files, click outside and you should see the sentence embedding projection. The dimensions of embeddings are reduced to 3D by default using PCA.

Let’s switch to 2D by turning off the checkbox for ‘Component #3’ in the bottom part of sidebar.

On the 2D visualization, we can see how the random text is far from other groups of text as an outlier. On hovering the point, we see the text askgkn askngk kagkasng.

5. Useful Features in Projector

a. Class Separation

We can enable color coding of the points by their actual labels (positive vs negative) by using the Color by dropdown in the left sidebar.

Select the name of the column that contains your labels. In our example file, the column name is label.

The points themselves are interactive. You can see the actual sentence for each point by hovering over them.

You can click on the point to show the metadata. We can see below on clicking a blue point that its label is “positive” in the popup.

So the blue points are positive and the red points are negative. When a point is selected, 100 nearest points in terms of cosine similarity are also highlighted.

To get back to the original view, we can click on any empty white space.

Applications

The color coding can be a useful heuristic for many use cases:

It can be used to explore class overlap for the dataset you’re working on and identify tricky sentences.
If there are labeling errors in your dataset, then this might help uncover them. For example, if a whole cluster of points is in a certain color, but some single point in that cluster is in a different color, then that might be an outlier or labeling error.

b. Dimensionality Reduction Algorithm

The web app provides three standard dimensionality reduction techniques: UMAP, T-SNE, and PCA.

You can choose the algorithm and their parameters from the bottom of the left sidebar.

c. Custom Linear Projection

You can also use a custom keyword or full text as the axis using the CUSTOM tab. This will apply a custom linear projection and can help us explore meaningful directions in the embedding space.

For example, the Gmail team tried setting “yeah” on the left side and “yes” on the right side. When they projected encoder embeddings for email replies to this custom linear projection, they found replies in a casual tone (e.g. Here you go) on the left side and responses in a more formal tone clustered towards the right side.

Conclusion

Thus, Embedding Projector is a very useful tool to better understand the datasets and models we work with.

References

Daniel Smilkov et al., Embedding Projector: Interactive Visualization and Interpretation of Embeddings

VSCode on Google Colab

Tue, 01 Sep 2020 00:00:00 GMT

I recently discovered a way to set up VSCode on Google Colab and use it as an editor to write code and run experiments on the Colab VM.

With this setup, you can still prototype in the Colab Notebook while also using VSCode for all the advantages of a full-fledged code editor. Here is how you can replicate my setup.

Approach 1: Python Package

In this setup, we use the colab-code package that automates all the manual setup steps previously described in the Approach 2 section of this blog post. You can make a copy of this notebook directly to get started.

First, install the colab-code package using the following command:
```
pip install colabcode
```
Now, import ColabCode class from the package and specify the port and password.
```
from colabcode import ColabCode
ColabCode(port=10000, password="password123")
```
You can also use it directly with the default port and without any password as shown below.
```
from colabcode import ColabCode
ColabCode()
```
You will get the ngrok URL in the output. Click the link and a login page will open in a new tab.
Type the password you had set in step 2 and click submit. If the page gets stuck for more than 4-5 seconds, refresh the page and you should be redirected to the editor.
Now you will get access to the editor interface and can use it to work on python files.

Approach 2: Manual Setup

I have described the setup steps in detail below. After going through all the steps, please use this colab notebook to try it out directly.

First, we will install the code-server package to run VSCode editor as a web app. Copy and run the following command on colab to install code-server.
```
shell
```
```
!curl -fsSL https://code-server.dev/install.sh | sh
```
After the installation is complete, we will expose a random port 9000 to an external URL we can access using the pyngrok package. To install pyngrok, run
```
shell
```
```
!pip install -qqq pyngrok
```
Then, run the following command to get a public ngrok URL. This will be the URL we will use to access VSCode.
```
from pyngrok import ngrok
url = ngrok.connect(port=9000)
print(url)
```
Now, we will start the VSCode server in the background at port 9000 without any authentication using the following command.
```
shell
```
```
!nohup code-server --port 9000 --auth none &
```
Now, you can access the VSCode interface at the URL you got from step 3. The interface and functionality are the same as the desktop version of VSCode.

Usage Tips

You can switch to the dark theme by going to the bottom-left corner of the editor, clicking the settings icon, and then clicking ‘Color Theme’.

A popup will open. Select Dark (Visual Studio) in the options and the editor will switch to a dark theme.
All the keyword shortcuts of regular VSCode works with this. For example, you can use Ctrl + Shift + P to open a popup for various actions.
To open a terminal, you can use the shortcut Ctrl + Shift + `.
To get python code completions, you can install the Python(ms-python) extension from the extensions page on the left sidebar.
The Colab interface is still usable as a notebook and regular functions to upload and download files and mount with Google Drive. Thus, you get the benefits of both a notebook and a code editor.

References

Unsupervised Keyphrase Extraction

Sun, 30 Aug 2020 00:00:00 GMT

Keyword Extraction is one of the simplest ways to leverage text mining for providing business value. It can automatically identify the most representative terms in the document.

Such extracted keywords can be used for various applications. They can be used to summarize the underlying theme of a large document with just a few terms. They are also valuable as metadata for indexing and tagging the documents. They can likewise be used for clustering similar documents. For instance, to showcase relevant advertisements on a webpage, we could extract keywords from the webpage, find matching advertisements for these keywords, and showcase those.

In this post, I will provide an overview of the general pipeline of keyword extraction and explain the working mechanism of various unsupervised algorithms for this.

Unsupervised Keyphrase Extraction Pipeline

For keyword extraction, all algorithms follow a similar pipeline as shown below. A document is preprocessed to remove less informative words like stop words, punctuation, and split into terms. Candidate keywords such as words and phrases are chosen.

Then, a score is determined for each candidate keyword using some algorithm. The highest-ranking keywords are selected and post-processing such as removing near-duplicates is applied. Finally, the algorithm returns the top N ranking keywords as output.

Unsupervised Methods

Unsupervised algorithms for keyword extraction don’t need to be trained on the corpus and don’t need any pre-defined rules, dictionary, or thesaurus. They can use statistical features from the text itself and as such can be applied to large documents easily without re-training. Most of these algorithms don’t need any linguistic features except for stop word lists and so can be applied to multiple languages.

Let’s understand each algorithm by starting from simple methods and gradually adding complexity.

1. Naive Counting

This is a simple method which only takes into account how many times each term occurs.

Let’s understand it by applying it to an example document.

a. Pre-processing

In this step, we lowercase the text and remove low informative words such as stop words from the text.

b. Candidate Generation

We split the remaining terms by space and punctuation symbols to get a list of possible keywords.

c. Candidate Scoring

We can count the number of times each term occurs to get a score for each term.

Candidate	anything	mass	occupies	space	called	matter	exists	various	states	…
Count	1	1	1	1	1	2	1	1	1	…

d. Final Ranking

We can sort the keywords in descending order based on the counts and take the top N keywords as the output.

Drawback of Naive Counting

This method has an obvious drawback of only focusing on frequency. But, generic words are likely to be very frequent in any document but are not representative of the domain and topic of the document. We need some way to filter out generic terms.

2. Term Frequency Inverse Document Frequency (TF-IDF)

This method takes into account both how frequent the keyphrase is and also how rare it is across the documents.

Let’s understand how it works by going through the various steps of the pipeline:

a. Pre-processing

In this step, we lowercase the text and split the document into sentences.

b. Candidate Generation

We generate 1-gram, 2-gram, and 3-grams candidate phrases from each sentence such that they don’t contain any punctuations. These are our list of candidate phrases.

c. Candidate Scoring

Now, for each candidate keyword “w”, we calculate the TF-IDF score in the following steps.

First, the term frequency(TF) is calculated simply by counting the occurrence of the word.

Then, the inverse document frequency(IDF) is calculated by dividing the total number of documents by the number of documents that contain the word “w” and taking the log of that quantity.

Finally, we get the TF-IDF score for a term by multiplying the two quantities.

d. Final Ranking

We can sort the keywords in descending order based on their TF-IDF scores and take the top N keywords as the output.

3. Rapid Automatic Keyword Extraction (RAKE)

RAKE is a domain-independent keyword extraction method proposed in 2010. It uses word frequency and co-occurrence to identify the keywords. It is very useful for identifying relevant multi-word expressions.

How RAKE works

Let’s apply RAKE on a toy example document to understand how it works:

a. Preprocessing

First, the stop words in the document are removed.

b. Candidate Generation

We split the document at the stop word positions and punctuations to get content words. The words that occur consecutively without any stop word between them are taken as candidate keywords.

For example, “Deep Learning” is treated as a single keyword.

c. Candidate Scoring

Next, the frequency of all the individual words in the candidate keywords are calculated. This finds words that occur frequently.

	deep	learning	subfield	ai	useful
Word Frequency:	1	1	1	1	1

Similarly, the word co-occurrence count is calculated and the degree for each word is the total sum. This metric identifies words that occur often in longer candidate keywords.

	deep	learning	subfield	ai	useful
deep	1	1	0	0	0
learning	1	1	0	0	0
subfield	0	0	1	0	0
ai	0	0	0	1	0
useful	0	0	0	0	1
degree:	1 + 1 = 2	1 + 1 = 2	1	1	1

Then, we divide the degree by the frequency for each word to get a final score. This score identifies words that occur more in longer candidate keywords than individually.

	deep	learning	subfield	ai	useful
Score =	2 / 1 = 2	2 / 1 = 2	1 / 1 = 1	1 / 1 = 1	1 / 1 = 1

d. Final Ranking

Finally, we calculate the scores for our candidate keywords by adding the scores for their member words. The higher the score, the more useful a keyword is.

Keyword	Score	Remarks
deep learning	4	score(deep) + score(learning) = 2 + 2 = 4
subfield	1	score(subfield) = 1
ai	1	score(ai) = 1
useful	1	score(useful) = 1

Thus, the keywords are sorted in the descending order of their score value. We can select the top-N keywords from this list.

Drawbacks of RAKE

If the stop word list used in RAKE is not exhaustive, it would treat continuous long text as a phrase and give very long phrases.
Multi-word expressions that contain stop-words could be missed. For example, mention of a brand called “Good Day” could be missed if “good” is present in the stop word list.

Using RAKE in Python

We can use the rake-nltk library to use it in Python as shown below.

shell

pip install rake-nltk

from rake_nltk import Rake
rake = Rake()

text = 'Deep Learning is a subfield of AI. It is very useful.'
rake.extract_keywords_from_text(text)

print(rake.get_ranked_phrases_with_scores())

[(4.0, 'deep learning'), (1.0, 'useful'), (1.0, 'subfield'), (1.0, 'ai')]

4. Yet Another Keyword Extractor (YAKE)

YAKE is another popular keyword extraction algorithm proposed in 2018. It outperforms TF-IDF and RAKE across many datasets and went on to win the best “short paper award” at ECIR 2018.

YAKE uses statistical features to identify and rank the most important keywords. It doesn’t need any linguistic information like NER or POS tagging and thus can be used with any language. It only requires a stop word list for the language.

How YAKE works:

i. Preprocessing and Candidate Generation

The sentences are split into terms using space and special character(line break, bracket, comma, period) as the delimiter.

We decide the maximum length of the keyword to be generated. If we decide max length of 3, then 1-gram, 2-gram, and 3-gram candidate phrases are generated using a sliding window.

Then, we remove phrases that contain punctuation marks. Also, phrases that begin and end with a stop word are removed.

ii. Candidate Scoring

YAKE uses 5 features to quantify how good each word is.

a. Casing

This feature considers the casing of the word. It gives more importance to capitalized words and acronyms such as “NASA”.

First, we count the number of times the word starts with a capital letter when it is not the beginning word of the sentence. We also count the times when the word is in acronym form.

Then, we take the maximum of the two counts and normalize it by the log of the total count.

b. Word Positional

This feature gives more importance to words present at the beginning of the document. It’s based on the assumption that relevant keywords are usually concentrated more at the beginning of a document.

First, we get all the sentence positions where the word “w” occurs.

Then, we compute the position feature by taking the median position and applying the following formula:

c. Word Frequency

This feature calculates the frequency of the words normalized by 1-standard deviation from the mean.

d. Word Relatedness to Context

This feature quantifies how related a word is to its context. For that, it counts how many different terms occur to the left or right of a candidate word. If the word occurs frequently with different words on the left or right side, it is more likely to be a stop word.

where,

WR = (number of unique words on right) / (total words on right)
WL = (number of unique words on left) / (total words on left)
PL = (total words on left) / (max count)
PR = (total words on right) / (max count)

e. Word Different Sentence

This feature quantifies how often a candidate word occurs with different sentences. A word that often occurs in different sentences has a higher score.

Combined Word Score

These 5 features are combined into a single score S(w) using the formula:

where,

a = casing
b = position
c = frequency
d = relatedness
e = different

Keyword Score

Now, for each of our candidate keywords, a score is calculated using the following formula. The count of keyword penalizes less frequent keywords.

iii. Post-processing

It’s pretty common to get similar candidates when extracting keyphrases. For example, we could have variations like:

“work”, “works”
“relevant”, “relevance”

To eliminate such duplicates, the following process is applied:

First, the keywords are sorted in ascending order of their scores and we maintain a list of chosen keywords so far
Then, for each keyword in the list
- If the keyword has a small Levenshtein distance with any of chosen keywords so far, it is skipped
- Otherwise, the keyword is added to the chosen keywords list

Thus, the chosen keyword list contains the final deduplicated keywords.

iv. Final Ranking

Thus, we have a list of keywords along with their scores. A keyword is more important if it has a lower score.

We can sort the keywords in ascending order and take the top N keywords as the output.

Using YAKE in Python

To apply YAKE, we will use the pke library. First, we need to install the library and its dependencies using the following command:

shell

pip install git+https://github.com/boudinfl/pke.git
python -m nltk.downloader stopwords
python -m spacy download en

Then, we can use YAKE to generate keywords of maximum length 2 as shown below.

from pke.unsupervised import YAKE
from nltk.corpus import stopwords

document = "Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence."

# 1. Create YAKE keyword extractor
extractor = YAKE()

# 2. Load document
extractor.load_document(input=document,
                        language='en',
                        normalization=None)


# 3. Generate candidate 1-gram and 2-gram keywords
stoplist = stopwords.words('english')
extractor.candidate_selection(n=2, stoplist=stoplist)

# 4. Calculate scores for the candidate keywords
extractor.candidate_weighting(window=2,
                              stoplist=stoplist,
                              use_stems=False)

# 5. Select 10 highest ranked keywords
# Remove redundant keywords with similarity above 80%
key_phrases = extractor.get_n_best(n=10, threshold=0.8)
print(key_phrases)

You get back a list of top-10 keywords and their scores. The highest ranked keyword has the lowest score.

[('machine learning', 0.01552184797949213),
 ('computer algorithms', 0.04188746641162499),
 ('improve automatically', 0.04188746641162499),
 ('machine', 0.12363091320521931),
 ('learning', 0.12363091320521931),
 ('experience', 0.12363091320521931),
 ('artificial intelligence', 0.18075564686791562),
 ('study', 0.2005079697193566),
 ('computer', 0.2005079697193566),
 ('algorithms', 0.2005079697193566)]

References

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents.10.1002/9780470689646.ch1
Eirini Papagiannopoulou et al., “A Review of Keyphrase Extraction”
“YAKE implementation in pke: an open source python-based keyphrase extraction toolkit”

Text Data Augmentation with MarianMT

Sun, 30 Aug 2020 00:00:00 GMT

Hugging Face recently released 1008 translation models for almost 140 languages on their model hub.

These models were originally trained by Jörg Tiedemann of the Language Technology Research Group at the University of Helsinki. They were trained on the Open Parallel Corpus(OPUS) using a neural machine translation framework called MarianNMT.

In this post, I will explain how you can use the MarianMT models to augment data text data.

Back Translation

We will use a data augmentation technique called “Back Translation”. In this, we take an original text written in English. Then, we convert it into another language (eg. French) using MarianMT. We translate the French text back into English using MarianMT. We keep the back-translated English text if it is different from the original English sentence.

Augmentation Process

First, we need to install Hugging Face transformers and Moses Tokenizers with the following command

shell

pip install transformers==4.1.1 sentencepiece==0.1.94
pip install mosestokenizer==1.1.0

After installation, we can now import the MarianMT model and tokenizer.

from transformers import MarianMTModel, MarianTokenizer

Then, we can create a initialize the model that can translate from English to Romance languages. This is a single model that can translate to any of the romance languages()

target_model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
target_tokenizer = MarianTokenizer.from_pretrained(target_model_name)
target_model = MarianMTModel.from_pretrained(target_model_name)

Similarly, we can initialize models that can translate Romance languages to English.

en_model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
en_tokenizer = MarianTokenizer.from_pretrained(en_model_name)
en_model = MarianMTModel.from_pretrained(en_model_name)

Next, we write a helper function to translate a batch of text given the machine translation model, tokenizer and the target romance language.

def translate(texts, model, tokenizer, language="fr"):
    # Prepare the text data into appropriate format for the model
    template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
    src_texts = [template(text) for text in texts]

    # Tokenize the texts
    encoded = tokenizer.prepare_seq2seq_batch(src_texts)
    
    # Generate translation using model
    translated = model.generate(**encoded)

    # Convert the generated tokens indices back into text
    translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
    
    return translated_texts

Next, we will prepare a function to use the above translate() function to perform back translation.

def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts

Now, we can perform data augmentation using back-translation from English to Spanish on a list of sentences as shown below.

en_texts = ['This is so cool', 'I hated the food', 'They were very helpful']

aug_texts = back_translate(en_texts, source_lang="en", target_lang="es")
print(aug_texts)

["Yeah, it's so cool.", "It's the food I hated.", 'They were of great help.']

Similarly, we can perform augmentation using English to French as shown below with the exact same helper method.

en_texts = ['This is so cool', 'I hated the food', 'They were very helpful']
aug_texts = back_translate(en_texts, source_lang="en", target_lang="fr")

print(aug_texts)

["It's so cool.", 'I hated food.', "They've been very helpful."]

Chained Back Translation

You can also run back translation in a chain to get more diversity. For example, English -> Spanish -> English -> French -> English

en_texts = ['This is so cool', 'I hated the food', 'They were very helpful']

aug1_texts = back_translate(en_texts, source_lang="en", target_lang="es")
aug2_texts = back_translate(aug1_texts, source_lang="en", target_lang="fr")

print(aug2_texts)

["Yeah, that's cool.", "It's the food I hated.", 'They were of great help.']

Available Models

Here are language codes for a subset of major romance language that you can use above.

Language	French	Spanish	Italian	Portuguese	Romanian	Catalan	Galician	Latin
Code	fr	es	it	pt	ro	ca	gl	la

Language	Walloon	Occitan (post 1500)	Sardinian	Aragonese	Corsican	Romansh
Code	wa	oc	sn	an	co	rm

To view all available language codes, you can run

target_tokenizer.supported_language_codes

Alternative Applications

Besides data augmentation, the back translation process can also be used for text paraphrasing.

Similarly, we can also use it as an adversarial attack. Suppose we have a training dataset on which we trained an NLP model. Then, we can augment the training dataset and generate prediction from our model on augmented texts. If the predictions are different than our ground-truth labels, then we have a list of texts where our model fails. We can get good insights by analyzing those responses.

Conclusion

Thus, MarianMT is a decent free and offline alternative to Google Translate for back-translation.

References

MarianMT - transformers 3.0.2 documentation

Evaluation Metrics For Information Retrieval

Tue, 04 Aug 2020 00:00:00 GMT

Most software products we encounter today have some form of search functionality integrated into them. We search for content on Google, videos on YouTube, products on Amazon, messages on Slack, emails on Gmail, people on Facebook, and so on.

As users, the workflow is pretty simple. We can search for items by writing our queries in a search box and the ranking model in their system gives us back the top-N most relevant results.

How do we evaluate how good the top-N results are?

In this post, I will answer the above question by explaining the common offline metrics used in learning to rank problems. These metrics are useful not only for evaluating search results but also for problems like keyword extraction and item recommendation.

Problem Setup 1: Binary Relevance

Let’s take a simple toy example to understand the details and trade-offs of various evaluation metrics.

We have a ranking model that gives us back 5-most relevant results for a certain query. The first, third, and fifth results were relevant as per our ground-truth annotation.

Let’s look at various metrics to evaluate this simple example.

A. Order-Unaware Metrics

1. Precision@k

This metric quantifies how many items in the top-K results were relevant. Mathematically, this is given by:

For our example, precision@1 = 1 as all items in the first 1 results is relevant.

Similarly, precision@2 = 0.5 as only one of the top-2 results are relevant.

Thus, we can calculate the precision score for all k values.

k	1	2	3	4	5
Precision@k

A limitation of precision@k is that it doesn’t consider the position of the relevant items. Consider two models A and B that have the same number of relevant results i.e. 3 out of 5.

For model A, the first three items were relevant, while for model B, the last three items were relevant. Precision@5 would be the same for both of these models even though model A is better.

2. Recall@k

This metric gives how many actual relevant results were shown out of all actual relevant results for the query. Mathematically, this is given by:

For our example, recall@1 = 0.33 as only one of the 3 actual relevant items are present.

Similarly, recall@3 = 0.67 as only two of the 3 actual relevant items are present.

Thus, we can calculate the recall score for different K values.

k	Recall@k
1
2
3
4
5

3. F1@k

This is a combined metric that incorporates both Precision@k and Recall@k by taking their harmonic mean. We can calculate it as:

Using the previously calculated values of precision and recall, we can calculate F1-scores for different K values as shown below.

Metric	Precision@k	Recall@k
k=1	1	1/3
k=2	1/2	1/3
k=3	2/3	2/3
k=4	1/2	2/3
k=5	3/5	1

B. Order Aware Metrics

While precision, recall, and F1 give us a single-value metric, they don’t consider the order in which the returned search results are sent. To solve that limitation, people have devised order-aware metrics given below:

1. Mean Reciprocal Rank(MRR)

This metric is useful when we want our system to return the best relevant item and want that item to be at a higher position. Mathematically, this is given by:

where: - denotes the total number of queries
- denotes the rank of the first relevant result

To calculate MRR, we first calculate the reciprocal rank. It is simply the reciprocal of the rank of the first correct relevant result and the value ranges from 0 to 1.

For our example, the reciprocal rank is as the first correct item is at position 1.

Let’s see another example where the only one relevant result is present at the end of the list i.e. position 5. It gets a lower reciprocal rank score of 0.2.

Let’s consider another example where none of the returned results are relevant. In such a scenario, the reciprocal rank will be 0.

For multiple different queries, we can calculate the MRR by taking the mean of the reciprocal rank for each query.

We can see that MRR doesn’t care about the position of the remaining relevant results. So, if your use-case requires returning multiple relevant results in the best possible way, MRR is not a suitable metric.

2. Average Precision(AP)

Average Precision is a metric that evaluates whether all of the ground-truth relevant items selected by the model are ranked higher or not. Unlike MRR, it considers all the relevant items.

Mathematically, it is given by:

where:

is an indicator function which is 1 when the item at rank K is relevant.
is the Precision@k metric

For our example, we can calculate the AP based on our Precision@K values for different K.

To illustrate the advantage of AP, let’s take our previous example but place the 3 relevant results at the beginning. We can see that this gets a perfect AP score than the above example.

3. Mean Average Precision(MAP)

If we want to evaluate average precision across multiple queries, we can use the MAP. It is simply the mean of the average precision for all queries. Mathematically, this is given by

where

is the total number of queries
is the average precision for query q.

Problem Setup 2: Graded Relevance

Let’s take another toy example where we annotated the items not just as relevant or not-relevant but instead used a grading scale between 0 to 5 where 0 denotes least relevant and 5 denotes the most relevant.

We have a ranking model that gives us back 5-most relevant results for a certain query. The first item had a relevance score of 3 as per our ground-truth annotation, the second item has a relevance score of 2 and so on.

Let’s understand the various metrics to evaluate this type of setup.

1. Cumulative Gain (CG@k)

This metric uses a simple idea to just sum up the relevance scores for top-K items. The total score is called cumulative gain. Mathematically, this is given by:

For our example, CG@2 will be 5 because we add the first two relevance scores 3 and 2.

Similarly, we can calculate the cumulative gain for all the K-values as:

Position(k)	1	2	3	4	5
Cumulative Gain@k	3	3+2=5	3+2+3=8	3+2+3+0=8	3+2+3+0+1=9

While simple, CG doesn’t take into account the order of the relevant items. So, even if we swap a less-relevant item to the first position, the CG@2 will be the same.

2. Discounted Cumulative Gain (DCG@k)

We saw how a simple cumulative gain doesn’t take into account the position. But, we would normally want items with a high relevance score to be present at a better rank.

Consider an example below. With the cumulative gain, we are simply adding the scores without taking into account their position.

An item with a relevance score of 3 at position 1 is better than the same item with relevance score 3 at position 2.

So, we need some way to penalize the scores by their position. DCG introduces a log-based penalty function to reduce the relevance score at each position. For 5 items, the penalty would be


1
2
3
4
5

Using this penalty, we can now calculate the discounted cumulative gain simply by taking the sum of the relevance score normalized by the penalty. Mathematically, this is given by:

To understand the behavior of the log-penalty, let’s plot ranking position in x-axis and the percentage of relevance score i.e. in the y-axis. As seen, in position 1, we don’t apply any penalty and score remains unchanged. But, the percentage of score kept decays exponentially from 100% in position 1 to 63% in position 2, 50% in position 3, and so on.

Let’s now calculate DCG for our example.


1	3
2	2
3	3
4	0
5	1

Based on these penalized scores, we can now calculate DCG at various k values simply by taking their sum up to k.

k	DCG@k
DCG@1
DCG@2
DCG@3
DCG@4
DCG@5

There is also an alternative formulation for DCG@K that gives more penalty if relevant items are ranked lower. This formulation is preferred more in industry.

While DCG solves the issues with cumulative gain, it has a limitation. Suppose we a query Q1 with 3 results and query Q2 with 5 results. Then the query with 5 results Q2 will have a larger overall DCG score. But we can’t say that query 2 was better than query 1.

3. Normalized Discounted Cumulative Gain (NDCG@k)

To allow a comparison of DCG across queries, we can use NDCG that normalizes the DCG values using the ideal order of the relevant items.

Let’s take our previous example where we had already calculated the DCG values at various K values.

k	DCG@k
DCG@1
DCG@2
DCG@3
DCG@4
DCG@5

For our example, ideally, we would have wanted the items to be sorted in descending order of relevance scores.

Let’s calculate the ideal DCG(IDCG) for this order.

			IDCG@k
1	3	3 / 1 = 3	3
2	3	3 / 1.5849 = 1.8927	3+1.8927=4.8927
3	2	2 / 2 = 1	3+1.8927+1=5.8927
4	1	1 / 2.3219 = 0.4306	3+1.8927+1+0.4306=6.3233
5	0	0 / 2.5849 = 0	3+1.8927+1+0.4306+0=6.3233

Now we can calculate the NDCG@k for various k by dividing DCG@k by IDCG@k as shown below:

	DCG@k	IDCG@k
1	3	3
2	4.2618	4.8927
3	5.7618	5.8927
4	5.7618	6.3233
5	6.1486	6.3233

Thus, we get NDCG scores with a range between 0 and 1. A perfect ranking would get a score of 1. We can also compare NDCG@k scores of different queries since it’s a normalized score.

Conclusion

Thus, we learned about various evaluation metrics for both binary and graded ground-truth labels and how each metric improves upon the previous.

References

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446.

Wikipedia. Discounted cumulative gain.

Wikipedia. Evaluation measures (information retrieval).

Wikipedia. Mean reciprocal rank.

Citation

BibTeX citation:

@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Evaluation {Metrics} {For} {Information} {Retrieval}},
  date = {2020-08-04},
  url = {https://amitness.com/posts/information-retrieval-evaluation.html},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2020. Evaluation Metrics For Information Retrieval.

Behavioral Testing of NLP models

Tue, 28 Jul 2020 00:00:00 GMT

When developing an NLP model, it’s a standard practice to test how well a model generalizes to unseen examples by evaluating it on a held-out dataset. Suppose we reach our target performance metric of 95% on a held-out dataset and thus deploy the model to production based on this single metric.

But, when real users start using it, the story could be completely different than what our 95% performance metric was saying. Our model might perform poorly even on simple variations of the training text.

In contrast, the field of software engineering uses a suite of unit tests, integration tests, and end-to-end tests to evaluate all aspects of the product for failures. An application is deployed to production only after passing these rigorous tests.

Ribeiro et al. noticed this gap and took inspiration from software engineering to propose an evaluation methodology for NLP called “CheckList”. Their paper won the best overall paper award at ACL 2020.

In this post, I will explain the overall concept of CheckList and the various components that it proposes for evaluating NLP models.

Behavioral Testing

To understand CheckList, let’s first understand behavioral testing in the context of software engineering.

Behavioral testing, also known as black-box testing, is a method where we test a piece of software based on its expected input and output. We don’t need access to the actual implementation details.

For example, let’s say you have a function that adds two numbers together.

def add(a, b):
    return a + b

We can evaluate this function by writing tests to compare it’s output to the expected answer. We are not concerned with how this function was implemented internally.

def test_add():
    assert add(1, 2) == 3
    assert add(1, 0) == 1
    assert add(-1, 1) == 0
    assert add(-1, -1) == -2

Even for a simple function such as addition, there are capabilities that it should satisfy. For example, the addition of a number with zero should yield the original number itself.

Capability	Function Signature	Output	Expected	Test Passed
Two Positive Numbers	add(1, 2)	3	3	Yes
No Change with Zero	add(1, 0)	1	1	Yes
Opposite Numbers	add(-1, 1)	0	0	Yes
Two Negative Number	add(-1, -1)	-2	-2	Yes
			Pass Rate	4/4 = 100%

CheckList Framework

CheckList proposes a general framework for writing behavioral tests for any NLP model and task.

The core idea is based on a conceptual matrix that is composed of linguistic capabilities as rows and test types as columns. The intersecting cells contain multiple test examples generated from templates that we run and calculate the failure rate for.

Capability / Test	Minimum Functionality Test(MFT)	Invariance Test(INV)	Directional Expectation Test(DIR)
VOCABULARY	15.0%	16.2%	34.6%
NER	0.0%	20.8%	-
NEGATION	76.4%	-	-
…

By calculating the failure rates for various test types and capabilities, we can know exactly where our model is weak.

Let’s understand each part of this conceptual matrix in detail now.

1. Test Types

These are the columns in the previous matrix. There are 3 types of tests proposed in the CheckList framework:

a. Minimum Functionality Test(MFT)

This test is similar to unit tests in software engineering. We build a collection of (text, expected label) pairs from scratch and test the model on this collection.

For example, we are testing the negation capability of the model using an MFT test below.

Template: I {NEGATION} {POS_VERB} the {THING}

The goal of this test is to make sure the model is not taking any shortcuts and possesses linguistic capabilities.

b. Invariance Test(INV)

In this test, we perturb our existing training examples in a way that the label should not change. Then, the model is tested on this perturbed example and the model passes the test only if its prediction remains the same (i.e invariant).

For example, changing the location from Chicago to Dallas should not change the original sentiment of a text.

We can use different perturbation functions to test different capabilities. The paper mentions two examples:

Capability	Perturbation	Invariance
NER	Change location name in text	Should not change sentiment
Robustness	Add typos to the text	Should not change prediction

c. Directional Expectation Test(DIR)

This test is similar to the invariance test but here we expect the model prediction to change after perturbation.

For example, if we add a text “You are lame” to the end of a text, the expectation is that sentiment of the original text will not move towards a positive direction.

We can also write tests where we expect the target label to change. For example, consider the QQP task where we need to detect if two questions are duplicates or not.

If we have a pair of duplicate questions and we change the location in one of the questions, then we expect the model to predict that they are not duplicates.

Capability	Question 1	Question 2	Expected	Predicted
NER	How many people are there in England?	What is the population of England?	Duplicate	Duplicate
NER	How many people are there in England?	What is the population of Turkey?	Not Duplicate	Duplicate

2. Linguistic Capabilities

These are the rows in the CheckList matrix. Each row contains a specific linguistic capability that applies to most NLP tasks.

Let’s understand examples of capabilities given in the original paper. The authors provide a lot of examples to help us build a mental model of how to test new capabilities relevant to our task and domain.

a. Vocabulary and POS

We want to ensure the model has enough vocabulary knowledge and can differentiate words with a different part of speech and how it impacts the task at hand.

For example, the paper shows the 3 test types for a sentiment analysis task.

Test Type	Example	Expected	Remarks
MFT	The company is Australian	neutral	neutral adjective and nouns
MFT	That cabin crew is extraordinary	positive	sentiment-laden adjectives
INV	~~the~~ ⮕ our nightmare continues	no change	Replace neutral words with other neutral words
DIR	AA45… JFK to LAS. You are brilliant	move towards +ve	Add positive phrase to end
DIR	your service sucks. You are lame	move towards -ve	Add negative phrase to end

This can also be applied for the QQP task as shown below.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	Is John a teacher?	Is John an accredited teacher?	Not Duplicate	Modifiers change question intent

b. Named Entity Recognition(NER)

It tests the capability of the model to understand named entities and whether it is important for the current task or not.

We have examples of NER capability tests for sentiment analysis given below.

Test Type	Example	Expected	Remarks
INV	We had a safe travel to ~~Chicago~~ ⮕ Dallas	no change	Switching locations should not change predictions
INV	~~Benjamin~~ ⮕ Anna was your savior	no change	Switching person names should not change predictions

We can also apply this to the QQP task.

Test Type	Question 1	Question 2	Expected	Remarks
INV	Why isn’t Hillary Clinton ⮕ Nicole Perez in jail?	Is Hillary Clinton ⮕ Nicole Perez going to go to jail?	Duplicate	Changing name in both question
DIR	Why isn’t Hillary Clinton in jail?	Is Hillary Clinton ⮕ Nicole Perez going to go to jail?	Not Duplicate	Changing name in only one question
DIR	Why’s Hillary Clinton running?	Is Hillary Clinton going to go to jail?	Not Duplicate	Keep first word and entities, replace everything else with ROBERTA

c. Temporal

Here we want to test if the model understands the order of events in the text.

Below are examples of tests we can devise to evaluate this capability for a sentiment model.

Test Type	Example	Expected	Remarks
MFT	I used to hate this airline, although now I like it	positive	sentiment change over time, the present should prevail
MFT	In the past I thought this airline was perfect, now I think it is creepy	negative	sentiment change over time, the present should prevail

Similarly, we can devise temporal capability tests for QQP data as well.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	Is Jordan Perry an advisor?	Did Jordan Perry use to be an advisor?	Not duplicate	is != used to be
MFT	Is it unhealthy to eat after 10pm?	Is it unhealthy to eat before 10pm?	Not duplicate	before != after
MFT	What was Danielle Bennett’s life before becoming an agent?	What was Danielle Bennett’s life after becoming an agent?	Not duplicate	before becoming != after becoming

d. Negation

This ensures the model understands negation and its impact on the output.

Below are examples of tests we can devise to evaluate negation capabilities for a sentiment model.

Test Type	Example	Expected	Remarks
MFT	The aircraft is not bad	positive/neutral	negated negative
MFT	This aircraft is not private	neutral	negated neutral
MFT	I thought the plane would be awful, but it wasn’t	positive/neutral	negation of negative at end
MFT	I wouldn’t say, given it’s a Tuesday, that this pilot was great	negative	negated positive with neutral content in middle

Similarly, we can devise negation capability tests for QQP data as well.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	How can I become a positive person?	How can I become a person who is not positive?	Not duplicate	simple negation
MFT	How can I become a positive person?	How can I become a person who is not negative?	Duplicate	negation of antonym

e. Semantic Role Labeling(SRL)

This ensures the model understands the agent and the object in the text.

Below are examples of tests we can devise to evaluate SRL capabilities for a sentiment model.

Test Type	Example	Expected	Remarks
MFT	Some people hate him, but I think the pilot was fantastic	positive	Author sentiment more important than others
MFT	Do I think the pilot was fantastic? Yes.	positive	parsing sentiment in (question, “yes”) form
MFT	Do I think the pilot was fantastic? No.	negative	parsing sentiment in (question, “no”) form

Similarly, we can devise SRL capability tests for QQP data as well.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	Are tigers heavier than insects?	What is heavier, insects or tigers?	Duplicate	Comparison
MFT	Is Anna related to Benjamin?	Is Benjamin related to Anna?	Duplicate	Symmetric relation
MFT	Is Anna hurting Benjamin?	Is Benjamin hurting Anna?	Not Duplicate	Asymmetric relation
MFT	Does Anna love Benjamin?	Is Benjamin loved by Anna?	Duplicate	Active / passive swap, same semantics
MFT	Does Anna support Benjamin?	Is Anna supported by Benjamin?	Not Duplicate	Active / passive swap, different semantics

f. Robustness

This ensures that the model can handle small variations or perturbations to the input text such as typos and irrelevant changes.

Below are examples of tests we can devise to evaluate robustness capabilities for a sentiment model.

Test Type	Example	Expected	Remarks
INV	@JetBlue no thanks @pi9QDK	no change	Add randomly generated URLs and handles to tweets
INV	@SouthwestAir no thanks -> thakns	no change	Swap one character with its neighbor (typo)

Similarly, we can devise robustness capability tests for QQP data as well.

Test Type	Question 1	Question 2	Expected	Remarks
INV	Why am I ~~getting~~ ⮕ gettnig lazy?	Why are we so lazy?	Duplicate	Swap one character with neighbor
DIR	Can I gain weight from not eating enough?	~~Can I~~ ⮕ Do you think I can gain weight from not eating enough?	Duplicate	Paraphrasing

g. Taxonomy

This ensures that the model has an understanding of synonyms and antonyms and how they affect the task at hand.

Below are examples of tests we can devise to evaluate taxonomy capabilities for the QQP task.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	How can I become more vocal?	How can I become more outspoken?	Duplicate	Synonyms in simple template
MFT	How can I become more optimistic?	How can I become less pessimistic?	Duplicate	More X = Less antonym(X)
INV	Is it necessary to follow a religion?	Is it necessary to follow an ~~organized~~ ⮕ organised religion?	Duplicate	Replace words with synonyms in real pairs

h. Coreference Resolution

This ensures that the model has an understanding of pronouns and what nouns they refer to.

Below are examples of tests we can devise to evaluate coreference capabilities for the QQP task.

Test Type	Question 1	Question 2	Expected	Remarks
MFT	If Anna and Benjamin were alone, do you think he would reject her?	If Anna and Benjamin were alone, do you think she would reject him?	Not Duplicate	Simple coreference: he != she
MFT	If Benjamin and Anna were married, do you think Anna’s family would be happy?	If Benjamin and Anna were married, do you think his family would be happy?	Not Duplicate	Simple resolved coreference, his and her

i. Logic

This ensures that the model can handle symmetry, consistency, and conjunctions.

For example, in the QQP task, the order of the question shouldn’t matter. If question 1 is a duplicate of question 2, then question 2 will also be a duplicate of question 1 by symmetry.

Test Type	Question 1	Question 2	Expected	Remarks
INV	Can I gain weight from not eating enough?	Do you think I can gain weight from not eating enough?	Duplicate	Original Order
	Do you think I can gain weight from not eating enough?	Can I gain weight from not eating enough?	Duplicate	Reversed Order

j. Fairness

This tests if the model reflects any form of bias towards a demographic from the training data.

Below are examples of tests we can devise to evaluate the fairness of a sentiment model. The model prediction failures are for the BERT model as shown in the paper.

Test Type	Example	Predicted	Expected	Remarks
MFT	I am a black woman	Negative	Neutral	Racial Bias
MFT	I am a asian woman	Positive	Neutral	Racial Bias
MFT	I am a straight woman	Positive	Neutral	Gender Bias
MFT	I am a gay woman	Negative	Neutral	Gender Bias
MFT	I am a lesbian woman	Negative	Neutral	Gender Bias

3. Test Generation

The paper’s authors have open-sourced a software tool that can generate test cases at scale based on the ideas above.

The tool provides three approaches to write test cases:

Approach	Idea	Advantage	Disadvantage
Scratch	Write tests manually	High Quality	Low Coverage, Expensive, Time-consuming
Perturbation Function	Apply perturbation to texts	Lots of Automated Tests	Low Quality
Template	Use templates and generate many variations	Balance of Quality and Quantity	Need to brainstorm Templates

To generate templates, you can either brainstorm them from scratch or generalize patterns from your existing data.

a. Manually Generated Templates

For example, if we had a text such as “I didn’t love the food” in our training data, we can generalize it as:

Original Text	Generalized Template
I didn’t love the food	I {NEGATION} {POS_VERB} the {THING}

Now, you can brainstorm possible fillers for the various template parts.

{NEGATION}	{POS_VERB}	{THING}
didn’t, can’t say I, …	love, like, …	food, flight, services, …

By taking the cartesian products of all these possibilities, we can generate a lot of test cases.

{NEGATION}	{POS_VERB}	{THING}	Variation	Expected Label
didn’t	love	food	I didn’t love the food	Negative
didn’t	like	food	I didn’t like the food	Negative
didn’t	love	flight	I didn’t love the flight	Negative
didn’t	love	services	I didn’t love the services	Negative
		…

b. Masked Language Model Template

Instead of manually specifying fill-ins for the template, we can also use MLM models like ROBERTA and use masking to generate variants.

For example, here we are using ROBERTA to suggest words for the mask and then we manually filter them into positive/negative/neutral.

Template	ROBERTA Prediction	Manual Filtering
I really {mask} the flight	enjoyed	positive
	liked	positive
	loved	positive
	regret	negative
	…

These fill-ins can be reused across multiple tests. The paper also suggests using WordNet to select only context-appropriate synonyms from ROBERTA.

c. Built-in Fill-ins

CheckList also provides out-of-box support for lexicons such as:

NER: common first/last names, cities and countries
Protected Group Adjectives: Nationalities, Religions, Gender, Sexuality

d. Built-in Perturbations

CheckList also provides perturbation functions such as character swaps, contractions, name and location changes, and neutral word replacement.

Conclusion

Thus, CheckList provides a general framework to perform a comprehensive and fine-grained evaluation of NLP models. This can help us better understand the state of NLP models beyond the leaderboard.

References

Marco Tulio Ribeiro et al., “Beyond Accuracy: Behavioral Testing of NLP models with CheckList”

Semi-Supervised Learning in Computer Vision

Sun, 12 Jul 2020 00:00:00 GMT

Semi-supervised learning methods for Computer Vision have been advancing quickly in the past few years. Current state-of-the-art methods are simplifying prior work in terms of architecture and loss function or introducing hybrid methods by blending different formulations.

In this post, I will illustrate the key ideas of these recent methods for semi-supervised learning through diagrams.

1. Self-Training

In this semi-supervised formulation, a model is trained on labeled data and used to predict pseudo-labels for the unlabeled data. The model is then trained on both ground truth labels and pseudo-labels simultaneously.

a. Pseudo-label

Lee (2013) proposed a very simple and efficient formulation called “Pseudo-label” in 2013.

The idea is to train a model simultaneously on a batch of both labeled and unlabeled images. The model is trained on labeled images in usual supervised manner with a cross-entropy loss. The same model is used to get predictions for a batch of unlabeled images and the maximum confidence class is used as the pseudo-label. Then, cross-entropy loss is calculated by comparing model predictions and the pseudo-label for the unlabeled images .

The total loss is a weighted sum of the labeled and unlabeled loss terms.

To make sure the model has learned enough from the labeled data, the term is set to 0 during the initial 100 training steps. It is then gradually increased up to 600 training steps and then kept constant.

b. Noisy Student

Xie et al. (2019b) proposed a semi-supervised method inspired by Knowledge Distillation called “Noisy Student” in 2019.

The key idea is to train two separate models called “Teacher” and “Student”. The teacher model is first trained on the labeled images and then it is used to infer the pseudo-labels for the unlabeled images. These pseudo-labels can either be soft-label or converted to hard-label by taking the most confident class. Then, the labeled and unlabeled images are combined together and a student model is trained on this combined data. The images are augmented using RandAugment as a form of input noise. Also, model noise such as Dropout and Stochastic Depth are incorporated in the student model architecture.

Once a student model is trained, it becomes the new teacher and this process is repeated for three iterations.

2. Consistency Regularization

This paradigm uses the idea that model predictions on an unlabeled image should remain the same even after adding noise. We could use input noise such as Image Augmentation and Gaussian noise. Noise can also be incorporated in the architecture itself using Dropout.

a. π-model

This model was proposed by Laine and Aila (2017) in a conference paper at ICLR 2017.

The key idea is to create two random augmentations of an image for both labeled and unlabeled data. Then, a model with dropout is used to predict the label of both these images. The square difference of these two predictions is used as a consistency loss. For labeled images, we also calculate the cross-entropy loss. The total loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.

b. Temporal Ensembling

This method was also proposed by Laine and Aila (2017) in the same paper as the pi-model. It modifies the π-model by leveraging the Exponential Moving Average(EMA) of predictions.

The key idea is to use the exponential moving average of past predictions as one view. To get another view, we augment the image as usual and a model with dropout is used to predict the label. The square difference of current prediction and EMA prediction is used as a consistency loss. For labeled images, we also calculate the cross-entropy loss. The final loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.

c. Mean Teacher

This method was proposed by Tarvainen and Valpola (2017). The general approach is similar to Temporal Ensembling but it uses Exponential Moving Average(EMA) of the model parameters instead of predictions.

The key idea is to have two models called “Student” and “Teacher”. The student model is a regular model with dropout. And the teacher model has the same architecture as the student model but its weights are set using an exponential moving average of the weights of student model. For a labeled or unlabeled image, we create two random augmented versions of the image. Then, the student model is used to predict label distribution for first image. And, the teacher model is used to predict the label distribution for the second augmented image. The square difference of these two predictions is used as a consistency loss. For labeled images, we also calculate the cross-entropy loss. The final loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.

d. Virtual Adversarial Training

This method was proposed by Miyato et al. (2019). It uses the concept of adversarial attack for consistency regularization.

The key idea is to generate an adversarial transformation of an image that will change the model prediction. To do so, first, an image is taken and an adversarial variant of it is created such that the KL-divergence between the model output for the original image and the adversarial image is maximized.

Then we proceed as previous methods. We take a labeled/unlabeled image as first view and take its adversarial example generated in previous step as the second view. Then, the same model is used to predict label distributions for both images. The KL-divergence of these two predictions is used as a consistency loss. For labeled images, we also calculate the cross-entropy loss. The final loss is a weighted sum of these two loss terms. A weight is applied to decide how much the consistency loss contributes in the overall loss.

e. Unsupervised Data Augmentation

This method was proposed by Xie et al. (2019a) and works for both images and text. Here, we will understand the method in the context of images.

The key idea is to create an augmented version of a unlabeled image using AutoAugment. Then, a same model is used to predict the label of both these images. The KL-divergence of these two predictions is used as a consistency loss. For labeled images, we only calculate the cross-entropy loss and don’t calculate any consistency loss. The final loss is a weighted sum of these two loss terms. A weight w(t) is applied to decide how much the consistency loss contributes in the overall loss.

3. Hybrid Methods

This paradigm combines ideas from previous work such as self-training and consistency regularization along with additional components for performance improvement.

a. MixMatch

This holistic method was proposed by Berthelot et al. (2019).

To understand this method, let’s take a walk through each of the steps.

For the labeled image, we create an augmentation of it. For the unlabeled image, we create K augmentations and get the model predictions on all K-images. Then, the predictions are averaged and temperature scaling is applied to get a final pseudo-label. This pseudo-label will be used for all the K-augmentations.
The batches of augmented labeled and unlabeled images are combined and the whole group is shuffled. Then, the first N images of this group are taken as , and the remaining M images are taken as .
Now, Mixup is applied between the augmented labeled batch and group . Similarly, mixup is applied between the M augmented unlabeled group and the group. Thus, we get the final labeled and unlabeled group.
Now, for the labeled group, we take model predictions and compute cross-entropy loss with the ground truth mixup labels. Similarly, for the unlabeled group, we compute model predictions and compute mean square error(MSE) loss with the mixup pseudo labels. A weighted sum is taken of these two terms with weighting the MSE loss.

b. FixMatch

This method was proposed by Sohn et al. (2020) and combines pseudo-labeling and consistency regularization while vastly simplifying the overall method. It got state of the art results on a wide range of benchmarks.

As seen, we train a supervised model on our labeled images with cross-entropy loss. For each unlabeled image, weak augmentation and strong augmentations are applied to get two images. The weakly augmented image is passed to our model and we get prediction over classes. The probability for the most confident class is compared to a threshold. If it is above the threshold, then we take that class as the ground label i.e. pseudo-label. Then, the strongly augmented image is passed through our model to get a prediction over classes. This prediction is compared to ground truth pseudo-label using cross-entropy loss. Both the losses are combined and the model is optimized.

If you want to learn more about FixMatch, I have an article that goes over it in depth.

Comparison of Methods

Here is a high-level summary of the differences between all the above-mentioned methods.

Method Name	Year	Unlabeled Loss	Augmentation
Pseudo-label	2013	Cross-Entropy	Random
π-model	2016	MSE	Random
Temporal Ensembling	2016	MSE	Random
Mean Teacher	2017	MSE	Random
Virtual Adversarial Training(VAT)	2017	KL-divergence	Adversarial transformation
Unsupervised Data Augmentation(UDA)	2019	KL-divergence	AutoAugment
MixMatch	2019	MSE	Random
Noisy Student	2019	Cross-Entropy	RandAugment
FixMatch	2020	Cross-Entropy	CTAugment / RandAugment

Common Evaluation Datasets

To evaluate the performance of these semi-supervised methods, the following datasets are commonly used. The authors simulate a low-data regime by using only a small portion(e.g. 40/250/4000/10000 examples) of the whole dataset as labeled and treating the remaining as the unlabeled set.

Dataset	Classes	Image Size	Train	Validation	Unlabeled
CIFAR-10	10	32*32	50,000	10,000	-
CIFAR-100	100	32*32	50,000	10,000	-
STL-10	10	96*96	5000	8000	1,00,000
SVHN	10	32*32	73,257	26,032	5,31,131
ILSVRC-2012	1000	vary	1.2 million	150,000	1,50,000

Conclusion

Thus, we got an overview of how semi-supervised methods for Computer Vision have progressed over the years. This is a really important line of research that can have a direct impact on the industry.

References

David Berthelot, Nicholas Carlini, I. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. MixMatch: A holistic approach to semi-supervised learning. Neural Information Processing Systems.

Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In 5th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings. OpenReview.net.

Dong-Hyun Lee. 2013. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2019. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1979–1993.

Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, E. D. Cubuk, Alexey Kurakin, Han Zhang, and Colin Raffel. 2020. FixMatch: Simplifying semi-supervised learning with consistency and confidence. Neural Information Processing Systems.

Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Neural Information Processing Systems.

Qizhe Xie, Zihang Dai, E. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019a. Unsupervised data augmentation for consistency training. Neural Information Processing Systems.

Qizhe Xie, E. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019b. Self-training with noisy student improves ImageNet classification. Computer Vision and Pattern Recognition.

Citation

BibTeX citation:

@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Semi-Supervised {Learning} in {Computer} {Vision}},
  date = {2020-07-12},
  url = {https://amitness.com/posts/semi-supervised-learning.html},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2020. Semi-Supervised Learning in Computer Vision.

FastAPI for Flask Users

Mon, 29 Jun 2020 00:00:00 GMT

While Flask has become the de-facto choice for API development in Machine Learning projects, there is a new framework called FastAPI that has been getting a lot of community traction.

I recently decided to give FastAPI a spin by porting a production Flask project. It was very easy to pick up FastAPI coming from Flask and I was able to get things up and running in just a few hours.

The added benefit of automatic data validation, documentation generation and baked-in best-practices such as pydantic schemas and python typing makes this a strong choice for future projects.

In this post, I will introduce FastAPI by contrasting the implementation of various common use-cases in both Flask and FastAPI.

Version Info:

At the time of this writing, the Flask version is 1.1.2 and the FastAPI version is 0.58.1

Installation

Both Flask and FastAPI are available on PyPI. For conda, you need to use the conda-forge channel to install FastAPI while it’s available in the default channel for Flask.

Flask:

shell

pip install flask
conda install flask

FastAPI:

shell

pip install fastapi uvicorn
conda install fastapi uvicorn -c conda-forge

Running “Hello World”

Flask:

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/')
def home():
    return {'hello': 'world'}

if __name__ == '__main__':
    app.run()

Now you can run the development server using the below command. It runs on port 5000 by default.

shell

python app.py

FastAPI

app.py

import uvicorn
from fastapi import FastAPI

app = FastAPI()

@app.get('/')
def home():
    return {'hello': 'world'}

if __name__ == '__main__':
    uvicorn.run(app)

FastAPI defers serving to a production-ready server called uvicorn. We can run it in development mode with a default port of 8000.

shell

python app.py

Production server

Flask:

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/')
def home():
    return {'hello': 'world'}

if __name__ == '__main__':
    app.run()

For a production server, gunicorn is a common choice in Flask.

shell

gunicorn app:app

FastAPI

app.py

import uvicorn
from fastapi import FastAPI

app = FastAPI()

@app.get('/')
def home():
    return {'hello': 'world'}

if __name__ == '__main__':
    uvicorn.run(app)

FastAPI defers serving to a production-ready server called uvicorn. We can start the server as:

shell

uvicorn app:app

You can also start it in hot-reload mode by running

shell

uvicorn app:app --reload

Furthermore, you can change the port as well.

shell

uvicorn app:app --port 5000

The number of workers can be controlled as well.

shell

uvicorn app:app --workers 2

You can use gunicorn to manage uvicorn as well using the following command. All regular gunicorn flags such as number of workers(-w) work.

shell

gunicorn -k uvicorn.workers.UvicornWorker app:app

HTTP Methods

Flask:

@app.route('/', methods=['POST'])
def example():
    ...

FastAPI:

@app.post('/')
def example():
    ...

You have individual decorator methods for each HTTP method.

@app.get('/')
@app.put('/')
@app.patch('/')
@app.delete('/')

URL Variables

We want to get the user id from the URL e.g. /users/1 and then return the user id to the user.

Flask:

@app.route('/users/')
def get_user_details(user_id):
    return {'user_id': user_id}

FastAPI:

In FastAPI, we make use of type hints in Python to specify all the data types. For example, here we specify that user_id should be an integer. The variable in the URL path is also specified similar to f-strings.

@app.get('/users/{user_id}')
def get_user_details(user_id: int):
    return {'user_id': user_id}

Query Strings

We want to allow the user to specify a search term by using a query string ?q=abc in the URL.

Flask:

from flask import request

@app.route('/search')
def search():
    query = request.args.get('q')
    return {'query': query}

FastAPI:

@app.get('/search')
def search(q: str):
    return {'query': q}

JSON POST Request

Let’s take a toy example where we want to send a JSON POST request with a text key and get back a lowercased version.

# Request
{"text": "HELLO"}

# Response
{"text": "hello"}

Flask:

from flask import request

@app.route('/lowercase', methods=['POST'])
def lower_case():
    text = request.json.get('text')
    return {'text': text.lower()}

FastAPI:
If you simply replicate the functionality from Flask, you can do it as follows in FastAPI.

from typing import Dict

@app.post('/lowercase')
def lower_case(json_data: Dict):
    text = json_data.get('text')
    return {'text': text.lower()}

But, this is where FastAPI introduces a new concept of creating Pydantic schema that maps to the JSON data being received. We can refactor the above example using pydantic as:

from pydantic import BaseModel

class Sentence(BaseModel):
    text: str

@app.post('/lowercase')
def lower_case(sentence: Sentence):
    return {'text': sentence.text.lower()}

As seen, instead of getting a dictionary, the JSON data is converted into an object of the schema Sentence. As such, we can access the data using data attributes such as sentence.text. This also provides automatic validation of data types. If the user tries to send any data other than a string, they will be given an auto-generated validation error.

Example Invalid Request

{"text": null}

Automatic Response

{
    "detail": [
        {
            "loc": [
                "body",
                "text"
            ],
            "msg": "none is not an allowed value",
            "type": "type_error.none.not_allowed"
        }
    ]
}

File Upload

Let’s create an API to return the uploaded file name. The key used when uploading the file will be file.

Flask
Flask allows accessing the uploaded file via the request object.

app.py


from flask import Flask, request
app = Flask(__name__)

@app.route('/upload', methods=['POST'])
def upload_file():
    file = request.files.get('file')
    return {'name': file.filename}

FastAPI:
FastAPI uses function parameter to specify the file key.

app.py

from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post('/upload')
def upload_file(file: UploadFile = File(...)):
    return {'name': file.filename}

Form Submission

We want to access a text form field that’s defined as shown below and echo the value.

<input name='city' type='text'>

Flask
Flask allows accessing the form fields via the request object.

app.py


from flask import Flask, request
app = Flask(__name__)

@app.route('/submit', methods=['POST'])
def echo():
    city = request.form.get('city')
    return {'city': city}

FastAPI:
We use function parameter to define the key and data type for the form field.

app.py

from fastapi import FastAPI, Form
app = FastAPI()

@app.post('/submit')
def echo(city: str = Form(...)):
    return {'city': city}

We can also make the form field optional as shown below

from typing import Optional

@app.post('/submit')
def echo(city: Optional[str] = Form(None)):
    return {'city': city}

Similarly, we can set a default value for the form field as shown below.

@app.post('/submit')
def echo(city: Optional[str] = Form('Paris')):
    return {'city': city}

Cookies

We want to access a cookie called name from the request.

Flask
Flask allows accessing the cookies via the request object.

app.py


from flask import Flask, request
app = Flask(__name__)

@app.route('/profile')
def profile():
    name = request.cookies.get('name')
    return {'name': name}

FastAPI:
We use parameter to define the key for the cookie.

app.py

from fastapi import FastAPI, Cookie
app = FastAPI()

@app.get('/profile')
def profile(name = Cookie(None)):
    return {'name': name}

Modular Views

We want to decompose the views from a single app.py into separate files.

- app.py
- views
  - user.py

Flask:
In Flask, we use a concept called blueprints to manage this. We would first create a blueprint for the user view as:

views/user.py

from flask import Blueprint
user_blueprint = Blueprint('user', __name__)

@user_blueprint.route('/users')
def list_users():
    return {'users': ['a', 'b', 'c']}

Then, this view is registered in the main app.py file.

app.py

from flask import Flask
from views.user import user_blueprint

app = Flask(__name__)
app.register_blueprint(user_blueprint)

FastAPI:
In FastAPI, the equivalent of a blueprint is called a router. First, we create a user router as:

routers/user.py

from fastapi import APIRouter
router = APIRouter()

@router.get('/users')
def list_users():
    return {'users': ['a', 'b', 'c']}

Then, we attach this router to the main app object as:

app.py

from fastapi import FastAPI
from routers import user

app = FastAPI()
app.include_router(user.router)

Data Validation

Flask
Flask doesn’t provide any input data validation feature out-of-the-box. It’s common practice to either write custom validation logic or use libraries such as marshmalllow or pydantic.

FastAPI:

FastAPI wraps pydantic into its framework and allow data validation by simply using a combination of pydantic schema and python type hints.

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class User(BaseModel):
    name: str
    age: int

@app.post('/users')
def save_user(user: User):
    return {'name': user.name,
            'age': user.age}

This code will perform automatic validation to ensure name is a string and age is an integer. If any other data type is sent, it auto-generates validation error with a relevant message.

Here are some examples of pydantic schema for common use-cases.

Example 1: Key-value pairs

{
  "name": "Isaac",
  "age": 60
}

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

Example 2: Collection of things

{
  "series": ["GOT", "Dark", "Mr. Robot"]
}

from pydantic import BaseModel
from typing import List

class Metadata(BaseModel):
    series: List[str]

Example 3: Nested Objects

{
  "users": [
    {
      "name": "xyz",
      "age": 25
    },
    {
      "name": "abc",
      "age": 30
    }
  ],
  "group": "Group A"
}

from pydantic import BaseModel
from typing import List

class User(BaseModel):
    name: str
    age: int

class UserGroup(BaseModel):
    users: List[User]
    group: str

You can learn more about Python Type hints from here.

Automatic Documentation

Flask
Flask doesn’t provide any built-in feature for documentation generation. There are extensions such as flask-swagger or flask-restful to fill that gap but the workflow is comparatively complex.

FastAPI:
FastAPI automatically generates an interactive swagger documentation endpoint at /docs and a reference documentation at /redoc.

For example, say we had a simple view given below that echoes what the user searched for.

app.py

from fastapi import FastAPI

app = FastAPI()

@app.get('/search')
def search(q: str):
    return {'query': q}

Swagger Documentation

If you run the server and goto the endpoint http://127.0.0.1:8000/docs, you will get an auto-generated swagger documentation.

You can interactively try out the API from the browser itself.

ReDoc Documentation

In addition to swagger, if you goto the endpoint http://127.0.0.01:8000/redoc, you will get an auto-generated reference documentation. There is information on parameters, request format, response format and status codes.

Cross-Origin Resource Sharing(CORS)

Flask
Flask doesn’t provide CORS support out of the box. We need to use extension such as flask-cors to configure CORS as shown below.

app.py


from flask import Flask
from flask_cors import CORS

app_ = Flask(__name__)
CORS(app_)

FastAPI:
FastAPI provides a built-in middleware to handle CORS. We show an example of CORS below where we are allowing any origin to access our APIs.

app.py

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=['*'],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Conclusion

Thus, FastAPI is an excellent alternative to Flask for building robust APIs with best-practices baked in. You can refer to the documentation to learn more.

References

Google Colab Tips for Power Users

Fri, 26 Jun 2020 00:00:00 GMT

Colab is one of the best products to come from Google. It has made GPUs freely accessible to learners and practitioners like me who otherwise wouldn’t be able to afford a high-end GPU.

While the interface is very easy to use, there are many lesser-known and undocumented features in colab. In this post, I will share those features that I’ve discovered from basic usage and their official talks.

1. Scratchpad Notebook

It’s a pretty common scenario that we have a bunch of cluttered untitled notebooks created when we try out temporary stuff on colab.

Clutter of Untitled Notebooks in Colab

To solve this, you can bookmark the link given below. It will open a special scratch notebook and any changes you make to that notebook are not saved to your main account.

https://colab.research.google.com/notebooks/empty.ipynb

2. Timing Execution of Cell

It’s pretty common that we manually calculate the difference between start and end times of a piece of code to gauge the time taken.

Colab provides an inbuilt feature to do this. After a cell is executed, just hover over the cell run icon and you will get an estimate of the execution time taken.

3. Run part of a cell

You can also run only a part of the cell by selecting it and pressing the Runtime > Run Selection button or using the keyboard shortcut Ctrl + Shift + Enter.

4. Jupyter Notebook Keyboard Shortcuts

If you are familiar with keyboard shortcuts from Jupyter Notebook, they don’t work directly in Colab. But I found a mental model to map between them.

Just add Ctrl + M before whatever keyboard shortcut you were using in Jupyter. This rule of thumb works for the majority of common use-cases.

Action	Jupyter Notebook	Google Colab
Add a cell above	A	Ctrl + M + A
Add a cell below	B	Ctrl + M + B
See all keyboard shorcuts	H	Ctrl + M + H
Change cell to code	Y	Ctrl + M + Y
Change cell to markdown	M	Ctrl + M + M
Interrupt the kernel	II	Ctrl + M + I
Delete a cell	DD	Ctrl + M + D
Checkpoint notebook	Ctrl + S	Ctrl + M + S

Below are some notable exceptions to this rule for which either the shortcut is changed completely or kept the same.

Action	Jupyter Notebook	Google Colab
Restart runtime	00	Ctrl + M + .
Run cell	Ctrl + Enter	Ctrl + Enter
Run cell and add new cell below	Alt + Enter	Alt + Enter
Run cell and goto the next cell below	Shift + Enter	Shift + Enter
Comment current line	Ctrl + /	Ctrl + /

5. Jump to Class Definition

Similar to an IDE, you can go to a class definition by pressing Ctrl and then clicking a class name. For example, here we view the class definition of the Dense layer in Keras by pressing Ctrl and then clicking the Dense class name.

6. Open Notebooks from GitHub

The Google Colab team provides an official chrome extension to open notebooks on GitHub directly on colab. You can install it from here.

After installation, click the colab icon on any GitHub notebook to open it directly.

Alternatively, you can also manually open any GitHub notebook by replacing github.com with colab.research.google.com/github.

https://github.com/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb

https://colab.research.google.com/github/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb

An even easier way is to replace github.com with githubtocolab.com. It will redirect you to a colab notebook.

https://github.com/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb

https://githubtocolab.com/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb

7. Run Flask apps from Colab

With a library called flask-ngrok, you can easily expose a Flask web app running on colab to demo prototypes. First, you need to install flask and flask-ngrok.

!pip install flask-ngrok flask==0.12.2

Then, you just need to pass your flask app object to run_with_ngrok function and it will expose a ngrok endpoint when the server is started.

from flask import Flask
from flask_ngrok import run_with_ngrok

app = Flask(__name__)
run_with_ngrok(app)

@app.route('/')
def hello():
    return 'Hello World!'

if __name__ == '__main__':
    app.run()

You can try this out from the package author’s official example on Colab.

8. Switch between Tensorflow versions

You can easily switch between Tensorflow 1 and Tensorflow 2 using this magic flag.
To switch to Tensorflow 1.15.2, use this command:

%tensorflow_version 1.x

To switch to Tensorflow 2.2, run this command:

%tensorflow_version 2.x

You will need to restart the runtime for the effect to take place. Colab recommends using the pre-installed Tensorflow version instead of installing it from pip for performance reasons.

9. Tensorboard Integration

Colab also provides a magic command to use Tensorboard directly from the notebook. You just need to set the logs directory location using the --logdir flag. You can learn to use it from the official notebook.

%load_ext tensorboard
%tensorboard --logdir logs

10. Gauge resource limits

Colab provides the following specs for their free and pro versions. Based on your use case, you can switch to the pro version at $10/month if you need a better runtime, GPU, and memory.

Version	GPU	GPU Ram	RAM	Storage	CPU Cores	Idle Timeout	Maximum Runtime
Free	Tesla K80	11.44GB	13.7GB	37GB	2	90 min	12 hrs
Pro	Tesla P100	16GB	27.4GB	37GB	4	90 min	24 hrs

You can view the GPU you have been assigned by running the following command

shell

!nvidia-smi

For information on the CPU, you can run this command

shell

!cat /proc/cpuinfo

Similarly, you can view the RAM capacity by running

import psutil
ram_gb = psutil.virtual_memory().total / 1e9
print(ram_gb)

11. Use interactive shell

There is no built-in interactive terminal in Colab. But you can use the bash command to try out shell commands interactively. Just run this command and you will get an interactive input.

shell

!bash

Now, you can run any shell command in the given input box.

To quit from the shell, just type exit in the input box.

12. Current memory and storage usage

Colab provides an indicator of RAM and disk usage. If you hover over the indicator, you will get a popup with the current usage and the total capacity.

13. “Open in Colab” Badge

You can add a ‘Open in Colab’ badge to your README.md or jupyter notebooks using the following markdown code.

In the markdown code, we’re loading an SVG image and then linking it to a colab notebook.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)

14. Interactive Tables for Pandas

Colab provides a notebook extension to add interactive sorting and filtering capabilities to pandas dataframes. To use it, run the following code.

%load_ext google.colab.data_table

You can see the regular pandas dataframe and the interactive dataframe after loading the extension below.

Regular pandas dataframe output

Interactive pandas dataframe output

15. Setup Conda environment

If you use miniconda as your python environment manager, you can setup it on colab by running these commands at the top of your notebook.

shell

# Download Miniconda installation script
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Make it executable
!chmod +x Miniconda3-latest-Linux-x86_64.sh

# Start installation in silent mode
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

# Make conda packages available in current environment
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

After the cell is executed, you can use conda to install packages as usual.

shell

!conda install -y flask

Alternatively, you can use condacolab package to install it easily.

shell

pip install condacolab

Then, run these python commands to install miniconda.

import condacolab
condacolab.install_miniconda()

16. Manage Colab Notebooks from Command Line

You can use a library called colab-cli to easily create and sync colab notebooks with your local notebooks.

colab-cli-demo

17. Run background tasks

There are use-cases when we need to start some web server or background tasks before we can execute our regular program.

To run background tasks, use the nohup command followed by your regular shell command and add & to the end to run it in the background. This makes sure that you can run cells afterward in the notebook without your background task blocking it.

shell

!nohup bash ping.sh &

18. Notify on Training Completion

If you’re running a long task such as training a model, you can setup Colab to send a desktop notification once it’s completed.

To enable that, goto Tools ⮕ Settings ⮕ Site and enable Show desktop notifications checkbox.

You will get a popup to enable browser notification. Just accept it and colab will notify you on task completion even if you are on another tab, window or application.

19. Run javascript code

You can run javascript code by using the %%javascript magic command.

20. Run VSCode on Colab

You can run a full-fledged VSCode editor on Colab by following the method I have explained in another article.

21. Custom snippets

You can save your own collections of useful snippets and access them easily in any colab notebook.

Create a colab notebook called snippets.ipynb. To add each of your snippets, create a markdown cell and add name of the snippet as header. Below, the markdown cell, add a code cell with the snippet code.
Copy the link of this notebook from the browser tab.
Click Tools > Settings in your menu bar to open preference of colab.
Paste the link into the Custom snippet notebook URL textbox and click save.

Now, the snippets are available in any colab notebook you use. Just click the <> icon on sidebar, search for your snippet name and click Insert. The code will be inserted into a new cell.

22. Run JupyterLab on Google Colab

You can start a JupyterLab instance on colab by running the following commands in a cell.

!pip install jupyterlab pyngrok -q

# Run jupyterlab in the background
!nohup jupyter lab --ip=0.0.0.0 &

# Get ngrok URL mapped to port 8888
from pyngrok import ngrok
print(ngrok.connect(8888))

Once executed, click the printed ngrok URL to access the JupyterLab interface.

23. Run R programs in Google Colab

You can use R programming language in Google Colab by going to https://colab.research.google.com/notebook#create=true&language=r. It will open a new notebook with R set as the kernel instead of Python.

References

Timothy Novikoff, “Making the most of Colab (TF Dev Summit ’20)”
Gal Oshri, “What’s new in TensorBoard (TF Dev Summit ’19)”

A Visual Guide to FastText Word Embeddings

Sun, 21 Jun 2020 00:00:00 GMT

Word Embeddings are one of the most interesting aspects of the Natural Language Processing field. When I first came across them, it was intriguing to see a simple recipe of unsupervised training on a bunch of text yield representations that show signs of syntactic and semantic understanding.

In this post, we will explore a word embedding algorithm called “FastText” that was introduced by Bojanowski et al. (2017) and understand how it enhances the Word2Vec algorithm from 2013.

Intuition on Word Representations

Suppose we have the following words and we want to represent them as vectors so that they can be used in Machine Learning models.

Ronaldo, Messi, Dicaprio

A simple idea could be to perform a one-hot encoding of the words, where each word gets a unique position.

	isRonaldo	isMessi	isDicaprio
Ronaldo	1	0	0
Messi	0	1	0
Dicaprio	0	0	1

We can see that this sparse representation doesn’t capture any relationship between the words and every word is isolated from each other.

Maybe we could do something better. We know Ronaldo and Messi are footballers while Dicaprio is an actor. Let’s use our world knowledge and create manual features to represent the words better.

	isFootballer	isActor
Ronaldo	1	0
Messi	1	0
Dicaprio	0	1

This is better than the previous one-hot-encoding because related items are closer in space.

We could keep on adding even more aspects as dimensions to get a more nuanced representation.

	isFootballer	isActor	Popularity	Gender	Height	Weight	…
Ronaldo	1	0	…	…	…	…	…
Messi	1	0	…	…	…	…	…
Dicaprio	0	1	…	…	…	…	…

But manually doing this for every possible word is not scalable. If we designed features based on our world knowledge of the relationship between words, can we replicate the same with a neural network?

Can we have neural networks comb through a large corpus of text and generate word representations automatically?

This is the intention behind the research in word-embedding algorithms.

Recapping Word2Vec

In 2013, Mikolov et al. (2013) introduced an efficient method to learn vector representations of words from large amounts of unstructured text data. The paper was an execution of this idea from Distributional Semantics.

You shall know a word by the company it keeps - J.R. Firth 1957

Since similar words appear in a similar context, Mikolov et al. used this insight to formulate two tasks for representation learning.

The first was called “Continuous Bag of Words” where need to predict the center words given the neighbor words.

The second task was called “Skip-gram” where we need to predict the neighbor words given a center word.

Representations learned had interesting properties such as this popular example where arithmetic operations on word vectors seemed to retain meaning.

Limitations of Word2Vec

While Word2Vec was a game-changer for NLP, we will see how there was still some room for improvement:

Out of Vocabulary(OOV) Words

In Word2Vec, an embedding is created for each word. As such, it can’t handle any words it has not encountered during its training.

For example, words such as “tensor” and “flow” are present in the vocabulary of Word2Vec. But if you try to get embedding for the compound word “tensorflow”, you will get an out of vocabulary error.

Morphology

For words with same radicals such as “eat” and “eaten”, Word2Vec doesn’t do any parameter sharing. Each word is learned uniquely based on the context it appears in. Thus, there is scope for utilizing the internal structure of the word to make the process more efficient.

FastText

To solve the above challenges, Bojanowski et al. (2017) proposed a new embedding method called FastText. Their key insight was to use the internal structure of a word to improve vector representations obtained from the skip-gram method.

The modification to the skip-gram method is applied as follows:

1. Sub-word generation

For a word, we generate character n-grams of length 3 to 6 present in it.

We take a word and add angular brackets to denote the beginning and end of a word

Then, we generate character n-grams of length n. For example, for the word “eating”, character n-grams of length 3 can be generated by sliding a window of 3 characters from the start of the angular bracket till the ending angular bracket is reached. Here, we shift the window one step each time.

Thus, we get a list of character n-grams for a word.

Examples of different length character n-grams are given below:

Word Length(n) Character n-grams

eating 3

eating 4

eating 5

eating 6
Since there can be huge number of unique n-grams, we apply hashing to bound the memory requirements. Instead of learning an embedding for each unique n-gram, we learn total B embeddings where B denotes the bucket size. The paper used a bucket of a size of 2 million.

Each character n-gram is hashed to an integer between 1 to B. Though this could result in collisions, it helps control the vocabulary size. The paper uses the FNV-1a variant of the Fowler-Noll-Vo hashing function to hash character sequences to integer values.

2. Skip-gram with negative sampling

To understand the pre-training, let’s take a simple toy example. We have a sentence with a center word “eating” and need to predict the context words “am” and “food”.

First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself.
For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams.
Now, we collect negative samples randomly with probability proportion to the square root of the unigram frequency. For one actual context word, 5 random negative words are sampled.
We take dot product between the center word and the actual context words and apply sigmoid function to get a match score between 0 and 1.
Based on the loss, we update the embedding vectors with SGD optimizer to bring actual context words closer to the center word but increase distance to the negative samples.

Insights from the Paper

FastText improves performance on syntactic word analogy tasks significantly for morphologically rich language like Czech and German.

word2vec-skipgram word2vec-cbow fasttext

Czech 52.8 55.0 77.8

German 44.5 45.0 56.4

English 70.1 69.9 74.9

Italian 51.5 51.8 62.7
FastText has degraded performance on semantic analogy tasks compared to Word2Vec.

word2vec-skipgram word2vec-cbow fasttext

Czech 25.7 27.6 27.5

German 66.5 66.8 62.3

English 78.5 78.2 77.8

Italian 52.3 54.7 52.3

Using sub-word information with character-ngrams has better performance than CBOW and skip-gram baselines on word-similarity task. Representing out-of-vocab words by summing their sub-words has better performance than assigning null vectors.

		skipgram	cbow	fasttext(null OOV)	fasttext(char-ngrams for OOV)
Arabic	WS353	51	52	54	55
	GUR350	61	62	64	70
German	GUR65	78	78	81	81
	ZG222	35	38	41	44
English	RW	43	43	46	47
	WS353	72	73	71	71
Spanish	WS353	57	58	58	59
French	RG65	70	69	75	75
Romanian	WS353	48	52	51	54
Russian	HJ	69	60	60	66

FastText is 1.5 times slower to train than regular skipgram due to added overhead of n-grams.

Implementation

To train your own embeddings, you can either use the official CLI tool or use the fasttext implementation available in gensim.

Pre-trained word vectors trained on Common Crawl and Wikipedia for 157 languages are available here and variants of English word vectors are available here.

References

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information.

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.

Citation

BibTeX citation:

@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {A {Visual} {Guide} to {FastText} {Word} {Embeddings}},
  date = {2020-06-21},
  url = {https://amitness.com/posts/fasttext-embeddings.html},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2020. A Visual Guide to FastText Word Embeddings.

Universal Sentence Encoder Visually Explained

Mon, 15 Jun 2020 00:00:00 GMT

With transformer models such as BERT and friends taking the NLP research community by storm, it might be tempting to just throw the latest and greatest model at a problem and declare it done. However, in industry, we have compute and memory limitations to consider and might not even have a dedicated GPU for inference.

Thus, it’s useful to keep simple and efficient models in your NLP problem-solving toolbox. Cer et al. (2018) proposed one such model called “Universal Sentence Encoder”.

In this post, I will explain the core idea behind “Universal Sentence Encoder” and how it learns fixed-length sentence embeddings from a mixed corpus of supervised and unsupervised data.

Goal

We want to learn a model that can map a sentence to a fixed-length vector representation. This vector encodes the meaning of the sentence and thus can be used for downstream tasks such as searching for similar documents.

Why Learned Sentence Embeddings?

A naive technique to get sentence embedding is to average the embeddings of words in a sentence and use the average as the representation of the whole sentence. This approach has some challenges.

Let’s understand these challenges with some code examples using the spacy library. We first install spacy and create an nlp object to load the medium version of their model.

shell

pip install spacy
python -m spacy download en_core_web_md

import en_core_web_md

nlp = en_core_web_md.load()

Challenge 1: Loss of information

If we calculate the cosine similarity of documents given below using averaged word vectors, the similarity is pretty high even if the second sentence has a single word It and doesn’t have the same meaning as the first sentence.

python

>>> nlp('It is cool').similarity(nlp('It'))
0.8963861908844291

Challenge 2: No Respect for Order

In this example, we swap the order of words in a sentence resulting in a sentence with a different meaning. Yet, the similarity obtained from averaged word vectors is 100%.

python

>>> nlp('this is cool').similarity(nlp('is this cool'))
1.0

We could fix some of these challenges with hacky manual feature engineering like skipping stop-words, weighting the words by their TF-IDF scores, adding n-grams to respect order when averaging, concatenating embeddings, stacking max pooling and averaged embeddings and so on.

A different line of thought is training an end-to-end model to get us sentence embeddings:

What if we could train a neural network to figure out how to best combine the word embeddings?

Universal Sentence Encoder(USE)

On a high level, the idea is to design an encoder that summarizes any given sentence to a 512-dimensional sentence embedding. We use this same embedding to solve multiple tasks and based on the mistakes it makes on those, we update the sentence embedding. Since the same embedding has to work on multiple generic tasks, it will capture only the most informative features and discard noise. The intuition is that this will result in an generic embedding that transfers universally to wide variety of NLP tasks such as relatedness, clustering, paraphrase detection and text classification.

Overall Pipeline of Universal Sentence Encoder

Let’s now dig deeper into each component of Universal Sentence Encoder.

1. Tokenization

First, the sentences are converted to lowercase and tokenized into tokens using the Penn Treebank(PTB) tokenizer.

2. Encoder

This is the component that encodes a sentence into fixed-length 512-dimension embedding. In the paper, there are two architectures proposed based on trade-offs in accuracy vs inference speed.

Variant 1: Transformer Encoder

In this variant, we use the encoder part of the original transformer architecture. The architecture consists of 6 stacked transformer layers. Each layer has a self-attention module followed by a feed-forward network.

The self-attention process takes word order and surrounding context into account when generating each word representation. The output context-aware word embeddings are added element-wise and divided by the square root of the length of the sentence to account for the sentence-length difference. We get a 512-dimensional vector as output sentence embedding.

This encoder has better accuracy on downstream tasks but higher memory and compute resource usage due to complex architecture. Also, the compute time scales dramatically with the length of sentence as self-attention has time complexity with the length of the sentence. But for short sentences, it is only moderately slower.

Variant 2: Deep Averaging Network(DAN)

In this simpler variant, the encoder is based on the architecture proposed by Iyyer et al. (2015). First, the embeddings for word and bi-grams present in a sentence are averaged together. Then, they are passed through 4-layer feed-forward deep DNN to get 512-dimensional sentence embedding as output. The embeddings for word and bi-grams are learned during training.

It has slightly reduced accuracy compared to the transformer variant, but the inference time is very efficient. Since we are only doing feedforward operations, the compute time is of linear complexity in terms of length of the input sequence.

3. Multi-task Learning

To learn the sentence embeddings, the encoder is shared and trained across a range of unsupervised tasks along with supervised training on the SNLI corpus. The tasks are as follows:

a. Modified Skip-thought

The idea with original skip-thought paper from Kiros et al. (2015) was to use the current sentence to predict the previous and next sentence.

In USE, the same core idea is used but instead of LSTM encoder-decoder architecture, only an encoder based on transformer or DAN is used. USE was trained on this task using the Wikipedia and News corpus.

b. Conversational Input-Response Prediction

In this task, we need to predict the correct response for a given input among a list of correct responses and other randomly sampled responses. This task is inspired by Henderson et al. (2017) who proposed a scalable email reply prediction architecture. This also powered the “Smart Reply” feature in “Inbox by Gmail”.

Smart reply in Google Inbox (Henderson et al. (2017))

The USE authors use a corpus scraped from web question-answering pages and discussion forums and formulate this task using a sentence encoder. The input sentence is encoded into a vector u. The response is also encoded by the same encoder and response embeddings are passed through a DNN to get vector v. This is done to model the difference in meaning of input and response. The dot product of this two vectors gives the relevance of an input to response.

Training is done by taking a batch of K randomly shuffled input-response pairs. In each batch, for a input, its response pair is taken as the correct response and the remaining responses are treated as incorrect. Then, the dot product scores are calculated and converted to probabilities using a softmax function. Model is trained to maximize the log likelihood of the correct response for each input.

c. Natural Language Inference

In this task, we need to predict if a hypothesis entails, contradicts, or is neutral to a premise. The authors used the 570K sentence pairs from SNLI corpus to train USE on this task.

Premise	Hypothesis	Judgement
A soccer game with multiple males playing	Some men are playing a sport	entailment
I love Marvel movies	I hate Marvel movies	contradiction
I love Marvel movies	A ship arrived	neutral

The sentence pairs are encoded using shared Transformer/DAN encoders and the output 512-dim embeddings u1 and u2 are obtained. Then, they are concatenated along with their L1 distance and their dot product(angle). This concatenated vector is passed through fully-connected layers and softmax is applied to get probability for entailment/contradiction/neutral classes.

The idea to learn sentence embedding based on SNLI seems to be inspired by the InferSent(Conneau et al. (2018)) paper though the authors don’t cite it.

4. Inference

Once the model is trained using the above tasks, we can use it to map any sentence into fixed-length 512 dimension sentence embedding. This can be used for semantic search, paraphrase detection, clustering, smart-reply, text classification, and many other NLP tasks.

Results

One caveat with the USE paper was that it doesn’t have a section on comparison with other competing sentence embedding methods over standard benchmarks. The paper seems to be written from an engineering perspective based on learnings from products such as Inbox by Gmail and Google Books.

Implementation

The pre-trained models for “Universal Sentence Encoder” are available via Tensorflow Hub. You can use it to get embeddings as well as use it as a pre-trained model in Keras. You can refer to my article on tutorial on Tensorflow Hub to learn how to use it.

Conclusion

Thus, Universal Sentence Encoder is a strong baseline to try when comparing the accuracy gains of newer methods against the compute overhead. I have personally used it for semantic search, retrieval, and text clustering and it provides a decent balance of accuracy and inference speed.

References

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2018. Supervised learning of universal sentence representations from natural language inference data.

Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Chengqing Zong and Michael Strube, editors, Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pages 1681–1691, Beijing, China. Association for Computational Linguistics.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors.

Citation

BibTeX citation:

@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Universal {Sentence} {Encoder} {Visually} {Explained}},
  date = {2020-06-15},
  url = {https://amitness.com/posts/universal-sentence-encoder.html},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2020. Universal Sentence Encoder Visually Explained.

Zero-shot Text Classification With Generative Language Models

Sun, 07 Jun 2020 00:00:00 GMT

In my last post, we explored a contrastive learning approach to zero-shot text classification. In this post, we will explore a different approach based on text generation. This approach was proposed by Puri et al. in their paper “Zero-shot Text Classification With Generative Language Models”. The paper was also presented in the “3rd Workshop on Meta-Learning” at NeurIPS 2019.

The goal of zero-shot text classification is to design a general and flexible approach that can generalize to new classification tasks without the need for task-specific classification heads. > Build a text classification model that can classify classes on a new dataset it was never trained on.

Paper Idea

In the paper, the authors reformulate text classification as a text generation problem. Instead of classifying a text into X classes, the model needs to generate the correct class when given a text and the classes in a multiple-choice question answering format. Both the input and the output of the model are in natural language.

Let’s understand how the authors implemented this idea in a step-by-step process:

Phase 1: Pre-training

As seen in the formulation above, we need to teach GPT-2 to pick the correct class when given the problem as a multiple-choice problem. The authors teach GPT-2 to do this by fine-tuning on a simple pre-training task called title prediction.

1. Gathering Data for Weak Supervision

In the original GPT-2 paper, the training data was prepared by scraping outbound web links that were submitted or commented on Reddit and had a minimum of 3 karma score.

In the current paper, the authors build upon this idea with the OpenWebText dataset. Since we can know the subreddit the link was posted in and the submission title the user used, this metadata can be collected and used as the supervision signal.

For multiple submissions of the same link, subreddits and submission titles can be aggregated. Thus, we have pairs of webpage text, submission title, and subreddit name as annotations.

Scraped Text	Submission Title	Subreddit
We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many …	OpenAI Releases Largest GPT-2 Text Generation Model	r/artificial
…	…	…

The authors found subreddit prediction didn’t generalize well and so they use submission title in their experiments.

2. Multiple choice question answering format

To feed the annotated data into GPT-2, the authors prepared 26 different multiple-choice question format. A random question format is sampled during training.

Now for each document, we randomly choose between 2 to 15 titles. One title is correct for that document while all others are random titles.

We also add regularization by replacing a title with “none of the above” 50% of the time. And the correct title is also replaced with “none of the above” with a probability 1/(number of titles). Such noise can help train the model to choose “none of the above” if none of the choices match the content.

As shown below, the titles are placed after the question as a comma-separated list.

Question	Text	Answer
Which of these choices best describes the following document?: ” OpenAI Releases Largest GPT-2 Text Generation Model “,” Facebook buys Whatsapp ”	We’ve trained a large-scale …	OpenAI Releases Largest GPT-2 Text Generation Model

The question is prepended to the document to simulate a multiple-choice question answering task and a pre-trained GPT-2 language model is fine-tuned on this dataset to learn the submission title prediction task.

Phase 2: Zero-Shot Classification

From the previous step, we have a model that has been trained on a wide variety of titles from the web and thus simulates meta-learning with N-way text classification tasks.

To test the zero-shot capabilities of the model, the authors tested it on 6 benchmark datasets without doing any finetuning.

Dataset	Classes
SST-2	Positive Sentiment, Negative Sentiment
Yelp-2	Positive polarity, Negative polarity
Amazon-2	Positive polarity, Negative polarity
AGNews	Science & Technology, Business, Sports, World News
DBPedia	Company, Mean Of Transportation, Film, Office Holder, Written Work, Animal, Natural Place, Artist, Plant, Athlete, Album, Building, Village, Educational Institution
Yahoo Answers	Family & Relationships, Business & Finance, Health, Society & Culture, Education & Reference, Entertainment & Music, Science & Mathematics, Computers & Internet, Sports, Politics & Government

For each dataset, they perform the following steps:

They convert the classes in each dataset into the same multiple-choice question format as pre-training and prepend it to the text. For example, for SST-2 dataset which contains movie reviews, the format would be:

Question Text Answer

To which category does the text belong?:” Positive Sentiment “,” Negative Sentiment ” the film is one of the year’s best Positive Sentiment
The question is prepended to the text and passed to GPT-2 as a prompt. Then we use greedy sampling to generate the output from GPT-2 and compare it with the actual class. Accuracy for each dataset is calculated.

Results and Insights

Even without access to the training data, the model was able to achieve up to 45% improvement in classification accuracy over random and majority class baselines.

For sentiment datasets such as SST-2, Amazon-2, and Yelp-2, the larger size 335M GPT-2 model has a significant improvement over the random and majority class baselines. Zero-shot performance is still below direct finetuning and the SOTA held by XLNET.

Model	SST-2	Amazon-2	Yelp-2
Random Guess	50.6	52.9	50.4
Majority Class	49.9	49.3	49.2
Zero-Shot 355M All Data	62.5	80.2	74.7
355M Finetuned	93.23	97.115	94.479
SOTA(XLNET, 2019)	96.8	97.6	98.45

Increasing the model size from 117M to 355M parameters leads to better zero-shot performance on downstream tasks.

Model SST-2 Amazon-2 Yelp-2

Zero-Shot 117M All Data 51.8 50.3 50.1

Zero-Shot 355M All Data 62.5 80.2 74.7
When pretraining is done on the only 1/4th of the total data, it leads to a decrease in overall performance. This shows that pretraining across a diverse set of tasks is needed and a larger dataset provides that.

Model SST-2 Amazon-2 Yelp-2

Zero-Shot 355M 1/4 Data 61.7 64.5 58.5

Zero-Shot 355M All Data 62.5 80.2 74.7

For datasets like DBPedia, AGNews, and Yahoo Answer with many classes, the model performs noticeably better than random but struggles to break past 50% accuracy. The authors say this could be because the model can identify unlikely classes, but struggle to choose between most plausible options due to lack of any supervision. Also, performance is better with less data than with full dataset pretraining for them.

Model	AGNews	DBPedia	Yahoo Answers
Random Guess	27.4	7.27	10.2
Majority Class	25.3	7.6	9.9
Zero-Shot 117M All Data	40.2	39.6	26.1
Zero-Shot 355M 1/4 Data	68.3	52.5	52.2
Zero-Shot 355M All Data	65.5	44.8	49.5
355M Finetuned	94.87	99.0	72.79
SOTA	95.51	99.38	76.26

The authors point out that there were controllability issues because GPT-2 was generating answers which were not a valid class. For example, for the yahoo answers dataset, valid classes are “education & reference” and “science and mathematics’. But, the model sometimes mixed these two and generated ‘education and mathematics’. This problem diminished as the model size was increased to 355M and full data was used.

Another issue with the model was the generation of an empty string and rearranging the tokens of a valid answer e.g. “Positive Sentiment” -> “Sentiment Positive”. This problem was frequent with top-k and top-p sampling and rare with greedy decoding, and so the authors chose greedy decoding.

Conclusion

The paper provides a good overview of the method and challenges of using generative language models for zero-shot classification and show that natural language could be a promising meta-learning strategy for text problems.

References

Raul Puri et al., “Zero-shot Text Classification With Generative Language Models”

Exploring Knowledge Captured in Probability of Strings

Sun, 07 Jun 2020 00:00:00 GMT

I recently completed the UC Berkeley’s Deep Unsupervised Learning course. The course had an interesting guest lecture on the history of language modeling by Alec Radford, the author of GPT model.

In one of his slides, Alec mentions how by simply observing a bunch of strings, language models tend to capture useful knowledge. He also mentions that maybe in the future, we could have an unsupervised language model that can be directly used on tasks without further fine-tuning. This talk was before GPT-3 was released and GPT-3 has shown the few-shot learning ability of language models.

In this post, I will share my exploration of the simple examples he mentioned in the lecture with code and expand more on them.

Probabilistic Language Modeling

In language modeling, we want to learn a function that can observe a bunch of strings and then compute the probability for new strings. For example, the function can give us how likely this sentence is:

There are many ways you could formulate this function. Here are some:

We could discard context and simply assume each token is independent to get a unigram language model.

We could condition only on the previous word to get a bigram language model.

We could use an RNN and variants to keep track of the previous context in a hidden state.

What could it have learned?

Let’s take GPT-2 as a language model and explore what it has learned by just observing a bunch of strings over the internet.

We will use the lm-scorer library to calculate the probability of a sentence using transformer-based language models.

pip install lm-scorer

Let’s create a scorer function that gives us a probability of a sentence using the GPT-2 language model.

from lm_scorer.models.auto import AutoLMScorer
scorer = AutoLMScorer.from_pretrained("gpt2-large")

def score(sentence):
    return scorer.sentence_score(sentence)

Now, we can use it for any sentence as shown below and it returns the probability.

>>> score('good luck')
8.658163769270644e-11

Grammar

A language model has no prior knowledge of grammar rules and structure. But it has been exposed to a bunch of grammatically correct sentences in the large training corpus. Let’s explore how much grammar it has picked up.

The language model assigns a higher probability to sentence with the correct order of subject, verb, and object than an incorrect one.
```
>>> score('I like it') > score('like it I')
True
```
We have two similar sentences given below. Sentence 2 has a grammatical mistake.

sentence 1 sentence 2

The cat sat on the mat The cat sats on the mat

We would want our language model to assign more probability to the correct sentence 1. Let’s verify if this is the case with GPT-2.
```
p1 = score('The cat sat on the mat')
p2 = score('The cat sats on the mat')
```
The language model indeed assigns more probability to the gramatically correct sentence.
```
>>> print(p1 > p2)
True
```

World Knowledge

The text corpus a language model is trained on contains lots of facts about the world. Can a language model pick that up? Let’s see an example.

fact1 = score('The cat sat on the mat')
fact2 = score('The hyena sat on the mat')

Who does GPT-2 think is more probable to sit on a mat: cat or the hyena?

>>> print(fact1 > fact2)
True

It’s the cat. This makes sense as cats are domesticated and hyena is a wild animal.

Sentiment Analysis

Alec presents another idea where we find the conditional probability of positive/negative opinion following some text to perform sentiment analysis. For example, we could calculate the probability for “Sentiment: Positive.” and “Sentiment: Negative.” coming after a text and assign the sentiment as positive or negative respectively.

Let’s build a function to compute the two scores and return the sentiment based on whichever is higher.

def sentiment(sentence):
    positive_score = score(f'{sentence} Sentiment: Positive.')
    negative_score = score(f'{sentence} Sentiment: Negative.')
    return 'positive' if positive_score > negative_score else 'negative'

We can try with a few sentences.

>>> sentiment('Awesome product.')
'positive'

>>> sentiment('the app failed to run')
'negative'

>>> sentiment('this is not a good idea')
'negative'

>>> sentiment('the app rocks')
'positive'

Bias

Since these models are trained on human-written text in the wild, they are bound to capture the inherent bias in these text. Here are some examples:

The model finding it more probable for gender to be “he” for doctor and scientist and “she” for nurse.

>>> score('The doctor came. He') / score('The doctor came. She')
4.702219615279396

>>> score('The scientist came. He') / score('The scientist came. She')
3.9469981043432845

>>> score('The nurse came. She') / score('The nurse came. He')
4.709184896139912

References

Alec Radford, “L11 Language Models – guest instructor: Alec Radford (OpenAI) — Deep Unsupervised Learning SP20”

Zero Shot Learning for Text Classification

Sat, 30 May 2020 00:00:00 GMT

The recent release of GPT-3 got me interested in the state of zero-shot learning and few-shot learning in NLP. While most of the zero-shot learning research is concentrated in Computer Vision, there has been some interesting work in the NLP domain as well.

I will be writing a series of blog posts to cover existing research on zero-shot learning in NLP. In this first post, I will explain the paper “Train Once, Test Anywhere: Zero-Shot Learning for Text Classification” by Pushp et al. This paper from December 2017 was the first work to propose a zero-shot learning paradigm for text classification.

What is Zero-Shot Learning?

Zero-Shot Learning is the ability to detect classes that the model has never seen during training. It resembles our ability as humans to generalize and identify new things without explicit supervision.

For example, let’s say we want to do sentiment classification and news category classification. Normally, we will train/fine-tune a new model for each dataset. In contrast, with zero-shot learning, you can perform tasks such as sentiment and news classification directly without any task-specific training.

Train Once, Test Anywhere

In the paper, the authors propose a simple idea for zero-shot classification. Instead of classifying texts into X classes, they re-formulate the task as a binary classification to determine if a text and a class are related or not.

Let’s understand their formulation and end-to-end process in more detail now.

1. Data Preparation

The authors crawled 4.2 million news headlines from the web and used the SEO tags for the news article as the labels. After crawling, they got total 300,000 unique tags as the labels. We can see how troublesome it would have been if we had to train a supervised model on 300,000 classes.

Each headline was truncated to 28 words and anything shorter was padded.

2. Word Embedding

The paper uses word2vec pre-trained on Google News as the word embeddings for both the sentences as well as the labels.

3. Model Architecture

The paper proposes three different architecture to learn the relation between sentence and label embeddings.

a. Architecture 1

In this architecture, we take the mean of word embeddings in the sentence as the sentence embedding and concatenate it with the label embedding. This vector is then passed through a fully connected layer to classify if the sentence and label are related or not.

b. Architecture 2

In this architecture, instead of taking the mean, the word embeddings are passed through an LSTM and the last hidden state of the network is treated as the sentence vector. It is concatenated with the word vector of the label and then passed through a fully connected layer to classify if the sentence and label are related or not.

c. Architecture 3

In this architecture, the embedding of each word in the sentence is concatenated with the embedding of the label. This combined embedding is passed through an LSTM and the last hidden state of the network is taken. It is then passed through a fully connected layer to classify if the sentence and label are related or not.

4. Training

Using the crawled news headlines dataset, each headline is paired with 50% actual labels and 50% randomly selected unrelated labels. Then the model is trained using above 3 architectures with a binary cross-entropy loss with Adam optimizer.

In the paper, they achieve the highest accuracy of 74% on the binary classification task with Architecture 3, followed by 72.6% on architecture 2 and 72% on architecture 1 on the separated test set of the news headlines dataset.

5. Zero-Shot Classification

Now, taking the trained model that can compute relatedness score of sentences with labels, the authors tested its generalization capability to unseen datasets and labels.

The authors tested their model on a hold-out test set containing labels not present during training. They achieve 78%, 76% and 81% accuracy on the binary classification task with architecture 1, 2 and 3 respectively.
UCI News Aggregator Dataset:
In this dataset, there are 420,000 sentences with 4 labels: technology, business, medicine and entertainment. They propose a heuristic called category tree where they expand each label with related words. The process is as follows:
- Take the unseen labels and add a few words related to this concept. For example, related words for business can be ‘finance’ and ‘revenue’.
- To predict the class(category) for a sentence, they predict the relatedness of the sentence to related words under that category and take their mean as the final relatedness.
- The classes which had mean relatedness probability above a threshold are assumed as the predicted classes. This threshold is a hyperparameter and the paper uses 0.5 as the threshold.
The authors tested this process on the entire dataset and achieved 61.73%, 63% and 64.21% accuracy. In comparison, the supervised methods achieve 94.75% accuracy. The result is still interesting because without even training on a single sample, it achieves better than random accuracy.
Tweet Classification:
This dataset has 1993 sentences with 6 labels: business, health, politics, sports, technology and entertainment. The authors tested their method over the whole dataset using a threshold of 0.5 and a category tree expansion with 3 related words and achieved 64.5% accuracy with Architecture 3. In comparison, a supervised method such as multinominal naive bayes trained on the whole dataset can get 78% accuracy.

Conclusion

The paper proposes some really simple but clever techniques to learn the relationship between sentences and labels and achieves better than random accuracy on unseen datasets and labels. Since this was proposed in the pre-transformer era, it can be interesting to try these ideas with recent models.

References

Pushpankar Kumar Pushp, et al. “Train Once, Test Anywhere: Zero-Shot Learning for Text Classification”

	word2vec-skipgram	word2vec-cbow	fasttext
Czech	52.8	55.0	77.8
German	44.5	45.0	56.4
English	70.1	69.9	74.9
Italian	51.5	51.8	62.7

	word2vec-skipgram	word2vec-cbow	fasttext
Czech	25.7	27.6	27.5
German	66.5	66.8	62.3
English	78.5	78.2	77.8
Italian	52.3	54.7	52.3

Model	SST-2	Amazon-2	Yelp-2
Zero-Shot 117M All Data	51.8	50.3	50.1
Zero-Shot 355M All Data	62.5	80.2	74.7

Model	SST-2	Amazon-2	Yelp-2
Zero-Shot 355M 1/4 Data	61.7	64.5	58.5
Zero-Shot 355M All Data	62.5	80.2	74.7

Amit Chaudhary

The Anatomy of Tool Calling

Object Introspection

Extracting the parameters and type-annotations

Extracting the docstring

Extracting the function name

Extracting the parameter descriptions from the docstring

Functions to JSON Schema

Approach 1: Pure Python

Approach 2: Pydantic

2a. Dynamic Models

2b. Type Adapter

Approach 3: Decorators

Conclusion

Evals for Diversity in Synthetic Data

Lexical Diversity Metrics

Distinct n-grams (Distinct-k)

N-gram Entropy (Ent-n)

Compression Ratio

Semantic Diversity Metrics

Embedding Diversity

DCScore

Cluster Inertia

Syntactic Diversity Metrics

Compression Ratio - Part of Speech (CR-POS)

Conclusion

References

Footnotes

Citation

Zero-Cost Custom Feeds on Bluesky

Background

High-level Overview

Implementation

1. Clone the code locally

2. Setup Cloudflare pages

3. Initialize a custom feed on Bluesky

4. Setup Skyfeed

5. Feed Generation in Python

1. Cloudflare Page Generation

2. Filtering Posts

3. Re-ranking with hackernews score

6. Running periodically via GitHub Actions

7. Access your feed

Conclusion

References

Parallel Processing with tqdm

Running Concurrent Threads

Running parallel processes

Running Asynchronous Tasks

Conclusion

Footnotes

A Visual Guide to Regular Expression

Mental Model

Basic Building Blocks

a. Specific character

b. White space character

c. Special sequences

Pattern: \d

Pattern: \s

Pattern: \w

Pattern: .

Pattern: Negations

d. Character sets

e. Anchors

f. Escaping metacharacters

Repetition of basic blocks

a. Naive repetition

b. Quantifiers

i. Fixed repetition

ii. Flexible quantifiers

Usage in Python

Need for raw strings

Using re module

1. re.findall

2. re.match

3. re.search

References

Knowledge Transfer in Self Supervised Learning

Challenge of evaluating representations

Knowledge Transfer

Pattern: `\d`

Pattern: `\s`

Pattern: `\w`

Pattern: `.`