<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Amit Chaudhary</title>
<link>https://amitness.com/</link>
<atom:link href="https://amitness.com/index.xml" rel="self" type="application/rss+xml"/>
<description>I&#39;m an independent AI engineer helping companies build robust AI-powered products.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Sat, 15 Feb 2025 00:00:00 GMT</lastBuildDate>
<item>
  <title>The Anatomy of Tool Calling</title>
  <link>https://amitness.com/posts/function-calling-schema/</link>
  <description><![CDATA[ 




<p>Giving an LLM the capability to call some external function based on the user’s input and receive the results back is a very powerful pattern and a key element behind the rapid rise of agentic workflows.</p>
<p>This pattern powers many of the features we see on ChatGPT today, such as web search, code execution, image generation, or personalized memory based on conversation history.</p>
<p>LLM&nbsp;providers expose this as tool use or function calling. We provide all the function signatures and parameters as JSON Schema and can later call the implementation in any programming language.</p>
<p><img src="https://amitness.com/posts/function-calling-schema/function-to-jsonschema.png" class="img-fluid"></p>
<p>For example, we can write a JSON schema to provide a simple add function to OpenAI as shown below.</p>
<!-- Most major LLM providers allow LLMs to call external function via a capability called function calling or tool-use. The basic idea is that the LLMs can decide to call one among the provided functions based on user's input and receive the results back. Function calling is one of the key elements for the rise of agentic workflows.

Function calling already powers a bunch of features on ChatGPT itself such as searching web, running code in python interpreter etc., generating images using DALLE or storing a memory based on what the user said.

For custom applications, we can leverage the tool calling capability that close-sourced provider provide. Both major players OpenAI and Anthropic have converged on specifying the functions via JSON Schema.

For example, if we wanted to provide a simple add function to the LLM, the JSON schema would be as shown below


![](function-to-jsonschema (2).png) -->
<div id="1a5c763a" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install openai <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>qqq</span></code></pre></div></div>
</div>
<div id="c6dd3815" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb2-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb2-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<p>At first, we convert the function into a JSON Schema showing the name of the function, the description of what it does, and the name and type of all the parameters that it can take.</p>
<div id="29ce7248" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">tools <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb3-2">    {</span>
<span id="cb3-3">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>,</span>
<span id="cb3-4">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>: {</span>
<span id="cb3-5">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"add"</span>,</span>
<span id="cb3-6">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Adds two integers together"</span>,</span>
<span id="cb3-7">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"strict"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb3-8">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"parameters"</span>: {</span>
<span id="cb3-9">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"object"</span>,</span>
<span id="cb3-10">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"required"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b"</span>],</span>
<span id="cb3-11">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>: {</span>
<span id="cb3-12">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"integer"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The first integer to add"</span>},</span>
<span id="cb3-13">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b"</span>: {</span>
<span id="cb3-14">                        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"integer"</span>,</span>
<span id="cb3-15">                        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The second integer to add"</span>,</span>
<span id="cb3-16">                    },</span>
<span id="cb3-17">                },</span>
<span id="cb3-18">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"additionalProperties"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb3-19">            },</span>
<span id="cb3-20">        },</span>
<span id="cb3-21">    }</span>
<span id="cb3-22">]</span></code></pre></div></div>
</div>
<p>Then, we can provide our schema as a list of tools and send a user query.</p>
<div id="b3d35353" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> openai <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> OpenAI</span>
<span id="cb4-2"></span>
<span id="cb4-3">client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> OpenAI()</span>
<span id="cb4-4"></span>
<span id="cb4-5">messages <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Add 2 and 3"</span>}]</span>
<span id="cb4-6">completion <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.chat.completions.create(</span>
<span id="cb4-7">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o-mini"</span>,</span>
<span id="cb4-8">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>messages,</span>
<span id="cb4-9">    tools<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>tools,</span>
<span id="cb4-10">)</span></code></pre></div></div>
</div>
<p>The model then decides that it wants to call the add function with the parameters a=2 and b=3</p>
<div id="08d1423c" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">tool_call <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> completion.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.tool_calls[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb5-2">tool_call.function</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="5">
<div class="ansi-escaped-output">
<pre><span class="ansi-magenta-fg ansi-bold">Function</span><span class="ansi-bold">(</span><span class="ansi-yellow-fg">arguments</span>=<span class="ansi-green-fg">'</span><span class="ansi-green-fg">{</span><span class="ansi-green-fg">"a":2,"b":3</span><span class="ansi-green-fg">}</span><span class="ansi-green-fg">'</span>, <span class="ansi-yellow-fg">name</span>=<span class="ansi-green-fg">'add'</span><span class="ansi-bold">)</span></pre>
</div>
</div>
</div>
<div id="3163c281" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">tool_call.function.name</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="6">
<div class="ansi-escaped-output">
<pre><span class="ansi-green-fg">'add'</span></pre>
</div>
</div>
</div>
<p>We can fetch the arguments to be passed to the function as shown below</p>
<div id="f32b6ec1" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb7-2"></span>
<span id="cb7-3">args <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> json.loads(tool_call.function.arguments)</span>
<span id="cb7-4">args</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="7">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-cyan-fg ansi-bold">2</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-cyan-fg ansi-bold">3</span><span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
<p>Then we call our function with those arguments and get a result</p>
<div id="63dd1509" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> add(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>args)</span>
<span id="cb8-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(result)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>5</code></pre>
</div>
</div>
<p>The result is sent back to the LLM context as a separate message and it will generate a natural language response as a reply for the next turn.</p>
<div id="cb8df486" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">messages.append(completion.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message)</span>
<span id="cb10-2">messages.append(</span>
<span id="cb10-3">    {</span>
<span id="cb10-4">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tool"</span>,</span>
<span id="cb10-5">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tool_call_id"</span>: tool_call.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span>,</span>
<span id="cb10-6">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(result),</span>
<span id="cb10-7">    }</span>
<span id="cb10-8">)</span>
<span id="cb10-9"></span>
<span id="cb10-10">completion_after_tool_call <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.chat.completions.create(</span>
<span id="cb10-11">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o-mini"</span>,</span>
<span id="cb10-12">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>messages,</span>
<span id="cb10-13">    tools<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>tools,</span>
<span id="cb10-14">)</span>
<span id="cb10-15">completion_after_tool_call.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="9">
<div class="ansi-escaped-output">
<pre><span class="ansi-green-fg">'The sum of 2 and 3 is 5.'</span></pre>
</div>
</div>
</div>
<hr>
<p>Now, the question becomes: how can we automatically convert Python functions into JSON Schemas?</p>
<p>In this post, I will go over various runtime introspection features that Python provides to extract pretty much everything about a function definition. Then we will use that knowledge to build automatic function to json schema converters.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/function-calling-schema/function-runtime-inspect.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="600"></p>
</figure>
</div>
<!-- - Function calling uses definitions of python function into json schema
- most major providers use this (anthropic, openai (add links here))
- give a simple example
```python
def add(a: int, b:int) -> int:
    """Adds two integers together"""
    return a + b
```
- show equivalent json schema for this
- show openai example of code to try this out
- design an image showing the different parts we need from the function to the json schema
  - in figma
- parts of a function
  - Function name
  - Docstring
  - Parameters and type hints
  - return type
  - signature

![](function-to-jsonschema (2).png) -->
<section id="object-introspection" class="level2">
<h2 class="anchored" data-anchor-id="object-introspection">Object Introspection</h2>
<p>Let’s understand the various introspection features step by step.</p>
<section id="extracting-the-parameters-and-type-annotations" class="level3">
<h3 class="anchored" data-anchor-id="extracting-the-parameters-and-type-annotations">Extracting the parameters and type-annotations</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/function-calling-schema/inspect-signature.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="400"></p>
</figure>
</div>
<p>To get the parameters of the function, we can use the signature function of the <strong>inspect</strong> module.</p>
<div id="70f19821" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb11-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb11-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<p>This will return the entire signature for both the input parameters and the return type.</p>
<div id="40532afa" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> inspect</span>
<span id="cb12-2"></span>
<span id="cb12-3">signature <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> inspect.signature(add)</span>
<span id="cb12-4">signature</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="11">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">&lt;</span><span class="ansi-bright-magenta-fg ansi-bold">Signature</span> <span class="ansi-bold">(</span>a: int, b: int<span class="ansi-bold">)</span> -&gt; int<span class="ansi-bold">&gt;</span></pre>
</div>
</div>
</div>
<p>We can get a dictionary of the parameters of the function from the signature</p>
<div id="9f8c2aa2" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">signature.parameters</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="12">
<div class="ansi-escaped-output">
<pre><span class="ansi-magenta-fg ansi-bold">mappingproxy</span><span class="ansi-bold">(</span><span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">&lt;</span><span class="ansi-bright-magenta-fg ansi-bold">Parameter</span> <span class="ansi-green-fg">"a: int"</span>&gt;, <span class="ansi-green-fg">'b'</span>: &lt;Parameter <span class="ansi-green-fg">"b: int"</span><span class="ansi-bold">&gt;</span><span class="ansi-bold">}</span><span class="ansi-bold">)</span></pre>
</div>
</div>
</div>
<p>We can access each parameter from the dictionary. It will return an object that has many useful properties</p>
<div id="07f9964f" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> signature.parameters[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>]</span>
<span id="cb14-2">a</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="13">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">&lt;</span><span class="ansi-bright-magenta-fg ansi-bold">Parameter</span> <span class="ansi-green-fg">"a: int"</span><span class="ansi-bold">&gt;</span></pre>
</div>
</div>
</div>
<p>We can now easily access the name of the parameter, its default value as well as the type annotation.</p>
<div id="d965c9ae" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Name of parameter: "</span>, a.name)</span>
<span id="cb15-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Default value: "</span>, a.default)</span>
<span id="cb15-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Type annotation: "</span>, a.annotation)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Name of parameter:  a
Default value:  &lt;class 'inspect._empty'&gt;
Type annotation:  &lt;class 'int'&gt;</code></pre>
</div>
</div>
<p>This means that if a parameter has a default value of <code>inspect._empty</code>, it’s a required parameter`</p>
<div id="6c922e02" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">a.default <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> inspect._empty</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="15">
<div class="ansi-escaped-output">
<pre><span style="font-style:italic" class="ansi-bright-green-fg">True</span></pre>
</div>
</div>
</div>
<p>The type annotation is of particular interest to us. It will return the type directly</p>
<div id="da6d03ae" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">a.annotation</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="16">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">&lt;</span><span class="ansi-bright-magenta-fg ansi-bold">class</span> <span class="ansi-green-fg">'int'</span><span class="ansi-bold">&gt;</span></pre>
</div>
</div>
</div>
<div id="4b3d765b" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">a.annotation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="17">
<div class="ansi-escaped-output">
<pre><span style="font-style:italic" class="ansi-bright-green-fg">True</span></pre>
</div>
</div>
</div>
<p>We can also get the type annotation for the return statement i.e.&nbsp;output of the function using the signature itself</p>
<div id="1f3c564d" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1">signature.return_annotation</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="18">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">&lt;</span><span class="ansi-bright-magenta-fg ansi-bold">class</span> <span class="ansi-green-fg">'int'</span><span class="ansi-bold">&gt;</span></pre>
</div>
</div>
</div>
<!-- ```python
class Parameter:
    """Represents a parameter in a function signature.

    Has the following public attributes:

    * name : str
        The name of the parameter as a string.
    * default : object
        The default value for the parameter if specified.  If the
        parameter has no default value, this attribute is set to
        `Parameter.empty`.
    * annotation
        The annotation for the parameter if specified.  If the
        parameter has no annotation, this attribute is set to
        `Parameter.empty`.
    * kind : str
        Describes how argument values are bound to the parameter.
        Possible values: `Parameter.POSITIONAL_ONLY`,
        `Parameter.POSITIONAL_OR_KEYWORD`, `Parameter.VAR_POSITIONAL`,
        `Parameter.KEYWORD_ONLY`, `Parameter.VAR_KEYWORD`.
    """
``` -->
</section>
<section id="extracting-the-docstring" class="level3">
<h3 class="anchored" data-anchor-id="extracting-the-docstring">Extracting the docstring</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/function-calling-schema/docstring.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="400"></p>
</figure>
</div>
<p>To get the docstring, we can use the <strong>__doc__</strong> attribute in the function</p>
<div id="706842ac" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb21-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb21-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<div id="f6348422" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1">add.__doc__</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="20">
<div class="ansi-escaped-output">
<pre><span class="ansi-green-fg">'Adds two integers together'</span></pre>
</div>
</div>
</div>
<p>An alternate approach is to use inspect module itself.</p>
<div id="680e5c56" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> inspect</span>
<span id="cb23-2"></span>
<span id="cb23-3">inspect.getdoc(add)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="21">
<div class="ansi-escaped-output">
<pre><span class="ansi-green-fg">'Adds two integers together'</span></pre>
</div>
</div>
</div>
</section>
<section id="extracting-the-function-name" class="level3">
<h3 class="anchored" data-anchor-id="extracting-the-function-name">Extracting the function name</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/function-calling-schema/function-name-extraction.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="400"></p>
</figure>
</div>
<p>This is relatively simple as python already provides a <strong>__name__</strong> attribute on each function.</p>
<div id="deb79823" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb24-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb24-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<div id="b0f4238b" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1">add.<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span></span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="23">
<div class="ansi-escaped-output">
<pre><span class="ansi-green-fg">'add'</span></pre>
</div>
</div>
</div>
</section>
<section id="extracting-the-parameter-descriptions-from-the-docstring" class="level3">
<h3 class="anchored" data-anchor-id="extracting-the-parameter-descriptions-from-the-docstring">Extracting the parameter descriptions from the docstring</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/function-calling-schema/extract-param-description.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="400"></p>
</figure>
</div>
<p>We can make use of a third-party library called <a href="https://github.com/rr-/docstring_parser">docstring_parser</a> as the format of docstrings can vary a lot.</p>
<div id="a9bb26c6" class="cell" data-execution_count="24">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install docstring_parser <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>qqq</span></code></pre></div></div>
</div>
<div id="ae507150" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb27-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb27-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Adds two integers together.</span></span>
<span id="cb27-4"></span>
<span id="cb27-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Args:</span></span>
<span id="cb27-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        a (int): The first integer.</span></span>
<span id="cb27-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        b (int): The second integer.</span></span>
<span id="cb27-8"></span>
<span id="cb27-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns:</span></span>
<span id="cb27-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        int: The sum of a and b.</span></span>
<span id="cb27-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb27-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<div id="09d29fd8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb28" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> docstring_parser <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> parse</span>
<span id="cb28-2"></span>
<span id="cb28-3">doc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> parse(add.__doc__)</span>
<span id="cb28-4">{param.arg_name: param.description <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> param <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc.params}</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="26">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-green-fg">'The first integer.'</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-green-fg">'The second integer.'</span><span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
</section>
</section>
<section id="functions-to-json-schema" class="level2">
<h2 class="anchored" data-anchor-id="functions-to-json-schema">Functions to JSON Schema</h2>
<p>With the above background knowledge, we have everything needed to convert the function definition to JSON Schema.</p>
<p>Let’s see how this is applied in various popular agent libraries.</p>
<section id="approach-1-pure-python" class="level3">
<h3 class="anchored" data-anchor-id="approach-1-pure-python">Approach 1: Pure Python</h3>
<p>This is the approach implemented in the <a href="https://github.com/openai/swarm">OpenAI Swarm</a> library. In this, we can use all introspection feature discussed above to write the conversion function from scratch.</p>
<div id="40b230cd" class="cell" data-execution_count="28">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install git<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>github.com<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>openai<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>swarm.git <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>qqq</span></code></pre></div></div>
</div>
<div id="f48a094c" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb30-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb30-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<p>Swarm has a utility function called <code>function_to_json</code> that converts a python function into a JSON schema.</p>
<div id="18b3efbb" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> swarm.util <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> function_to_json</span>
<span id="cb31-2"></span>
<span id="cb31-3">function_to_json(add)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="30">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'function'</span>,
    <span class="ansi-green-fg">'function'</span>: <span class="ansi-bold">{</span>
        <span class="ansi-green-fg">'name'</span>: <span class="ansi-green-fg">'add'</span>,
        <span class="ansi-green-fg">'description'</span>: <span class="ansi-green-fg">'Adds two integers together'</span>,
        <span class="ansi-green-fg">'parameters'</span>: <span class="ansi-bold">{</span>
            <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>,
            <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
            <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>
        <span class="ansi-bold">}</span>
    <span class="ansi-bold">}</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
<p>As seen above, we first need some mapping to convert the parameter types from Python to the equivalent JSON schema data type.</p>
<div id="f6ce114d" class="cell">
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="31">
<style type="text/css">
</style>

<table id="T_fe5d8" class="caption-top table table-sm table-striped small">
<thead>
<tr class="header">
<th id="T_fe5d8_level0_col0" class="col_heading level0 col0" data-quarto-table-cell-role="th">python</th>
<th id="T_fe5d8_level0_col1" class="col_heading level0 col1" data-quarto-table-cell-role="th">json_schema</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td id="T_fe5d8_row0_col0" class="data row0 col0">str</td>
<td id="T_fe5d8_row0_col1" class="data row0 col1">string</td>
</tr>
<tr class="even">
<td id="T_fe5d8_row1_col0" class="data row1 col0">int</td>
<td id="T_fe5d8_row1_col1" class="data row1 col1">integer</td>
</tr>
<tr class="odd">
<td id="T_fe5d8_row2_col0" class="data row2 col0">float</td>
<td id="T_fe5d8_row2_col1" class="data row2 col1">number</td>
</tr>
<tr class="even">
<td id="T_fe5d8_row3_col0" class="data row3 col0">bool</td>
<td id="T_fe5d8_row3_col1" class="data row3 col1">boolean</td>
</tr>
<tr class="odd">
<td id="T_fe5d8_row4_col0" class="data row4 col0">list</td>
<td id="T_fe5d8_row4_col1" class="data row4 col1">array</td>
</tr>
<tr class="even">
<td id="T_fe5d8_row5_col0" class="data row5 col0">dict</td>
<td id="T_fe5d8_row5_col1" class="data row5 col1">object</td>
</tr>
<tr class="odd">
<td id="T_fe5d8_row6_col0" class="data row6 col0">None</td>
<td id="T_fe5d8_row6_col1" class="data row6 col1">null</td>
</tr>
</tbody>
</table>
</div>
</div>
<p>Based on this, the implementation is quite simple and reuses all the concept we discussed before.</p>
<p>We take the function signature and extract the parameter types for each paramter as well as get the function name and docstring. Using this, we construct the JSON Schema at the end.</p>
<div id="deb121c0" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="annotated-cell-30" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-30-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Source: https://github.com/openai/swarm/blob/9db581cecaacea0d46a933d6453c312b034dbf47/swarm/util.py#L31</span></span>
<span id="annotated-cell-30-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> inspect</span>
<span id="annotated-cell-30-3"></span>
<span id="annotated-cell-30-4"></span>
<span id="annotated-cell-30-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> function_to_json(func) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>:</span>
<span id="annotated-cell-30-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A mapping of types from python to JSON</span></span>
<span id="annotated-cell-30-7">    type_map <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="annotated-cell-30-8">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span>,</span>
<span id="annotated-cell-30-9">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"integer"</span>,</span>
<span id="annotated-cell-30-10">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"number"</span>,</span>
<span id="annotated-cell-30-11">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">bool</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"boolean"</span>,</span>
<span id="annotated-cell-30-12">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"array"</span>,</span>
<span id="annotated-cell-30-13">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"object"</span>,</span>
<span id="annotated-cell-30-14">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>): <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"null"</span>,</span>
<span id="annotated-cell-30-15">    }</span>
<span id="annotated-cell-30-16"></span>
<span id="annotated-cell-30-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-30" data-target-annotation="1">1</button><span id="annotated-cell-30-18" class="code-annotation-target">        signature <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> inspect.signature(func)</span>
<span id="annotated-cell-30-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">ValueError</span> <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> e:</span>
<span id="annotated-cell-30-20">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">ValueError</span>(</span>
<span id="annotated-cell-30-21">            <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Failed to get signature for function </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>func<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="annotated-cell-30-22">        )</span>
<span id="annotated-cell-30-23"></span>
<span id="annotated-cell-30-24">    parameters <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-30" data-target-annotation="2">2</button><span id="annotated-cell-30-25" class="code-annotation-target">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> param <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> signature.parameters.values():</span>
<span id="annotated-cell-30-26">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<span id="annotated-cell-30-27">            param_type <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> type_map.get(param.annotation, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span>)</span>
<span id="annotated-cell-30-28">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">KeyError</span> <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> e:</span>
<span id="annotated-cell-30-29">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">KeyError</span>(</span>
<span id="annotated-cell-30-30">                <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Unknown type annotation </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>param<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>annotation<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> for parameter </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>param<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>name<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="annotated-cell-30-31">            )</span>
<span id="annotated-cell-30-32">        parameters[param.name] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: param_type}</span>
<span id="annotated-cell-30-33"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-30" data-target-annotation="3">3</button><span id="annotated-cell-30-34" class="code-annotation-target">    required <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="annotated-cell-30-35">        param.name</span>
<span id="annotated-cell-30-36">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> param <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> signature.parameters.values()</span>
<span id="annotated-cell-30-37">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> param.default <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> inspect._empty</span>
<span id="annotated-cell-30-38">    ]</span>
<span id="annotated-cell-30-39"></span>
<span id="annotated-cell-30-40">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="annotated-cell-30-41">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>,</span>
<span id="annotated-cell-30-42">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>: {</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-30" data-target-annotation="4">4</button><span id="annotated-cell-30-43" class="code-annotation-target">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: func.<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>,</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-30" data-target-annotation="5">5</button><span id="annotated-cell-30-44" class="code-annotation-target">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: func.__doc__ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">or</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>,</span>
<span id="annotated-cell-30-45">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"parameters"</span>: {</span>
<span id="annotated-cell-30-46">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"object"</span>,</span>
<span id="annotated-cell-30-47">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>: parameters,</span>
<span id="annotated-cell-30-48">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"required"</span>: required,</span>
<span id="annotated-cell-30-49">            },</span>
<span id="annotated-cell-30-50">        },</span>
<span id="annotated-cell-30-51">    }</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<div class="cell-annotation">
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-30" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-30" data-code-lines="18" data-code-annotation="1">Get the function signature</span>
</dd>
<dt data-target-cell="annotated-cell-30" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-30" data-code-lines="25,27" data-code-annotation="2">For each parameter, convert the type annotation to valid JSON type. Default to string if user didn’t specify a type</span>
</dd>
<dt data-target-cell="annotated-cell-30" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-30" data-code-lines="34,35,36,37,38" data-code-annotation="3">Find out which parameters are required</span>
</dd>
<dt data-target-cell="annotated-cell-30" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-30" data-code-lines="43" data-code-annotation="4">Extract the function name</span>
</dd>
<dt data-target-cell="annotated-cell-30" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-30" data-code-lines="44" data-code-annotation="5">Extract the docstring</span>
</dd>
</dl>
</div>
</div>
<div id="1d24fde8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1">function_to_json(add)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="33">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'function'</span>,
    <span class="ansi-green-fg">'function'</span>: <span class="ansi-bold">{</span>
        <span class="ansi-green-fg">'name'</span>: <span class="ansi-green-fg">'add'</span>,
        <span class="ansi-green-fg">'description'</span>: <span class="ansi-green-fg">'Adds two integers together'</span>,
        <span class="ansi-green-fg">'parameters'</span>: <span class="ansi-bold">{</span>
            <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>,
            <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
            <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>
        <span class="ansi-bold">}</span>
    <span class="ansi-bold">}</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
</section>
<section id="approach-2-pydantic" class="level3">
<h3 class="anchored" data-anchor-id="approach-2-pydantic">Approach 2: Pydantic</h3>
<section id="a.-dynamic-models" class="level4">
<h4 class="anchored" data-anchor-id="a.-dynamic-models">2a. Dynamic Models</h4>
<p>I first came across this approach in Jeremy Howards’s <a href="https://www.youtube.com/watch?v=jkrNMKz9pWU">talk</a> and this pattern is also implemented in popular libraries like <a href="https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/tools/utils.py">LlamaIndex</a> and LangChain under the hood.</p>
<p>Pydantic is a popular python library already used for data validation and serialization of structured data. As such, it can convert a Python class into a JSON schema directly.</p>
<p>For example, if we were to define a Pydantic model for our add function manually, it would look something like below.</p>
<div id="93e8795a" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb33-2"></span>
<span id="cb33-3"></span>
<span id="cb33-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Add(BaseModel):</span>
<span id="cb33-5">    a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span>
<span id="cb33-6">    b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span>
<span id="cb33-7"></span>
<span id="cb33-8"></span>
<span id="cb33-9">Add.model_json_schema()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="35">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'A'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'B'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
    <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>,
    <span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'Add'</span>,
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
<p>But, we actually want to create the Pydantic data model dynamically. This is possible via the <strong>create_model</strong> function provided by Pydantic. It takes the name for the model as the first argument, and then the named paramters for the different fields in the model.</p>
<p>Here <code>a=(int, ...)</code> means that the field <code>a</code> is of type <code>int</code> and is required.</p>
<div id="bd4bcfef" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb34" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> create_model</span>
<span id="cb34-2"></span>
<span id="cb34-3">a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> create_model(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Add"</span>, a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, ...), b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, ...))</span>
<span id="cb34-4">a.model_json_schema()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="36">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'A'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'B'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
    <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>,
    <span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'Add'</span>,
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
<p>Thus, if we can somehow create a dictionary of our function parameters, then we can pass that using the **kwargs trick and then get the JSON schema directly.</p>
<div id="316c06c8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> create_model</span>
<span id="cb35-2"></span>
<span id="cb35-3">a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> create_model(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Add"</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>: (<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, ...), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b"</span>: (<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, ...)})</span>
<span id="cb35-4">a.model_json_schema()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="37">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'A'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'B'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
    <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>,
    <span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'Add'</span>,
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
<p>Below, we implement a function that uses this concept to convert the add function into JSON Schema directly.</p>
<p>We use inspect.signature as before to get all the function parameters and then prepare a Pydantic model directly from it.</p>
<div id="d2a5b4f8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb36" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> inspect</span>
<span id="cb36-2"></span>
<span id="cb36-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> create_model</span>
<span id="cb36-4"></span>
<span id="cb36-5"></span>
<span id="cb36-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb36-7">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb36-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span>
<span id="cb36-9"></span>
<span id="cb36-10"></span>
<span id="cb36-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> schema(f):</span>
<span id="cb36-12">    kws <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb36-13">        name: (</span>
<span id="cb36-14">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the type annotation</span></span>
<span id="cb36-15">            parameter.annotation,</span>
<span id="cb36-16">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check if parameter is required or optional</span></span>
<span id="cb36-17">            ... <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> parameter.default <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> inspect._empty <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> parameter.default,</span>
<span id="cb36-18">        )</span>
<span id="cb36-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> name, parameter <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> inspect.signature(f).parameters.items()</span>
<span id="cb36-20">    }</span>
<span id="cb36-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Pass the function name and parameters to get a pydantic model</span></span>
<span id="cb36-22">    p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> create_model(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"`</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>f<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">`"</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>kws)</span>
<span id="cb36-23"></span>
<span id="cb36-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert to JSON Schema</span></span>
<span id="cb36-25">    schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> p.model_json_schema()</span>
<span id="cb36-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb36-27">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>,</span>
<span id="cb36-28">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>: {</span>
<span id="cb36-29">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: f.<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>,</span>
<span id="cb36-30">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: f.__doc__,</span>
<span id="cb36-31">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"parameters"</span>: schema,</span>
<span id="cb36-32">        },</span>
<span id="cb36-33">    }</span>
<span id="cb36-34"></span>
<span id="cb36-35"></span>
<span id="cb36-36">schema(add)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="43">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'function'</span>,
    <span class="ansi-green-fg">'function'</span>: <span class="ansi-bold">{</span>
        <span class="ansi-green-fg">'name'</span>: <span class="ansi-green-fg">'add'</span>,
        <span class="ansi-green-fg">'description'</span>: <span class="ansi-green-fg">'Adds two integers together'</span>,
        <span class="ansi-green-fg">'parameters'</span>: <span class="ansi-bold">{</span>
            <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span>
                <span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'A'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>,
                <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'B'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>
            <span class="ansi-bold">}</span>,
            <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>,
            <span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'`add`'</span>,
            <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>
        <span class="ansi-bold">}</span>
    <span class="ansi-bold">}</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
</section>
<section id="b.-type-adapter" class="level4">
<h4 class="anchored" data-anchor-id="b.-type-adapter">2b. Type Adapter</h4>
<p>Pydantic introduced a new feature called Type Adapter in version 2.0. It allows you to convert any arbitrary Python object into a Pydantic model.</p>
<p>We can use it to get JSON schema for the function parameters directly without requiring use of inspect.signature.</p>
<div id="4b0a3f29" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> TypeAdapter</span>
<span id="cb37-2"></span>
<span id="cb37-3"></span>
<span id="cb37-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb37-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two integers together"""</span></span>
<span id="cb37-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span>
<span id="cb37-7"></span>
<span id="cb37-8"></span>
<span id="cb37-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> schema(f):</span>
<span id="cb37-10">    schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> TypeAdapter(f).json_schema()</span>
<span id="cb37-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb37-12">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>,</span>
<span id="cb37-13">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"function"</span>: {</span>
<span id="cb37-14">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: f.<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>,</span>
<span id="cb37-15">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"description"</span>: f.__doc__,</span>
<span id="cb37-16">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"parameters"</span>: schema,</span>
<span id="cb37-17">        },</span>
<span id="cb37-18">    }</span>
<span id="cb37-19"></span>
<span id="cb37-20"></span>
<span id="cb37-21">schema(add)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="6">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'function'</span>,
    <span class="ansi-green-fg">'function'</span>: <span class="ansi-bold">{</span>
        <span class="ansi-green-fg">'name'</span>: <span class="ansi-green-fg">'add'</span>,
        <span class="ansi-green-fg">'description'</span>: <span class="ansi-green-fg">'Adds two integers together'</span>,
        <span class="ansi-green-fg">'parameters'</span>: <span class="ansi-bold">{</span>
            <span class="ansi-green-fg">'additionalProperties'</span>: <span style="font-style:italic" class="ansi-bright-red-fg">False</span>,
            <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span>
                <span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'A'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>,
                <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'title'</span>: <span class="ansi-green-fg">'B'</span>, <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>
            <span class="ansi-bold">}</span>,
            <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>,
            <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>
        <span class="ansi-bold">}</span>
    <span class="ansi-bold">}</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
</section>
</section>
<section id="approach-3-decorators" class="level3">
<h3 class="anchored" data-anchor-id="approach-3-decorators">Approach 3: Decorators</h3>
<p>Most agent libraries wrap conversion approaches like above as decorators (e.g.&nbsp;<a href="https://huggingface.co/docs/smolagents/en/guided_tour?build-a-tool=Decorate+a+function+with+%40tool#create-a-new-tool">smolagents</a>) to make them easier to use.</p>
<p>For example, we can make a decorator called <code>tool</code>, which, when applied to a function, will add a <code>json_schema</code> method to that function.</p>
<div id="88eb704e" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> tool(func):</span>
<span id="cb38-2">    func.json_schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span>: function_to_json(func)</span>
<span id="cb38-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> func</span></code></pre></div></div>
</div>
<p>We can mark out functions with the decorator.</p>
<div id="bda3d918" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tool</span></span>
<span id="cb39-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, b: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>:</span>
<span id="cb39-3">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Adds two numbers"""</span></span>
<span id="cb39-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
</div>
<p>And can use the <code>json_schema</code> method to get the schema directly and use it downstream in LLM API.</p>
<div id="c5903537" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1">add.json_schema()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="41">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">{</span>
    <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'function'</span>,
    <span class="ansi-green-fg">'function'</span>: <span class="ansi-bold">{</span>
        <span class="ansi-green-fg">'name'</span>: <span class="ansi-green-fg">'add'</span>,
        <span class="ansi-green-fg">'description'</span>: <span class="ansi-green-fg">'Adds two numbers'</span>,
        <span class="ansi-green-fg">'parameters'</span>: <span class="ansi-bold">{</span>
            <span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'object'</span>,
            <span class="ansi-green-fg">'properties'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'a'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span>, <span class="ansi-green-fg">'b'</span>: <span class="ansi-bold">{</span><span class="ansi-green-fg">'type'</span>: <span class="ansi-green-fg">'integer'</span><span class="ansi-bold">}</span><span class="ansi-bold">}</span>,
            <span class="ansi-green-fg">'required'</span>: <span class="ansi-bold">[</span><span class="ansi-green-fg">'a'</span>, <span class="ansi-green-fg">'b'</span><span class="ansi-bold">]</span>
        <span class="ansi-bold">}</span>
    <span class="ansi-bold">}</span>
<span class="ansi-bold">}</span></pre>
</div>
</div>
</div>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, we understood how Python’s runtime introspection enables automatic conversion of function definitions into JSON Schema.</p>


</section>

 ]]></description>
  <category>function-calling</category>
  <category>agents</category>
  <category>python</category>
  <guid>https://amitness.com/posts/function-calling-schema/</guid>
  <pubDate>Sat, 15 Feb 2025 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/function-calling-schema/function-runtime-inspect.png" medium="image" type="image/png" height="66" width="144"/>
</item>
<item>
  <title>Evals for Diversity in Synthetic Data</title>
  <link>https://amitness.com/posts/diversity-evals/</link>
  <description><![CDATA[ 




<!-- pygments, tango, github -->
<p>Synthetic data is a popular approach for bootstrapping an initial dataset when building LLM-based applications.</p>
<p>We can find practical examples of synthetic data usage in the wild such as:</p>
<ul>
<li>Generating synthetic user queries from existing documents to evaluate RAG systems <sup>1</sup></li>
<li>Producing fake meeting transcripts for video call summarization <sup>2</sup></li>
<li>Bootstrapping lots of texts (emails, inquiries, multi-turn chats etc.) for good old classification tasks (customer service routing, intent classification, sentiment analysis, etc.).</li>
</ul>
<p>As a common starting point, people write a prompt defining the data they need, provide a few seed examples either within the prompt or as few-shot exemplars, and sample multiple times from the LLM to bootstrap a dataset.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/diversity-evals/need-for-diversity-score.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="700"></p>
</figure>
</div>
<p>However, LLMs generate repetitive outputs out of the box, and we need special techniques to increase diversity:</p>
<ul>
<li><strong>Sampling Parameters</strong>: higher temperature, nucleus-sampling, top-k sampling, random seeds<br>
</li>
<li><strong>Attribute Generation</strong>: Generating various attributes (topics, writing style, length, personas, emotion, sentiment, location, etc.) beforehand and inserting randomly sampled attributes in the prompt. (<span class="citation" data-cites="yu2023large">Yu et al. (2023)</span>, <span class="citation" data-cites="ge2024scalingsyntheticdatacreation">Ge et al. (2024)</span>)</li>
<li><strong>Post-decoding Clustering</strong>: Overgenerating a large number of texts and deduplicating via cluster centroids <span class="citation" data-cites="ippolito-etal-2019-comparison">(Ippolito et al., 2019)</span> and semantic hashing <span class="citation" data-cites="minishlab2025semhash">(Dongen and Tulkens, 2025)</span></li>
</ul>
<p>But this raises the question:</p>
<blockquote class="blockquote">
<p>How do we systematically test the impact of various techniques above on diversity without relying on just vibe checks?</p>
</blockquote>
<p>I was curious and read the existing academic literature on evaluating diversity. It turns out that there is a large body of prior work on evaluating diversity from the days of classic sequence-to-sequence models and dialogue generation (<span class="citation" data-cites="shaib2024standardizingmeasurementtextdiversity">Shaib et al. (2024a)</span>, <span class="citation" data-cites="guo2024benchmarkinglinguisticdiversitylarge">Guo et al. (2024)</span>).</p>
<p>In this post, I will discuss the various diversity metrics from the literature and explain how they work. These automatic metrics are fast to compute and can be a useful tool to have as a proxy for evaluating linguistic diversity in applied use cases.</p>
<section id="lexical-diversity-metrics" class="level2">
<h2 class="anchored" data-anchor-id="lexical-diversity-metrics">Lexical Diversity Metrics</h2>
<p>Lexical diversity metrics capture the surface-level repetition of words, phrases, topics, and n-grams in the generations.</p>
<section id="distinct-n-grams-distinct-k" class="level3">
<h3 class="anchored" data-anchor-id="distinct-n-grams-distinct-k">Distinct n-grams (Distinct-k)</h3>
<p><span class="citation" data-cites="li2016diversitypromotingobjectivefunctionneural">Li et al. (2016)</span> proposed distinct-k to evaluate their technique for increasing diversity in sequence-to-sequence models. It builds on the type-token ratio concept from linguistics.</p>
<p>They calculate diversity as the ratio of the number of unique n-grams to the total n-grams occurring in the entire generated dataset. As shown below, the two texts contain only 5 unique words out of a total of 9 words and thus, the diversity score is only 55% (0.55).</p>
<p><img src="https://amitness.com/posts/diversity-evals/1-gram-diversity.png" class="img-fluid"></p>
<p>However, if all the synthetic texts were unique, we would get a diversity score of 100% (1.0).</p>
<p><img src="https://amitness.com/posts/diversity-evals/1-gram-hundren-percent-diversity.png" class="img-fluid"></p>
<p>We can extend this same idea from unigrams to bigrams, trigrams, and any higher-order n-grams. There are two approaches.</p>
<p>In the first approach, we report the diversity score separately for different n-grams. <span class="citation" data-cites="li2016diversitypromotingobjectivefunctionneural">Li et al. (2016)</span> do this for unigrams and bigrams as distinct-1 and distinct-2. While <span class="citation" data-cites="padmakumar2023writing">Padmakumar and He (2023)</span> report diversity scores up to 4-grams separately in their paper that shows instruction-tuned models have lower diversity compared to base models.</p>
<p>Alternatively, we can report a single diversity score by combining the scores for different n-grams. <span class="citation" data-cites="li2022contrastive">Li et al. (2022)</span> take the product of the diversity score for unigrams, bigrams, trigrams, and four-grams as a single final score, while <span class="citation" data-cites="meister2023locallytypicalsampling">Meister et al. (2023)</span> take the sum of the diversities.</p>
<p>The library <a href="https://github.com/cshaib/diversity">diversity</a> by <span class="citation" data-cites="shaib2024standardizingmeasurementtextdiversity">Shaib et al. (2024a)</span> provides an easy way to compute the distinct-k metric:</p>
<!-- We can use the distinct-k metric via the library [diversity](https://github.com/cshaib/diversity) by @shaib2024standardizingmeasurementtextdiversity as shown below: -->
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install diversity</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diversity <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ngram_diversity_score</span>
<span id="cb2-2"></span>
<span id="cb2-3">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'As an AI language model'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'As an AI model'</span>]</span>
<span id="cb2-4"></span>
<span id="cb2-5">ngram_diversity_score(texts, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.556</span></span></code></pre></div></div>
</section>
<section id="n-gram-entropy-ent-n" class="level3">
<h3 class="anchored" data-anchor-id="n-gram-entropy-ent-n">N-gram Entropy (Ent-n)</h3>
<p><span class="citation" data-cites="zhang2018generating">Zhang et al. (2018)</span> introduced this metric, and <span class="citation" data-cites="jagfeld-etal-2018-sequence">Jagfeld et al. (2018)</span> also used it to evaluate the diversity of template to natural language generation.</p>
<p>The intuition behind it is that in an ideal case, LLM generates texts that are all unique and no n-grams is repeated more than once.</p>
<!-- all the texts generated from an LLM would be unique and no n-gram would be repeated more than once. -->
<p>We can measure this by collecting all the unique bigrams in the text and calculating their count and the relative frequency. This yields a probability distribution over the bigrams.</p>
<p>For the highest diversity, all the texts would be unique and thus the probability distribution over the bigrams would be uniform, resulting in the highest entropy. Therefore, the entropy of the n-gram distribution serves as a diversity metric, as shown below.</p>
<p><img src="https://amitness.com/posts/diversity-evals/ngram-diversity-high.png" class="img-fluid"></p>
<p>Given the distribution of bigrams, we can calculate the entropy easily as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> math</span>
<span id="cb4-2"></span>
<span id="cb4-3">probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>]</span>
<span id="cb4-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> math.log(p) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> p <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.3862</span></span></code></pre></div></div>
<p>However, let’s take another case where there is lots of repetition e.g.&nbsp;“Play the music” being generated 100 times. In such a case, the bigrams “Play the” and “the music” dominate the frequency distribution. As such, the entropy reduces, and thus, the diversity score drops.</p>
<p><img src="https://amitness.com/posts/diversity-evals/ngram-diversity-low.png" class="img-fluid"></p>
<p>We can also extend this idea to higher-order n-grams similar to the distinct n-grams metric. <span class="citation" data-cites="tevet2020evaluating">Tevet and Berant (2020)</span> calculate and report entropy separately for unigram, bigram, and trigrams.</p>
<p>While <span class="citation" data-cites="oraby2018controlling">Oraby et al. (2018)</span> combine all unique unigrams, bigrams, and trigrams and then use the entropy of the resulting distribution as the diversity.</p>
<p>This metric can be implemented in code as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-6" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> math</span>
<span id="annotated-cell-6-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="annotated-cell-6-3"></span>
<span id="annotated-cell-6-4"></span>
<span id="annotated-cell-6-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate_ngrams(words, n: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>):</span>
<span id="annotated-cell-6-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>.join(words[i : i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> n]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(words) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)]</span>
<span id="annotated-cell-6-7"></span>
<span id="annotated-cell-6-8"></span>
<span id="annotated-cell-6-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ngram_entropy(texts: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>], n: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>:</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="1">1</button><span id="annotated-cell-6-10" class="code-annotation-target">    ngrams <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="annotated-cell-6-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> text <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> texts:</span>
<span id="annotated-cell-6-12">        words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text.split()</span>
<span id="annotated-cell-6-13">        ngrams.extend(generate_ngrams(words, n))</span>
<span id="annotated-cell-6-14"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="2">2</button><span id="annotated-cell-6-15" class="code-annotation-target">    ngram_counts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Counter(ngrams)</span>
<span id="annotated-cell-6-16">    total_ngrams <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(ngram_counts.values())</span>
<span id="annotated-cell-6-17"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="3">3</button><span id="annotated-cell-6-18" class="code-annotation-target">    ngram_frequencies <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> total_ngrams <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> ngram, count <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ngram_counts.items()]</span>
<span id="annotated-cell-6-19"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="4">4</button><span id="annotated-cell-6-20" class="code-annotation-target">    entropy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(freq <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> math.log(freq) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> freq <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ngram_frequencies)</span>
<span id="annotated-cell-6-21">        </span>
<span id="annotated-cell-6-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> entropy</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-6" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="10,11,12,13" data-code-annotation="1">Step 1: Generate n-grams from input texts</span>
</dd>
<dt data-target-cell="annotated-cell-6" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="15,16" data-code-annotation="2">Step 2: Count the frequency of each n-gram</span>
</dd>
<dt data-target-cell="annotated-cell-6" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="18" data-code-annotation="3">Step 3: Calculate the frequency of each n-gram</span>
</dd>
<dt data-target-cell="annotated-cell-6" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="20" data-code-annotation="4">Step 4: Calculate entropy</span>
</dd>
</dl>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Call an Uber"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Play the music"</span>]</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Unigram entropy:"</span>, ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb6-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bigram entropy:"</span>, ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb6-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Trigram entropy:"</span>, ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb7-1">Unigram entropy<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.7917594692280547</span></span>
<span id="cb7-2">Bigram entropy<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.3862943611198906</span></span>
<span id="cb7-3">Trigram entropy<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6931471805599453</span></span></code></pre></div></div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Normalized N-gram Entropy
</div>
</div>
<div class="callout-body-container callout-body">
<p>The original n-gram entropy metric doesn’t have a fixed range for the score.</p>
<p>To get a score between a range of 0 to 1, I thought of a normalized version inspired by the <a href="https://amitness.com/posts/information-retrieval-evaluation#normalized-discounted-cumulative-gain-ndcgk">NDCG</a> metric from Information Retrieval.</p>
<p>For any generated set of texts, the maximum diversity possible happens when all the n-grams occur with the same frequency. Thus, the entropy of a uniform distribution of those ngrams would provide us with the upper bound of diversity.</p>
<p>We can calculate the n-gram entropy as before and then divide it by the entropy of the ideal uniform distribution over the n-grams to get a normalized diversity score between 0 and 1.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> math</span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb8-3"></span>
<span id="cb8-4"></span>
<span id="cb8-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate_ngrams(words: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>], n: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>]:</span>
<span id="cb8-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>.join(words[i : i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> n]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(words) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)]</span>
<span id="cb8-7"></span>
<span id="cb8-8"></span>
<span id="cb8-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> normalized_ngram_entropy(texts: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>], n: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>:</span>
<span id="cb8-10">    ngrams <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb8-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> text <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> texts:</span>
<span id="cb8-12">        words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text.split()</span>
<span id="cb8-13">        ngrams.extend(generate_ngrams(words, n))</span>
<span id="cb8-14"></span>
<span id="cb8-15">    ngram_counts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Counter(ngrams)</span>
<span id="cb8-16">    total_ngrams <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(ngram_counts.values())</span>
<span id="cb8-17"></span>
<span id="cb8-18">    ngram_frequencies <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> total_ngrams <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> ngram, count <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ngram_counts.items()]</span>
<span id="cb8-19">    entropy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(freq <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> math.log(freq) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> freq <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> ngram_frequencies)</span>
<span id="cb8-20"></span>
<span id="cb8-21">    uniform_frequencies <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(ngrams) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(ngrams))]</span>
<span id="cb8-22">    ideal_entropy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(freq <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> math.log(freq) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> freq <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> uniform_frequencies)</span>
<span id="cb8-23"></span>
<span id="cb8-24">    diversity <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> entropy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ideal_entropy</span>
<span id="cb8-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> diversity</span></code></pre></div></div>
<p>We can use it similar to before.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Call an Uber"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Play the music"</span>]</span>
<span id="cb9-2"></span>
<span id="cb9-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Unigram diversity:"</span>, normalized_ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb9-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bigram diversity:"</span>, normalized_ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb9-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Trigram diversity:"</span>, normalized_ngram_entropy(texts, n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb10-1">Unigram diversity<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span></span>
<span id="cb10-2">Bigram diversity<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span></span>
<span id="cb10-3">Trigram diversity<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span></span></code></pre></div></div>
</div>
</div>
</section>
<section id="compression-ratio" class="level3">
<h3 class="anchored" data-anchor-id="compression-ratio">Compression Ratio</h3>
<p><span class="citation" data-cites="shaib2024standardizingmeasurementtextdiversity">Shaib et al. (2024a)</span> proposed this metric by adapting the concept of the <a href="https://en.wikipedia.org/wiki/Data_compression_ratio">compression ratio</a>, originally used to evaluate compression algorithms, as a diversity metric.</p>
<p>Compression ratio calculates the ratio of the size of the compressed file to its original size. A high compression ratio indicates the file was highly compressible and thus had higher redundancy, indicating lower diversity in the file contents.</p>
<p>To apply this concept to texts, we can compress them using an algorithm like Gzip and then calculate the compression ratio. A higher ratio indicates lower diversity in the text. Thus, the greater the compression ratio, the less diverse the generated texts.</p>
<p><img src="https://amitness.com/posts/diversity-evals/gzip-diversity.png" class="img-fluid"></p>
<p>Thus, diversity can be calculated as the reciprocal of the compression ratio to get a score between 0 and 1.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BDiversity%7D%20=%20%5Cfrac%7B1%7D%7B%5Ctext%7BCompression%20Ratio%7D%7D%20=%20%5Cfrac%7B1%7D%7B16.258%7D%20%20=%200.06%0A"></p>
<p>If all the texts are unique, then the compressed file size would be the same as the original file size and thus the compression ratio and the diversity both would be 1.</p>
<p>We can implement this in code using the diversity library.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode zsh code-with-copy"><code class="sourceCode zsh"><span id="cb11-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install diversity</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diversity <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> compression_ratio</span>
<span id="cb12-2"></span>
<span id="cb12-3">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Call an Uber'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Play the music'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb12-4"></span>
<span id="cb12-5">compression_ratio(texts)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">16.258</span></span></code></pre></div></div>
</section>
</section>
<section id="semantic-diversity-metrics" class="level2">
<h2 class="anchored" data-anchor-id="semantic-diversity-metrics">Semantic Diversity Metrics</h2>
<p>These metrics capture the diversity in terms of meaning and rely on embeddings. They handle cases where the texts share similar meaning but have zero n-gram overlap.</p>
<p>For example, “Play the music” and “Start a song” have zero word overlap and thus would be incorrectly assigned 100% diversity by lexical metrics. However, they are repetitive in meaning and thus should have been assigned a lower diversity score. Semantic diversity metrics can tackle this.</p>
<section id="embedding-diversity" class="level3">
<h3 class="anchored" data-anchor-id="embedding-diversity">Embedding Diversity</h3>
<p><span class="citation" data-cites="tevet2020evaluating">Tevet and Berant (2020)</span> proposed this metric, which considers diversity as the dissimilarity of text embeddings.</p>
<p>The metric calculates sentence embeddings for all generated texts using an encoder (e.g.&nbsp;sentence-transformers).</p>
<p><img src="https://amitness.com/posts/diversity-evals/embedding-diversity.png" class="img-fluid"></p>
<p>Then, we calculate the cosine similarity between all the unique pairs and take the average to get a similarity score.</p>
<p><img src="https://amitness.com/posts/diversity-evals/bert-diversity-pairs.png" class="img-fluid"></p>
<p>To convert the similarity into diversity, we can either take the negation of the average cosine similarity <span class="citation" data-cites="tevet2020evaluating">(Tevet and Berant, 2020)</span> or take the cosine distance i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?1%20-%20%5Ctext%7Bcosine%20similarity%7D"> (<span class="citation" data-cites="young2024improvingstructuraldiversityblackbox">Young et al. (2024)</span>; <span class="citation" data-cites="hayati-etal-2024-far">Hayati et al. (2024)</span>)</p>
<table class="caption-top table">
<colgroup>
<col style="width: 46%">
<col style="width: 25%">
<col style="width: 17%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Approach</th>
<th>Mean Cosine Similarity</th>
<th>Diversity</th>
<th>Range</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><span class="citation" data-cites="young2024improvingstructuraldiversityblackbox">Young et al. (2024)</span> / <span class="citation" data-cites="hayati-etal-2024-far">Hayati et al. (2024)</span></td>
<td>0.39</td>
<td>1 - 0.39 = 0.61</td>
<td>0 to 1</td>
</tr>
<tr class="even">
<td><span class="citation" data-cites="tevet2020evaluating">Tevet and Berant (2020)</span></td>
<td>0.39</td>
<td>-0.39</td>
<td>-1 to 0</td>
</tr>
</tbody>
</table>
</section>
<section id="dcscore" class="level3">
<h3 class="anchored" data-anchor-id="dcscore">DCScore</h3>
<p>This metric was proposed in a paper currently under review for ICLR 2025 <span class="citation" data-cites="anonymous2024evaluating">(Anonymous, 2024)</span>.</p>
<p>The metric, similar to embedding diversity, also starts by calculating the pairwise similarity between all the text embeddings but has a unique take on formulating the diversity.</p>
<p>To understand the intuition, let’s look at the first row of the pairwise similarity matrix. Here, the numbers 1.0, 0.75, and 0.2 mean that the text is 100% similar to itself, 75% similar to some other text, and 20% similar to the final text. Hypothetically, we would have wanted the text to only be 100% similar to itself and 0% similar to everything else for maximum diversity.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/diversity-evals/dcscore-pairwise-row.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="400"></p>
</figure>
</div>
<p>Thus, we want some relative measure of similarity of the text to itself in comparison to others. The authors use softmax for this. Softmax converts the cosine similarities into relative probabilities of the text belonging to itself and others. When we apply softmax, we see that the first text is only belonging 45% to itself. Thus, the softmax probability of the text belonging to itself can be a measure of diversity.</p>
<p><img src="https://amitness.com/posts/diversity-evals/dcscore.png" class="img-fluid"></p>
<p>To calculate the diversity of the dataset overall, we simply take the mean of the diagonal of the pairwise matrix after applying softmax. Thus, we get a diversity score of 0.47 in the example above.</p>
<p>The implementation is simple and fits in a few lines of code. We can swap the embedding model as needed.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode zsh code-with-copy"><code class="sourceCode zsh"><span id="cb14-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install sentence-transformers scipy numpy</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-13" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="annotated-cell-13-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.special <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> softmax</span>
<span id="annotated-cell-13-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sentence_transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SentenceTransformer</span>
<span id="annotated-cell-13-4"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-13" data-target-annotation="1">1</button><span id="annotated-cell-13-5" class="code-annotation-target">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SentenceTransformer(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sentence-transformers/all-MiniLM-L6-v2"</span>)</span>
<span id="annotated-cell-13-6"></span>
<span id="annotated-cell-13-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> dcscore(texts: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>]) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>:</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-13" data-target-annotation="2">2</button><span id="annotated-cell-13-8" class="code-annotation-target">    text_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.encode(texts, normalize_embeddings<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-13" data-target-annotation="3">3</button><span id="annotated-cell-13-9" class="code-annotation-target">    pairwise_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">@</span> text_embeddings.T</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-13" data-target-annotation="4">4</button><span id="annotated-cell-13-10" class="code-annotation-target">    softmax_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> softmax(pairwise_matrix, axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-13" data-target-annotation="5">5</button><span id="annotated-cell-13-11" class="code-annotation-target">    score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.mean(np.diag(softmax_matrix))</span>
<span id="annotated-cell-13-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> score</span>
<span id="annotated-cell-13-13"></span>
<span id="annotated-cell-13-14">score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dcscore([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Play the music'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Start the music'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Call an Uber'</span>])</span>
<span id="annotated-cell-13-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(score)</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-13" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-13" data-code-lines="5" data-code-annotation="1">Load the MiniLM Sentence-BERT model</span>
</dd>
<dt data-target-cell="annotated-cell-13" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-13" data-code-lines="8" data-code-annotation="2">Generate embeddings for the sentences</span>
</dd>
<dt data-target-cell="annotated-cell-13" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-13" data-code-lines="9" data-code-annotation="3">Calculate pairwise cosine similarity</span>
</dd>
<dt data-target-cell="annotated-cell-13" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-13" data-code-lines="10" data-code-annotation="4">Apply softmax on the row level for each text</span>
</dd>
<dt data-target-cell="annotated-cell-13" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-13" data-code-lines="11" data-code-annotation="5">Take the mean of the scores in the diagonal</span>
</dd>
</dl>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb15-1"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.47264108</span></span></code></pre></div></div>
</section>
<section id="cluster-inertia" class="level3">
<h3 class="anchored" data-anchor-id="cluster-inertia">Cluster Inertia</h3>
<p><span class="citation" data-cites="du-black-2019-boosting">Du and Black (2019)</span> proposed this metric, reusing the inertia metric used to compute the quality of clustering as the diversity.</p>
<p>The metric clusters embeddings of the LLM-generated texts into 10 clusters and measures the inertia. Inertia is the sum of the squared distances between all points in a cluster and its centroid.</p>
<p><img src="https://amitness.com/posts/diversity-evals/kmeans-inertia-diversity.png" class="img-fluid"></p>
<p>We can treat the inertia as a proxy for diversity because if the texts are diverse, they would be far apart from the centroid and thus the squared distance from the cluster centroid will be larger.</p>
<p>In code, this can be accomplished as shown below:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb16-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.cluster <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> KMeans</span>
<span id="cb16-3"></span>
<span id="cb16-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Text embeddings for 1024 synthetic texts</span></span>
<span id="cb16-5">text_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.rand(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1024</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">768</span>)</span>
<span id="cb16-6"></span>
<span id="cb16-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Run clustering</span></span>
<span id="cb16-8">kmeans <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KMeans(n_clusters<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb16-9">kmeans.fit(text_embeddings)</span>
<span id="cb16-10"></span>
<span id="cb16-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the inertia</span></span>
<span id="cb16-12">k.inertia_</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">64556.00644871439</span></span></code></pre></div></div>
</section>
</section>
<section id="syntactic-diversity-metrics" class="level2">
<h2 class="anchored" data-anchor-id="syntactic-diversity-metrics">Syntactic Diversity Metrics</h2>
<p>These metrics capture diversity in terms of the underlying grammatical structure.</p>
<section id="compression-ratio---part-of-speech-cr-pos" class="level3">
<h3 class="anchored" data-anchor-id="compression-ratio---part-of-speech-cr-pos">Compression Ratio - Part of Speech (CR-POS)</h3>
<p><span class="citation" data-cites="shaib2024detection">Shaib et al. (2024b)</span> proposed this metric to detect the repetition of syntactic templates in LLM-generated texts.</p>
<p>It reuses the idea of Compression Ratio but applies it to syntactic representation instead of the raw text. This works by applying a part-of-speech tagger to the text to get the POS tag for each token.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/diversity-evals/pos-tag-visualization.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:50.0%"></p>
</figure>
</div>
<p>We apply a POS tagger to all the synthetically generated texts and get their syntactic representation as strings.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/diversity-evals/text-to-pos-tags.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:75.0%"></p>
</figure>
</div>
<p>Then, the process is the same as the regular compression ratio. We concatenate the POS-tagged strings of all the texts together, compress the text using gzip, and then compare the ratio of the original file size with the compressed file size.</p>
<p><img src="https://amitness.com/posts/diversity-evals/compression-ratio-pos.png" class="img-fluid"></p>
<p>If the compression ratio is high, it indicates a large repetition of syntactic templates in the generated texts. Thus, the diversity will be low.</p>
<p>We can compute diversity directly by taking the reciprocal of the compression ratio and get a score between 0 and 1.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BDiversity(POS)%7D%20=%20%5Cfrac%7B1%7D%7B%5Ctext%7BCompression%20Ratio(POS)%7D%7D%20=%20%5Cfrac%7B1%7D%7B13.02%7D%20%20=%200.076%0A"></p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, in this post, we learned about three different linguistic diversity metrics - lexical, semantic, and syntactic.</p>
<p>We have skipped a category of diversity metrics called homogenization scores above as those can be computationally expensive for practical use cases. These work by applying evaluation metrics from machine translation/summarization such as BLEU, ROUGE, etc. on each text treating all other texts as the reference text (<span class="citation" data-cites="zhu2018texygen">Zhu et al. (2018)</span>).</p>
<p>For further deep-dive into diversity metrics, you can read <span class="citation" data-cites="shaib2024standardizingmeasurementtextdiversity">Shaib et al. (2024a)</span> for a comparative analysis of these metrics on various datasets and <span class="citation" data-cites="guo2024benchmarkinglinguisticdiversitylarge">Guo et al. (2024)</span> for application of the metrics to evaluate popular LLMs.</p>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-anonymous2024evaluating" class="csl-entry">
Anonymous. 2024. <a href="https://openreview.net/forum?id=mnB4hDTIDr">Evaluating diversity of <span>LLM</span>-generated datasets: A classification perspective</a>. In <em>Submitted to the thirteenth international conference on learning representations</em>. under review.
</div>
<div id="ref-minishlab2025semhash" class="csl-entry">
Thomas van Dongen and Stephan Tulkens. 2025. <a href="https://github.com/MinishLab/semhash">SemHash: Fast semantic text deduplication</a>.
</div>
<div id="ref-du-black-2019-boosting" class="csl-entry">
Wenchao Du and Alan W Black. 2019. <a href="https://doi.org/10.18653/v1/P19-1005">Boosting dialog response generation</a>. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, <em>Proceedings of the 57th annual meeting of the association for computational linguistics</em>, pages 38–43, Florence, Italy. Association for Computational Linguistics.
</div>
<div id="ref-ge2024scalingsyntheticdatacreation" class="csl-entry">
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. <a href="https://arxiv.org/abs/2406.20094">Scaling synthetic data creation with 1,000,000,000 personas</a>.
</div>
<div id="ref-guo2024benchmarkinglinguisticdiversitylarge" class="csl-entry">
Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2024. <a href="https://arxiv.org/abs/2412.10271">Benchmarking linguistic diversity of large language models</a>.
</div>
<div id="ref-hayati-etal-2024-far" class="csl-entry">
Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, and Dongyeop Kang. 2024. <a href="https://doi.org/10.18653/v1/2024.emnlp-main.306">How far can we extract diverse perspectives from large language models?</a> In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, <em>Proceedings of the 2024 conference on empirical methods in natural language processing</em>, pages 5336–5366, Miami, Florida, USA. Association for Computational Linguistics.
</div>
<div id="ref-ippolito-etal-2019-comparison" class="csl-entry">
Daphne Ippolito, Reno Kriz, João Sedoc, Maria Kustikova, and Chris Callison-Burch. 2019. <a href="https://doi.org/10.18653/v1/P19-1365">Comparison of diverse decoding methods from conditional language models</a>. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, <em>Proceedings of the 57th annual meeting of the association for computational linguistics</em>, pages 3752–3762, Florence, Italy. Association for Computational Linguistics.
</div>
<div id="ref-jagfeld-etal-2018-sequence" class="csl-entry">
Glorianna Jagfeld, Sabrina Jenne, and Ngoc Thang Vu. 2018. <a href="https://doi.org/10.18653/v1/W18-6529">Sequence-to-sequence models for data-to-text natural language generation: Word- vs. Character-based processing and output diversity</a>. In Emiel Krahmer, Albert Gatt, and Martijn Goudbeek, editors, <em>Proceedings of the 11th international conference on natural language generation</em>, pages 221–232, Tilburg University, The Netherlands. Association for Computational Linguistics.
</div>
<div id="ref-li2016diversitypromotingobjectivefunctionneural" class="csl-entry">
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. <a href="https://arxiv.org/abs/1510.03055">A diversity-promoting objective function for neural conversation models</a>.
</div>
<div id="ref-li2022contrastive" class="csl-entry">
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and M. Lewis. 2022. <a href="https://doi.org/10.48550/arXiv.2210.15097">Contrastive decoding: Open-ended text generation as optimization</a>. <em>Annual Meeting of the Association for Computational Linguistics</em>.
</div>
<div id="ref-meister2023locallytypicalsampling" class="csl-entry">
Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023. <a href="https://arxiv.org/abs/2202.00666">Locally typical sampling</a>.
</div>
<div id="ref-docTTTTTquery" class="csl-entry">
Rodrigo Nogueira and Jimmy Lin. 2019. <a href="https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf">From doc2query to <span class="nocase">docTTTTTquery</span></a>.
</div>
<div id="ref-oraby2018controlling" class="csl-entry">
Shereen Oraby, Lena Reed, Shubhangi Tandon, S. SharathT., S. Lukin, and M. Walker. 2018. <a href="https://doi.org/10.18653/v1/W18-5019">Controlling personality-based stylistic variation with neural natural language generators</a>. <em>SIGDIAL Conference</em>.
</div>
<div id="ref-padmakumar2023writing" class="csl-entry">
Vishakh Padmakumar and He He. 2023. <a href="https://doi.org/10.48550/arXiv.2309.05196">Does writing with language models reduce content diversity?</a> <em>International Conference on Learning Representations</em>.
</div>
<div id="ref-shaib2024standardizingmeasurementtextdiversity" class="csl-entry">
Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, and Ani Nenkova. 2024a. <a href="https://arxiv.org/abs/2403.00553">Standardizing the measurement of text diversity: A tool and a comparative analysis of scores</a>.
</div>
<div id="ref-shaib2024detection" class="csl-entry">
Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and Byron C. Wallace. 2024b. <a href="https://doi.org/10.48550/arXiv.2407.00211">Detection and measurement of syntactic templates in generated text</a>. <em>Conference on Empirical Methods in Natural Language Processing</em>.
</div>
<div id="ref-tevet2020evaluating" class="csl-entry">
Guy Tevet and Jonathan Berant. 2020. <a href="https://doi.org/10.18653/v1/2021.eacl-main.25">Evaluating the evaluation of diversity in natural language generation</a>. <em>Conference of the European Chapter of the Association for Computational Linguistics</em>.
</div>
<div id="ref-young2024improvingstructuraldiversityblackbox" class="csl-entry">
Halley Young, Yimeng Zeng, Jacob Gardner, and Osbert Bastani. 2024. <a href="https://arxiv.org/abs/2408.06186">Improving structural diversity of blackbox LLMs via chain-of-specification prompting</a>.
</div>
<div id="ref-yu2023large" class="csl-entry">
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J. Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. 2023. <a href="https://doi.org/10.48550/arXiv.2306.15895">Large language model as attributed training data generator: A tale of diversity and bias</a>. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-zhang2018generating" class="csl-entry">
Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and W. Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-zhu2018texygen" class="csl-entry">
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. <a href="https://doi.org/10.1145/3209978.3210080">Texygen: A benchmarking platform for text generation models</a>. <em>Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</em>.
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Jason Liu has a <a href="https://jxnl.co/writing/2024/02/28/levels-of-complexity-rag-applications/#evaluating-the-search-system">great conceptual example</a> of using synthetic data for RAG Evaluation. <span class="citation" data-cites="docTTTTTquery">Nogueira and Lin (2019)</span> is another classic paper.↩︎</p></li>
<li id="fn2"><p>OpenAI has an example walkthrough on generating synthetic transcripts for a daily standup meeting summarization use-case in their <a href="https://vimeo.com/showcase/11333741/video/1023317525">build hour on evals</a>↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{chaudhary2025,
  author = {Chaudhary, Amit},
  title = {Evals for {Diversity} in {Synthetic} {Data}},
  date = {2025-02-09},
  url = {https://amitness.com/posts/diversity-evals/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-chaudhary2025" class="csl-entry quarto-appendix-citeas">
Amit Chaudhary. 2025. <a href="https://amitness.com/posts/diversity-evals/">Evals for Diversity
in Synthetic Data</a>.
</div></div></section></div> ]]></description>
  <category>synthetic-data</category>
  <category>evals</category>
  <category>llm</category>
  <guid>https://amitness.com/posts/diversity-evals/</guid>
  <pubDate>Sun, 09 Feb 2025 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/diversity-evals/dcscore.png" medium="image" type="image/png" height="66" width="144"/>
</item>
<item>
  <title>Zero-Cost Custom Feeds on Bluesky</title>
  <link>https://amitness.com/posts/bluesky-custom-feed/</link>
  <description><![CDATA[ 




<section id="background" class="level2">
<h2 class="anchored" data-anchor-id="background">Background</h2>
<p>I recently built a custom feed on Bluesky to capture the latest discussions on pre-prints from arxiv.org and research papers from conferences like ACL. It was inspired by this bluesky <a href="https://bsky.app/profile/mariaa.bsky.social/post/3lbrevv7sik2k">post</a> from a researcher requesting for such a feed.</p>
<p>While there are drag-and-drop custom feed generators like <a href="https://skyfeed.app">Skyfeed</a>, you are limited to using only regular expressions for the filtering part. If you use a regex pattern to capture all ‘arxiv.org’ links on Skyfeed, it will yield a bunch of false positives with papers from non-ML fields like Quantum Physics, Economics, and so on.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-false-positives.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Though it’s possible for us to instead build and host the custom feed from scratch ourself as Bluesky’s protocol is open and provide programmatic access, it will be costly to run a server 24/7, especially if a large number of people subscribe to our custom feed.</p>
<p>As such, I thought of a nice alternate solution to circumvent this need to run a server by leveraging how the Bluesky protocol works with custom feeds. The bluesky app only makes GET requests to the server to fetch a JSON of a list of post IDs. So, we could in theory make use of a static site to host the endpoints with the data that matches what they expect and not run a backend server via Flask / FastAPI.</p>
<p>I implemented this idea and it works perfectly. We can offload to Skyfeed for initial filtering, use GitHub Actions for periodic feed generation, filtering, and ranking, and then host the JSONs on a static site using Cloudflare Pages. This removes the need to run a backend server at all and you can launch a custom feed 100% free.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/paper-feed-screenshot.png" class="img-fluid figure-img"></p>
<figcaption>The feed can be easily added to your homepage here: <a href="https://bsky.app/profile/amitness.com/feed/arxiv-feed">https://bsky.app/profile/amitness.com/feed/arxiv-feed</a></figcaption>
</figure>
</div>
</section>
<section id="high-level-overview" class="level2">
<h2 class="anchored" data-anchor-id="high-level-overview">High-level Overview</h2>
<p>We first use Skyfeed to filter the entire network of posts on Bluesky using a regular expression for posts with links for <strong>arxiv.org</strong> papers.</p>
<p>Then, the resulting feed is filtered using Bluesky’s atproto library through Python. Here, we iterate through each paper and check if the paper belongs to the arxiv categories for Machine Learning, NLP, and Computer Vision via the <a href="https://github.com/thechrisu/pyarxiv">pyarxiv</a> library. From the filtered list of papers, we generate the JSON data format required by Bluesky for reading feeds and push that to Cloudflare pages as a static site.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/bluesky-stack-pipeline.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>When the feed is loaded on the Bluesky app, the app will make a request to our static page on Cloudflare and get a list of the post IDs as a JSON response. The app will parse each post ID, render it in the app, and display the feed. This runs super quick.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/bluesky-api-calls.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="implementation" class="level2">
<h2 class="anchored" data-anchor-id="implementation">Implementation</h2>
<section id="clone-the-code-locally" class="level3">
<h3 class="anchored" data-anchor-id="clone-the-code-locally">1. Clone the code locally</h3>
<p>The code for the concept described above has been implemented at <a href="https://github.com/amitness/bluesky-arxiv" class="uri">https://github.com/amitness/bluesky-arxiv</a>.</p>
<p>First, make a fork of my repo from <a href="https://github.com/amitness/bluesky-arxiv" class="uri">https://github.com/amitness/bluesky-arxiv</a> and then clone your repo locally.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace with the link to your repo</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone git@github.com:amitness/bluesky-arxiv.git</span></code></pre></div></div>
<p>Install the required libraries via the requirements.txt file in your virtual environment.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-r</span> requirements.txt</span></code></pre></div></div>
</section>
<section id="setup-cloudflare-pages" class="level3">
<h3 class="anchored" data-anchor-id="setup-cloudflare-pages">2. Setup Cloudflare pages</h3>
<p>We will need a Cloudflare page to host the data in the format needed by Bluesky.</p>
<p>You can create an account on <a href="https://pages.cloudflare.com/">Cloudflare pages</a>. Once the account is created, go to <strong>Workers and Pages &gt; Overview</strong> from the left sidebar on the dashboard.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-create-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You should see two tabs: Workers and Pages. Click the <strong>Pages</strong> tab.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-pages-tab.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Then, click the <strong>“Upload Assets”</strong> button.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-pages-step-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Then enter a name for the project. Cloudflare will provide you a unique domain based on it. Click <strong>Create Project</strong>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-pages-step3.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You will be shown a page below that allows you to upload a zip file or a folder. At this stage, just upload a random folder from your device at least one file in it. Once you’re done, click <strong>Deploy site</strong>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-page-upload-assets.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Once the site is deployed, you should see a message below with the URL of your domain.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-deployment-message.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>In the repo that you cloned locally, change the <strong>SERVICE_DOMAIN</strong> variable in <strong>config.py</strong> file to the domain you got above from Cloudflare.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Domain provided by Cloudflare pages</span></span>
<span id="cb3-2">SERVICE_DOMAIN <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bluesky-1tj.pages.dev"</span></span></code></pre></div></div>
</div>
</section>
<section id="initialize-a-custom-feed-on-bluesky" class="level3">
<h3 class="anchored" data-anchor-id="initialize-a-custom-feed-on-bluesky">3. Initialize a custom feed on Bluesky</h3>
<p>Now, we will initialize a custom feed programmatically on Bluesky.</p>
<p>In the repo, you will find a <strong>config.py</strong> file. You have to change a few configurations inside it.</p>
<p>First, change the HANDLE to your bluesky handle.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># YOUR bluesky handle</span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ex: user.bsky.social</span></span>
<span id="cb4-3">HANDLE: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"amitness.com"</span></span></code></pre></div></div>
</div>
<p>Then you need to generate an app password for Bluesky. It’s available at <a href="https://bsky.app/settings/app-passwords" class="uri">https://bsky.app/settings/app-passwords</a> and will allow us to get programmatic access to Bluesky in Python.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/bluesky-add-app-password.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can set a name to denote what the password is going to be used for. Here I set it to <strong>custom-feed</strong>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/bluesky-set-app-password-name.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Then you will receive your app password. Take note of it in a safe place as you won’t be able to access it again.</p>
<p>Now you can set the <strong>BLUESKY_APP_PASSWORD</strong> environment variable to your password.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">export</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">BLUESKY_APP_PASSWORD</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>...</span></code></pre></div></div>
<p>This will be read by the <strong>setup_feed.py</strong> script.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># YOUR bluesky password, or preferably an App Password (found in your client settings)</span></span>
<span id="cb6-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ex: abcd-1234-efgh-5678</span></span>
<span id="cb6-3">PASSWORD <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BLUESKY_APP_PASSWORD"</span>]</span></code></pre></div></div>
</div>
<p>Next, you can modify the name of your custom feed, a description and the slug. Here is what I have set.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A short name for the record that will show in urls</span></span>
<span id="cb7-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Lowercase with no spaces.</span></span>
<span id="cb7-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ex: whats-hot</span></span>
<span id="cb7-4">RECORD_NAME: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arxiv-feed"</span></span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A display name for your feed</span></span>
<span id="cb7-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ex: What's Hot</span></span>
<span id="cb7-8">DISPLAY_NAME: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Papers"</span></span>
<span id="cb7-9"></span>
<span id="cb7-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># (Optional) A description of your feed</span></span>
<span id="cb7-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ex: Top trending content from the whole network</span></span>
<span id="cb7-12">DESCRIPTION: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dedent(</span>
<span id="cb7-13">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> Latest ML research papers and preprints from arxiv.org discussed on Bluesky.</span></span>
<span id="cb7-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb7-16"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> Logic:</span></span>
<span id="cb7-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> - Fetch arxiv preprints &amp; filters out non-ML via arxiv API</span></span>
<span id="cb7-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> - Ranks the items using hackernews algorithm</span></span>
<span id="cb7-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> """</span></span>
<span id="cb7-20">).strip()</span></code></pre></div></div>
</div>
<p>Here is how it will render up on Bluesky app later on.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/custom-feed-metadata.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Once everything above is setup, now you can run the script.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> setup_feed.py</span></code></pre></div></div>
<p>This will initialize our custom feed on Bluesky. If everything was set up correctly, you will get an output for the value of <strong>FEED_URI</strong>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/setup-feed-output.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Update the <strong>FEED_URI</strong> in <strong>config.py</strong> file with this value.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Feed URI generated by running `python setup_feed.py`</span></span>
<span id="cb9-2">FEED_URI <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"at://did:plc:bpuq5cgmyvssgi3iwsyvd4gn/app.bsky.feed.generator/arxiv-feed"</span></span></code></pre></div></div>
</div>
<p>Your feed has been created and now it needs to be populated before you can start using it in the app.</p>
</section>
<section id="setup-skyfeed" class="level3">
<h3 class="anchored" data-anchor-id="setup-skyfeed">4. Setup Skyfeed</h3>
<p>In this step, we will build an initial feed using the interface of the <a href="https://skyfeed.app/">Skyfeed</a> app.</p>
<p>You can signup on <a href="https://skyfeed.app/">skyfeed.app</a> using your Bluesky handle and the app password you created in previous step.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-login.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>After logging in, go to the top-right and click <strong>Create Feed</strong> to create a new feed</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-initial-page.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You will see bunch of options. Since our goal is to filter out all the posts on Bluesky in past 24 hours that mention arxiv.org or aclanthology.org, we can set up the options as such.</p>
<p>First, the Input field specifies how many posts to capture. We will specify the <strong>Entire Network</strong> and set the time to <strong>24 hours</strong> because we want to run a regex over all posts on Bluesky indexed in the past 24 hours. Depending on your usecase, you can modify this part.</p>
<p>As seen below, it yields 6 million posts in the past 24 hours.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-input.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Now, we will filter those 6 million posts to only get items that mention either the <strong>arxiv.org</strong> or the <strong>aclanthology.org</strong> links. This can be achieved with the below regex and can be pasted in the <strong>RegEx</strong> field. Make sure the <strong>Post Text</strong> and <strong>Link</strong> items are green as we want to search only in the post text and links.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode js code-with-copy"><code class="sourceCode javascript"><span id="cb10-1">(arxiv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">org</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/.+</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>(aclanthology<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">org</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/.+</span>)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-arxiv-regex.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Here is how it should look after everything is set up correctly.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-regex-usage.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>With this setup, we can now publish the feed as shown below by clicking <strong>Update Feed</strong> button and clicking <strong>Publish</strong> in the popup. This will create a feed that can be accessed via Bluesky now.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-publish.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You should see the link to your published skyfeed as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/skyfeed-copy-did.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Copy the portion as shown above to the <strong>SKYFEED_DID</strong> variable in <code>config.py</code>. We will be further filtering this feed now using Python in the next steps.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>config.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" data-filename="config.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Skyfeed path</span></span>
<span id="cb11-2">SKYFEED_DID <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"did:plc:bpuq5cgmyvssgi3iwsyvd4gn/feed/aaagg56kp5qzi"</span></span></code></pre></div></div>
</div>
</section>
<section id="feed-generation-in-python" class="level3">
<h3 class="anchored" data-anchor-id="feed-generation-in-python">5. Feed Generation in Python</h3>
<p>With the above steps done, we can build out the feed generation logic. The main crux of the logic is present in <strong>generate_feed.py</strong> file. Let’s understand how it works:</p>
<section id="cloudflare-page-generation" class="level4">
<h4 class="anchored" data-anchor-id="cloudflare-page-generation">1. Cloudflare Page Generation</h4>
<p>The entire thing is defined in the <strong>main</strong> function.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>generate_feed.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" data-filename="generate_feed.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> main():</span>
<span id="cb12-2"> did_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb12-3">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"@context"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://www.w3.org/ns/did/v1"</span>],</span>
<span id="cb12-4">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"did:web:</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>config<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>SERVICE_DOMAIN<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb12-5">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"service"</span>: [</span>
<span id="cb12-6"> {</span>
<span id="cb12-7">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#bsky_fg"</span>,</span>
<span id="cb12-8">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BskyFeedGenerator"</span>,</span>
<span id="cb12-9">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"serviceEndpoint"</span>: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"https://</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>config<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>SERVICE_DOMAIN<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb12-10"> }</span>
<span id="cb12-11"> ],</span>
<span id="cb12-12"> }</span>
<span id="cb12-13">    write_json(did_data, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./_site/.well-known/did.json"</span>)</span>
<span id="cb12-14"></span>
<span id="cb12-15"> feed_generator_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb12-16">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"encoding"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"application/json"</span>,</span>
<span id="cb12-17">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"body"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"did"</span>: config.SERVICE_DID, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"feeds"</span>: [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"uri"</span>: config.FEED_URI}]},</span>
<span id="cb12-18"> }</span>
<span id="cb12-19"></span>
<span id="cb12-20">    write_json(feed_generator_data, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./_site/xrpc/app.bsky.feed.describeFeedGenerator"</span>)</span></code></pre></div></div>
</div>
<p>This part of the code will generate some metadata JSON that will be called by Bluesky to our Cloudflare pages at following paths.</p>
<ul>
<li><a href="https://bluesky-1tj.pages.dev/.well-known/did.json">https://bluesky-1tj.pages.dev/.well-known/did.json</a></li>
<li><a href="https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.describeFeedGenerator">https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.describeFeedGenerator</a> (feed details)</li>
</ul>
</section>
<section id="filtering-posts" class="level4">
<h4 class="anchored" data-anchor-id="filtering-posts">2. Filtering Posts</h4>
<p>The main logic lies in the code below, which generates the data for the endpoint that contains all the post IDs that should be rendered in the feed.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>generate_feed.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" data-filename="generate_feed.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fetch latest posts and prepare data in the format expected by Bluesky protocol</span></span>
<span id="cb13-2">post_uris <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fetch_latest_posts()</span>
<span id="cb13-3"></span>
<span id="cb13-4">feed_skeletion <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"feed"</span>: [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"post"</span>: uri} <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> uri <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> post_uris]}</span>
<span id="cb13-5">write_json(feed_skeletion, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./_site/xrpc/app.bsky.feed.getFeedSkeleton"</span>)</span></code></pre></div></div>
</div>
<p>It generates the endpoint that will return the post IDs that should be rendered in our custom feed. (<a href="https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.getFeedSkeleton">https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.getFeedSkeleton</a>)</p>
<p>The main logic for the feed filtering is defined in the <a href="https://github.com/amitness/bluesky-arxiv/blob/main/generate_feed.py#L112">fetch_latest_posts()</a> function in the <a href="https://github.com/amitness/bluesky-arxiv/blob/main/generate_feed.py">generate_feed.py</a> file.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>generate_feed.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-14" data-filename="generate_feed.py" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-14-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fetch_latest_posts():</span>
<span id="annotated-cell-14-2"> client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Client()</span>
<span id="annotated-cell-14-3"> client.login(config.HANDLE, config.PASSWORD)</span>
<span id="annotated-cell-14-4"></span>
<span id="annotated-cell-14-5"> data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.app.bsky.feed.get_feed(</span>
<span id="annotated-cell-14-6"> {</span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-14" data-target-annotation="1" onclick="event.preventDefault();">1</a><span id="annotated-cell-14-7" class="code-annotation-target">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"feed"</span>: config.SKYFEED_PATH,</span>
<span id="annotated-cell-14-8">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"limit"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="annotated-cell-14-9"> },</span>
<span id="annotated-cell-14-10">        timeout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="annotated-cell-14-11"> )</span>
<span id="annotated-cell-14-12"></span>
<span id="annotated-cell-14-13"> feed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data.feed</span>
<span id="annotated-cell-14-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>):</span>
<span id="annotated-cell-14-15"> data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.app.bsky.feed.get_feed(</span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-14" data-target-annotation="2" onclick="event.preventDefault();">2</a><span id="annotated-cell-14-16" class="code-annotation-target"> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"feed"</span>: config.SKYFEED_PATH, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"limit"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cursor"</span>: data.cursor},</span>
<span id="annotated-cell-14-17">            timeout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>,</span>
<span id="annotated-cell-14-18"> )</span>
<span id="annotated-cell-14-19"> feed.extend(data.feed)</span>
<span id="annotated-cell-14-20"></span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-14" data-target-annotation="3" onclick="event.preventDefault();">3</a><span id="annotated-cell-14-21" class="code-annotation-target"> bool_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> thread_map(filter_item, feed)</span>
<span id="annotated-cell-14-22"> filtered_feed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compress(feed, bool_filter) </span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-14" data-target-annotation="4" onclick="event.preventDefault();">4</a><span id="annotated-cell-14-23" class="code-annotation-target"> sorted_feed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rank_posts(filtered_feed)</span>
<span id="annotated-cell-14-24"> post_uris <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [item.post.uri <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> sorted_feed]</span>
<span id="annotated-cell-14-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> post_uris</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
</div>
<dl class="code-annotation-container-grid">
<dt data-target-cell="annotated-cell-14" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-14" data-code-lines="7" data-code-annotation="1">We fetch the feed from the Skyfeed custom feed we generated in the earlier step</span>
</dd>
<dt data-target-cell="annotated-cell-14" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-14" data-code-lines="16" data-code-annotation="2">A cursor is used to paginate and select additional 200 items from that feed</span>
</dd>
<dt data-target-cell="annotated-cell-14" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-14" data-code-lines="21" data-code-annotation="3">Then the items are filtered using the <strong>filter_item</strong> function that checks whether the links present in the item are indeed CS Arxiv papers. We make use of <a href="https://amitness.com/posts/parallel-progress-bar/#running-concurrent-threads">thread_map</a> to parallelize the process.</span>
</dd>
<dt data-target-cell="annotated-cell-14" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-14" data-code-lines="23" data-code-annotation="4">We re-rank the filtered items in the feed to use the Hackernews algorithm</span>
</dd>
</dl>
</section>
<section id="re-ranking-with-hackernews-score" class="level4">
<h4 class="anchored" data-anchor-id="re-ranking-with-hackernews-score">3. Re-ranking with hackernews score</h4>
<p>The re-ranking of the posts is defined in the <strong>rank_posts</strong> function. I made use of hackernews algorithm which is quite simple. We compute the points for a post as the sum of its number of likes, quotes, replies and reposts. Then that score is decayed by how many hours it has been since the post was created so slowly downvote items that are getting older. This balances the popular vs recent research papers.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>generate_feed.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" data-filename="generate_feed.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> hackernews_score(item, gravity: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.5</span>):</span>
<span id="cb14-2"> hours_passed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb14-3"> datetime.now(timezone.utc) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> parse_date(item.post.indexed_at)</span>
<span id="cb14-4"> ).total_seconds() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3600</span></span>
<span id="cb14-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> hours_passed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>:</span>
<span id="cb14-6">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb14-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb14-8"> points <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb14-9"> item.post.like_count</span>
<span id="cb14-10">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> item.post.quote_count</span>
<span id="cb14-11">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> item.post.reply_count</span>
<span id="cb14-12">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> item.post.repost_count</span>
<span id="cb14-13"> )</span>
<span id="cb14-14"> score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> points <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> ((hours_passed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span> (gravity))</span>
<span id="cb14-15">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> score</span>
<span id="cb14-16"></span>
<span id="cb14-17"></span>
<span id="cb14-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> rank_posts(feed):</span>
<span id="cb14-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sorted</span>(feed, key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>hackernews_score, reverse<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
</div>
</section>
</section>
<section id="running-periodically-via-github-actions" class="level3">
<h3 class="anchored" data-anchor-id="running-periodically-via-github-actions">6. Running periodically via GitHub Actions</h3>
<p>To run our script periodically for free, we can leverage Github Actions. This will fetch the feed from Skyfeed, perform the filtering and re-ranking, and push the resulting data to Cloudflare pages every 30 minutes.</p>
<p>The schedule for the cron job is defined in the <strong>build_and_deploy.yml</strong> file and can be modified there as needed.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>.github/workflows/build_and_deploy.yml</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" data-filename=".github/workflows/build_and_deploy.yml" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb15-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Build and deploy site to cloudflare</span></span>
<span id="cb15-2"></span>
<span id="cb15-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb15-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">push</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb15-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">branches</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb15-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> main</span></span>
<span id="cb15-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">schedule</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb15-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cron</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'*/30 * * * *'</span></span></code></pre></div></div>
</div>
<p><a href="https://crontab.guru/#*/30_*_*_*_*">Crontab.guru</a> is a great website to visualize what the cron syntax does.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/crontab-guru-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>To enable the actions in your forked GitHub repo, goto <strong>“Settings &gt; Secrets and Variables”</strong> and click <strong>“New Repository Secret”</strong> and set these three variables one by one</p>
<ul>
<li>BLUESKY_APP_PASSWORD</li>
<li>CLOUDFLARE_ACCOUNT_ID</li>
<li>CLOUDFLARE_API_TOKEN</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/setup-github-secret-variables.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can get your <strong>“CLOUDFLARE_ACCOUNT_ID”</strong> by logging in to <a href="https://pages.cloudflare.com/">Cloudflare Pages</a> and then getting the value from the right sidebar as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-account-id.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>To get the <strong>CLOUDFLARE_API_TOKEN</strong>, create a new token from <a href="https://dash.cloudflare.com/profile/api-tokens">https://dash.cloudflare.com/profile/api-tokens</a> as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/cloudflare-user-token-creation.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Once all three secret variables have been set up, you can enable GitHub actions in your forked repo as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/github-action-enable.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>The action should automatically run every 30 minutes now. As such, it will fetch the latest posts from skyfeed, perform the filtering and generate the final set of posts to be displayed on Bluesky and deploy that to Cloudflare.</p>
</section>
<section id="access-your-feed" class="level3">
<h3 class="anchored" data-anchor-id="access-your-feed">7. Access your feed</h3>
<p>Your feed will be listed on your profile now at <a href="https://bsky.app/feeds">bsky.app/feeds</a> and can be pinned to the homepage as well.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/bluesky-custom-feed/custom-feed-pin.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can find the link from the address bar when the feed is open and share it with others. <img src="https://amitness.com/posts/bluesky-custom-feed/custom-feed-url.png" class="img-fluid quarto-figure quarto-figure-center"></p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, we saw an approach on how to make a custom feed on Bluesky with a combination of Skyfeed, Github Actions and Cloudflare pages.</p>
<p>While we built it to get a feed of Arxiv papers, you can extend the same approach to do a bunch of useful stuff. You could integrate lightweight classifiers to classify/re-rank posts for relevance to your interests or even filter out toxic posts from your feed.</p>
<p>You can also skip Skyfeed as the initial source and instead read from the firehose or one of your existing feeds directly using <a href="https://atproto.blue/en/latest/">atproto</a> and handle the indexing via a small SQLite database or JSON committed directly to GitHub via the actions.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://docs.bsky.app/docs/starter-templates/custom-feeds">Official Documentation on Bluesky Custom Feeds</a></li>
<li><a href="https://github.com/MarshalX/bluesky-feed-generator">Bluesky custom feed algorithms server in Python</a></li>
<li><a href="https://github.com/amac0/miscpubliccode/blob/master/helloworldbskytimeline">A custom “Hello World” feed implemented using AWS Lambda, demonstrating a single hardcoded post</a></li>
</ul>


</section>

 ]]></description>
  <category>misc</category>
  <guid>https://amitness.com/posts/bluesky-custom-feed/</guid>
  <pubDate>Sun, 01 Dec 2024 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/bluesky-custom-feed/bluesky-stack-pipeline-thumb.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>Parallel Processing with tqdm</title>
  <link>https://amitness.com/posts/parallel-progress-bar/</link>
  <description><![CDATA[ 




<p><a href="https://github.com/tqdm/tqdm">tqdm</a> is a popular library that’s widely used in a bunch of open-source python ML libraries for displaying progress bars. As such, it’s already pre-installed as a dependency when working on machine learning projects.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> show tqdm</span></code></pre></div></div>
</div>
<blockquote class="blockquote">
<p>Required-by: <em>datasets</em>, <em>dvc</em>, <em>evaluate</em>, <em>huggingface-hub</em>, <em>openai</em>, <em>sentence-transformers</em>, <em>spacy</em>, <em>transformers</em></p>
</blockquote>
<p>For example, consider a task where we loop over a list of websites and need to fetch the status code for each.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>Naive loop</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" data-filename="Naive loop" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="cb2-2"></span>
<span id="cb2-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ping(url):</span>
<span id="cb2-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> requests.head(url).status_code</span>
<span id="cb2-5"></span>
<span id="cb2-6">urls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://amitness.com'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<span id="cb2-7">statuses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ping(url) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> url <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> urls]</span></code></pre></div></div>
</div>
<p>To get a progress bar, it’s as easy as wrapping the <code>urls</code> list with the tqdm class.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install tqdm</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-4" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="1">1</button><span id="annotated-cell-4-2" class="code-annotation-target"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tqdm.auto <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> tqdm</span>
<span id="annotated-cell-4-3"></span>
<span id="annotated-cell-4-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ping(url):</span>
<span id="annotated-cell-4-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> requests.head(url).status_code</span>
<span id="annotated-cell-4-6"></span>
<span id="annotated-cell-4-7">urls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://amitness.com'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="2">2</button><span id="annotated-cell-4-8" class="code-annotation-target">statuses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ping(url) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> url <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> tqdm(urls)]</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-4" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="2" data-code-annotation="1">Import the tqdm object. Importing from <code>tqdm.auto</code> is preferred as it automatically select the best progress bar (jupyter-compatible or console-based)</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="8" data-code-annotation="2">Simply wrap the list of items and you get a progress bar</span>
</dd>
</dl>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/parallel-progress-bar/tqdm-example.gif" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>While this use case of <code>tqdm</code> as a progress bar library is well known, there are three relatively undocumented features in tqdm to get progress bars while doing concurrent, parallel or asynchronous processing.</p>
<section id="running-concurrent-threads" class="level2">
<h2 class="anchored" data-anchor-id="running-concurrent-threads">Running Concurrent Threads</h2>
<p>You can execute a function on the list concurrently with multiple threads using the <code>thread_map</code> function. It takes the function to run as the first argument and a list of items as the second argument and returns the results.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-5" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="annotated-cell-5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tqdm.contrib.concurrent <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> thread_map</span>
<span id="annotated-cell-5-3"></span>
<span id="annotated-cell-5-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ping(url):</span>
<span id="annotated-cell-5-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> requests.head(url).status_code</span>
<span id="annotated-cell-5-6"></span>
<span id="annotated-cell-5-7">urls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://amitness.com'</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="1">1</button><span id="annotated-cell-5-8" class="code-annotation-target">statuses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> thread_map(ping, urls, max_workers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-5" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="8" data-code-annotation="1">The number of threaded-workers to use can be specified using <code>max_workers</code> parameter.</span>
</dd>
</dl>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/parallel-progress-bar/thread_map_results.gif" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>This is useful to speed up IO-bound tasks such as fetching data by scraping a website, calling a remote third party API or querying a remote database.</p>
<p>Internally, <code>thread_map</code> leverages the <code>ThreadPoolExecutor</code> from concurrent.futures standard library.<sup>1</sup></p>
</section>
<section id="running-parallel-processes" class="level2">
<h2 class="anchored" data-anchor-id="running-parallel-processes">Running parallel processes</h2>
<p>For compute-bound tasks, tqdm provides a <code>process_map</code> function with a similar API to process the list in parallel using multiple child processes.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-6" style="background: #f1f3f5;"><pre class="sourceCode python code-annotation-code code-with-copy code-annotated"><code class="sourceCode python"><span id="annotated-cell-6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="annotated-cell-6-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tqdm.contrib.concurrent <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> process_map</span>
<span id="annotated-cell-6-3"></span>
<span id="annotated-cell-6-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ping(url):</span>
<span id="annotated-cell-6-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> requests.head(url).status_code</span>
<span id="annotated-cell-6-6"></span>
<span id="annotated-cell-6-7">urls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://amitness.com'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="1">1</button><span id="annotated-cell-6-8" class="code-annotation-target">statuses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> process_map(ping, urls, max_workers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-6" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="8" data-code-annotation="1">The number of processes to use can be specified using <code>max_workers</code> parameter.</span>
</dd>
</dl>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/parallel-progress-bar/process_map_progress_bar.gif" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>This is particularly useful when the task involves heavy computation such as generating sentence embeddings for a large dataset or running batch model inference on CPU.</p>
<p>Internally, <code>process_map</code> uses <code>ProcessPoolExecutor</code> from concurrent.futures standard library.<sup>2</sup></p>
</section>
<section id="running-asynchronous-tasks" class="level2">
<h2 class="anchored" data-anchor-id="running-asynchronous-tasks">Running Asynchronous Tasks</h2>
<p>For asynchronous tasks, tqdm provides an asyncio-compatible progress bar using <code>tqdm_asyncio</code>. This allows you to run asynchronous functions with a progress bar.</p>
<p>We use the same example as before, but this time we will use <code>httpx</code> to make asynchronous HTTP requests instead of the synchronous <code>requests</code> library.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install httpx</span></code></pre></div></div>
</div>
<p>In the code, we only need to use <code>tqdm_asyncio.gather</code> instead of <code>asyncio.gather</code> to get a progress bar. Everything else is regular asyncio code.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> asyncio</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> httpx</span>
<span id="cb5-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tqdm.asyncio <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> tqdm_asyncio</span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ping(client, url):</span>
<span id="cb5-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<span id="cb5-7">        response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">await</span> client.head(url, timeout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb5-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> response.status_code</span>
<span id="cb5-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">Exception</span> <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> e:</span>
<span id="cb5-10">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Error: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>e<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-11"></span>
<span id="cb5-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> main():</span>
<span id="cb5-13">    urls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://amitness.com'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<span id="cb5-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> httpx.AsyncClient() <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> client:</span>
<span id="cb5-15">        tasks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ping(client, url) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> url <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> urls]</span>
<span id="cb5-16">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># tqdm_asyncio.gather instead of asyncio.gather</span></span>
<span id="cb5-17">        statuses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">await</span> tqdm_asyncio.gather(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>tasks)</span>
<span id="cb5-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> statuses</span>
<span id="cb5-19"></span>
<span id="cb5-20"></span>
<span id="cb5-21"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb5-22">    asyncio.run(main())</span></code></pre></div></div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, <code>thread_map</code>, <code>process_map</code> and <code>tqdm_asyncio</code> are useful tools to add to your toolbox when dealing with parallel processing. As tqdm is already pre-installed via other libraries you might use in ML, it’s a quick and easy way to add parallel processing to your program logic.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Source code for <a href="https://github.com/tqdm/tqdm/blob/951a2ba8d8754b7385e6e8c08dae9045f73b1438/tqdm/contrib/concurrent.py#L54">thread_map</a>↩︎</p></li>
<li id="fn2"><p>Source code for <a href="https://github.com/tqdm/tqdm/blob/951a2ba8d8754b7385e6e8c08dae9045f73b1438/tqdm/contrib/concurrent.py#L72">process_map</a>↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>python</category>
  <guid>https://amitness.com/posts/parallel-progress-bar/</guid>
  <pubDate>Sun, 20 Oct 2024 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/parallel-progress-bar/tqdm-parallel-thumbnail.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>A Visual Guide to Regular Expression</title>
  <link>https://amitness.com/posts/visual-regex</link>
  <description><![CDATA[ 




<p>It’s a common task in NLP to either check a text against a pattern or extract parts from the text that matches a certain pattern. A regular expression or “regex” is a powerful tool to achieve this.</p>
<p>While powerful, regex can feel daunting as it comes with a lot of features and sub-parts that you need to remember.</p>
<p>In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.</p>
<section id="mental-model" class="level2">
<h2 class="anchored" data-anchor-id="mental-model">Mental Model</h2>
<p>Let’s start with a simple example where we are trying to find the word ‘cool’ in the text.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-mental-model-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>With regex, we could simply type out the word ‘cool’ as the pattern and it will match the word.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb1-1"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cool'</span></span></code></pre></div></div>
<p>While regex matched our desired word ‘<strong>cool</strong>’, the way it operates is not at the word level but the character level. This is the key idea.</p>
<div class="callout callout-style-simple callout-note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p><strong>Key Idea</strong>: Regex works at the character-level, not word-level.</p>
</div>
</div>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-working.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>The implication of this is that the regex <code>r'cool'</code> would match the following sentences as well.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-exact-word-match.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="basic-building-blocks" class="level2">
<h2 class="anchored" data-anchor-id="basic-building-blocks">Basic Building Blocks</h2>
<p>Now that we understand the key idea, let’s understand how we can match simple characters using regex.</p>
<section id="a.-specific-character" class="level3">
<h3 class="anchored" data-anchor-id="a.-specific-character">a. Specific character</h3>
<p>We can simply specify the character in the regular expression and it will match all instances in the text.</p>
<p>For example, a regular expression given below will match all instances of ‘a’ in the text. You can use any of the small and capital alphabets.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb2-1"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-match-only-a.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can also use any digits from 0 to 9 and it will work as well.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb3-1"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'3'</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-python-3.7-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Note that regex is case-sensitive by default and thus the following regex won’t match anything.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb4-1"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'A'</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-not-matched-by-capital-a.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="b.-white-space-character" class="level3">
<h3 class="anchored" data-anchor-id="b.-white-space-character">b. White space character</h3>
<p>We can detect special characters such as whitespace and newlines using special escape sequences.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-white-space-characters.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Besides the common ones above, we have:</p>
<ul>
<li><strong>\r</strong> for carriage return<br>
</li>
<li><strong>\f</strong> for form feed<br>
</li>
<li><strong>\e</strong> for escape</li>
</ul>
</section>
<section id="c.-special-sequences" class="level3">
<h3 class="anchored" data-anchor-id="c.-special-sequences">c.&nbsp;Special sequences</h3>
<p>Regex provides a bunch of built-in special symbols that can match a group of characters at once. These begin with backslash <code>\</code>.</p>
<section id="pattern-d" class="level4">
<h4 class="anchored" data-anchor-id="pattern-d">Pattern: <code>\d</code></h4>
<p>It matches any single-digit number between 0 to 9.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-single-digit.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Notice that matches are single digit. So we have 4 different matches below instead of a single number <code>18.04</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-ubuntu-18.04.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="pattern-s" class="level4">
<h4 class="anchored" data-anchor-id="pattern-s">Pattern: <code>\s</code></h4>
<p>It matches any whitespace character (<span style="color: #66bb6a;">space</span>, <span style="color: #b668d2;">tab</span> or <span style="color: #2c9cdb;">newline</span>).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-cover.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:60.0%"></p>
</figure>
</div>
</section>
<section id="pattern-w" class="level4">
<h4 class="anchored" data-anchor-id="pattern-w">Pattern: <code>\w</code></h4>
<p>It matches any of the small alphabets(a to z), capital alphabets(A to Z), digits (0 to 9), and underscore.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-slash-w.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="pattern-." class="level4">
<h4 class="anchored" data-anchor-id="pattern-.">Pattern: <code>.</code></h4>
<p>It matches any character except the new line ().</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-everything-except-newline.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb5-2"></span>
<span id="cb5-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> re.findall(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">.</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'line 1</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">line2'</span>)</span>
<span id="cb5-4">[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'l'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'n'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'e'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">' '</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'l'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'n'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'e'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2'</span>]</span></code></pre></div></div>
</section>
<section id="pattern-negations" class="level4">
<h4 class="anchored" data-anchor-id="pattern-negations">Pattern: Negations</h4>
<p>If you use the capitalized versions of the patterns above, they act as negation.</p>
<p>For example, if “ matched any digits from 0 to 9, then”” will match anything except “0 to 9”.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-negation.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
</section>
<section id="d.-character-sets" class="level3">
<h3 class="anchored" data-anchor-id="d.-character-sets">d.&nbsp;Character sets</h3>
<p>These are patterns starting with <code>[</code> and ending with <code>]</code> and specify the characters that should be matched enclosed by brackets.</p>
<p>For example, the following pattern matches any of the characters ‘a’, ‘e’, ‘i’, ‘o’, and ‘u’.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-aeiou.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can also replicate the functionality of <code>\d</code> using the below pattern. It will match any digits between 0 to 9.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-1-to-9.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Instead of specifying all the digits, we can use <code>-</code> to specify only start and end digits. So, instead of <code>[0123456789]</code>, we can do:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-refactor-all-digits.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>For example, <code>[2-4]</code> can be used to match any digits between 2 to 4 i.e.&nbsp;(2 or 3 or 4).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-year-2014-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can even use the special characters we learned previously inside the brackets. For example, you can match any digit from 0 to 9 or whitespace as:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-whitespace-or-digit.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Below, I have listed some useful common patterns and what they mean.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-common-pattern-for-bracket.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="e.-anchors" class="level3">
<h3 class="anchored" data-anchor-id="e.-anchors">e. Anchors</h3>
<p>Regex also has special handlers to make the pattern only match if it’s at the start or end of the string.</p>
<p>We can use the <code>^</code> anchor to match patterns only at the start of a line. For example:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-start-anchor.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Similarly, we can use the <code>$</code> anchor after the character to match patterns only if it’s the end of the line. For example:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-anchor-end.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="f.-escaping-metacharacters" class="level3">
<h3 class="anchored" data-anchor-id="f.-escaping-metacharacters">f.&nbsp;Escaping metacharacters</h3>
<p>Consider a case where we want to exactly match the word “Mr.&nbsp;Stark”.</p>
<p>If we write a regex like <code>Mr. Stark</code>, then it will have an unintended effect. Since we know dot has a special meaning in a regex.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-dot-issue.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>So, we should always escape the special metacharacters like <code>.</code>, <code>$</code> etc. if our goal is to match the exact character itself.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-dot-fixed.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Here is the list of metacharacters that you should remember to escape if you’re using them directly.</p>
<pre><code>^ $ . * + ? { } [ ] \ | ( )</code></pre>
</section>
</section>
<section id="repetition-of-basic-blocks" class="level2">
<h2 class="anchored" data-anchor-id="repetition-of-basic-blocks">Repetition of basic blocks</h2>
<p>Now that we can pattern match any characters, we could repeat things and start building more complicated patterns.</p>
<section id="a.-naive-repetition" class="level3">
<h3 class="anchored" data-anchor-id="a.-naive-repetition">a. Naive repetition</h3>
<p>Using only what we have learned so far, a naive way would be to just repeat the pattern. For example, we can match two-digit numbers by just repeating the character-level pattern.</p>
<pre><code>\d\d</code></pre>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-slash-d-slash-d.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
</section>
<section id="b.-quantifiers" class="level3">
<h3 class="anchored" data-anchor-id="b.-quantifiers">b. Quantifiers</h3>
<p>Regex provides special quantifiers to specify different types of repetition for the character preceding it.</p>
<section id="i.-fixed-repetition" class="level4">
<h4 class="anchored" data-anchor-id="i.-fixed-repetition">i. Fixed repetition</h4>
<p>We can use the <code>{...}</code> quantifier to specify the number of times a pattern should repeat.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-manual-counts.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>For example, the previous pattern for matching 2-digit number can be recreated as:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-it-is-2020.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>You can also specify a range of repetitions using the same quantifier. For example, to match from 2-digit to 4-digit numbers, we could use the pattern:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-min-max-count.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>When applied to a sentence, it will match both 4-digit and 2-digit numbers.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-20-years-old.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<div class="callout callout-style-default callout-warning callout-titled" title="Note:">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Note:
</div>
</div>
<div class="callout-body-container callout-body">
<p>There should not be any space between minimum and maximum count For example, <code>\d{2, 4}</code> doesn’t work.</p>
</div>
</div>
</section>
<section id="ii.-flexible-quantifiers" class="level4">
<h4 class="anchored" data-anchor-id="ii.-flexible-quantifiers">ii. Flexible quantifiers</h4>
<p>Regex also provides quantifiers “*“,”+” and “?” using which you can specify flexible repetition of a character.</p>
<ul>
<li><p><strong>0 or 1 times</strong>: <code>?</code><br>
The <code>?</code> quantifier matches the previous character if it repeats 0 or 1 times. This can be useful to make certain parts optional. It is equivalent to <code>{0,1}</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-question-mark-clarify.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>For example, let’s say we want to match both the word “sound” and “sounds” where “s” is optional. Then, we can use the <code>?</code> quantifier that matches if a character repeats 0 or 1 times.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-question-mark-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div></li>
<li><p><strong>one or more times</strong>: <code>+</code><br>
The <code>+</code> quantifier matches the previous character if it repeats 1 or more times. It is equivalent to <code>{1,}</code>.</p>
<p>For example, we could find numbers of any arbitrary length using the regex <code>\d+</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-example-of-plus.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div></li>
<li><p><strong>zero or more times</strong>: <code>*</code><br>
The <code>*</code> quantifier matches the previous character if it repeats zero or more times. It is equivalent to <code>{0,}</code>.</p></li>
</ul>
</section>
</section>
</section>
<section id="usage-in-python" class="level2">
<h2 class="anchored" data-anchor-id="usage-in-python">Usage in Python</h2>
<p>Python provides a module called “re” in the standard library to work with regular expression.</p>
<section id="need-for-raw-strings" class="level3">
<h3 class="anchored" data-anchor-id="need-for-raw-strings">Need for raw strings</h3>
<p>To specify a regular expression in Python, we precede it with <strong>r</strong> to create raw strings.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">pattern <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">\d</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span></span></code></pre></div></div>
<p>To understand why we precede with <strong>r</strong>, let’s try printing the expression <code>\t</code> without <code>r</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">pattern <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span></span>
<span id="cb9-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(pattern)</span></code></pre></div></div>
<p>You can see how when we don’t use raw string, the string <code>\t</code> is treated as the escape character for tab by Python.</p>
<p>Now let’s convert it into raw string. We get back whatever we specified.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">pattern <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span></span>
<span id="cb10-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(pattern)</span>
<span id="cb10-3">\t</span></code></pre></div></div>
</section>
<section id="using-re-module" class="level3">
<h3 class="anchored" data-anchor-id="using-re-module">Using re module</h3>
<p>To use <code>re</code> module, we can start by importing the <code>re</code> module as:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span></code></pre></div></div>
<section id="re.findall" class="level4">
<h4 class="anchored" data-anchor-id="re.findall">1. re.findall</h4>
<p>This function allows us to get all the matches as a list of strings.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb12-2">re.findall(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">\d</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'123456'</span>)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'3'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'4'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'5'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'6'</span>]</span></code></pre></div></div>
</section>
<section id="re.match" class="level4">
<h4 class="anchored" data-anchor-id="re.match">2. re.match</h4>
<p>This function searches for a pattern at the beginning of the string and returns the first occurrence as a match object. If the pattern is not found, it returns None.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb14-2"></span>
<span id="cb14-3">match <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> re.match(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'batman'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'batman is cool'</span>)</span>
<span id="cb14-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(match)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>re.Match <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">object</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> span<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>), match<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'batman'</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/regex-match-object.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>With the match object, we can get the matched text as</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(match.group())</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">batman</span></code></pre></div></div>
<p>In a case where our pattern is not at the start of the sentence, we will not get any match.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb18-2"></span>
<span id="cb18-3">match <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> re.match(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'batman'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The batman is cool'</span>)</span>
<span id="cb18-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(match)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span></code></pre></div></div>
</section>
<section id="re.search" class="level4">
<h4 class="anchored" data-anchor-id="re.search">3. re.search</h4>
<p>This function also finds the first occurrence of a pattern but the pattern can occur anywhere in the text. If the pattern is not found, it returns None.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb20-2"></span>
<span id="cb20-3">match <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> re.search(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'batman'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the batman is cool'</span>)</span>
<span id="cb20-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(match.group())</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">batman</span></code></pre></div></div>
</section>
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>A.M. Kuchling, <a href="https://docs.python.org/3/howto/regex.html">“Regular Expression HOWTO - Python 3.9.0 documentation”</a></li>
</ul>


</section>

 ]]></description>
  <category>python</category>
  <category>nlp</category>
  <guid>https://amitness.com/posts/visual-regex</guid>
  <pubDate>Wed, 21 Oct 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/regex-cover.png" medium="image" type="image/png" height="56" width="144"/>
</item>
<item>
  <title>Knowledge Transfer in Self Supervised Learning</title>
  <link>https://amitness.com/posts/knowledge-transfer</link>
  <description><![CDATA[ 




<p>Self Supervised Learning is an interesting research area where the goal is to learn rich representations from unlabeled data without any human annotation.</p>
<p>This can be achieved by creatively formulating a problem such that you use parts of the data itself as labels and try to predict that. Such formulations are called pretext tasks.</p>
<p>For example, you can setup a pretext task to predict the color version of the image given the grayscale version. Similarly, you could remove a part of the image and train a model to predict the part from the surrounding. There are many such <a href="https://amitness.com/posts/self-supervised-learning">pretext tasks</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-pretext-tasks.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Examples of pretext tasks"></p>
</figure>
</div>
<p>By pre-training on the pretext task, the hope is that the model will learn useful representations. Then, we can finetune the model to downstream tasks such as image classification, object detection, and semantic segmentation with only a small set of labeled training data.</p>
<section id="challenge-of-evaluating-representations" class="level2">
<h2 class="anchored" data-anchor-id="challenge-of-evaluating-representations">Challenge of evaluating representations</h2>
<p>So pretext tasks can help us learn representations. But, this poses a question:</p>
<blockquote class="blockquote">
<p>How to determine how good a learned representation is?</p>
</blockquote>
<p>Currently, the standard way to gauge the representations is to evaluate it on a set of standard tasks and benchmark datasets.</p>
<ul>
<li><strong>Linear classification</strong>: ImageNet classification using frozen features<br>
</li>
<li><strong>Low Data Regime</strong>: ImageNet Classification using only 1% to 10% of data<br>
</li>
<li><strong>Transfer Learning</strong>: Object Classification, Object Detection and Semantic Segmentation on PASCAL VOC</li>
</ul>
<p>We can see that the above evaluation methods require us to use the same model architecture for both the pretext task and the target task.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-pretext-target-challenge.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Coupling of pretext task architecture and downstream task architecture"></p>
</figure>
</div>
<p>This poses some interesting challenges:</p>
<ol type="1">
<li><p>For the pretext task, our goal is to learn on a large-scale unlabeled dataset and thus deeper models(e.g.&nbsp;ResNet) would help us learn better representations. But, for downstream tasks, we would prefer shallow models(e.g.&nbsp;AlexNet) for actual applications. Thus, we currently have to consider this limitation when designing the pretext task.</p></li>
<li><p>It’s harder to fairly compare which pre-text task is better if some methods used simpler architecture while other methods used deeper architecture.</p></li>
<li><p>We can’t compare the representations learned from pretext tasks to handcrafted features such as HOG.</p></li>
<li><p>We may want to exploit several data domains such as sound, text, and videos in the pretext task but the target task may limit our design choices.</p></li>
<li><p>Model trained on pretext task may learn extra knowledge that is not useful for generic visual recognition. Currently, the final task-specific layers are ignored and weights or features only up to certain convolutional layers are taken.</p></li>
</ol>
</section>
<section id="knowledge-transfer" class="level2">
<h2 class="anchored" data-anchor-id="knowledge-transfer">Knowledge Transfer</h2>
<p><span class="citation" data-cites="noroozi2018boosting">Noroozi et al. (2018)</span> proposed a simple idea to tackle these issues in their 2018 paper <a href="https://arxiv.org/abs/1805.00385">“Boosting Self-Supervised Learning via Knowledge Transfer”</a>.</p>
<section id="intuition" class="level3">
<h3 class="anchored" data-anchor-id="intuition">Intuition</h3>
<p>The authors observed that in a good representation space, semantically similar data points should be close together.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-good-vs-bad-representation.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Intuition behind Knowledge Transfer"></p>
</figure>
</div>
<p>In regular supervised classification, the information that images are semantically similar is encoded through labels annotated by humans. A model trained on such labels would have a representation space that groups semantically similar images.</p>
<p>Thus, with pre-text tasks in self-supervised learning, the objective is implicitly learning a metric that makes the same category images similar and different category images dissimilar. Hence we can provide a robust estimate of the learned representation if we could encode semantically related images to the same labels in some way.</p>
</section>
</section>
<section id="general-framework" class="level2">
<h2 class="anchored" data-anchor-id="general-framework">General Framework</h2>
<p>The authors propose a novel framework to transfer knowledge from a deep self-supervised model to a separate shallow downstream model. You can use different model architectures for the pretext task and downstream task.</p>
<p><strong>Key Idea:</strong></p>
<blockquote class="blockquote">
<p>Cluster features from pretext task and assign cluster centers as pseudo-labels for unlabeled images. Then, re-train a small network with target task architecture on pseudo-labels to predict pseudo-labels and learn a novel representation.</p>
</blockquote>
<p>The end-to-end process is described below:</p>
<section id="pretext-task" class="level4">
<h4 class="anchored" data-anchor-id="pretext-task">1. Pretext task</h4>
<p>Here we choose some deep network architecture and train it on some pretext task of our choice on some dataset. We can take features from some intermediate layer after the model is trained.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-step-1.png" class="img-fluid figure-img" alt="Applying pretext task"></p>
<figcaption>Figure: Training on Pre-text Task <span class="citation" data-cites="noroozi2018boosting">(Noroozi et al., 2018)</span></figcaption>
</figure>
</div>
</section>
<section id="k-means-clustering" class="level4">
<h4 class="anchored" data-anchor-id="k-means-clustering">2. K-means Clustering</h4>
<p>For all the unlabeled images in the dataset, we compute the feature vectors from the pretext task model. Then, we run K-means clustering to group semantically similar images. The idea is that the cluster centers will be aligned with categories in ImageNet.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-step-2.png" class="img-fluid figure-img" alt="Clustering features from pretext task"></p>
<figcaption>Figure: Clustering Features <span class="citation" data-cites="noroozi2018boosting">(Noroozi et al., 2018)</span></figcaption>
</figure>
</div>
<p>In the paper, the authors ran K-means on a single Titan X GPU for 4 hours to cluster 1.3M images into 2000 categories.</p>
</section>
<section id="pseudo-labeling" class="level4">
<h4 class="anchored" data-anchor-id="pseudo-labeling">3. Pseudo-labeling</h4>
<p>The cluster centers are treated as the pseudo-label. We can use either the same dataset as the above step or use a different dataset itself. Then, we compute the feature vectors for those images and find the closest cluster center for each image. This cluster center is used as the pseudo-label.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-step-3.png" class="img-fluid figure-img" alt="Generating pseudo-labels using cluster centers"></p>
<figcaption>Figure: Generating Pseudo-labels <span class="citation" data-cites="noroozi2018boosting">(Noroozi et al., 2018)</span></figcaption>
</figure>
</div>
</section>
<section id="training-on-pseudo-labels" class="level4">
<h4 class="anchored" data-anchor-id="training-on-pseudo-labels">4. Training on Pseudo-labels</h4>
<p>We take the model architecture that will be used for downstream tasks and train it to classify the unlabeled images into the pseudo-labels. Thus, the target architecture will learn a new representation such that it will map images that were originally close in the pre-trained feature space to close points.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-step-4.png" class="img-fluid figure-img" alt="Training model from scratch on pseudo-labels"></p>
<figcaption>Figure: Re-training on pseudo-labels <span class="citation" data-cites="noroozi2018boosting">(Noroozi et al., 2018)</span></figcaption>
</figure>
</div>
</section>
</section>
<section id="advantage-of-knowledge-transfer" class="level2">
<h2 class="anchored" data-anchor-id="advantage-of-knowledge-transfer">Advantage of Knowledge Transfer</h2>
<p>We saw how by clustering the features and then using pseudo-labels, we can bring the knowledge from any pretext task representations into a common reference model like AlexNet.</p>
<p>As such, we can now easily compare different pretext tasks even if they are trained using different architectures and on different data domains. This also allows us to improve self-supervised methods by using deep models and challenging pretext tasks.</p>
</section>
<section id="how-well-does-this-framework-work" class="level2">
<h2 class="anchored" data-anchor-id="how-well-does-this-framework-work">How well does this framework work?</h2>
<p>To evaluate the idea quantitatively, the authors set up an experiment as described below:</p>
<section id="a.-increase-complexity-of-pretext-task-jigsaw" class="level3">
<h3 class="anchored" data-anchor-id="a.-increase-complexity-of-pretext-task-jigsaw">a. Increase complexity of pretext task (Jigsaw++)</h3>
<p>To evaluate their method, the authors took an old puzzle-like pretext task called “Jigsaw” where we need to predict the permutation that was used to randomly shuffle a 3*3 square grid of image.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-jigsaw-plus-plus.png" class="img-fluid figure-img" alt="Jigsaw to Jigsaw++ task"></p>
<figcaption>Image adapted from <span class="citation" data-cites="noroozi2018boosting">Noroozi et al. (2018)</span></figcaption>
</figure>
</div>
<p>They extended the task by randomly replacing 0 to 2 number of tiles with tile from another random image at some random locations. This increases the difficulty as now we need to solve the problem using only the remaining patches. The new pretext task is called “Jigsaw++”.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-jigsaw-plus-plus-goal.png" class="img-fluid figure-img" alt="Goal of Jigsaw++"></p>
<figcaption>Image adapted from <span class="citation" data-cites="noroozi2018boosting">Noroozi et al. (2018)</span></figcaption>
</figure>
</div>
<p>In the paper, they use 701 total permutations which had a minimum hamming distance of 3. They apply mean and standard deviation normalization at each image tile independently. They also make images gray-scale 70% of the time to prevent the network from cheating with low-level statistics.</p>
</section>
<section id="b.-use-a-deeper-network-to-solve-pretext-task" class="level3">
<h3 class="anchored" data-anchor-id="b.-use-a-deeper-network-to-solve-pretext-task">b. Use a deeper network to solve pretext task</h3>
<p>The authors used VGG-16 to solve the pretext task and learn representations. As VGG-16 has increased capacity, it can better handle the increased complexity of the “Jigsaw++” task and thus extract better representation.</p>
</section>
<section id="c.-transfer-knowledge-back-to-alexnet" class="level3">
<h3 class="anchored" data-anchor-id="c.-transfer-knowledge-back-to-alexnet">c.&nbsp;Transfer Knowledge back to AlexNet</h3>
<p>The representations from VGG-16 are clustered and cluster centers are converted to pseudo-labels. Then, AlexNet is trained to classify the pseudo-labels.</p>
</section>
<section id="d.-finetune-alexnet-on-evaluation-datasets" class="level3">
<h3 class="anchored" data-anchor-id="d.-finetune-alexnet-on-evaluation-datasets">d.&nbsp;Finetune AlexNet on Evaluation datasets</h3>
<p>For downstream tasks, the convolutional layers for the AlexNet model are initialized with weights from pseudo-label classification and the fully connected layers were randomly initialized. The pre-trained AlexNet is then finetuned on various benchmark datasets.</p>
</section>
<section id="e.-results" class="level3">
<h3 class="anchored" data-anchor-id="e.-results">e. Results</h3>
<p>Using a deeper network like VGG-16 leads to better representation and pseudo-labels and also better results in benchmark tasks. It got state of the art results on several benchmarks in 2018 and reduced the gap between supervised and self-supervised methods further.</p>
<section id="transfer-learning-on-pascal-voc" class="level4">
<h4 class="anchored" data-anchor-id="transfer-learning-on-pascal-voc">1. Transfer Learning on PASCAL VOC</h4>
<p>The authors tested their method on object classification and detection on PASCAL VOC 2007 dataset and semantic segmentation on PASCAL VOC 2012 dataset.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="Insights">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Insights
</div>
</div>
<div class="callout-body-container callout-body">
<ul>
<li>Training Jigsaw++ with VGG16 and using AlexNet to predict cluster gives the best performance.</li>
<li>Switching to a challenging pretext task “Jigsaw++” improves performance than “Jigsaw”.</li>
<li>Knowledge transfer doesn’t have a significant impact when using the same architecture AlexNet in both Jigsaw++ and downstream tasks.</li>
</ul>
</div>
</div>
<table class="caption-top table">
<colgroup>
<col style="width: 7%">
<col style="width: 9%">
<col style="width: 19%">
<col style="width: 14%">
<col style="width: 13%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th>Task</th>
<th>Cluster</th>
<th>Pretext</th>
<th>Downstream</th>
<th>Classification</th>
<th>Detection(SS)</th>
<th>Detec.(MS)</th>
<th>Segmentation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Jigsaw</td>
<td>no</td>
<td>AlexNet</td>
<td>AlexNet</td>
<td>67.7</td>
<td>53.2</td>
<td>-</td>
<td>-</td>
</tr>
<tr class="even">
<td>Jigsaw++</td>
<td>no</td>
<td>AlexNet</td>
<td>AlexNet</td>
<td>69.8</td>
<td>55.5</td>
<td>55.7</td>
<td>38.1</td>
</tr>
<tr class="odd">
<td>Jigsaw++</td>
<td>yes</td>
<td>AlexNet</td>
<td>AlexNet</td>
<td>69.9</td>
<td>55.0</td>
<td>55.8</td>
<td>40.0</td>
</tr>
<tr class="even">
<td>Jigsaw++</td>
<td>yes</td>
<td>VGG-16</td>
<td>AlexNet</td>
<td><strong>72.5</strong></td>
<td><strong>56.5</strong></td>
<td><strong>57.2</strong></td>
<td><strong>42.6</strong></td>
</tr>
</tbody>
</table>
</section>
<section id="linear-classification-on-imagenet" class="level4">
<h4 class="anchored" data-anchor-id="linear-classification-on-imagenet">2. Linear Classification on ImageNet</h4>
<p>In this, a linear classifier is trained on features extracted from AlexNet at different convolutional layers. For ImageNet, using VGG-16 and transferring knowledge to AlexNet using clustering gives a substantial boost of 2%.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-imagenet-performance.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Results of Jigsaw++ on ImageNet"></p>
</figure>
</div>
</section>
<section id="non-linear-classification-on-imagenet" class="level4">
<h4 class="anchored" data-anchor-id="non-linear-classification-on-imagenet">3. Non-linear classification on ImageNet</h4>
<p>For a non-linear classifier, using VGG-16 and transferring knowledge to AlexNet using clustering gives the best performance on ImageNet.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-nonlinear-result.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Non-Linear classification results"></p>
</figure>
</div>
</section>
</section>
</section>
<section id="additional-insights-from-paper" class="level2">
<h2 class="anchored" data-anchor-id="additional-insights-from-paper">Additional Insights from Paper</h2>
<section id="how-does-the-number-of-clusters-affect-the-performance" class="level4">
<h4 class="anchored" data-anchor-id="how-does-the-number-of-clusters-affect-the-performance">1. How does the number of clusters affect the performance?</h4>
<p>The network is not significantly affected by the number of clusters. The authors tested AlexNet trained on pseudo-labels from a different number of clusters on the task of object detection.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-impact-of-cluster-numbers.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Impact of number of clusters on performance"></p>
</figure>
</div>
</section>
<section id="how-is-this-different-from-knowledge-distillation" class="level4">
<h4 class="anchored" data-anchor-id="how-is-this-different-from-knowledge-distillation">2. How is this different from Knowledge Distillation?</h4>
<p>Knowledge transfer is fundamentally different from knowledge distillation. Here, the goal is to only preserve the cluster association of images from the representation and transfer that to the target model. Unlike distillation, we don’t do any regression to the exact output of the teacher.</p>
</section>
<section id="can-you-use-different-datasets-in-clustering-vs-predicting-pseudo-labels" class="level4">
<h4 class="anchored" data-anchor-id="can-you-use-different-datasets-in-clustering-vs-predicting-pseudo-labels">3. Can you use different datasets in clustering vs predicting pseudo-labels?</h4>
<p>Yes, the method is flexible and you can pre-train on one dataset, cluster on another, and get pseudo-labels for the third one.</p>
<p>The authors did an experiment where they trained clustering on representations for ImageNet and then calculated cluster centers on the “Places” dataset to get pseudo-labels. There was only a small reduction (-1.5%) in performance for object classification.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/kt-different-datasets-impact.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Impact of using different datasets"></p>
</figure>
</div>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, Knowledge Transfer is a simple and efficient way to map representations from deep to shallow models.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-noroozi2017unsupervisedlearningvisualrepresentations" class="csl-entry">
Mehdi Noroozi and Paolo Favaro. 2017. <a href="https://arxiv.org/abs/1603.09246">Unsupervised learning of visual representations by solving jigsaw puzzles</a>.
</div>
<div id="ref-noroozi2018boosting" class="csl-entry">
M. Noroozi, Ananth Vinjimoor, P. Favaro, and H. Pirsiavash. 2018. <a href="https://doi.org/10.1109/CVPR.2018.00975">Boosting self-supervised learning via knowledge transfer</a>. <em>IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>.
</div>
<div id="ref-okanohara-tsujii-2007-discriminative" class="csl-entry">
Daisuke Okanohara and Jun’ichi Tsujii. 2007. <a href="https://aclanthology.org/P07-1010/">A discriminative language model with pseudo-negative samples</a>. In Annie Zaenen and Antal van den Bosch, editors, <em>Proceedings of the 45th annual meeting of the association of computational linguistics</em>, pages 73–80, Prague, Czech Republic. Association for Computational Linguistics.
</div>
</div></section></div> ]]></description>
  <category>self-supervised-learning</category>
  <guid>https://amitness.com/posts/knowledge-transfer</guid>
  <pubDate>Sun, 04 Oct 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/pseudolabeling.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>Interactive Analysis of Sentence Embeddings</title>
  <link>https://amitness.com/posts/visualize-sentence-embeddings</link>
  <description><![CDATA[ 




<p><a href="https://projector.tensorflow.org/">Embedding Projector</a> is a free web application for visualizing high-dimensional data. It has built-in demos for visualizing word embeddings in NLP and image embeddings for MNIST in Computer Vision.</p>
<p>I recently experimented with a way to load sentence embeddings along with the class labels into this tool and explore them interactively. In this blog post, I will explain the end-to-end process with an example dataset.</p>
<section id="toy-example-outlier-detection" class="level2">
<h2 class="anchored" data-anchor-id="toy-example-outlier-detection">Toy Example: Outlier Detection</h2>
<section id="preparing-dataset" class="level3">
<h3 class="anchored" data-anchor-id="preparing-dataset">1. Preparing Dataset</h3>
<p>To understand this use case, let’s take a subset of 100 movie reviews from the SST-2 dataset which are labeled as positive and negative.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"></span>
<span id="cb1-3">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'http://bit.ly/dataset-sst2'</span>, </span>
<span id="cb1-4">                 nrows<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'label'</span>])</span>
<span id="cb1-5"></span>
<span id="cb1-6">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'label'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'label'</span>].replace({<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'negative'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'positive'</span>})</span></code></pre></div></div>
<p>The dataset has a column containing the text and a label indicating whether it’s positive or negative opinion.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-head-5.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="First 5 rows of SST-2 dataset"></p>
</figure>
</div>
<p>We will introduce noise into our dataset by corrupting five of the responses with random text. It will act as an outlier for our example.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">df.loc[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">27</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">54</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">72</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">91</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'askgkn askngk kagkasng'</span></span></code></pre></div></div>
</section>
<section id="generating-embeddings" class="level3">
<h3 class="anchored" data-anchor-id="generating-embeddings">2. Generating Embeddings</h3>
<p>Now, we will compute sentence embeddings for the headlines using the <code>sentence-transformers</code> package. First, let’s install it using pip.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!pip</span> install sentence-transformers</span></code></pre></div></div>
</div>
<p>Next, we will create a helper function to return a NumPy array of sentence embeddings given a list of sentences.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sentence_transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SentenceTransformer</span>
<span id="cb4-2"></span>
<span id="cb4-3">sentence_bert_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SentenceTransformer(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'distilbert-base-nli-stsb-mean-tokens'</span>)</span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> get_embeddings(sentences):</span>
<span id="cb4-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> sentence_bert_model.encode(sentences,</span>
<span id="cb4-7">                                    batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span>, </span>
<span id="cb4-8">                                    show_progress_bar<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
<p>Using the above function, we can generate sentence embeddings for our data as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">e <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> get_embeddings(df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>])</span>
<span id="cb5-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shape: (100, 768)</span></span></code></pre></div></div>
</section>
<section id="exporting-to-embedding-projector-format" class="level3">
<h3 class="anchored" data-anchor-id="exporting-to-embedding-projector-format">3. Exporting to Embedding Projector Format</h3>
<p>Embedding Projector requires two TSV files to load our custom embeddings. - <code>output.tsv</code>: This file should contain the embeddings without any headers. - <code>metadata.tsv</code>: This file should contain the original text and labels for the embeddings</p>
<p>Let’s first generate the <code>output.tsv</code> file for our sentence embeddings from the previous step.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert NumPy array of embedding into data frame</span></span>
<span id="cb6-2">embedding_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(e)</span>
<span id="cb6-3"></span>
<span id="cb6-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save dataframe as as TSV file without any index and header</span></span>
<span id="cb6-5">embedding_df.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'output.tsv'</span>, sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, header<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span></code></pre></div></div>
<p>To generate <code>metadata.csv</code>, we simply save our original dataframe.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save dataframe without any index</span></span>
<span id="cb7-2">df.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'metadata.tsv'</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span></code></pre></div></div>
</section>
<section id="importing-into-embedding-projector" class="level3">
<h3 class="anchored" data-anchor-id="importing-into-embedding-projector">4. Importing into Embedding Projector</h3>
<p>We first go to <a href="https://projector.tensorflow.org/">https://projector.tensorflow.org/</a>.</p>
<p>On the left-hand sidebar, click the <strong>Load</strong> button.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-load-step-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:90.0%" alt="Loading file in embedding projector"></p>
</figure>
</div>
<p>Then, for the first <strong>Choose file</strong> button, upload the <code>output.tsv</code> file and for the second <strong>Choose file</strong> button, upload the <code>metadata.tsv</code> file.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-load-step-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:75.0%" alt="Choosing embeddings and metadata"></p>
</figure>
</div>
<p>After uploading both files, click outside and you should see the sentence embedding projection. The dimensions of embeddings are reduced to 3D by default using PCA.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-3d.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="3D projection of embeddings"></p>
</figure>
</div>
<p>Let’s switch to 2D by turning off the checkbox for ‘Component #3’ in the bottom part of sidebar.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-turn-off-3d.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Switching from 3D to 2D using PCA"></p>
</figure>
</div>
<p>On the 2D visualization, we can see how the random text is far from other groups of text as an <strong>outlier</strong>. On hovering the point, we see the text <code>askgkn askngk kagkasng</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-outlier.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Detecting outlier using projection"></p>
</figure>
</div>
</section>
<section id="useful-features-in-projector" class="level3">
<h3 class="anchored" data-anchor-id="useful-features-in-projector">5. Useful Features in Projector</h3>
<section id="a.-class-separation" class="level4">
<h4 class="anchored" data-anchor-id="a.-class-separation">a. Class Separation</h4>
<p>We can enable color coding of the points by their actual labels (positive vs negative) by using the <strong>Color by</strong> dropdown in the left sidebar.</p>
<p>Select the name of the column that contains your labels. In our example file, the column name is <strong>label</strong>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-color-code-labels.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Coloring points by class label"></p>
</figure>
</div>
<p>The points themselves are interactive. You can see the actual sentence for each point by hovering over them.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-interactive-1.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Hovering over points"></p>
</figure>
</div>
<p>You can click on the point to show the metadata. We can see below on clicking a blue point that its label is “positive” in the popup.</p>
<p>So the blue points are positive and the red points are negative. When a point is selected, 100 nearest points in terms of cosine similarity are also highlighted.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-click-point.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Impact of clicking a data point"></p>
</figure>
</div>
<p>To get back to the original view, we can click on any empty white space.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Applications
</div>
</div>
<div class="callout-body-container callout-body">
<p>The color coding can be a useful heuristic for many use cases:</p>
<ul>
<li>It can be used to explore class overlap for the dataset you’re working on and identify tricky sentences.</li>
<li>If there are labeling errors in your dataset, then this might help uncover them. For example, if a whole cluster of points is in a certain color, but some single point in that cluster is in a different color, then that might be an outlier or labeling error.</li>
</ul>
</div>
</div>
</section>
<section id="b.-dimensionality-reduction-algorithm" class="level4">
<h4 class="anchored" data-anchor-id="b.-dimensionality-reduction-algorithm">b. Dimensionality Reduction Algorithm</h4>
<p>The web app provides three standard dimensionality reduction techniques: <strong>UMAP</strong>, <strong>T-SNE</strong>, and <strong>PCA</strong>.</p>
<p>You can choose the algorithm and their parameters from the bottom of the left sidebar.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-choose-dim-algorithm.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Choosing the dimensionality reduction algorithm"></p>
</figure>
</div>
</section>
<section id="c.-custom-linear-projection" class="level4">
<h4 class="anchored" data-anchor-id="c.-custom-linear-projection">c.&nbsp;Custom Linear Projection</h4>
<p>You can also use a custom keyword or full text as the axis using the <strong>CUSTOM</strong> tab. This will apply a custom linear projection and can help us explore meaningful directions in the embedding space.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-custom-dim.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Performing custom linear projection"></p>
</figure>
</div>
<p>For example, the Gmail team tried setting “yeah” on the left side and “yes” on the right side. When they projected encoder embeddings for email replies to this custom linear projection, they found replies in a casual tone (e.g.&nbsp;Here you go) on the left side and responses in a more formal tone clustered towards the right side.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/projector-custom-direction.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:75.0%" alt="Formal vs informal aspects as axis"></p>
</figure>
</div>
</section>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, Embedding Projector is a very useful tool to better understand the datasets and models we work with.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Daniel Smilkov et al., <a href="https://arxiv.org/abs/1611.05469">Embedding Projector: Interactive Visualization and Interpretation of Embeddings</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>embeddings</category>
  <guid>https://amitness.com/posts/visualize-sentence-embeddings</guid>
  <pubDate>Thu, 24 Sep 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/embedding_projector_thumbnail.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>VSCode on Google Colab</title>
  <link>https://amitness.com/posts/vscode-colab</link>
  <description><![CDATA[ 




<!-- aliases: [/vscode-on-colab/, /editor.html] -->
<!-- ---
title: "VSCode on Google Colab"
date: 2020-09-01T00:00-00:00
last_modified_at: 2020-10-08T00:00:00-00:00
permalink: /vscode-on-colab/
categories:
  - colab
excerpt: Learn how to setup and use VSCode as an IDE on Google Colab and Kaggle.  
header:
  og_image: images/colab-vscode.png
  teaser: images/colab-vscode.png
classes: wide
--- -->
<p>I recently discovered a way to set up VSCode on Google Colab and use it as an editor to write code and run experiments on the Colab VM.</p>
<p>With this setup, you can still prototype in the Colab Notebook while also using VSCode for all the advantages of a full-fledged code editor. Here is how you can replicate my setup.</p>
<section id="approach-1-python-package" class="level2">
<h2 class="anchored" data-anchor-id="approach-1-python-package">Approach 1: Python Package</h2>
<p>In this setup, we use the <a href="https://github.com/abhishekkrthakur/colabcode">colab-code</a> package that automates all the manual setup steps previously described in the <strong>Approach 2</strong> section of this blog post. You can make a copy of this <a href="https://colab.research.google.com/github/abhishekkrthakur/colabcode/blob/master/colab_starter.ipynb">notebook</a> directly to get started.</p>
<ol type="1">
<li><p>First, install the <code>colab-code</code> package using the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">pip install colabcode</span></code></pre></div></div></li>
<li><p>Now, import <code>ColabCode</code> class from the package and specify the port and password.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> colabcode <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ColabCode</span>
<span id="cb2-2">ColabCode(port<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, password<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"password123"</span>)</span></code></pre></div></div>
<p>You can also use it directly with the default port and without any password as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> colabcode <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ColabCode</span>
<span id="cb3-2">ColabCode()</span></code></pre></div></div></li>
<li><p>You will get the ngrok URL in the output. Click the link and a login page will open in a new tab.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-code-step-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Generated NGROK URL"></p>
</figure>
</div></li>
<li><p>Type the password you had set in step 2 and click submit. If the page gets stuck for more than 4-5 seconds, refresh the page and you should be redirected to the editor.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-code-step-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Authenticating with password in VSCode"></p>
</figure>
</div></li>
<li><p>Now you will get access to the editor interface and can use it to work on python files.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-code-step-3.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="VSCode Interface"></p>
</figure>
</div></li>
</ol>
</section>
<section id="approach-2-manual-setup" class="level2">
<h2 class="anchored" data-anchor-id="approach-2-manual-setup">Approach 2: Manual Setup</h2>
<p>I have described the setup steps in detail below. After going through all the steps, please use this <a href="https://colab.research.google.com/drive/1yvUy5Gn9lPjmCQH6RjD_LvUO2NE0Z7RM?usp=sharing">colab notebook</a> to try it out directly.</p>
<ol type="1">
<li><p>First, we will install the <a href="https://github.com/coder/code-server">code-server</a> package to run VSCode editor as a web app. Copy and run the following command on colab to install <code>code-server</code>.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!curl</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-fsSL</span> https://code-server.dev/install.sh <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sh</span></span></code></pre></div></div>
</div></li>
<li><p>After the installation is complete, we will expose a random port <code>9000</code> to an external URL we can access using the <code>pyngrok</code> package. To install <code>pyngrok</code>, run</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!pip</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-qqq</span> pyngrok</span></code></pre></div></div>
</div></li>
<li><p>Then, run the following command to get a public ngrok URL. This will be the URL we will use to access VSCode.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyngrok <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ngrok</span>
<span id="cb6-2">url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ngrok.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(port<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9000</span>)</span>
<span id="cb6-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(url)</span></code></pre></div></div></li>
<li><p>Now, we will start the VSCode server in the background at port 9000 without any authentication using the following command.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!nohup</span> code-server <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> 9000 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--auth</span> none <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">&amp;</span></span></code></pre></div></div>
</div></li>
<li><p>Now, you can access the VSCode interface at the URL you got from step 3. The interface and functionality are the same as the desktop version of VSCode.</p></li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-vscode.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of a running instance of VSCode server"></p>
</figure>
</div>
</section>
<section id="usage-tips" class="level2">
<h2 class="anchored" data-anchor-id="usage-tips">Usage Tips</h2>
<ol type="1">
<li><p>You can switch to the dark theme by going to the bottom-left corner of the editor, clicking the <strong>settings icon</strong>, and then clicking ‘<strong>Color Theme</strong>’.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-dark-theme-step-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Switching to dark theme on VSCode"></p>
</figure>
</div>
<p>A popup will open. Select <strong>Dark (Visual Studio)</strong> in the options and the editor will switch to a dark theme.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-dark-theme-step-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Theme selection interface on VSCode"></p>
</figure>
</div></li>
<li><p>All the keyword shortcuts of regular VSCode works with this. For example, you can use <code>Ctrl + Shift + P</code> to open a popup for various actions.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/vscode-ctrl-shift-p.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Action popup in VSCode"></p>
</figure>
</div></li>
<li><p>To open a terminal, you can use the shortcut <code>Ctrl + Shift + `</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/vscode-terminal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Opening integrated terminal in VSCode"></p>
</figure>
</div></li>
<li><p>To get python code completions, you can install the Python(<code>ms-python</code>) extension from the extensions page on the left sidebar.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/vscode-code-completions.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Installing extensions in VSCode"></p>
</figure>
</div></li>
<li><p>The Colab interface is still usable as a notebook and regular functions to upload and download files and mount with Google Drive. Thus, you get the benefits of both a notebook and a code editor.</p></li>
</ol>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://github.com/coder/code-server/blob/v3.5.0/doc/FAQ.md">Code-Server FAQs</a></li>
<li><a href="https://pyngrok.readthedocs.io/en/latest/">pyngrok - a Python wrapper for ngrok</a></li>
</ul>


</section>

 ]]></description>
  <category>colab</category>
  <guid>https://amitness.com/posts/vscode-colab</guid>
  <pubDate>Tue, 01 Sep 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/vscode-colab-thumbnail.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>Unsupervised Keyphrase Extraction</title>
  <link>https://amitness.com/posts/keyword-extraction</link>
  <description><![CDATA[ 




<p>Keyword Extraction is one of the simplest ways to leverage text mining for providing business value. It can automatically identify the most representative terms in the document.</p>
<p>Such extracted keywords can be used for various applications. They can be used to summarize the underlying theme of a large document with just a few terms. They are also valuable as metadata for indexing and tagging the documents. They can likewise be used for clustering similar documents. For instance, to showcase relevant advertisements on a webpage, we could extract keywords from the webpage, find matching advertisements for these keywords, and showcase those.</p>
<p>In this post, I will provide an overview of the general pipeline of keyword extraction and explain the working mechanism of various unsupervised algorithms for this.</p>
<section id="unsupervised-keyphrase-extraction-pipeline" class="level2">
<h2 class="anchored" data-anchor-id="unsupervised-keyphrase-extraction-pipeline">Unsupervised Keyphrase Extraction Pipeline</h2>
<p>For keyword extraction, all algorithms follow a similar pipeline as shown below. A document is preprocessed to remove less informative words like stop words, punctuation, and split into terms. Candidate keywords such as words and phrases are chosen.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-extraction-pipeline.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="General Pipeline of Keyword Extraction"></p>
</figure>
</div>
<p>Then, a score is determined for each candidate keyword using some algorithm. The highest-ranking keywords are selected and post-processing such as removing near-duplicates is applied. Finally, the algorithm returns the top N ranking keywords as output.</p>
</section>
<section id="unsupervised-methods" class="level2">
<h2 class="anchored" data-anchor-id="unsupervised-methods">Unsupervised Methods</h2>
<p>Unsupervised algorithms for keyword extraction don’t need to be trained on the corpus and don’t need any pre-defined rules, dictionary, or thesaurus. They can use statistical features from the text itself and as such can be applied to large documents easily without re-training. Most of these algorithms don’t need any linguistic features except for stop word lists and so can be applied to multiple languages.</p>
<p>Let’s understand each algorithm by starting from simple methods and gradually adding complexity.</p>
</section>
<section id="naive-counting" class="level2">
<h2 class="anchored" data-anchor-id="naive-counting">1. Naive Counting</h2>
<p>This is a simple method which only takes into account how many times each term occurs.</p>
<p>Let’s understand it by applying it to an example document.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-matter-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example document for keyword extraction"></p>
</figure>
</div>
<section id="a.-pre-processing" class="level3">
<h3 class="anchored" data-anchor-id="a.-pre-processing">a. Pre-processing</h3>
<p>In this step, we lowercase the text and remove low informative words such as stop words from the text.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-matter-stopword-removal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Removing stopwords from a document"></p>
</figure>
</div>
</section>
<section id="b.-candidate-generation" class="level3">
<h3 class="anchored" data-anchor-id="b.-candidate-generation">b. Candidate Generation</h3>
<p>We split the remaining terms by space and punctuation symbols to get a list of possible keywords.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-candidates.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Generating candidate keywords"></p>
</figure>
</div>
</section>
<section id="c.-candidate-scoring" class="level3">
<h3 class="anchored" data-anchor-id="c.-candidate-scoring">c.&nbsp;Candidate Scoring</h3>
<p>We can count the number of times each term occurs to get a score for each term.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 11%">
<col style="width: 5%">
<col style="width: 11%">
<col style="width: 7%">
<col style="width: 8%">
<col style="width: 8%">
<col style="width: 8%">
<col style="width: 10%">
<col style="width: 8%">
<col style="width: 4%">
</colgroup>
<thead>
<tr class="header">
<th>Candidate</th>
<th>anything</th>
<th>mass</th>
<th>occupies</th>
<th>space</th>
<th>called</th>
<th>matter</th>
<th>exists</th>
<th>various</th>
<th>states</th>
<th>…</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Count</strong></td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>…</td>
</tr>
</tbody>
</table>
</section>
<section id="d.-final-ranking" class="level3">
<h3 class="anchored" data-anchor-id="d.-final-ranking">d.&nbsp;Final Ranking</h3>
<p>We can sort the keywords in descending order based on the counts and take the top N keywords as the output.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-counting-ranking.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Re-ranking keywords using count"></p>
</figure>
</div>
</section>
<section id="drawback-of-naive-counting" class="level3">
<h3 class="anchored" data-anchor-id="drawback-of-naive-counting">Drawback of Naive Counting</h3>
<p>This method has an obvious drawback of only focusing on frequency. But, generic words are likely to be very frequent in any document but are not representative of the domain and topic of the document. We need some way to filter out generic terms.</p>
</section>
</section>
<section id="term-frequency-inverse-document-frequency-tf-idf" class="level2">
<h2 class="anchored" data-anchor-id="term-frequency-inverse-document-frequency-tf-idf">2. Term Frequency Inverse Document Frequency (TF-IDF)</h2>
<p>This method takes into account both how frequent the keyphrase is and also how rare it is across the documents.</p>
<p>Let’s understand how it works by going through the various steps of the pipeline:</p>
<section id="a.-pre-processing-1" class="level3">
<h3 class="anchored" data-anchor-id="a.-pre-processing-1">a. Pre-processing</h3>
<p>In this step, we lowercase the text and split the document into sentences.</p>
</section>
<section id="b.-candidate-generation-1" class="level3">
<h3 class="anchored" data-anchor-id="b.-candidate-generation-1">b. Candidate Generation</h3>
<p>We generate 1-gram, 2-gram, and 3-grams candidate phrases from each sentence such that they don’t contain any punctuations. These are our list of candidate phrases.</p>
</section>
<section id="c.-candidate-scoring-1" class="level3">
<h3 class="anchored" data-anchor-id="c.-candidate-scoring-1">c.&nbsp;Candidate Scoring</h3>
<p>Now, for each candidate keyword “w”, we calculate the TF-IDF score in the following steps.</p>
<p>First, the term frequency(TF) is calculated simply by counting the occurrence of the word.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ATF(w)%20=%20count(w)%0A"></p>
<p>Then, the inverse document frequency(IDF) is calculated by dividing the total number of documents by the number of documents that contain the word “w” and taking the log of that quantity.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AIDF(W)%20=%20log(%5C%20%5Cfrac%7Btotal%5C%20documents%7D%7Bnumber%5C%20of%5C%20docs%5C%20containing%5C%20word%5C%20w%7D%5C%20)%0A"></p>
<p>Finally, we get the <code>TF-IDF</code> score for a term by multiplying the two quantities.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ATFIDF(w)%20=%20TF(w)%20*%20IDF(w)%0A"></p>
</section>
<section id="d.-final-ranking-1" class="level3">
<h3 class="anchored" data-anchor-id="d.-final-ranking-1">d.&nbsp;Final Ranking</h3>
<p>We can sort the keywords in descending order based on their TF-IDF scores and take the top N keywords as the output.</p>
</section>
</section>
<section id="rapid-automatic-keyword-extraction-rake" class="level2">
<h2 class="anchored" data-anchor-id="rapid-automatic-keyword-extraction-rake">3. Rapid Automatic Keyword Extraction (RAKE)</h2>
<p>RAKE is a domain-independent keyword extraction method proposed in 2010. It uses word frequency and co-occurrence to identify the keywords. It is very useful for identifying relevant multi-word expressions.</p>
<section id="how-rake-works" class="level3">
<h3 class="anchored" data-anchor-id="how-rake-works">How RAKE works</h3>
<p>Let’s apply RAKE on a toy example document to understand how it works:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-sentence.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example document for RAKE algorithm"></p>
</figure>
</div>
<section id="a.-preprocessing" class="level4">
<h4 class="anchored" data-anchor-id="a.-preprocessing">a. Preprocessing</h4>
<p>First, the stop words in the document are removed.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-stopwords-removal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Removing stop words from the document"></p>
</figure>
</div>
</section>
<section id="b.-candidate-generation-2" class="level4">
<h4 class="anchored" data-anchor-id="b.-candidate-generation-2">b. Candidate Generation</h4>
<p>We split the document at the stop word positions and punctuations to get content words. The words that occur consecutively without any stop word between them are taken as candidate keywords.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/keyword-split-at-stopwords.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Generating candidate keywords using merging"></p>
</figure>
</div>
<p>For example, “Deep Learning” is treated as a single keyword.</p>
</section>
<section id="c.-candidate-scoring-2" class="level4">
<h4 class="anchored" data-anchor-id="c.-candidate-scoring-2">c.&nbsp;Candidate Scoring</h4>
<p>Next, the frequency of all the individual words in the candidate keywords are calculated. This finds words that occur frequently.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 6%">
<col style="width: 13%">
<col style="width: 13%">
<col style="width: 5%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th>deep</th>
<th>learning</th>
<th>subfield</th>
<th>ai</th>
<th>useful</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Word Frequency: <img src="https://latex.codecogs.com/png.latex?freq(w)"></strong></td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>Similarly, the word co-occurrence count is calculated and the degree for each word is the total sum. This metric identifies words that occur often in longer candidate keywords.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th></th>
<th style="text-align: center;">deep</th>
<th style="text-align: center;">learning</th>
<th style="text-align: center;">subfield</th>
<th style="text-align: center;">ai</th>
<th style="text-align: center;">useful</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>deep</strong></td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
</tr>
<tr class="even">
<td><strong>learning</strong></td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
</tr>
<tr class="odd">
<td><strong>subfield</strong></td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
</tr>
<tr class="even">
<td><strong>ai</strong></td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">0</td>
</tr>
<tr class="odd">
<td><strong>useful</strong></td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">1</td>
</tr>
<tr class="even">
<td>degree: <img src="https://latex.codecogs.com/png.latex?deg(w)"></td>
<td style="text-align: center;">1 + 1 = 2</td>
<td style="text-align: center;">1 + 1 = 2</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1</td>
</tr>
</tbody>
</table>
<p>Then, we divide the degree by the frequency for each word to get a final score. This score identifies words that occur more in longer candidate keywords than individually.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 44%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th>deep</th>
<th>learning</th>
<th>subfield</th>
<th>ai</th>
<th>useful</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Score = <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bdeg(w)%7D%7Bfreq(w)%7D"></strong></td>
<td>2 / 1 = 2</td>
<td>2 / 1 = 2</td>
<td>1 / 1 = 1</td>
<td>1 / 1 = 1</td>
<td>1 / 1 = 1</td>
</tr>
</tbody>
</table>
</section>
<section id="d.-final-ranking-2" class="level4">
<h4 class="anchored" data-anchor-id="d.-final-ranking-2">d.&nbsp;Final Ranking</h4>
<p>Finally, we calculate the scores for our candidate keywords by adding the scores for their member words. The higher the score, the more useful a keyword is.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 26%">
<col style="width: 7%">
<col style="width: 65%">
</colgroup>
<thead>
<tr class="header">
<th>Keyword</th>
<th>Score</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>deep learning</strong></td>
<td>4</td>
<td>score(deep) + score(learning) = 2 + 2 = 4</td>
</tr>
<tr class="even">
<td><strong>subfield</strong></td>
<td>1</td>
<td>score(subfield) = 1</td>
</tr>
<tr class="odd">
<td><strong>ai</strong></td>
<td>1</td>
<td>score(ai) = 1</td>
</tr>
<tr class="even">
<td><strong>useful</strong></td>
<td>1</td>
<td>score(useful) = 1</td>
</tr>
</tbody>
</table>
<p>Thus, the keywords are sorted in the descending order of their score value. We can select the top-N keywords from this list.</p>
</section>
</section>
<section id="drawbacks-of-rake" class="level3">
<h3 class="anchored" data-anchor-id="drawbacks-of-rake">Drawbacks of RAKE</h3>
<ul>
<li>If the stop word list used in RAKE is not exhaustive, it would treat continuous long text as a phrase and give very long phrases.</li>
<li>Multi-word expressions that contain stop-words could be missed. For example, mention of a brand called “Good Day” could be missed if “good” is present in the stop word list.</li>
</ul>
</section>
<section id="using-rake-in-python" class="level3">
<h3 class="anchored" data-anchor-id="using-rake-in-python">Using RAKE in Python</h3>
<p>We can use the <a href="https://csurfer.github.io/rake-nltk/_build/html/index.html">rake-nltk</a> library to use it in Python as shown below.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install rake-nltk</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> rake_nltk <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Rake</span>
<span id="cb2-2">rake <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Rake()</span>
<span id="cb2-3"></span>
<span id="cb2-4">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Deep Learning is a subfield of AI. It is very useful.'</span></span>
<span id="cb2-5">rake.extract_keywords_from_text(text)</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(rake.get_ranked_phrases_with_scores())</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">[(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'deep learning'</span>), (<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'useful'</span>), (<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'subfield'</span>), (<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ai'</span>)]</span></code></pre></div></div>
</section>
</section>
<section id="yet-another-keyword-extractor-yake" class="level2">
<h2 class="anchored" data-anchor-id="yet-another-keyword-extractor-yake">4. Yet Another Keyword Extractor (YAKE)</h2>
<p>YAKE is another popular keyword extraction algorithm proposed in 2018. It outperforms TF-IDF and RAKE across many datasets and went on to win the best “short paper award” at <a href="https://ecir2018.org/" title="European Conference on Information Retrieval 2018">ECIR 2018</a>.</p>
<p>YAKE uses statistical features to identify and rank the most important keywords. It doesn’t need any linguistic information like NER or POS tagging and thus can be used with any language. It only requires a stop word list for the language.</p>
<section id="how-yake-works" class="level3">
<h3 class="anchored" data-anchor-id="how-yake-works">How YAKE works:</h3>
<section id="i.-preprocessing-and-candidate-generation" class="level4">
<h4 class="anchored" data-anchor-id="i.-preprocessing-and-candidate-generation">i. Preprocessing and Candidate Generation</h4>
<p>The sentences are split into terms using space and special character(line break, bracket, comma, period) as the delimiter.</p>
<p>We decide the maximum length of the keyword to be generated. If we decide max length of 3, then 1-gram, 2-gram, and 3-gram candidate phrases are generated using a sliding window.</p>
<p>Then, we remove phrases that contain punctuation marks. Also, phrases that begin and end with a stop word are removed.</p>
</section>
<section id="ii.-candidate-scoring" class="level4">
<h4 class="anchored" data-anchor-id="ii.-candidate-scoring">ii. Candidate Scoring</h4>
<p>YAKE uses 5 features to quantify how good each word is.</p>
<section id="a.-casing" class="level5">
<h5 class="anchored" data-anchor-id="a.-casing">a. Casing</h5>
<p>This feature considers the casing of the word. It gives more importance to capitalized words and acronyms such as “NASA”.</p>
<p>First, we count the number of times the word starts with a capital letter when it is not the beginning word of the sentence. We also count the times when the word is in acronym form.</p>
<p>Then, we take the maximum of the two counts and normalize it by the log of the total count.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Acasing(w)%20=%20%5Cfrac%7Bmax(%20count(w%5C%20is%5C%20capital),%20count(w%5C%20is%5C%20acronym)%20)%7D%7B1%20+%20log(count(w))%7D%0A"></p>
</section>
<section id="b.-word-positional" class="level5">
<h5 class="anchored" data-anchor-id="b.-word-positional">b. Word Positional</h5>
<p>This feature gives more importance to words present at the beginning of the document. It’s based on the assumption that relevant keywords are usually concentrated more at the beginning of a document.</p>
<p>First, we get all the sentence positions where the word “w” occurs.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ASen(w)%20=%20positions%5C%20of%5C%20sentences%5C%20where%5C%20w%5C%20occurs%0A"></p>
<p>Then, we compute the position feature by taking the median position and applying the following formula:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aposition(w)%20=%20log(%20log(%203%20+%20Median(Sen(w))%20)%20)%0A"></p>
</section>
<section id="c.-word-frequency" class="level5">
<h5 class="anchored" data-anchor-id="c.-word-frequency">c.&nbsp;Word Frequency</h5>
<p>This feature calculates the frequency of the words normalized by 1-standard deviation from the mean.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Afrequency(w)%20=%20%5Cfrac%7Bcount%5C%20of%5C%20word%5C%20w%7D%7Bmean(counts)%20+%20standard%5C%20deviation(counts)%7D%0A"></p>
</section>
<section id="d.-word-relatedness-to-context" class="level5">
<h5 class="anchored" data-anchor-id="d.-word-relatedness-to-context">d.&nbsp;Word Relatedness to Context</h5>
<p>This feature quantifies how related a word is to its context. For that, it counts how many different terms occur to the left or right of a candidate word. If the word occurs frequently with different words on the left or right side, it is more likely to be a stop word.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Arelatedness(w)%20=%201%20+%20(WR%20+%20WL)%20*%20%5Cfrac%7Bcount(w)%7D%7Bmax%5C%20count%7D%20+%20PL%20+%20PR%0A"></p>
<p>where,</p>
<ul>
<li>WR = (number of unique words on right) / (total words on right)</li>
<li>WL = (number of unique words on left) / (total words on left)</li>
<li>PL = (total words on left) / (max count)</li>
<li>PR = (total words on right) / (max count)</li>
</ul>
</section>
<section id="e.-word-different-sentence" class="level5">
<h5 class="anchored" data-anchor-id="e.-word-different-sentence">e. Word Different Sentence</h5>
<p>This feature quantifies how often a candidate word occurs with different sentences. A word that often occurs in different sentences has a higher score.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Adifferent(w)%20=%20%5Cfrac%7Bnumber%5C%20of%5C%20sentences%5C%20w%5C%20occurs%5C%20in%7D%7Btotal%5C%20sentences%7D%0A"></p>
</section>
<section id="combined-word-score" class="level5">
<h5 class="anchored" data-anchor-id="combined-word-score">Combined Word Score</h5>
<p>These 5 features are combined into a single score S(w) using the formula:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ascore(w)%20=%20%5Cfrac%7Bd%20*%20b%7D%7Ba%20+%20(c%20/%20d)%20+%20(e%20/%20d)%7D%0A"></p>
<p>where,</p>
<ul>
<li>a = casing</li>
<li>b = position</li>
<li>c = frequency</li>
<li>d = relatedness</li>
<li>e = different</li>
</ul>
</section>
<section id="keyword-score" class="level5">
<h5 class="anchored" data-anchor-id="keyword-score">Keyword Score</h5>
<p>Now, for each of our candidate keywords, a score is calculated using the following formula. The count of keyword penalizes less frequent keywords.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AS(kw)%20=%20%5Cfrac%7Bproduct(scores%5C%20of%5C%20words%5C%20in%5C%20keyword)%7D%7B1%20+%20(sum%5C%20of%5C%20scores%5C%20of%5C%20words)%20*%20count(keyword)%7D%0A"></p>
</section>
</section>
<section id="iii.-post-processing" class="level4">
<h4 class="anchored" data-anchor-id="iii.-post-processing">iii. Post-processing</h4>
<p>It’s pretty common to get similar candidates when extracting keyphrases. For example, we could have variations like:</p>
<ul>
<li>“work”, “works”<br>
</li>
<li>“relevant”, “relevance”</li>
</ul>
<p>To eliminate such duplicates, the following process is applied:</p>
<ul>
<li>First, the keywords are sorted in ascending order of their scores and we maintain a list of chosen keywords so far<br>
</li>
<li>Then, for each keyword in the list
<ul>
<li>If the keyword has a small Levenshtein distance with any of chosen keywords so far, it is skipped</li>
<li>Otherwise, the keyword is added to the chosen keywords list</li>
</ul></li>
</ul>
<p>Thus, the chosen keyword list contains the final deduplicated keywords.</p>
</section>
<section id="iv.-final-ranking" class="level4">
<h4 class="anchored" data-anchor-id="iv.-final-ranking">iv. Final Ranking</h4>
<p>Thus, we have a list of keywords along with their scores. A keyword is more important if it has a lower score.</p>
<p>We can sort the keywords in ascending order and take the top N keywords as the output.</p>
</section>
</section>
<section id="using-yake-in-python" class="level3">
<h3 class="anchored" data-anchor-id="using-yake-in-python">Using YAKE in Python</h3>
<p>To apply YAKE, we will use the <a href="https://github.com/boudinfl/pke">pke</a> library. First, we need to install the library and its dependencies using the following command:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install git+https://github.com/boudinfl/pke.git</span>
<span id="cb4-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> nltk.downloader stopwords</span>
<span id="cb4-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> spacy download en</span></code></pre></div></div>
</div>
<p>Then, we can use YAKE to generate keywords of maximum length 2 as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pke.unsupervised <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> YAKE</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> nltk.corpus <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> stopwords</span>
<span id="cb5-3"></span>
<span id="cb5-4">document <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence."</span></span>
<span id="cb5-5"></span>
<span id="cb5-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 1. Create YAKE keyword extractor</span></span>
<span id="cb5-7">extractor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> YAKE()</span>
<span id="cb5-8"></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 2. Load document</span></span>
<span id="cb5-10">extractor.load_document(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>document,</span>
<span id="cb5-11">                        language<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'en'</span>,</span>
<span id="cb5-12">                        normalization<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb5-13"></span>
<span id="cb5-14"></span>
<span id="cb5-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3. Generate candidate 1-gram and 2-gram keywords</span></span>
<span id="cb5-16">stoplist <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stopwords.words(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'english'</span>)</span>
<span id="cb5-17">extractor.candidate_selection(n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, stoplist<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>stoplist)</span>
<span id="cb5-18"></span>
<span id="cb5-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 4. Calculate scores for the candidate keywords</span></span>
<span id="cb5-20">extractor.candidate_weighting(window<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb5-21">                              stoplist<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>stoplist,</span>
<span id="cb5-22">                              use_stems<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb5-23"></span>
<span id="cb5-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 5. Select 10 highest ranked keywords</span></span>
<span id="cb5-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Remove redundant keywords with similarity above 80%</span></span>
<span id="cb5-26">key_phrases <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> extractor.get_n_best(n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, threshold<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb5-27"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(key_phrases)</span></code></pre></div></div>
<p>You get back a list of top-10 keywords and their scores. The highest ranked keyword has the lowest score.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">[(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'machine learning'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01552184797949213</span>),</span>
<span id="cb6-2"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'computer algorithms'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.04188746641162499</span>),</span>
<span id="cb6-3"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'improve automatically'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.04188746641162499</span>),</span>
<span id="cb6-4"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'machine'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.12363091320521931</span>),</span>
<span id="cb6-5"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'learning'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.12363091320521931</span>),</span>
<span id="cb6-6"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'experience'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.12363091320521931</span>),</span>
<span id="cb6-7"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'artificial intelligence'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.18075564686791562</span>),</span>
<span id="cb6-8"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'study'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2005079697193566</span>),</span>
<span id="cb6-9"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'computer'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2005079697193566</span>),</span>
<span id="cb6-10"> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'algorithms'</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2005079697193566</span>)]</span></code></pre></div></div>
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Rose, Stuart &amp; Engel, Dave &amp; Cramer, Nick &amp; Cowley, Wendy. (2010). <a href="https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents">Automatic Keyword Extraction from Individual Documents</a>.10.1002/9780470689646.ch1</li>
<li>Eirini Papagiannopoulou et al., <a href="https://arxiv.org/abs/1905.05044">“A Review of Keyphrase Extraction”</a></li>
<li><a href="https://github.com/boudinfl/pke/blob/master/pke/unsupervised/statistical/yake.py">“YAKE implementation in pke: an open source python-based keyphrase extraction toolkit”</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <guid>https://amitness.com/posts/keyword-extraction</guid>
  <pubDate>Sun, 30 Aug 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/keyword-extraction-pipeline.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>Text Data Augmentation with MarianMT</title>
  <link>https://amitness.com/posts/back-translation</link>
  <description><![CDATA[ 




<p>Hugging Face recently released <a href="https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt">1008 translation models</a> for almost 140 languages on their model hub.</p>
<p>These models were originally trained by <a href="https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann">Jörg Tiedemann</a> of the <a href="https://blogs.helsinki.fi/language-technology/">Language Technology Research Group at the University of Helsinki</a>. They were trained on the <a href="https://opus.nlpl.eu/">Open Parallel Corpus(OPUS)</a> using a neural machine translation framework called <a href="https://marian-nmt.github.io/">MarianNMT</a>.</p>
<p>In this post, I will explain how you can use the MarianMT models to augment data text data.</p>
<section id="back-translation" class="level2">
<h2 class="anchored" data-anchor-id="back-translation">Back Translation</h2>
<p>We will use a data augmentation technique called “Back Translation”. In this, we take an original text written in English. Then, we convert it into another language (eg. French) using MarianMT. We translate the French text back into English using MarianMT. We keep the back-translated English text if it is different from the original English sentence.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/back-translation-marianmt.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Backtranslation with MarianMT"></p>
</figure>
</div>
</section>
<section id="augmentation-process" class="level2">
<h2 class="anchored" data-anchor-id="augmentation-process">Augmentation Process</h2>
<p><a href="https://colab.research.google.com/drive/1J_KpNYj03gecT0p9s6YeDcDJHKgPn1Hh?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" class="img-fluid"></a></p>
<p>First, we need to install Hugging Face transformers and Moses Tokenizers with the following command</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install transformers==4.1.1 sentencepiece==0.1.94</span>
<span id="cb1-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install mosestokenizer==1.1.0</span></code></pre></div></div>
</div>
<p>After installation, we can now import the MarianMT model and tokenizer.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MarianMTModel, MarianTokenizer</span></code></pre></div></div>
<p>Then, we can create a initialize the model that can translate from English to Romance languages. This is a single model that can translate to any of the romance languages()</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">target_model_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Helsinki-NLP/opus-mt-en-ROMANCE'</span></span>
<span id="cb3-2">target_tokenizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MarianTokenizer.from_pretrained(target_model_name)</span>
<span id="cb3-3">target_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MarianMTModel.from_pretrained(target_model_name)</span></code></pre></div></div>
<p>Similarly, we can initialize models that can translate Romance languages to English.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">en_model_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Helsinki-NLP/opus-mt-ROMANCE-en'</span></span>
<span id="cb4-2">en_tokenizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MarianTokenizer.from_pretrained(en_model_name)</span>
<span id="cb4-3">en_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MarianMTModel.from_pretrained(en_model_name)</span></code></pre></div></div>
<p>Next, we write a helper function to translate a batch of text given the machine translation model, tokenizer and the target romance language.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> translate(texts, model, tokenizer, language<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fr"</span>):</span>
<span id="cb5-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Prepare the text data into appropriate format for the model</span></span>
<span id="cb5-3">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> text: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>text<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> language <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"&gt;&gt;</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>language<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">&lt;&lt; </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>text<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-4">    src_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [template(text) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> text <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> texts]</span>
<span id="cb5-5"></span>
<span id="cb5-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Tokenize the texts</span></span>
<span id="cb5-7">    encoded <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenizer.prepare_seq2seq_batch(src_texts)</span>
<span id="cb5-8">    </span>
<span id="cb5-9">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate translation using model</span></span>
<span id="cb5-10">    translated <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.generate(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>encoded)</span>
<span id="cb5-11"></span>
<span id="cb5-12">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert the generated tokens indices back into text</span></span>
<span id="cb5-13">    translated_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenizer.batch_decode(translated, skip_special_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb5-14">    </span>
<span id="cb5-15">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> translated_texts</span></code></pre></div></div>
<p>Next, we will prepare a function to use the above <code>translate()</code> function to perform back translation.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> back_translate(texts, source_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>, target_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fr"</span>):</span>
<span id="cb6-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Translate from source to target language</span></span>
<span id="cb6-3">    fr_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translate(texts, target_model, target_tokenizer, </span>
<span id="cb6-4">                         language<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>target_lang)</span>
<span id="cb6-5"></span>
<span id="cb6-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Translate from target language back to source language</span></span>
<span id="cb6-7">    back_translated_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translate(fr_texts, en_model, en_tokenizer, </span>
<span id="cb6-8">                                      language<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>source_lang)</span>
<span id="cb6-9">    </span>
<span id="cb6-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> back_translated_texts</span></code></pre></div></div>
<p>Now, we can perform data augmentation using back-translation from English to Spanish on a list of sentences as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">en_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'This is so cool'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I hated the food'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'They were very helpful'</span>]</span>
<span id="cb7-2"></span>
<span id="cb7-3">aug_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> back_translate(en_texts, source_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>, target_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"es"</span>)</span>
<span id="cb7-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(aug_texts)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Yeah, it's so cool."</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"It's the food I hated."</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'They were of great help.'</span>]</span></code></pre></div></div>
<p>Similarly, we can perform augmentation using English to French as shown below with the exact same helper method.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">en_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'This is so cool'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I hated the food'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'They were very helpful'</span>]</span>
<span id="cb9-2">aug_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> back_translate(en_texts, source_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>, target_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fr"</span>)</span>
<span id="cb9-3"></span>
<span id="cb9-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(aug_texts)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"It's so cool."</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I hated food.'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"They've been very helpful."</span>]</span></code></pre></div></div>
</section>
<section id="chained-back-translation" class="level2">
<h2 class="anchored" data-anchor-id="chained-back-translation">Chained Back Translation</h2>
<p>You can also run back translation in a chain to get more diversity. For example, <code>English -&gt; Spanish -&gt; English -&gt; French -&gt; English</code></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">en_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'This is so cool'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I hated the food'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'They were very helpful'</span>]</span>
<span id="cb11-2"></span>
<span id="cb11-3">aug1_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> back_translate(en_texts, source_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>, target_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"es"</span>)</span>
<span id="cb11-4">aug2_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> back_translate(aug1_texts, source_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>, target_lang<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fr"</span>)</span>
<span id="cb11-5"></span>
<span id="cb11-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(aug2_texts)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1">[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Yeah, that's cool."</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"It's the food I hated."</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'They were of great help.'</span>]</span></code></pre></div></div>
</section>
<section id="available-models" class="level2">
<h2 class="anchored" data-anchor-id="available-models">Available Models</h2>
<p>Here are language codes for a subset of major romance language that you can use above.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 12%">
<col style="width: 9%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 15%">
<col style="width: 12%">
<col style="width: 10%">
<col style="width: 12%">
<col style="width: 7%">
</colgroup>
<thead>
<tr class="header">
<th>Language</th>
<th>French</th>
<th>Spanish</th>
<th>Italian</th>
<th>Portuguese</th>
<th>Romanian</th>
<th>Catalan</th>
<th>Galician</th>
<th>Latin</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Code</strong></td>
<td>fr</td>
<td>es</td>
<td>it</td>
<td>pt</td>
<td>ro</td>
<td>ca</td>
<td>gl</td>
<td>la</td>
</tr>
</tbody>
</table>
<table class="caption-top table">
<colgroup>
<col style="width: 11%">
<col style="width: 10%">
<col style="width: 28%">
<col style="width: 13%">
<col style="width: 13%">
<col style="width: 11%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Language</th>
<th>Walloon</th>
<th>Occitan (post 1500)</th>
<th>Sardinian</th>
<th>Aragonese</th>
<th>Corsican</th>
<th>Romansh</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Code</strong></td>
<td>wa</td>
<td>oc</td>
<td>sn</td>
<td>an</td>
<td>co</td>
<td>rm</td>
</tr>
</tbody>
</table>
<p>To view all available language codes, you can run</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">target_tokenizer.supported_language_codes</span></code></pre></div></div>
</section>
<section id="alternative-applications" class="level2">
<h2 class="anchored" data-anchor-id="alternative-applications">Alternative Applications</h2>
<p>Besides data augmentation, the back translation process can also be used for text paraphrasing.</p>
<p>Similarly, we can also use it as an adversarial attack. Suppose we have a training dataset on which we trained an NLP model. Then, we can augment the training dataset and generate prediction from our model on augmented texts. If the predictions are different than our ground-truth labels, then we have a list of texts where our model fails. We can get good insights by analyzing those responses.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, MarianMT is a decent free and offline alternative to Google Translate for back-translation.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://huggingface.co/docs/transformers/main/model_doc/marian">MarianMT - transformers 3.0.2 documentation</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>data-augmentation</category>
  <guid>https://amitness.com/posts/back-translation</guid>
  <pubDate>Sun, 30 Aug 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/back-translation-marianmt.png" medium="image" type="image/png" height="52" width="144"/>
</item>
<item>
  <title>Evaluation Metrics For Information Retrieval</title>
  <link>https://amitness.com/posts/information-retrieval-evaluation</link>
  <description><![CDATA[ 




<p>Most software products we encounter today have some form of search functionality integrated into them. We search for content on Google, videos on YouTube, products on Amazon, messages on Slack, emails on Gmail, people on Facebook, and so on.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-search-box.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Search box in popular software apps"></p>
</figure>
</div>
<p>As users, the workflow is pretty simple. We can search for items by writing our queries in a search box and the ranking model in their system gives us back the top-N most relevant results.</p>
<blockquote class="blockquote">
<p><em>How do we evaluate how good the top-N results are?</em></p>
</blockquote>
<p>In this post, I will answer the above question by explaining the common offline metrics used in learning to rank problems. These metrics are useful not only for evaluating search results but also for problems like keyword extraction and item recommendation.</p>
<section id="problem-setup-1-binary-relevance" class="level2">
<h2 class="anchored" data-anchor-id="problem-setup-1-binary-relevance">Problem Setup 1: Binary Relevance</h2>
<p>Let’s take a simple toy example to understand the details and trade-offs of various evaluation metrics.</p>
<p>We have a ranking model that gives us back 5-most relevant results for a certain query. The first, third, and fifth results were <span class="bg-color-green">relevant</span> as per our ground-truth annotation.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-documents-horizontal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of binary relevance"></p>
</figure>
</div>
<p>Let’s look at various metrics to evaluate this simple example.</p>
</section>
<section id="a.-order-unaware-metrics" class="level2">
<h2 class="anchored" data-anchor-id="a.-order-unaware-metrics">A. Order-Unaware Metrics</h2>
<section id="precisionk" class="level3">
<h3 class="anchored" data-anchor-id="precisionk">1. Precision@k</h3>
<p>This metric quantifies how many items in the top-K results were relevant. Mathematically, this is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BPrecision@k%7D%20=%20%5Cfrac%7B%20%5Ctext%7Btrue%20positives%20@%20k%7D%7D%7B(%5Ctext%7Btrue%20positives@k%7D)%20+%20(%5Ctext%7Bfalse%20positives@k%7D)%7D%0A"></p>
<p>For our example, precision@1 = 1 as all items in the first 1 results is relevant.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-precision-at-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Precision@1 for 5 documents"></p>
</figure>
</div>
<p>Similarly, precision@2 = 0.5 as only one of the top-2 results are relevant.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-precision-at-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Precision@2 for 5 documents"></p>
</figure>
</div>
<p>Thus, we can calculate the precision score for all k values.</p>
<table class="table-hover table-bordered caption-top table">
<colgroup>
<col style="width: 15%">
<col style="width: 15%">
<col style="width: 17%">
<col style="width: 18%">
<col style="width: 17%">
<col style="width: 17%">
</colgroup>
<thead>
<tr class="header">
<th>k</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Precision@k</strong></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B1%7D=1"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B2%7D=0.5"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2%7D%7B3%7D=0.67"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2%7D%7B4%7D=0.5"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B3%7D%7B5%7D=0.6"></td>
</tr>
</tbody>
</table>
<p>A limitation of precision@k is that it doesn’t consider the position of the relevant items. Consider two models A and B that have the same number of relevant results i.e.&nbsp;3 out of 5.</p>
<p>For model A, the first three items were relevant, while for model B, the last three items were relevant. Precision@5 would be the same for both of these models even though model A is better.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-precision-drawback.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Drawback of precision@k metric"></p>
</figure>
</div>
</section>
<section id="recallk" class="level3">
<h3 class="anchored" data-anchor-id="recallk">2. Recall@k</h3>
<p>This metric gives how many actual relevant results were shown out of all actual relevant results for the query. Mathematically, this is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BRecall@k%7D%20=%20%5Cfrac%7B%5Ctext%7Btrue%20positives@k%7D%7D%7B(%5Ctext%7Btrue%20positives@k%7D)%20+%20(%5Ctext%7Bfalse%20negatives@k%7D)%7D%0A"></p>
<p>For our example, recall@1 = 0.33 as only one of the 3 actual relevant items are present.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-recall-at-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Calculation of Recall@1 for 5 documents"></p>
</figure>
</div>
<p>Similarly, recall@3 = 0.67 as only two of the 3 actual relevant items are present.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-recall-at-3.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Calculation of Recall@3 for 5 documents"></p>
</figure>
</div>
<p>Thus, we can calculate the recall score for different K values.</p>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 40%">
<col style="width: 60%">
</colgroup>
<thead>
<tr class="header">
<th>k</th>
<th>Recall@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B(1+2)%7D=%5Cfrac%7B1%7D%7B3%7D=0.33"></td>
</tr>
<tr class="even">
<td>2</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B(1+2)%7D=%5Cfrac%7B1%7D%7B3%7D=0.33"></td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2%7D%7B(2+1)%7D=%5Cfrac%7B2%7D%7B3%7D=0.67"></td>
</tr>
<tr class="even">
<td>4</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2%7D%7B(2+1)%7D=%5Cfrac%7B2%7D%7B3%7D=0.67"></td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B3%7D%7B(3+0)%7D=%5Cfrac%7B3%7D%7B3%7D=1"></td>
</tr>
</tbody>
</table>
</section>
<section id="f1k" class="level3">
<h3 class="anchored" data-anchor-id="f1k">3. F1@k</h3>
<p>This is a combined metric that incorporates both Precision@k and Recall@k by taking their harmonic mean. We can calculate it as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BF1@k%7D%20=%20%5Cfrac%7B2*(%5Ctext%7BPrecision@k%7D)%20*%20(%5Ctext%7BRecall@k%7D)%7D%7B(%5Ctext%7BPrecision@k%7D)%20+%20(%5Ctext%7BRecall@k%7D)%7D%0A"></p>
<p>Using the previously calculated values of precision and recall, we can calculate F1-scores for different K values as shown below.</p>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 27%">
<col style="width: 29%">
<col style="width: 28%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Metric</th>
<th style="text-align: center;">Precision@k</th>
<th style="text-align: center;">Recall@k</th>
<th style="text-align: center;">F1@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><strong>k=1</strong></td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1/3</td>
<td style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2*1*(1/3)%7D%7B(1+1/3)%7D=0.5"></td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong>k=2</strong></td>
<td style="text-align: center;">1/2</td>
<td style="text-align: center;">1/3</td>
<td style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2*(1/2)*(1/3)%7D%7B(1/2+1/3)%7D=0.4"></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><strong>k=3</strong></td>
<td style="text-align: center;">2/3</td>
<td style="text-align: center;">2/3</td>
<td style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2*(2/3)*(2/3)%7D%7B(2/3+2/3)%7D=0.666"></td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong>k=4</strong></td>
<td style="text-align: center;">1/2</td>
<td style="text-align: center;">2/3</td>
<td style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2*(1/2)*(2/3)%7D%7B(1/2+2/3)%7D=0.571"></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><strong>k=5</strong></td>
<td style="text-align: center;">3/5</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2*(3/5)*1%7D%7B(3/5+1)%7D=0.749"></td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="b.-order-aware-metrics" class="level2">
<h2 class="anchored" data-anchor-id="b.-order-aware-metrics">B. Order Aware Metrics</h2>
<p>While precision, recall, and F1 give us a single-value metric, they don’t consider the order in which the returned search results are sent. To solve that limitation, people have devised order-aware metrics given below:</p>
<section id="mean-reciprocal-rankmrr" class="level3">
<h3 class="anchored" data-anchor-id="mean-reciprocal-rankmrr">1. Mean Reciprocal Rank(MRR)</h3>
<p>This metric is useful when we want our system to return the best relevant item and want that item to be at a higher position. Mathematically, this is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AMRR%20=%20%5Cfrac%7B1%7D%7B%7CQ%7C%7D%20%5Csum_%7Bi=1%7D%5E%7B%7CQ%7C%7D%20%5Cfrac%7B1%7D%7Brank_%7Bi%7D%7D%0A"></p>
<p>where: - <img src="https://latex.codecogs.com/png.latex?%5ClVert%20Q%20%5CrVert"> denotes the total number of queries<br>
- <img src="https://latex.codecogs.com/png.latex?rank_i"> denotes the rank of the first relevant result</p>
<p>To calculate MRR, we first calculate the <strong>reciprocal rank</strong>. It is simply the reciprocal of the rank of the first correct relevant result and the value ranges from 0 to 1.</p>
<p>For our example, the reciprocal rank is <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B1%7D=1"> as the first correct item is at position 1.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-reciprocal-rank.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Calculation of MRR for first relevant result"></p>
</figure>
</div>
<p>Let’s see another example where the only one relevant result is present at the end of the list i.e.&nbsp;position 5. It gets a lower reciprocal rank score of 0.2.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-reciprocal-rank-last.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="MRR when document is at last"></p>
</figure>
</div>
<p>Let’s consider another example where none of the returned results are relevant. In such a scenario, the reciprocal rank will be 0.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-reciprocal-rank-zero.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Worst case example of MRR"></p>
</figure>
</div>
<p>For multiple different queries, we can calculate the MRR by taking the mean of the reciprocal rank for each query.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-mean-reciprocal-rank.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Calculation of MRR for 3 queries"></p>
</figure>
</div>
<p>We can see that MRR doesn’t care about the position of the remaining relevant results. So, if your use-case requires returning multiple relevant results in the best possible way, MRR is not a suitable metric.</p>
</section>
<section id="average-precisionap" class="level3">
<h3 class="anchored" data-anchor-id="average-precisionap">2. Average Precision(AP)</h3>
<p>Average Precision is a metric that evaluates whether all of the ground-truth relevant items selected by the model are ranked higher or not. Unlike MRR, it considers all the relevant items.</p>
<p>Mathematically, it is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AAP%20=%20%5Cfrac%7B%5Csum_%7Bk=1%7D%5E%7Bn%7D%20(P(k)%20*%20rel(k))%7D%7B%5Ctext%7Bnumber%20of%20relevant%20items%7D%7D%0A"></p>
<p>where:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?rel(k)"> is an indicator function which is 1 when the item at rank K is relevant.<br>
</li>
<li><img src="https://latex.codecogs.com/png.latex?P(k)"> is the Precision@k metric</li>
</ul>
<p>For our example, we can calculate the AP based on our Precision@K values for different K.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-average-precision-example-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Precision@k for different values of k"></p>
</figure>
</div>
<p><img src="https://latex.codecogs.com/png.latex?%0AAP%20=%20%5Cfrac%7B(1%20+%202/3%20+%203/5)%7D%7B3%7D%20=%200.7555%0A"></p>
<p>To illustrate the advantage of AP, let’s take our previous example but place the 3 relevant results at the beginning. We can see that this gets a perfect AP score than the above example.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-average-precision-example-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Impact of order on average precision"></p>
</figure>
</div>
<p><img src="https://latex.codecogs.com/png.latex?%0AAP%20=%20%5Cfrac%7B(1%20+%201%20+%201)%7D%7B3%7D%20=%201%0A"></p>
</section>
<section id="mean-average-precisionmap" class="level3">
<h3 class="anchored" data-anchor-id="mean-average-precisionmap">3. Mean Average Precision(MAP)</h3>
<p>If we want to evaluate average precision across multiple queries, we can use the MAP. It is simply the mean of the average precision for all queries. Mathematically, this is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AMAP%20=%20%5Cfrac%7B1%7D%7BQ%7D%20%5Csum_%7Bq=1%7D%5E%7BQ%7D%20AP(q)%0A"></p>
<p>where</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?Q"> is the total number of queries<br>
</li>
<li><img src="https://latex.codecogs.com/png.latex?AP(q)"> is the average precision for query q.</li>
</ul>
</section>
</section>
<section id="problem-setup-2-graded-relevance" class="level2">
<h2 class="anchored" data-anchor-id="problem-setup-2-graded-relevance">Problem Setup 2: Graded Relevance</h2>
<p>Let’s take another toy example where we annotated the items not just as relevant or not-relevant but instead used a grading scale between 0 to 5 where 0 denotes least relevant and 5 denotes the most relevant.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-graded-scale.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of graded relevance score"></p>
</figure>
</div>
<p>We have a ranking model that gives us back 5-most relevant results for a certain query. The first item had a relevance score of 3 as per our ground-truth annotation, the second item has a relevance score of 2 and so on.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-graded-relevance.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Actual example of graded relevance"></p>
</figure>
</div>
<p>Let’s understand the various metrics to evaluate this type of setup.</p>
<section id="cumulative-gain-cgk" class="level3">
<h3 class="anchored" data-anchor-id="cumulative-gain-cgk">1. Cumulative Gain (CG@k)</h3>
<p>This metric uses a simple idea to just sum up the relevance scores for top-K items. The total score is called cumulative gain. Mathematically, this is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ACG@k%20=%20%5Csum_%7B1%7D%5E%7Bk%7D%20rel_%7Bi%7D%0A"></p>
<p>For our example, CG@2 will be 5 because we add the first two relevance scores 3 and 2.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-cumulative-gain-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Calculation of cumulative gain for 5 documents"></p>
</figure>
</div>
<p>Similarly, we can calculate the cumulative gain for all the K-values as:</p>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 36%">
<col style="width: 8%">
<col style="width: 8%">
<col style="width: 12%">
<col style="width: 15%">
<col style="width: 18%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Position(k)</th>
<th style="text-align: center;">1</th>
<th style="text-align: center;">2</th>
<th style="text-align: center;">3</th>
<th style="text-align: center;">4</th>
<th style="text-align: center;">5</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><strong>Cumulative Gain@k</strong></td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">3+2=5</td>
<td style="text-align: center;">3+2+3=8</td>
<td style="text-align: center;">3+2+3+0=8</td>
<td style="text-align: center;">3+2+3+0+1=9</td>
</tr>
</tbody>
</table>
<p>While simple, CG doesn’t take into account the order of the relevant items. So, even if we swap a less-relevant item to the first position, the CG@2 will be the same.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-cumulative-gain-drawback.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Drawback of cumulative gain due to ordering"></p>
</figure>
</div>
</section>
<section id="discounted-cumulative-gain-dcgk" class="level3">
<h3 class="anchored" data-anchor-id="discounted-cumulative-gain-dcgk">2. Discounted Cumulative Gain (DCG@k)</h3>
<p>We saw how a simple cumulative gain doesn’t take into account the position. But, we would normally want items with a high relevance score to be present at a better rank.</p>
<p>Consider an example below. With the cumulative gain, we are simply adding the scores without taking into account their position.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-need-for-dcg.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Need for discounted cumulative gain"></p>
</figure>
</div>
<blockquote class="blockquote">
<p>An item with a relevance score of 3 at position 1 is better than the same item with relevance score 3 at position 2.</p>
</blockquote>
<p>So, we need some way to penalize the scores by their position. DCG introduces a log-based penalty function to reduce the relevance score at each position. For 5 items, the penalty would be</p>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 18%">
<col style="width: 81%">
</colgroup>
<thead>
<tr class="header">
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bi%7D"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Blog_%7B2%7D(i+1)%7D"></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(1+1)%20=%20log_%7B2%7D(2)%20=%201"></td>
</tr>
<tr class="even">
<td>2</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(2+1)%20=%20log_%7B2%7D(3)%20=%201.584"></td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(3+1)%20=%20log_%7B2%7D(4)%20=%202"></td>
</tr>
<tr class="even">
<td>4</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(4+1)%20=%20log_%7B2%7D(5)%20=%202.321"></td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(5+1)%20=%20log_%7B2%7D(6)%20=%202.584"></td>
</tr>
</tbody>
</table>
<p>Using this penalty, we can now calculate the discounted cumulative gain simply by taking the sum of the <span class="bg-color-green">relevance score</span> <span class="bg-color-red">normalized by the penalty</span>. Mathematically, this is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ADCG@k%20=%20%5Csum_%7Bi=1%7D%5E%7Bk%7D%20%5Cfrac%7B%20%5Ccolor%7B#81c784%7D%7Brel_%7Bi%7D%7D%20%7D%7B%20%5Ccolor%7B#e57373%7D%7Blog_%7B2%7D(i%20+%201)%7D%20%7D%0A"></p>
<p>To understand the behavior of the log-penalty, let’s plot ranking position in x-axis and the percentage of relevance score i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Blog_%7B2%7D(i+1)%7D%20*%20100"> in the y-axis. As seen, in position 1, we don’t apply any penalty and score remains unchanged. But, the percentage of score kept decays exponentially from 100% in position 1 to 63% in position 2, 50% in position 3, and so on.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-penalty-plot.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Penalty on score based on position"></p>
</figure>
</div>
<p>Let’s now calculate DCG for our example.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-graded-relevance.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example calculation for DCG"></p>
</figure>
</div>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 11%">
<col style="width: 18%">
<col style="width: 43%">
<col style="width: 27%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BPosition(i)%7D"></th>
<th style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BRelevance(rel_%7Bi%7D)%7D"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Blog_%7B2%7D(i+1)%7D"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7B%5Cfrac%7Brel_%7Bi%7D%7D%7Blog_%7B2%7D(i+1)%7D%7D"></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">1</td>
<td style="text-align: center;">3</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(1+1)%20=%20log_%7B2%7D(2)%20=%201"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B3%7D%7B1%7D%20=%203"></td>
</tr>
<tr class="even">
<td style="text-align: center;">2</td>
<td style="text-align: center;">2</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(2+1)%20=%20log_%7B2%7D(3)%20=%201.5849"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B2%7D%7B1.5849%7D%20=%201.2618"></td>
</tr>
<tr class="odd">
<td style="text-align: center;">3</td>
<td style="text-align: center;">3</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(3+1)%20=%20log_%7B2%7D(4)%20=%202"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B3%7D%7B2%7D%20=%201.5"></td>
</tr>
<tr class="even">
<td style="text-align: center;">4</td>
<td style="text-align: center;">0</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(4+1)%20=%20log_%7B2%7D(5)%20=%202.3219"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B0%7D%7B2.3219%7D%20=%200"></td>
</tr>
<tr class="odd">
<td style="text-align: center;">5</td>
<td style="text-align: center;">1</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(5+1)%20=%20log_%7B2%7D(6)%20=%202.5849"></td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B2.5849%7D%20=%200.3868"></td>
</tr>
</tbody>
</table>
<p>Based on these penalized scores, we can now calculate DCG at various k values simply by taking their sum up to k.<br>
<!--  --></p>
<table class="table-bordered caption-top table">
<thead>
<tr class="header">
<th>k</th>
<th>DCG@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DCG@1</td>
<td><img src="https://latex.codecogs.com/png.latex?3"></td>
</tr>
<tr class="even">
<td>DCG@2</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618=4.2618"></td>
</tr>
<tr class="odd">
<td>DCG@3</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5=5.7618"></td>
</tr>
<tr class="even">
<td>DCG@4</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5+0=5.7618"></td>
</tr>
<tr class="odd">
<td>DCG@5</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5+0+0.3868%20=%206.1486"></td>
</tr>
</tbody>
</table>
<p>There is also an alternative formulation for DCG@K that gives more penalty if relevant items are ranked lower. This formulation is preferred more in industry.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0ADCG@k%20=%20%5Csum_%7Bi=1%7D%5E%7Bk%7D%20%5Cfrac%7B%20%5Ccolor%7B#81c784%7D%7B2%5E%7Brel_%7Bi%7D%7D%20-%201%7D%20%7D%7B%20%5Ccolor%7B#e57373%7D%7Blog_%7B2%7D(i%20+%201)%7D%20%7D%0A"></p>
<p>While DCG solves the issues with cumulative gain, it has a limitation. Suppose we a query Q1 with 3 results and query Q2 with 5 results. Then the query with 5 results Q2 will have a larger overall DCG score. But we can’t say that query 2 was better than query 1.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-dcg-drawback.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Drawback of discounted cumulative gain"></p>
</figure>
</div>
</section>
<section id="normalized-discounted-cumulative-gain-ndcgk" class="level3">
<h3 class="anchored" data-anchor-id="normalized-discounted-cumulative-gain-ndcgk">3. Normalized Discounted Cumulative Gain (NDCG@k)</h3>
<p>To allow a comparison of DCG across queries, we can use NDCG that normalizes the DCG values using the ideal order of the relevant items.</p>
<p>Let’s take our previous example where we had already calculated the DCG values at various K values.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-graded-relevance.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example problem for NDCG"></p>
</figure>
</div>
<table class="table-bordered caption-top table">
<thead>
<tr class="header">
<th>k</th>
<th>DCG@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DCG@1</td>
<td><img src="https://latex.codecogs.com/png.latex?3"></td>
</tr>
<tr class="even">
<td>DCG@2</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618=4.2618"></td>
</tr>
<tr class="odd">
<td>DCG@3</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5=5.7618"></td>
</tr>
<tr class="even">
<td>DCG@4</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5+0=5.7618"></td>
</tr>
<tr class="odd">
<td>DCG@5</td>
<td><img src="https://latex.codecogs.com/png.latex?3+1.2618+1.5+0+0.3868%20=%206.1486"></td>
</tr>
</tbody>
</table>
<p>For our example, ideally, we would have wanted the items to be sorted in descending order of relevance scores.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ltr-ndcg-ideal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Ideal order of search results"></p>
</figure>
</div>
<p>Let’s calculate the ideal DCG(IDCG) for this order.</p>
<table class="table-bordered caption-top table">
<colgroup>
<col style="width: 11%">
<col style="width: 18%">
<col style="width: 19%">
<col style="width: 27%">
<col style="width: 23%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BPosition(i)%7D"></th>
<th style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BRelevance(rel_%7Bi%7D)%7D"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Blog_%7B2%7D(i+1)%7D"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7B%5Cfrac%7Brel_%7Bi%7D%7D%7Blog_%7B2%7D(i+1)%7D%7D"></th>
<th>IDCG@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">1</td>
<td style="text-align: center;">3</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(2)%20=%201"></td>
<td>3 / 1 = 3</td>
<td>3</td>
</tr>
<tr class="even">
<td style="text-align: center;">2</td>
<td style="text-align: center;">3</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(3)%20=%201.5849"></td>
<td>3 / 1.5849 = 1.8927</td>
<td>3+1.8927=4.8927</td>
</tr>
<tr class="odd">
<td style="text-align: center;">3</td>
<td style="text-align: center;">2</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(4)%20=%202"></td>
<td>2 / 2 = 1</td>
<td>3+1.8927+1=5.8927</td>
</tr>
<tr class="even">
<td style="text-align: center;">4</td>
<td style="text-align: center;">1</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(5)%20=%202.3219"></td>
<td>1 / 2.3219 = 0.4306</td>
<td>3+1.8927+1+0.4306=6.3233</td>
</tr>
<tr class="odd">
<td style="text-align: center;">5</td>
<td style="text-align: center;">0</td>
<td><img src="https://latex.codecogs.com/png.latex?log_%7B2%7D(6)%20=%202.5849"></td>
<td>0 / 2.5849 = 0</td>
<td>3+1.8927+1+0.4306+0=6.3233</td>
</tr>
</tbody>
</table>
<p>Now we can calculate the NDCG@k for various k by dividing DCG@k by IDCG@k as shown below:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BNDCG@k%7D%20=%20%5Cfrac%7B%5Ctext%7BDCG@k%7D%7D%7B%5Ctext%7BIDCG@k%7D%7D%0A"></p>
<table class="table-bordered caption-top table">
<thead>
<tr class="header">
<th style="text-align: center;"><img src="https://latex.codecogs.com/png.latex?k"></th>
<th style="text-align: center;">DCG@k</th>
<th>IDCG@k</th>
<th>NDCG@k</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">1</td>
<td style="text-align: center;">3</td>
<td>3</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B3%7D%7B3%7D%20=%201"></td>
</tr>
<tr class="even">
<td style="text-align: center;">2</td>
<td style="text-align: center;">4.2618</td>
<td>4.8927</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B4.2618%7D%7B4.8927%7D%20=%200.8710"></td>
</tr>
<tr class="odd">
<td style="text-align: center;">3</td>
<td style="text-align: center;">5.7618</td>
<td>5.8927</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B5.7618%7D%7B5.8927%7D%20=%200.9777"></td>
</tr>
<tr class="even">
<td style="text-align: center;">4</td>
<td style="text-align: center;">5.7618</td>
<td>6.3233</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B5.7618%7D%7B6.3233%7D%20=%200.9112"></td>
</tr>
<tr class="odd">
<td style="text-align: center;">5</td>
<td style="text-align: center;">6.1486</td>
<td>6.3233</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B6.1486%7D%7B6.3233%7D%20=%200.9723"></td>
</tr>
</tbody>
</table>
<p>Thus, we get NDCG scores with a range between 0 and 1. A perfect ranking would get a score of 1. We can also compare NDCG@k scores of different queries since it’s a normalized score.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, we learned about various evaluation metrics for both binary and graded ground-truth labels and how each metric improves upon the previous.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-10.1145/582415.582418" class="csl-entry">
Kalervo Järvelin and Jaana Kekäläinen. 2002. <a href="https://doi.org/10.1145/582415.582418">Cumulated gain-based evaluation of IR techniques</a>. <em>ACM Trans. Inf. Syst.</em>, 20(4):422–446.
</div>
<div id="ref-wikipediadiscountedcumulativegain" class="csl-entry">
Wikipedia. <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain">Discounted cumulative gain</a>.
</div>
<div id="ref-wikipediaevaluationmeasuresinformationretrieval" class="csl-entry">
Wikipedia. <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)">Evaluation measures (information retrieval)</a>.
</div>
<div id="ref-wikipediameanreciprocalrank" class="csl-entry">
Wikipedia. <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">Mean reciprocal rank</a>.
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Evaluation {Metrics} {For} {Information} {Retrieval}},
  date = {2020-08-04},
  url = {https://amitness.com/posts/information-retrieval-evaluation.html},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-chaudhary2020" class="csl-entry quarto-appendix-citeas">
Amit Chaudhary. 2020. <a href="https://amitness.com/posts/information-retrieval-evaluation.html">Evaluation
Metrics For Information Retrieval</a>.
</div></div></section></div> ]]></description>
  <category>information-retrieval</category>
  <category>evals</category>
  <category>rag</category>
  <guid>https://amitness.com/posts/information-retrieval-evaluation</guid>
  <pubDate>Tue, 04 Aug 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/ltr-cover.png" medium="image" type="image/png" height="73" width="144"/>
</item>
<item>
  <title>Behavioral Testing of NLP models</title>
  <link>https://amitness.com/posts/behavioral-testing-nlp</link>
  <description><![CDATA[ 




<p>When developing an NLP model, it’s a standard practice to test how well a model generalizes to unseen examples by evaluating it on a held-out dataset. Suppose we reach our target performance metric of 95% on a held-out dataset and thus deploy the model to production based on this single metric.</p>
<p>But, when real users start using it, the story could be completely different than what our 95% performance metric was saying. Our model might perform poorly even on simple variations of the training text.</p>
<p>In contrast, the field of software engineering uses a suite of unit tests, integration tests, and end-to-end tests to evaluate all aspects of the product for failures. An application is deployed to production only after passing these rigorous tests.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/checklist-software-testing.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Different types of tests in software engineering"></p>
</figure>
</div>
<p><a href="https://arxiv.org/abs/2005.04118">Ribeiro et al.</a> noticed this gap and took inspiration from software engineering to propose an evaluation methodology for NLP called <strong>“CheckList”</strong>. Their paper won the best overall paper award at ACL 2020.</p>
<p>In this post, I will explain the overall concept of CheckList and the various components that it proposes for evaluating NLP models.</p>
<section id="behavioral-testing" class="level2">
<h2 class="anchored" data-anchor-id="behavioral-testing">Behavioral Testing</h2>
<p>To understand CheckList, let’s first understand behavioral testing in the context of software engineering.</p>
<p>Behavioral testing, also known as black-box testing, is a method where we test a piece of software based on its expected input and output. We don’t need access to the actual implementation details.</p>
<p>For example, let’s say you have a function that adds two numbers together.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add(a, b):</span>
<span id="cb1-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b</span></code></pre></div></div>
<p>We can evaluate this function by writing tests to compare it’s output to the expected answer. We are not concerned with how this function was implemented internally.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_add():</span>
<span id="cb2-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> add(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb2-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> add(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb2-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> add(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb2-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> add(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span></code></pre></div></div>
<p>Even for a simple function such as addition, there are capabilities that it should satisfy. For example, the addition of a number with zero should yield the original number itself.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 30%">
<col style="width: 20%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 30%">
</colgroup>
<thead>
<tr class="header">
<th>Capability</th>
<th>Function Signature</th>
<th>Output</th>
<th>Expected</th>
<th>Test Passed</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Two Positive Numbers</strong></td>
<td>add(1, 2)</td>
<td>3</td>
<td>3</td>
<td><span style="color:#4caf50; font-weight: bold;">Yes</span></td>
</tr>
<tr class="even">
<td><strong>No Change with Zero</strong></td>
<td>add(1, 0)</td>
<td>1</td>
<td>1</td>
<td><span style="color:#4caf50; font-weight: bold;">Yes</span></td>
</tr>
<tr class="odd">
<td><strong>Opposite Numbers</strong></td>
<td>add(-1, 1)</td>
<td>0</td>
<td>0</td>
<td><span style="color:#4caf50; font-weight: bold;">Yes</span></td>
</tr>
<tr class="even">
<td><strong>Two Negative Number</strong></td>
<td>add(-1, -1)</td>
<td>-2</td>
<td>-2</td>
<td><span style="color:#4caf50; font-weight: bold;">Yes</span></td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td></td>
<td><strong>Pass Rate</strong></td>
<td><span style="color:#4caf50; font-weight: bold;">4</span>/<span style="font-weight: bold;">4</span> = <span style="color:#4caf50; font-weight: bold;">100%</span></td>
</tr>
</tbody>
</table>
</section>
<section id="checklist-framework" class="level2">
<h2 class="anchored" data-anchor-id="checklist-framework">CheckList Framework</h2>
<p>CheckList proposes a general framework for writing behavioral tests for any NLP model and task.</p>
<p>The core idea is based on a conceptual matrix that is composed of <span style="background-color: #e0f2f1;">linguistic capabilities</span> as rows and <span style="background-color: #efebe9;">test types</span> as columns. The intersecting cells contain multiple test examples generated from templates that we run and calculate the <span style="background-color: #ffebee;">failure rate</span> for.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 36%">
<col style="width: 21%">
<col style="width: 19%">
<col style="width: 22%">
</colgroup>
<thead>
<tr class="header">
<th><span style="text-decoration: underline; text-decoration-color: #4e91a5; font-weight: bold;">Capability</span> / <span style="text-decoration: underline; text-decoration-color: #a1887f;font-weight: bold;">Test</span></th>
<th><span style="text-decoration: underline; text-decoration-color: #a1887f;font-weight: bold;">Minimum Functionality Test(MFT)</span></th>
<th><span style="text-decoration: underline; text-decoration-color: #a1887f;font-weight: bold;">Invariance Test(INV)</span></th>
<th><span style="text-decoration: underline; text-decoration-color: #a1887f;font-weight: bold;">Directional Expectation Test(DIR)</span></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><span style="text-decoration: underline; text-decoration-color: #4e91a5; font-weight: bold;">VOCABULARY</span></td>
<td><span style="color: #e57373; font-weight: bold;">15.0%</span></td>
<td><span style="color: #e57373; font-weight: bold;">16.2%</span></td>
<td><span style="color: #e57373; font-weight: bold;">34.6%</span></td>
</tr>
<tr class="even">
<td><span style="text-decoration: underline; text-decoration-color: #4e91a5; font-weight: bold;">NER</span></td>
<td><span style="color: #e57373; font-weight: bold;">0.0%</span></td>
<td><span style="color: #e57373; font-weight: bold;">20.8%</span></td>
<td>-</td>
</tr>
<tr class="odd">
<td><span style="text-decoration: underline; text-decoration-color: #4e91a5; font-weight: bold;">NEGATION</span></td>
<td><span style="color: #e57373; font-weight: bold;">76.4%</span></td>
<td>-</td>
<td>-</td>
</tr>
<tr class="even">
<td><span style="color: #4e91a5; font-weight: bold;">…</span></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>By calculating the failure rates for various test types and capabilities, we can know exactly where our model is weak.</p>
<p>Let’s understand each part of this conceptual matrix in detail now.</p>
<section id="test-types" class="level3">
<h3 class="anchored" data-anchor-id="test-types">1. Test Types</h3>
<p>These are the columns in the previous matrix. There are 3 types of tests proposed in the CheckList framework:</p>
<section id="a.-minimum-functionality-testmft" class="level4">
<h4 class="anchored" data-anchor-id="a.-minimum-functionality-testmft">a. Minimum Functionality Test(MFT)</h4>
<p>This test is similar to unit tests in software engineering. We build a collection of (text, expected label) pairs from scratch and test the model on this collection.</p>
<p>For example, we are testing the negation capability of the model using an MFT test below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/checklist-mft.png" class="img-fluid figure-img"></p>
<figcaption>Template: I <span style="color: #E57373;">{NEGATION}</span> <span style="color: #81C784;">{POS_VERB}</span> the <span style="color: #90A4AE;">{THING}</span></figcaption>
</figure>
</div>
<p>The goal of this test is to make sure the model is not taking any shortcuts and possesses linguistic capabilities.</p>
</section>
<section id="b.-invariance-testinv" class="level4">
<h4 class="anchored" data-anchor-id="b.-invariance-testinv">b. Invariance Test(INV)</h4>
<p>In this test, we perturb our existing training examples in a way that the label should not change. Then, the model is tested on this perturbed example and the model passes the test only if its prediction remains the same (i.e invariant).</p>
<p>For example, changing the location from Chicago to Dallas should not change the original sentiment of a text.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/checklist-INV.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of invariance test"></p>
</figure>
</div>
<p>We can use different perturbation functions to test different capabilities. The paper mentions two examples:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 15%">
<col style="width: 42%">
<col style="width: 42%">
</colgroup>
<thead>
<tr class="header">
<th>Capability</th>
<th>Perturbation</th>
<th>Invariance</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>NER</td>
<td>Change location name in text</td>
<td>Should not change sentiment</td>
</tr>
<tr class="even">
<td>Robustness</td>
<td>Add typos to the text</td>
<td>Should not change prediction</td>
</tr>
</tbody>
</table>
</section>
<section id="c.-directional-expectation-testdir" class="level4">
<h4 class="anchored" data-anchor-id="c.-directional-expectation-testdir">c.&nbsp;Directional Expectation Test(DIR)</h4>
<p>This test is similar to the invariance test but here we expect the model prediction to change after perturbation.</p>
<p>For example, if we add a text “You are lame” to the end of a text, the expectation is that sentiment of the original text will not move towards a positive direction.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/checklist-DIR.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of directional expectation test"></p>
</figure>
</div>
<p>We can also write tests where we expect the target label to change. For example, consider the QQP task where we need to detect if two questions are duplicates or not.</p>
<p>If we have a pair of duplicate questions and we change the location in one of the questions, then we expect the model to predict that they are not duplicates.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 3%">
<col style="width: 34%">
<col style="width: 32%">
<col style="width: 4%">
<col style="width: 24%">
</colgroup>
<thead>
<tr class="header">
<th>Capability</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Predicted</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>NER</td>
<td>How many people are there in <span style="color: #4e91a5;font-weight: bold">England</span>?</td>
<td>What is the population of <span style="color: #4e91a5;font-weight: bold">England</span>?</td>
<td>Duplicate</td>
<td>Duplicate</td>
</tr>
<tr class="even">
<td>NER</td>
<td>How many people are there in <span style="color: #4e91a5;font-weight: bold">England</span>?</td>
<td>What is the population of <span style="color: #a1887f;font-weight: bold">Turkey</span>?</td>
<td>Not Duplicate</td>
<td><span style="color: #E57373; font-weight: bold;">Duplicate</span></td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="linguistic-capabilities" class="level3">
<h3 class="anchored" data-anchor-id="linguistic-capabilities">2. Linguistic Capabilities</h3>
<p>These are the rows in the CheckList matrix. Each row contains a specific linguistic capability that applies to most NLP tasks.</p>
<p>Let’s understand examples of capabilities given in the original paper. The authors provide a lot of examples to help us build a mental model of how to test new capabilities relevant to our task and domain.</p>
<section id="a.-vocabulary-and-pos" class="level4">
<h4 class="anchored" data-anchor-id="a.-vocabulary-and-pos">a. Vocabulary and POS</h4>
<p>We want to ensure the model has enough vocabulary knowledge and can differentiate words with a different part of speech and how it impacts the task at hand.</p>
<p>For example, the paper shows the 3 test types for a sentiment analysis task.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 4%">
<col style="width: 61%">
<col style="width: 8%">
<col style="width: 24%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>The company is Australian</td>
<td>neutral</td>
<td>neutral adjective and nouns</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>That cabin crew is <span style="background-color: #e8f5e9;">extraordinary</span></td>
<td>positive</td>
<td>sentiment-laden adjectives</td>
</tr>
<tr class="odd">
<td>INV</td>
<td><span class="bg-color-red"><del>the</del></span> ⮕ <span style="background-color: #e8f5e9;">our</span> nightmare continues</td>
<td>no change</td>
<td>Replace neutral words with other neutral words</td>
</tr>
<tr class="even">
<td>DIR</td>
<td>AA45… JFK to LAS. <span class="bg-color-green">You are brilliant</span></td>
<td>move towards +ve</td>
<td>Add positive phrase to end</td>
</tr>
<tr class="odd">
<td>DIR</td>
<td>your service sucks. <span class="bg-color-red">You are lame</span></td>
<td>move towards -ve</td>
<td>Add negative phrase to end</td>
</tr>
</tbody>
</table>
<p>This can also be applied for the QQP task as shown below.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 13%">
<col style="width: 47%">
<col style="width: 9%">
<col style="width: 23%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>Is John a teacher?</td>
<td>Is John <span class="bg-color-red">an accredited</span> teacher?</td>
<td>Not Duplicate</td>
<td>Modifiers change question intent</td>
</tr>
</tbody>
</table>
</section>
<section id="b.-named-entity-recognitionner" class="level4">
<h4 class="anchored" data-anchor-id="b.-named-entity-recognitionner">b. Named Entity Recognition(NER)</h4>
<p>It tests the capability of the model to understand named entities and whether it is important for the current task or not.</p>
<p>We have examples of NER capability tests for sentiment analysis given below.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 4%">
<col style="width: 61%">
<col style="width: 4%">
<col style="width: 28%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>INV</td>
<td>We had a safe travel to <span class="bg-color-red"><del>Chicago</del></span> ⮕ <span class="bg-color-green">Dallas</span></td>
<td>no change</td>
<td>Switching locations should not change predictions</td>
</tr>
<tr class="even">
<td>INV</td>
<td><span class="bg-color-red"><del>Benjamin</del></span> ⮕ <span class="bg-color-green">Anna</span> was your savior</td>
<td>no change</td>
<td>Switching person names should not change predictions</td>
</tr>
</tbody>
</table>
<p>We can also apply this to the QQP task.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 2%">
<col style="width: 35%">
<col style="width: 37%">
<col style="width: 3%">
<col style="width: 19%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>INV</td>
<td>Why isn’t <span class="bg-color-red">Hillary Clinton</span> ⮕ <span class="bg-color-green">Nicole Perez</span> in jail?</td>
<td>Is <span class="bg-color-red">Hillary Clinton</span> ⮕ <span class="bg-color-green">Nicole Perez</span> going to go to jail?</td>
<td>Duplicate</td>
<td>Changing name in both question</td>
</tr>
<tr class="even">
<td>DIR</td>
<td>Why isn’t Hillary Clinton in jail?</td>
<td>Is <span class="bg-color-red">Hillary Clinton</span> ⮕ <span class="bg-color-green">Nicole Perez</span> going to go to jail?</td>
<td>Not Duplicate</td>
<td>Changing name in only one question</td>
</tr>
<tr class="odd">
<td>DIR</td>
<td>Why<span class="bg-color-green">’s</span> Hillary Clinton <span class="bg-color-green">running</span>?</td>
<td>Is Hillary Clinton going to go to jail?</td>
<td>Not Duplicate</td>
<td>Keep first word and entities, replace everything else with ROBERTA</td>
</tr>
</tbody>
</table>
</section>
<section id="c.-temporal" class="level4">
<h4 class="anchored" data-anchor-id="c.-temporal">c.&nbsp;Temporal</h4>
<p>Here we want to test if the model understands the order of events in the text.</p>
<p>Below are examples of tests we can devise to evaluate this capability for a sentiment model.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 5%">
<col style="width: 52%">
<col style="width: 5%">
<col style="width: 35%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td><strong>I used to</strong> hate this airline, <strong>although now</strong> I like it</td>
<td>positive</td>
<td>sentiment change over time, the present should prevail</td>
</tr>
<tr class="even">
<td>MFT</td>
<td><strong>In the past I thought</strong> this airline was perfect, <strong>now I think</strong> it is creepy</td>
<td>negative</td>
<td>sentiment change over time, the present should prevail</td>
</tr>
</tbody>
</table>
<p>Similarly, we can devise temporal capability tests for QQP data as well.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 5%">
<col style="width: 34%">
<col style="width: 34%">
<col style="width: 7%">
<col style="width: 18%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td><strong>Is</strong> Jordan Perry an advisor?</td>
<td><strong>Did</strong> Jordan Perry <strong>use to be</strong> an advisor?</td>
<td>Not duplicate</td>
<td>is != used to be</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>Is it unhealthy to eat <strong>after</strong> 10pm?</td>
<td>Is it unhealthy to eat <strong>before</strong> 10pm?</td>
<td>Not duplicate</td>
<td>before != after</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td>What was Danielle Bennett’s life <strong>before becoming</strong> an agent?</td>
<td>What was Danielle Bennett’s life <strong>after becoming</strong> an agent?</td>
<td>Not duplicate</td>
<td>before becoming != after becoming</td>
</tr>
</tbody>
</table>
</section>
<section id="d.-negation" class="level4">
<h4 class="anchored" data-anchor-id="d.-negation">d.&nbsp;Negation</h4>
<p>This ensures the model understands negation and its impact on the output.</p>
<p>Below are examples of tests we can devise to evaluate negation capabilities for a sentiment model.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 4%">
<col style="width: 66%">
<col style="width: 7%">
<col style="width: 22%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>The aircraft is <strong>not</strong> <span class="bg-color-red">bad</span></td>
<td>positive/neutral</td>
<td>negated negative</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>This aircraft is <strong>not</strong> <span class="bg-color-yellow">private</span></td>
<td>neutral</td>
<td>negated neutral</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td><span class="bg-color-red">I thought the plane would be awful</span>, <strong>but it wasn’t</strong></td>
<td>positive/neutral</td>
<td>negation of negative at end</td>
</tr>
<tr class="even">
<td>MFT</td>
<td><strong>I wouldn’t say</strong>, <span class="bg-color-yellow">given it’s a Tuesday</span>, <span class="bg-color-green">that this pilot was great</span></td>
<td>negative</td>
<td>negated positive with neutral content in middle</td>
</tr>
</tbody>
</table>
<p>Similarly, we can devise negation capability tests for QQP data as well.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 4%">
<col style="width: 17%">
<col style="width: 61%">
<col style="width: 6%">
<col style="width: 9%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>How can I become a positive person?</td>
<td>How can I become a person <span class="bg-color-red"><strong>who is not</strong></span> <span class="bg-color-green">positive</span>?</td>
<td>Not duplicate</td>
<td>simple negation</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>How can I become a positive person?</td>
<td>How can I become a person <span class="bg-color-red"><strong>who is not</strong></span> <span class="bg-color-red">negative</span>?</td>
<td>Duplicate</td>
<td>negation of antonym</td>
</tr>
</tbody>
</table>
</section>
<section id="e.-semantic-role-labelingsrl" class="level4">
<h4 class="anchored" data-anchor-id="e.-semantic-role-labelingsrl">e. Semantic Role Labeling(SRL)</h4>
<p>This ensures the model understands the agent and the object in the text.</p>
<p>Below are examples of tests we can devise to evaluate SRL capabilities for a sentiment model.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 5%">
<col style="width: 65%">
<col style="width: 4%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td><strong>Some people</strong> hate him, but <strong>I think</strong> <span class="bg-color-green">the pilot was fantastic</span></td>
<td>positive</td>
<td>Author sentiment more important than others</td>
</tr>
<tr class="even">
<td>MFT</td>
<td><span class="bg-color-green">Do I think the pilot was fantastic?</span> <span class="bg-color-green">Yes.</span></td>
<td>positive</td>
<td>parsing sentiment in (question, “yes”) form</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td><span class="bg-color-green">Do I think the pilot was fantastic?</span> <span class="bg-color-red">No.</span></td>
<td>negative</td>
<td>parsing sentiment in (question, “no”) form</td>
</tr>
</tbody>
</table>
<p>Similarly, we can devise SRL capability tests for QQP data as well.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 27%">
<col style="width: 29%">
<col style="width: 8%">
<col style="width: 28%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>Are <strong>tigers</strong> heavier than <strong>insects</strong>?</td>
<td>What is heavier, <strong>insects</strong> or <strong>tigers</strong>?</td>
<td>Duplicate</td>
<td>Comparison</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>Is <strong>Anna</strong> related to <strong>Benjamin</strong>?</td>
<td>Is <strong>Benjamin</strong> related to <strong>Anna</strong>?</td>
<td>Duplicate</td>
<td>Symmetric relation</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td>Is <strong>Anna</strong> hurting <strong>Benjamin</strong>?</td>
<td>Is <strong>Benjamin</strong> hurting <strong>Anna</strong>?</td>
<td>Not Duplicate</td>
<td>Asymmetric relation</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>Does <strong>Anna</strong> love <strong>Benjamin</strong>?</td>
<td>Is <strong>Benjamin</strong> loved by <strong>Anna</strong>?</td>
<td>Duplicate</td>
<td>Active / passive swap, same semantics</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td>Does <strong>Anna</strong> support <strong>Benjamin</strong>?</td>
<td>Is <strong>Anna</strong> supported by <strong>Benjamin</strong>?</td>
<td>Not Duplicate</td>
<td>Active / passive swap, different semantics</td>
</tr>
</tbody>
</table>
</section>
<section id="f.-robustness" class="level4">
<h4 class="anchored" data-anchor-id="f.-robustness">f.&nbsp;Robustness</h4>
<p>This ensures that the model can handle small variations or perturbations to the input text such as typos and irrelevant changes.</p>
<p>Below are examples of tests we can devise to evaluate robustness capabilities for a sentiment model.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 5%">
<col style="width: 60%">
<col style="width: 5%">
<col style="width: 28%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>INV</td>
<td><span class="citation" data-cites="JetBlue">@JetBlue</span> no thanks <span class="bg-color-green"><span class="citation" data-cites="pi9QDK">@pi9QDK</span></span></td>
<td>no change</td>
<td>Add randomly generated URLs and handles to tweets</td>
</tr>
<tr class="even">
<td>INV</td>
<td><span class="citation" data-cites="SouthwestAir">@SouthwestAir</span> no <span class="bg-color-red">thanks</span> -&gt; <span class="bg-color-green">thakns</span></td>
<td>no change</td>
<td>Swap one character with its neighbor (typo)</td>
</tr>
</tbody>
</table>
<p>Similarly, we can devise robustness capability tests for QQP data as well.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 3%">
<col style="width: 36%">
<col style="width: 46%">
<col style="width: 3%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>INV</td>
<td>Why am I <span class="bg-color-red"><del>getting</del></span> ⮕ <span class="bg-color-green">gettnig</span> lazy?</td>
<td>Why are we so lazy?</td>
<td>Duplicate</td>
<td>Swap one character with neighbor</td>
</tr>
<tr class="even">
<td>DIR</td>
<td>Can I gain weight from not eating enough?</td>
<td><span class="bg-color-red"><del>Can I</del></span> ⮕ <span class="bg-color-green">Do you think I can</span> gain weight from not eating enough?</td>
<td>Duplicate</td>
<td>Paraphrasing</td>
</tr>
</tbody>
</table>
</section>
<section id="g.-taxonomy" class="level4">
<h4 class="anchored" data-anchor-id="g.-taxonomy">g. Taxonomy</h4>
<p>This ensures that the model has an understanding of synonyms and antonyms and how they affect the task at hand.</p>
<p>Below are examples of tests we can devise to evaluate taxonomy capabilities for the QQP task.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 3%">
<col style="width: 35%">
<col style="width: 44%">
<col style="width: 3%">
<col style="width: 13%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td><strong>How can I become more</strong> <span class="bg-color-green">vocal</span>?</td>
<td><strong>How can I become more</strong> <span class="bg-color-green">outspoken</span>?</td>
<td>Duplicate</td>
<td>Synonyms in simple template</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>How can I become <span class="bg-color-green">more</span> <span class="bg-color-green">optimistic</span>?</td>
<td>How can I become <span class="bg-color-green">less</span> <span class="bg-color-red">pessimistic</span>?</td>
<td>Duplicate</td>
<td>More X = Less antonym(X)</td>
</tr>
<tr class="odd">
<td>INV</td>
<td>Is it necessary to follow a religion?</td>
<td>Is it necessary to follow an <span class="bg-color-red"><del>organized</del></span> ⮕ <span class="bg-color-green">organised</span> religion?</td>
<td>Duplicate</td>
<td>Replace words with synonyms in real pairs</td>
</tr>
</tbody>
</table>
</section>
<section id="h.-coreference-resolution" class="level4">
<h4 class="anchored" data-anchor-id="h.-coreference-resolution">h. Coreference Resolution</h4>
<p>This ensures that the model has an understanding of pronouns and what nouns they refer to.</p>
<p>Below are examples of tests we can devise to evaluate coreference capabilities for the QQP task.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 4%">
<col style="width: 36%">
<col style="width: 35%">
<col style="width: 5%">
<col style="width: 18%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>If Anna and Benjamin were alone, do you think <strong>he</strong> would reject <strong>her</strong>?</td>
<td>If Anna and Benjamin were alone, do you think <strong>she</strong> would reject <strong>him</strong>?</td>
<td>Not Duplicate</td>
<td>Simple coreference: he != she</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>If Benjamin and Anna were married, do you think <strong>Anna’s family</strong> would be happy?</td>
<td>If Benjamin and Anna were married, do you think <strong>his family</strong> would be happy?</td>
<td>Not Duplicate</td>
<td>Simple resolved coreference, his and her</td>
</tr>
</tbody>
</table>
</section>
<section id="i.-logic" class="level4">
<h4 class="anchored" data-anchor-id="i.-logic">i. Logic</h4>
<p>This ensures that the model can handle symmetry, consistency, and conjunctions.</p>
<p>For example, in the QQP task, the order of the question shouldn’t matter. If question 1 is a duplicate of question 2, then question 2 will also be a duplicate of question 1 by symmetry.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 38%">
<col style="width: 38%">
<col style="width: 6%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Question 1</th>
<th>Question 2</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>INV</strong></td>
<td>Can I gain weight from not eating enough?</td>
<td>Do you think I can gain weight from not eating enough?</td>
<td>Duplicate</td>
<td>Original Order</td>
</tr>
<tr class="even">
<td></td>
<td>Do you think I can gain weight from not eating enough?</td>
<td>Can I gain weight from not eating enough?</td>
<td>Duplicate</td>
<td>Reversed Order</td>
</tr>
</tbody>
</table>
</section>
<section id="j.-fairness" class="level4">
<h4 class="anchored" data-anchor-id="j.-fairness">j. Fairness</h4>
<p>This tests if the model reflects any form of bias towards a demographic from the training data.</p>
<p>Below are examples of tests we can devise to evaluate the fairness of a sentiment model. The model prediction failures are for the BERT model as shown in the paper.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 9%">
<col style="width: 26%">
<col style="width: 44%">
<col style="width: 8%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th>Test Type</th>
<th>Example</th>
<th>Predicted</th>
<th>Expected</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MFT</td>
<td>I am a <strong>black</strong> woman</td>
<td><span class="bg-color-red">Negative</span></td>
<td>Neutral</td>
<td>Racial Bias</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>I am a <strong>asian</strong> woman</td>
<td><span class="bg-color-red">Positive</span></td>
<td>Neutral</td>
<td>Racial Bias</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td>I am a <strong>straight</strong> woman</td>
<td><span class="bg-color-red">Positive</span></td>
<td>Neutral</td>
<td>Gender Bias</td>
</tr>
<tr class="even">
<td>MFT</td>
<td>I am a <strong>gay</strong> woman</td>
<td><span class="bg-color-red">Negative</span></td>
<td>Neutral</td>
<td>Gender Bias</td>
</tr>
<tr class="odd">
<td>MFT</td>
<td>I am a <strong>lesbian</strong> woman</td>
<td><span class="bg-color-red">Negative</span></td>
<td>Neutral</td>
<td>Gender Bias</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="test-generation" class="level3">
<h3 class="anchored" data-anchor-id="test-generation">3. Test Generation</h3>
<p>The paper’s authors have open-sourced a <a href="https://github.com/marcotcr/checklist">software tool</a> that can generate test cases at scale based on the ideas above.</p>
<p>The tool provides three approaches to write test cases:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 15%">
<col style="width: 31%">
<col style="width: 23%">
<col style="width: 29%">
</colgroup>
<thead>
<tr class="header">
<th>Approach</th>
<th>Idea</th>
<th>Advantage</th>
<th>Disadvantage</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Scratch</td>
<td>Write tests manually</td>
<td>High Quality</td>
<td>Low Coverage, Expensive, Time-consuming</td>
</tr>
<tr class="even">
<td>Perturbation Function</td>
<td>Apply perturbation to texts</td>
<td>Lots of Automated Tests</td>
<td>Low Quality</td>
</tr>
<tr class="odd">
<td>Template</td>
<td>Use templates and generate many variations</td>
<td>Balance of Quality and Quantity</td>
<td>Need to brainstorm Templates</td>
</tr>
</tbody>
</table>
<p>To generate templates, you can either brainstorm them from scratch or generalize patterns from your existing data.</p>
<section id="a.-manually-generated-templates" class="level4">
<h4 class="anchored" data-anchor-id="a.-manually-generated-templates">a. Manually Generated Templates</h4>
<p>For example, if we had a text such as “<em>I didn’t love the food</em>” in our training data, we can generalize it as:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Original Text</th>
<th>Generalized Template</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>I didn’t love the food</td>
<td>I {NEGATION} {POS_VERB} the {THING}</td>
</tr>
</tbody>
</table>
<p>Now, you can brainstorm possible fillers for the various template parts.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 36%">
<col style="width: 22%">
<col style="width: 40%">
</colgroup>
<thead>
<tr class="header">
<th>{NEGATION}</th>
<th>{POS_VERB}</th>
<th>{THING}</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>didn’t, can’t say I, …</td>
<td>love, like, …</td>
<td>food, flight, services, …</td>
</tr>
</tbody>
</table>
<p>By taking the cartesian products of all these possibilities, we can generate a lot of test cases.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 13%">
<col style="width: 11%">
<col style="width: 41%">
<col style="width: 19%">
</colgroup>
<thead>
<tr class="header">
<th>{NEGATION}</th>
<th>{POS_VERB}</th>
<th>{THING}</th>
<th>Variation</th>
<th>Expected Label</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>didn’t</td>
<td>love</td>
<td>food</td>
<td>I didn’t <strong>love</strong> the food</td>
<td>Negative</td>
</tr>
<tr class="even">
<td>didn’t</td>
<td>like</td>
<td>food</td>
<td>I didn’t <strong>like</strong> the food</td>
<td>Negative</td>
</tr>
<tr class="odd">
<td>didn’t</td>
<td>love</td>
<td>flight</td>
<td>I didn’t love the <strong>flight</strong></td>
<td>Negative</td>
</tr>
<tr class="even">
<td>didn’t</td>
<td>love</td>
<td>services</td>
<td>I didn’t love the <strong>services</strong></td>
<td>Negative</td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td>…</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</section>
<section id="b.-masked-language-model-template" class="level4">
<h4 class="anchored" data-anchor-id="b.-masked-language-model-template">b. Masked Language Model Template</h4>
<p>Instead of manually specifying fill-ins for the template, we can also use MLM models like ROBERTA and use masking to generate variants.</p>
<p>For example, here we are using ROBERTA to suggest words for the mask and then we manually filter them into positive/negative/neutral.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 46%">
<col style="width: 28%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Template</th>
<th>ROBERTA Prediction</th>
<th>Manual Filtering</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>I really <strong>{mask}</strong> the flight</td>
<td>enjoyed</td>
<td>positive</td>
</tr>
<tr class="even">
<td></td>
<td>liked</td>
<td>positive</td>
</tr>
<tr class="odd">
<td></td>
<td>loved</td>
<td>positive</td>
</tr>
<tr class="even">
<td></td>
<td>regret</td>
<td>negative</td>
</tr>
<tr class="odd">
<td></td>
<td>…</td>
<td></td>
</tr>
</tbody>
</table>
<p>These fill-ins can be reused across multiple tests. The paper also suggests using WordNet to select only context-appropriate synonyms from ROBERTA.</p>
</section>
<section id="c.-built-in-fill-ins" class="level4">
<h4 class="anchored" data-anchor-id="c.-built-in-fill-ins">c.&nbsp;Built-in Fill-ins</h4>
<p>CheckList also provides out-of-box support for lexicons such as:</p>
<ul>
<li><strong>NER</strong>: common first/last names, cities and countries</li>
<li><strong>Protected Group Adjectives</strong>: Nationalities, Religions, Gender, Sexuality</li>
</ul>
</section>
<section id="d.-built-in-perturbations" class="level4">
<h4 class="anchored" data-anchor-id="d.-built-in-perturbations">d.&nbsp;Built-in Perturbations</h4>
<p>CheckList also provides perturbation functions such as character swaps, contractions, name and location changes, and neutral word replacement.</p>
</section>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, CheckList provides a general framework to perform a comprehensive and fine-grained evaluation of NLP models. This can help us better understand the state of NLP models beyond the leaderboard.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Marco Tulio Ribeiro et al., <a href="https://arxiv.org/abs/2005.04118">“Beyond Accuracy: Behavioral Testing of NLP models with CheckList”</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>evals</category>
  <guid>https://amitness.com/posts/behavioral-testing-nlp</guid>
  <pubDate>Tue, 28 Jul 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/checklist-cover.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>Semi-Supervised Learning in Computer Vision</title>
  <link>https://amitness.com/posts/semi-supervised-learning</link>
  <description><![CDATA[ 




<p>Semi-supervised learning methods for Computer Vision have been advancing quickly in the past few years. Current state-of-the-art methods are simplifying prior work in terms of architecture and loss function or introducing hybrid methods by blending different formulations.</p>
<p>In this post, I will illustrate the key ideas of these recent methods for semi-supervised learning through diagrams.</p>
<section id="self-training" class="level2">
<h2 class="anchored" data-anchor-id="self-training">1. Self-Training</h2>
<p>In this semi-supervised formulation, a model is trained on labeled data and used to predict pseudo-labels for the unlabeled data. The model is then trained on both ground truth labels and pseudo-labels simultaneously.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-self-training.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Idea of Self-Training"></p>
</figure>
</div>
<section id="a.-pseudo-label" class="level3">
<h3 class="anchored" data-anchor-id="a.-pseudo-label">a. Pseudo-label</h3>
<p><span class="citation" data-cites="Lee2013PseudoLabelT">Lee (2013)</span> proposed a very simple and efficient formulation called “Pseudo-label” in 2013.</p>
<p>The idea is to train a <span style="background-color: #e8f5e9;">model</span> simultaneously on a batch of both labeled and unlabeled images. The <span style="background-color: #e8f5e9;">model</span> is trained on labeled images in usual supervised manner with a cross-entropy loss. The same model is used to get predictions for a batch of unlabeled images and the <span style="background-color: #fce4ec;">maximum confidence class</span> is used as the <span style="background-color: #f3e5f5;">pseudo-label</span>. Then, cross-entropy loss is calculated by comparing <span style="background-color: #e8f5e9;">model</span> predictions and the pseudo-label for the unlabeled images .</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-pseudo-label.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Pseudo-Label for Semi-supervised Learning"></p>
</figure>
</div>
<p>The total loss is a weighted sum of the labeled and unlabeled loss terms.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%20=%20L_%7Blabeled%7D%20+%20%5Calpha_%7Bt%7D%20*%20L_%7Bunlabeled%7D%0A"></p>
<p>To make sure the model has learned enough from the labeled data, the <img src="https://latex.codecogs.com/png.latex?%5Calpha_t"> term is set to 0 during the initial 100 training steps. It is then gradually increased up to 600 training steps and then kept constant.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-pseudolabel-alpha-increase.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Impact of alpha on semi-supervised loss"></p>
</figure>
</div>
</section>
<section id="b.-noisy-student" class="level3">
<h3 class="anchored" data-anchor-id="b.-noisy-student">b. Noisy Student</h3>
<p><span class="citation" data-cites="xie2019selftraining">Xie et al. (2019b)</span> proposed a semi-supervised method inspired by Knowledge Distillation called “Noisy Student” in 2019.</p>
<p>The key idea is to train two separate models called <span style="background-color: #e8f5e9;">“Teacher”</span> and <span style="background: #fff3e0;">“Student”</span>. The <span style="background-color: #e8f5e9;">teacher model</span> is first trained on the labeled images and then it is used to infer the pseudo-labels for the unlabeled images. These pseudo-labels can either be soft-label or converted to hard-label by <span style="background-color: #fce4ec;">taking the most confident class</span>. Then, the labeled and unlabeled images are combined together and a <span style="background-color: #fff3e0;">student model</span> is trained on this combined data. The images are augmented using RandAugment as a form of input noise. Also, model noise such as Dropout and Stochastic Depth are incorporated in the student model architecture.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-noisy-student.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Noisy Student"></p>
</figure>
</div>
<p>Once a <span style="background-color: #fff3e0;">student model</span> is trained, it becomes the new <span style="background-color: #e8f5e9;">teacher</span> and this process is repeated for three iterations.</p>
</section>
</section>
<section id="consistency-regularization" class="level2">
<h2 class="anchored" data-anchor-id="consistency-regularization">2. Consistency Regularization</h2>
<p>This paradigm uses the idea that <span style="background-color: #e8f5e9;">model</span> predictions on an unlabeled image should remain the same even after adding noise. We could use input noise such as Image Augmentation and Gaussian noise. Noise can also be incorporated in the architecture itself using Dropout.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fixmatch-unlabeled-augment-concept.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Consistency Regularization Concept"></p>
</figure>
</div>
<section id="a.-π-model" class="level3">
<h3 class="anchored" data-anchor-id="a.-π-model">a. π-model</h3>
<p>This model was proposed by <span class="citation" data-cites="DBLP:conf/iclr/LaineA17">Laine and Aila (2017)</span> in a conference paper at ICLR 2017.</p>
<p>The key idea is to create two random augmentations of an image for both labeled and unlabeled data. Then, a <span style="background-color: #e8f5e9;">model with dropout</span> is used to predict the label of both these images. The <span style="background-color: #ede7f6;">square difference</span> of these two <span style="background-color: #e3f2fd;">predictions</span> is used as a <span style="background-color: #ede7f6;">consistency loss</span>. For labeled images, we also calculate the <span style="background-color: #e0f2f1;">cross-entropy loss</span>. The total loss is a weighted sum of these two loss terms. A weight <span style="background-color: #eeeeee;">w(t)</span> is applied to decide how much the consistency loss contributes in the overall loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-pi-model.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="PI Model"></p>
</figure>
</div>
</section>
<section id="b.-temporal-ensembling" class="level3">
<h3 class="anchored" data-anchor-id="b.-temporal-ensembling">b. Temporal Ensembling</h3>
<p>This method was also proposed by <span class="citation" data-cites="DBLP:conf/iclr/LaineA17">Laine and Aila (2017)</span> in the same paper as the pi-model. It modifies the π-model by leveraging the <span style="background-color: #fff3e0;">Exponential Moving Average(EMA)</span> of predictions.</p>
<p>The key idea is to use the <span style="background-color: #fff3e0;">exponential moving average</span> of past predictions as one view. To get another view, we augment the image as usual and a <span style="background-color: #e8f5e9;">model with dropout</span> is used to predict the label. The <span style="background-color: #ede7f6;">square difference</span> of <span style="background-color: #e3f2fd;">current prediction</span> and <span style="background-color: #fff3e0;">EMA prediction</span> is used as a <span style="background-color: #ede7f6;">consistency loss</span>. For labeled images, we also calculate the <span style="background-color: #e0f2f1;">cross-entropy loss</span>. The final loss is a weighted sum of these two loss terms. A weight <span style="background-color: #eeeeee;">w(t)</span> is applied to decide how much the consistency loss contributes in the overall loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-temporal-ensembling.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Temporal Ensembling"></p>
</figure>
</div>
</section>
<section id="c.-mean-teacher" class="level3">
<h3 class="anchored" data-anchor-id="c.-mean-teacher">c.&nbsp;Mean Teacher</h3>
<p>This method was proposed by <span class="citation" data-cites="tarvainen2017mean">Tarvainen and Valpola (2017)</span>. The general approach is similar to Temporal Ensembling but it uses Exponential Moving Average(EMA) of the model parameters instead of predictions.</p>
<p>The key idea is to have two models called <span style="background-color: #e8f5e9;">“Student”</span> and <span style="background-color: #ffebee;">“Teacher”</span>. The <span style="background-color: #e8f5e9;">student</span> model is a regular model with dropout. And the <span style="background-color: #ffebee;">teacher</span> model has the same architecture as the <span style="background-color: #e8f5e9;">student</span> model but its weights are set using an <span style="background-color: #ffebee;">exponential moving average</span> of the weights of <span style="background-color: #e8f5e9;">student</span> model. For a labeled or unlabeled image, we create two random augmented versions of the image. Then, the <span style="background-color: #e8f5e9;">student</span> model is used to predict <span style="background-color: #e3f2fd;">label distribution</span> for first image. And, the <span style="background-color: #ffebee;">teacher</span> model is used to predict the <span style="background-color: #e3f2fd;">label distribution</span> for the second augmented image. The <span style="background-color: #ede7f6;">square difference</span> of these two <span style="background-color: #e3f2fd;">predictions</span> is used as a <span style="background-color: #ede7f6;">consistency loss</span>. For labeled images, we also calculate the <span style="background-color: #e0f2f1;">cross-entropy loss</span>. The final loss is a weighted sum of these two loss terms. A weight <span style="background-color: #eeeeee;">w(t)</span> is applied to decide how much the consistency loss contributes in the overall loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-mean-teacher.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Mean Teacher"></p>
</figure>
</div>
</section>
<section id="d.-virtual-adversarial-training" class="level3">
<h3 class="anchored" data-anchor-id="d.-virtual-adversarial-training">d.&nbsp;Virtual Adversarial Training</h3>
<p>This method was proposed by <span class="citation" data-cites="DBLP:journals/pami/MiyatoMKI19">Miyato et al. (2019)</span>. It uses the concept of adversarial attack for consistency regularization.</p>
<p>The key idea is to generate an adversarial transformation of an image that will change the model prediction. To do so, first, an image is taken and an adversarial variant of it is created such that the KL-divergence between the model output for the original image and the adversarial image is maximized.</p>
<p>Then we proceed as previous methods. We take a labeled/unlabeled image as first view and take its adversarial example generated in previous step as the second view. Then, the same <span style="background-color: #e8f5e9;">model</span> is used to predict <span style="background-color: #e3f2fd;">label distributions</span> for both images. The <span style="background-color: #ede7f6;">KL-divergence</span> of these two <span style="background-color: #e3f2fd;">predictions</span> is used as a <span style="background-color: #ede7f6;">consistency loss</span>. For labeled images, we also calculate the <span style="background-color: #e0f2f1;">cross-entropy loss</span>. The final loss is a weighted sum of these two loss terms. A weight <span style="background-color: #eeeeee;"><img src="https://latex.codecogs.com/png.latex?%5Calpha"></span> is applied to decide how much the consistency loss contributes in the overall loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-virtual-adversarial-training.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Virtual Adversarial Training"></p>
</figure>
</div>
</section>
<section id="e.-unsupervised-data-augmentation" class="level3">
<h3 class="anchored" data-anchor-id="e.-unsupervised-data-augmentation">e. Unsupervised Data Augmentation</h3>
<p>This method was proposed by <span class="citation" data-cites="xie2019unsupervised">Xie et al. (2019a)</span> and works for both images and text. Here, we will understand the method in the context of images.</p>
<p>The key idea is to create an augmented version of a unlabeled image using AutoAugment. Then, a same <span style="background-color: #e8f5e9;">model</span> is used to predict the label of both these images. The <span style="background-color: #ede7f6;">KL-divergence</span> of these two <span style="background-color: #e3f2fd;">predictions</span> is used as a <span style="background-color: #ede7f6;">consistency loss</span>. For labeled images, we only calculate the <span style="background-color: #e0f2f1;">cross-entropy loss</span> and don’t calculate any <span style="background-color: #ede7f6;">consistency loss</span>. The final loss is a weighted sum of these two loss terms. A weight <span style="background-color: #eeeeee;">w(t)</span> is applied to decide how much the consistency loss contributes in the overall loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-unsupervised-data-augmentation.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Unsupervised Data Augmentation"></p>
</figure>
</div>
</section>
</section>
<section id="hybrid-methods" class="level2">
<h2 class="anchored" data-anchor-id="hybrid-methods">3. Hybrid Methods</h2>
<p>This paradigm combines ideas from previous work such as self-training and consistency regularization along with additional components for performance improvement.</p>
<section id="a.-mixmatch" class="level3">
<h3 class="anchored" data-anchor-id="a.-mixmatch">a. MixMatch</h3>
<p>This holistic method was proposed by <span class="citation" data-cites="berthelot2019mixmatch">Berthelot et al. (2019)</span>.</p>
<p>To understand this method, let’s take a walk through each of the steps.</p>
<ol type="i">
<li><p>For the labeled image, we create an augmentation of it. For the unlabeled image, we create K augmentations and get the model <span style="background-color: #ffebee;">predictions</span> on all K-images. Then, the <span style="background-color: #ffebee;">predictions</span> are <span style="background-color: #e0f7fa;">averaged</span> and <span style="background-color: #e3f2fd;">temperature scaling</span> is applied to get a final pseudo-label. This pseudo-label will be used for all the K-augmentations.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-mixmatch-part-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Preparing Pseudo-label in MixMatch"></p>
</figure>
</div></li>
<li><p>The batches of augmented labeled and unlabeled images are combined and the whole group is shuffled. Then, the first N images of this group are taken as <img src="https://latex.codecogs.com/png.latex?W_L">, and the remaining M images are taken as <img src="https://latex.codecogs.com/png.latex?W_U">.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-mixmatch-part-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Shuffling labeled and unlabeled images"></p>
</figure>
</div></li>
<li><p>Now, Mixup is applied between the augmented labeled batch and group <img src="https://latex.codecogs.com/png.latex?W_L">. Similarly, mixup is applied between the M augmented unlabeled group and the <img src="https://latex.codecogs.com/png.latex?W_U"> group. Thus, we get the final labeled and unlabeled group.<br>
<img src="https://amitness.com/posts/images/ssl-mixmatch-part-3.png" class="img-fluid quarto-figure quarto-figure-center" alt="Applying Mixup trick in MixMatch"></p></li>
<li><p>Now, for the labeled group, we take model predictions and compute <span style="background-color: #e0f2f1;">cross-entropy loss</span> with the ground truth mixup labels. Similarly, for the unlabeled group, we compute model predictions and compute <span style="background-color: #ede7f6;">mean square error(MSE) loss</span> with the mixup pseudo labels. A weighted sum is taken of these two terms with <img src="https://latex.codecogs.com/png.latex?%5Clambda"> weighting the MSE loss.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/ssl-mixmatch-part-4.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="MixMatch overall pipeline"></p>
</figure>
</div></li>
</ol>
</section>
<section id="b.-fixmatch" class="level3">
<h3 class="anchored" data-anchor-id="b.-fixmatch">b. FixMatch</h3>
<p>This method was proposed by <span class="citation" data-cites="sohn2020fixmatch">Sohn et al. (2020)</span> and combines pseudo-labeling and consistency regularization while vastly simplifying the overall method. It got state of the art results on a wide range of benchmarks.</p>
<p>As seen, we train a supervised model on our labeled images with cross-entropy loss. For each unlabeled image, <span style="background-color:#efdcd5">weak augmentation</span> and <span style="background-color: #e8f5e9">strong augmentations</span> are applied to get two images. The <span style="background-color:#efdcd5;">weakly augmented image</span> is passed to our model and we get prediction over classes. The probability for the most confident class is compared to a <span style="background-color: #fce4ec">threshold</span>. If it is above the <span style="background-color: #fce4ec;">threshold</span>, then we take that class as the ground label i.e.&nbsp;<span style="background-color: #f3e5f5;">pseudo-label</span>. Then, the <span style="background-color: #e8f5e9">strongly augmented</span> image is passed through our model to get a prediction over classes. This <span style="background-color: #e1f5fe;">prediction</span> is compared to ground truth <span style="background-color: #f3e5f5;">pseudo-label</span> using cross-entropy loss. Both the losses are combined and the model is optimized.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fixmatch-pipeline.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Overall Architecture of FixMatch"></p>
</figure>
</div>
<p>If you want to learn more about FixMatch, I have an <a href="https://amitness.com/posts/fixmatch">article</a> that goes over it in depth.</p>
</section>
</section>
<section id="comparison-of-methods" class="level2">
<h2 class="anchored" data-anchor-id="comparison-of-methods">Comparison of Methods</h2>
<p>Here is a high-level summary of the differences between all the above-mentioned methods.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 44%">
<col style="width: 5%">
<col style="width: 17%">
<col style="width: 32%">
</colgroup>
<thead>
<tr class="header">
<th>Method Name</th>
<th>Year</th>
<th>Unlabeled Loss</th>
<th>Augmentation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Pseudo-label</td>
<td>2013</td>
<td>Cross-Entropy</td>
<td>Random</td>
</tr>
<tr class="even">
<td>π-model</td>
<td>2016</td>
<td>MSE</td>
<td>Random</td>
</tr>
<tr class="odd">
<td>Temporal Ensembling</td>
<td>2016</td>
<td>MSE</td>
<td>Random</td>
</tr>
<tr class="even">
<td>Mean Teacher</td>
<td>2017</td>
<td>MSE</td>
<td>Random</td>
</tr>
<tr class="odd">
<td>Virtual Adversarial Training(VAT)</td>
<td>2017</td>
<td>KL-divergence</td>
<td>Adversarial transformation</td>
</tr>
<tr class="even">
<td>Unsupervised Data Augmentation(UDA)</td>
<td>2019</td>
<td>KL-divergence</td>
<td>AutoAugment</td>
</tr>
<tr class="odd">
<td>MixMatch</td>
<td>2019</td>
<td>MSE</td>
<td>Random</td>
</tr>
<tr class="even">
<td>Noisy Student</td>
<td>2019</td>
<td>Cross-Entropy</td>
<td>RandAugment</td>
</tr>
<tr class="odd">
<td>FixMatch</td>
<td>2020</td>
<td>Cross-Entropy</td>
<td>CTAugment / RandAugment</td>
</tr>
</tbody>
</table>
</section>
<section id="common-evaluation-datasets" class="level2">
<h2 class="anchored" data-anchor-id="common-evaluation-datasets">Common Evaluation Datasets</h2>
<p>To evaluate the performance of these semi-supervised methods, the following datasets are commonly used. The authors simulate a low-data regime by using only a small portion(e.g.&nbsp;40/250/4000/10000 examples) of the whole dataset as labeled and treating the remaining as the unlabeled set.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Dataset</th>
<th>Classes</th>
<th>Image Size</th>
<th>Train</th>
<th>Validation</th>
<th>Unlabeled</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10</a></td>
<td>10</td>
<td>32*32</td>
<td>50,000</td>
<td>10,000</td>
<td>-</td>
</tr>
<tr class="even">
<td><a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-100</a></td>
<td>100</td>
<td>32*32</td>
<td>50,000</td>
<td>10,000</td>
<td>-</td>
</tr>
<tr class="odd">
<td><a href="http://ai.stanford.edu/~acoates/stl10/">STL-10</a></td>
<td>10</td>
<td>96*96</td>
<td>5000</td>
<td>8000</td>
<td>1,00,000</td>
</tr>
<tr class="even">
<td><a href="http://ufldl.stanford.edu/housenumbers/">SVHN</a></td>
<td>10</td>
<td>32*32</td>
<td>73,257</td>
<td>26,032</td>
<td>5,31,131</td>
</tr>
<tr class="odd">
<td><a href="https://www.tensorflow.org/datasets/catalog/imagenet2012">ILSVRC-2012</a></td>
<td>1000</td>
<td>vary</td>
<td>1.2 million</td>
<td>150,000</td>
<td>1,50,000</td>
</tr>
</tbody>
</table>
<!-- Part 2: Classic methods
- S4L
- Ladder Network
- Bad GAN
- Interpolation Consistency Training(ICT) for SSL
- RealMix
- Stochastic Weight Averaging (SWA)
- EnAET
- Dual Student
- CC-GAN²
- Semi-supervised self-training of object detection models.  [pseudolabeling]
-->
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, we got an overview of how semi-supervised methods for Computer Vision have progressed over the years. This is a really important line of research that can have a direct impact on the industry.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-berthelot2019mixmatch" class="csl-entry">
David Berthelot, Nicholas Carlini, I. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. MixMatch: A holistic approach to semi-supervised learning. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-DBLP:conf/iclr/LaineA17" class="csl-entry">
Samuli Laine and Timo Aila. 2017. <a href="https://openreview.net/forum?id=BJ6oOfqge">Temporal ensembling for semi-supervised learning</a>. In <em>5th international conference on learning representations, <span>ICLR</span> 2017, toulon, france, april 24-26, 2017, conference track proceedings</em>. OpenReview.net.
</div>
<div id="ref-Lee2013PseudoLabelT" class="csl-entry">
Dong-Hyun Lee. 2013. <a href="https://api.semanticscholar.org/CorpusID:18507866">Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks</a>. In
</div>
<div id="ref-DBLP:journals/pami/MiyatoMKI19" class="csl-entry">
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2019. <a href="https://doi.org/10.1109/TPAMI.2018.2858821">Virtual adversarial training: <span>A</span> regularization method for supervised and semi-supervised learning</a>. <em><span>IEEE</span> Trans. Pattern Anal. Mach. Intell.</em>, 41(8):1979–1993.
</div>
<div id="ref-sohn2020fixmatch" class="csl-entry">
Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, E. D. Cubuk, Alexey Kurakin, Han Zhang, and Colin Raffel. 2020. FixMatch: Simplifying semi-supervised learning with consistency and confidence. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-tarvainen2017mean" class="csl-entry">
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-xie2019unsupervised" class="csl-entry">
Qizhe Xie, Zihang Dai, E. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019a. Unsupervised data augmentation for consistency training. <em>Neural Information Processing Systems</em>.
</div>
<div id="ref-xie2019selftraining" class="csl-entry">
Qizhe Xie, E. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019b. <a href="https://doi.org/10.1109/cvpr42600.2020.01070">Self-training with noisy student improves ImageNet classification</a>. <em>Computer Vision and Pattern Recognition</em>.
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Semi-Supervised {Learning} in {Computer} {Vision}},
  date = {2020-07-12},
  url = {https://amitness.com/posts/semi-supervised-learning.html},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-chaudhary2020" class="csl-entry quarto-appendix-citeas">
Amit Chaudhary. 2020. <a href="https://amitness.com/posts/semi-supervised-learning.html">Semi-Supervised
Learning in Computer Vision</a>.
</div></div></section></div> ]]></description>
  <category>semi-supervised-learning</category>
  <guid>https://amitness.com/posts/semi-supervised-learning</guid>
  <pubDate>Sun, 12 Jul 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/ssl-pseudo-label.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>FastAPI for Flask Users</title>
  <link>https://amitness.com/posts/fastapi-vs-flask</link>
  <description><![CDATA[ 




<p>While Flask has become the de-facto choice for API development in Machine Learning projects, there is a new framework called FastAPI that has been getting a lot of community traction.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/flask-to-fastapi.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Flask and FastAPI Logo"></p>
</figure>
</div>
<p>I recently decided to give FastAPI a spin by porting a production Flask project. It was very easy to pick up FastAPI coming from Flask and I was able to get things up and running in just a few hours.</p>
<p>The added benefit of automatic data validation, documentation generation and baked-in best-practices such as pydantic schemas and python typing makes this a strong choice for future projects.</p>
<p>In this post, I will introduce FastAPI by contrasting the implementation of various common use-cases in both Flask and FastAPI.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="Version Info:">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Version Info:
</div>
</div>
<div class="callout-body-container callout-body">
<p>At the time of this writing, the Flask version is 1.1.2 and the FastAPI version is 0.58.1</p>
</div>
</div>
<section id="installation" class="level2">
<h2 class="anchored" data-anchor-id="installation">Installation</h2>
<p>Both Flask and FastAPI are available on PyPI. For conda, you need to use the <code>conda-forge</code> channel to install FastAPI while it’s available in the default channel for Flask.</p>
<p><strong>Flask:</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install flask</span>
<span id="cb1-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> install flask</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install fastapi uvicorn</span>
<span id="cb2-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> install fastapi uvicorn <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-c</span> conda-forge</span></code></pre></div></div>
</div>
</section>
<section id="running-hello-world" class="level2">
<h2 class="anchored" data-anchor-id="running-hello-world">Running “Hello World”</h2>
<p><strong>Flask:</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask</span>
<span id="cb3-2"></span>
<span id="cb3-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb3-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> home():</span>
<span id="cb3-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hello'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'world'</span>}</span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'__main__'</span>:</span>
<span id="cb3-10">    app.run()</span></code></pre></div></div>
</div>
<p>Now you can run the development server using the below command. It runs on port 5000 by default.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> app.py</span></code></pre></div></div>
</div>
<p><strong>FastAPI</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> uvicorn</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb5-3"></span>
<span id="cb5-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb5-5"></span>
<span id="cb5-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb5-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> home():</span>
<span id="cb5-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hello'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'world'</span>}</span>
<span id="cb5-9"></span>
<span id="cb5-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'__main__'</span>:</span>
<span id="cb5-11">    uvicorn.run(app)</span></code></pre></div></div>
</div>
<p>FastAPI defers serving to a production-ready server called <code>uvicorn</code>. We can run it in development mode with a default port of 8000.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> app.py</span></code></pre></div></div>
</div>
</section>
<section id="production-server" class="level2">
<h2 class="anchored" data-anchor-id="production-server">Production server</h2>
<p><strong>Flask:</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask</span>
<span id="cb7-2"></span>
<span id="cb7-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb7-4"></span>
<span id="cb7-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb7-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> home():</span>
<span id="cb7-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hello'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'world'</span>}</span>
<span id="cb7-8"></span>
<span id="cb7-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'__main__'</span>:</span>
<span id="cb7-10">    app.run()</span></code></pre></div></div>
</div>
<p>For a production server, <code>gunicorn</code> is a common choice in Flask.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gunicorn</span> app:app</span></code></pre></div></div>
</div>
<p><strong>FastAPI</strong></p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> uvicorn</span>
<span id="cb9-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb9-3"></span>
<span id="cb9-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb9-5"></span>
<span id="cb9-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb9-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> home():</span>
<span id="cb9-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hello'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'world'</span>}</span>
<span id="cb9-9"></span>
<span id="cb9-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'__main__'</span>:</span>
<span id="cb9-11">    uvicorn.run(app)</span></code></pre></div></div>
</div>
<p>FastAPI defers serving to a production-ready server called <a href="https://www.uvicorn.org/settings/">uvicorn</a>. We can start the server as:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb10-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">uvicorn</span> app:app</span></code></pre></div></div>
</div>
<p>You can also start it in hot-reload mode by running</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb11-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">uvicorn</span> app:app <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--reload</span></span></code></pre></div></div>
</div>
<p>Furthermore, you can change the port as well.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb12-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">uvicorn</span> app:app <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> 5000</span></code></pre></div></div>
</div>
<p>The number of workers can be controlled as well.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">uvicorn</span> app:app <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--workers</span> 2</span></code></pre></div></div>
</div>
<p>You can use <code>gunicorn</code> to manage uvicorn as well using the following command. All regular gunicorn flags such as number of workers(<code>-w</code>) work.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb14-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gunicorn</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-k</span> uvicorn.workers.UvicornWorker app:app</span></code></pre></div></div>
</div>
</section>
<section id="http-methods" class="level2">
<h2 class="anchored" data-anchor-id="http-methods">HTTP Methods</h2>
<p><strong>Flask:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>, methods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'POST'</span>])</span>
<span id="cb15-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> example():</span>
<span id="cb15-3">    ...</span></code></pre></div></div>
<p><strong>FastAPI:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb16-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> example():</span>
<span id="cb16-3">    ...</span></code></pre></div></div>
<p>You have individual decorator methods for each HTTP method.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb17-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.put</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb17-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.patch</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb17-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.delete</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span></code></pre></div></div>
</section>
<section id="url-variables" class="level2">
<h2 class="anchored" data-anchor-id="url-variables">URL Variables</h2>
<p>We want to get the user id from the URL e.g.&nbsp;<code>/users/1</code> and then return the user id to the user.</p>
<p><strong>Flask:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/users/&lt;int:user_id&gt;'</span>)</span>
<span id="cb18-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> get_user_details(user_id):</span>
<span id="cb18-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'user_id'</span>: user_id}</span></code></pre></div></div>
<p><strong>FastAPI:</strong></p>
<p>In FastAPI, we make use of type hints in Python to specify all the data types. For example, here we specify that <code>user_id</code> should be an integer. The variable in the URL path is also specified similar to f-strings.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/users/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{user_id}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb19-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> get_user_details(user_id: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>):</span>
<span id="cb19-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'user_id'</span>: user_id}</span></code></pre></div></div>
</section>
<section id="query-strings" class="level2">
<h2 class="anchored" data-anchor-id="query-strings">Query Strings</h2>
<p>We want to allow the user to specify a search term by using a query string <code>?q=abc</code> in the URL.</p>
<p><strong>Flask:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> request</span>
<span id="cb20-2"></span>
<span id="cb20-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/search'</span>)</span>
<span id="cb20-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> search():</span>
<span id="cb20-5">    query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> request.args.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'q'</span>)</span>
<span id="cb20-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'query'</span>: query}</span></code></pre></div></div>
<p><strong>FastAPI:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/search'</span>)</span>
<span id="cb21-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> search(q: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb21-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'query'</span>: q}</span></code></pre></div></div>
</section>
<section id="json-post-request" class="level2">
<h2 class="anchored" data-anchor-id="json-post-request">JSON POST Request</h2>
<p>Let’s take a toy example where we want to send a JSON POST request with a <code>text</code> key and get back a lowercased version.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb22-1"><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">#</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">Request</span></span>
<span id="cb22-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"text"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HELLO"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb22-3"></span>
<span id="cb22-4"><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">#</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">Response</span></span>
<span id="cb22-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"text"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hello"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p><strong>Flask:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> request</span>
<span id="cb23-2"></span>
<span id="cb23-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/lowercase'</span>, methods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'POST'</span>])</span>
<span id="cb23-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> lower_case():</span>
<span id="cb23-5">    text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> request.json.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>)</span>
<span id="cb23-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>: text.lower()}</span></code></pre></div></div>
<p><strong>FastAPI:</strong><br>
If you simply replicate the functionality from Flask, you can do it as follows in FastAPI.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> typing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Dict</span>
<span id="cb24-2"></span>
<span id="cb24-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/lowercase'</span>)</span>
<span id="cb24-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> lower_case(json_data: Dict):</span>
<span id="cb24-5">    text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> json_data.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>)</span>
<span id="cb24-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>: text.lower()}</span></code></pre></div></div>
<p>But, this is where FastAPI introduces a new concept of creating Pydantic schema that maps to the JSON data being received. We can refactor the above example using pydantic as:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb25-2"></span>
<span id="cb25-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Sentence(BaseModel):</span>
<span id="cb25-4">    text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span>
<span id="cb25-5"></span>
<span id="cb25-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/lowercase'</span>)</span>
<span id="cb25-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> lower_case(sentence: Sentence):</span>
<span id="cb25-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>: sentence.text.lower()}</span></code></pre></div></div>
<p>As seen, instead of getting a dictionary, the JSON data is converted into an object of the schema <code>Sentence</code>. As such, we can access the data using data attributes such as <code>sentence.text</code>. This also provides automatic validation of data types. If the user tries to send any data other than a string, they will be given an auto-generated validation error.</p>
<p><strong>Example Invalid Request</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb26-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"text"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">null</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p><strong>Automatic Response</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb27-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb27-2">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"detail"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb27-3">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb27-4">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"loc"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb27-5">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"body"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb27-6">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span></span>
<span id="cb27-7">            <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb27-8">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"msg"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none is not an allowed value"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb27-9">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type_error.none.not_allowed"</span></span>
<span id="cb27-10">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb27-11">    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb27-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
</section>
<section id="file-upload" class="level2">
<h2 class="anchored" data-anchor-id="file-upload">File Upload</h2>
<p>Let’s create an API to return the uploaded file name. The key used when uploading the file will be <code>file</code>.</p>
<p><strong>Flask</strong><br>
Flask allows accessing the uploaded file via the request object.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb28" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"></span>
<span id="cb28-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask, request</span>
<span id="cb28-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb28-4"></span>
<span id="cb28-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/upload'</span>, methods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'POST'</span>])</span>
<span id="cb28-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> upload_file():</span>
<span id="cb28-7">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> request.files.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'file'</span>)</span>
<span id="cb28-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>.filename}</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong><br>
FastAPI uses function parameter to specify the file key.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb29" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI, UploadFile, File</span>
<span id="cb29-2"></span>
<span id="cb29-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb29-4"></span>
<span id="cb29-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/upload'</span>)</span>
<span id="cb29-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> upload_file(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>: UploadFile <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> File(...)):</span>
<span id="cb29-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>.filename}</span></code></pre></div></div>
</div>
</section>
<section id="form-submission" class="level2">
<h2 class="anchored" data-anchor-id="form-submission">Form Submission</h2>
<p>We want to access a text form field that’s defined as shown below and echo the value.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode html code-with-copy"><code class="sourceCode html"><span id="cb30-1"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">input</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;"> name</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;"> type</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p><strong>Flask</strong><br>
Flask allows accessing the form fields via the request object.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb31" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"></span>
<span id="cb31-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask, request</span>
<span id="cb31-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb31-4"></span>
<span id="cb31-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/submit'</span>, methods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'POST'</span>])</span>
<span id="cb31-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> echo():</span>
<span id="cb31-7">    city <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> request.form.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span>)</span>
<span id="cb31-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span>: city}</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong><br>
We use function parameter to define the key and data type for the form field.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb32" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI, Form</span>
<span id="cb32-2">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb32-3"></span>
<span id="cb32-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/submit'</span>)</span>
<span id="cb32-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> echo(city: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Form(...)):</span>
<span id="cb32-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span>: city}</span></code></pre></div></div>
</div>
<p>We can also make the form field optional as shown below</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> typing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Optional</span>
<span id="cb33-2"></span>
<span id="cb33-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/submit'</span>)</span>
<span id="cb33-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> echo(city: Optional[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Form(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)):</span>
<span id="cb33-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span>: city}</span></code></pre></div></div>
<p>Similarly, we can set a default value for the form field as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb34" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/submit'</span>)</span>
<span id="cb34-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> echo(city: Optional[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Form(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Paris'</span>)):</span>
<span id="cb34-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'city'</span>: city}</span></code></pre></div></div>
</section>
<section id="cookies" class="level2">
<h2 class="anchored" data-anchor-id="cookies">Cookies</h2>
<p>We want to access a cookie called <code>name</code> from the request.</p>
<p><strong>Flask</strong><br>
Flask allows accessing the cookies via the request object.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb35" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"></span>
<span id="cb35-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask, request</span>
<span id="cb35-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb35-4"></span>
<span id="cb35-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/profile'</span>)</span>
<span id="cb35-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> profile():</span>
<span id="cb35-7">    name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> request.cookies.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>)</span>
<span id="cb35-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: name}</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong><br>
We use parameter to define the key for the cookie.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb36" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI, Cookie</span>
<span id="cb36-2">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb36-3"></span>
<span id="cb36-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/profile'</span>)</span>
<span id="cb36-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> profile(name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Cookie(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)):</span>
<span id="cb36-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: name}</span></code></pre></div></div>
</div>
</section>
<section id="modular-views" class="level2">
<h2 class="anchored" data-anchor-id="modular-views">Modular Views</h2>
<p>We want to decompose the views from a single app.py into separate files.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb37-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> app.py</span></span>
<span id="cb37-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> views</span></span>
<span id="cb37-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> user.py</span></span></code></pre></div></div>
<p><strong>Flask:</strong><br>
In Flask, we use a concept called blueprints to manage this. We would first create a blueprint for the user view as:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>views/user.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb38" data-filename="views/user.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Blueprint</span>
<span id="cb38-2">user_blueprint <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Blueprint(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'user'</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb38-3"></span>
<span id="cb38-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@user_blueprint.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/users'</span>)</span>
<span id="cb38-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> list_users():</span>
<span id="cb38-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'users'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>]}</span></code></pre></div></div>
</div>
<p>Then, this view is registered in the main <code>app.py</code> file.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb39" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask</span>
<span id="cb39-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> views.user <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> user_blueprint</span>
<span id="cb39-3"></span>
<span id="cb39-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb39-5">app.register_blueprint(user_blueprint)</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong><br>
In FastAPI, the equivalent of a blueprint is called a router. First, we create a user router as:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>routers/user.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb40" data-filename="routers/user.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> APIRouter</span>
<span id="cb40-2">router <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> APIRouter()</span>
<span id="cb40-3"></span>
<span id="cb40-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@router.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/users'</span>)</span>
<span id="cb40-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> list_users():</span>
<span id="cb40-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'users'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>]}</span></code></pre></div></div>
</div>
<p>Then, we attach this router to the main app object as:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb41" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb41-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> routers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> user</span>
<span id="cb41-3"></span>
<span id="cb41-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb41-5">app.include_router(user.router)</span></code></pre></div></div>
</div>
</section>
<section id="data-validation" class="level2">
<h2 class="anchored" data-anchor-id="data-validation">Data Validation</h2>
<p><strong>Flask</strong><br>
Flask doesn’t provide any input data validation feature out-of-the-box. It’s common practice to either write custom validation logic or use libraries such as <a href="https://marshmallow.readthedocs.io/en/stable/">marshmalllow</a> or <a href="https://docs.pydantic.dev/latest/">pydantic</a>.</p>
<p><strong>FastAPI:</strong></p>
<p>FastAPI wraps pydantic into its framework and allow data validation by simply using a combination of pydantic schema and python type hints.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb42-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb42-3"></span>
<span id="cb42-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb42-5"></span>
<span id="cb42-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> User(BaseModel):</span>
<span id="cb42-7">    name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span>
<span id="cb42-8">    age: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span>
<span id="cb42-9"></span>
<span id="cb42-10"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.post</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/users'</span>)</span>
<span id="cb42-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> save_user(user: User):</span>
<span id="cb42-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: user.name,</span>
<span id="cb42-13">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'age'</span>: user.age}</span></code></pre></div></div>
<p>This code will perform automatic validation to ensure <code>name</code> is a string and <code>age</code> is an integer. If any other data type is sent, it auto-generates validation error with a relevant message.</p>
<p>Here are some examples of pydantic schema for common use-cases.</p>
<section id="example-1-key-value-pairs" class="level3">
<h3 class="anchored" data-anchor-id="example-1-key-value-pairs">Example 1: Key-value pairs</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb43-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb43-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Isaac"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb43-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"age"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span></span>
<span id="cb43-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb44-2"></span>
<span id="cb44-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> User(BaseModel):</span>
<span id="cb44-4">    name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span>
<span id="cb44-5">    age: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span></code></pre></div></div>
</section>
<section id="example-2-collection-of-things" class="level3">
<h3 class="anchored" data-anchor-id="example-2-collection-of-things">Example 2: Collection of things</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb45-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb45-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"series"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GOT"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dark"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mr. Robot"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb45-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb46-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> typing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> List</span>
<span id="cb46-3"></span>
<span id="cb46-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Metadata(BaseModel):</span>
<span id="cb46-5">    series: List[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>]</span></code></pre></div></div>
</section>
<section id="example-3-nested-objects" class="level3">
<h3 class="anchored" data-anchor-id="example-3-nested-objects">Example 3: Nested Objects</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb47" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb47-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb47-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"users"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb47-3">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb47-4">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"xyz"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb47-5">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"age"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span></span>
<span id="cb47-6">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb47-7">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb47-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abc"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb47-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"age"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span></span>
<span id="cb47-10">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb47-11">  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb47-12">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"group"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Group A"</span></span>
<span id="cb47-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb48" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb48-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb48-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> typing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> List</span>
<span id="cb48-3"></span>
<span id="cb48-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> User(BaseModel):</span>
<span id="cb48-5">    name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span>
<span id="cb48-6">    age: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span>
<span id="cb48-7"></span>
<span id="cb48-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> UserGroup(BaseModel):</span>
<span id="cb48-9">    users: List[User]</span>
<span id="cb48-10">    group: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span></code></pre></div></div>
<p>You can learn more about Python Type hints from <a href="https://fastapi.tiangolo.com/python-types/">here</a>.</p>
</section>
</section>
<section id="automatic-documentation" class="level2">
<h2 class="anchored" data-anchor-id="automatic-documentation">Automatic Documentation</h2>
<p><strong>Flask</strong><br>
Flask doesn’t provide any built-in feature for documentation generation. There are extensions such as <a href="https://pypi.org/project/flask-swagger/">flask-swagger</a> or <a href="https://flask-restplus.readthedocs.io/en/stable/swagger.html">flask-restful</a> to fill that gap but the workflow is comparatively complex.</p>
<p><strong>FastAPI:</strong><br>
FastAPI automatically generates an interactive swagger documentation endpoint at <code>/docs</code> and a reference documentation at <code>/redoc</code>.</p>
<p>For example, say we had a simple view given below that echoes what the user searched for.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb49" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb49-2"></span>
<span id="cb49-3">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb49-4"></span>
<span id="cb49-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.get</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/search'</span>)</span>
<span id="cb49-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> search(q: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb49-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'query'</span>: q}</span></code></pre></div></div>
</div>
<section id="swagger-documentation" class="level3">
<h3 class="anchored" data-anchor-id="swagger-documentation">Swagger Documentation</h3>
<p>If you run the server and goto the endpoint <code>http://127.0.0.1:8000/docs</code>, you will get an auto-generated swagger documentation.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fastapi-swagger.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="OpenAPI Swagger UI in FastAPI"></p>
</figure>
</div>
<p>You can interactively try out the API from the browser itself.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fastapi-swagger-interactive.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Interactive API Usage in FastAPI"></p>
</figure>
</div>
</section>
<section id="redoc-documentation" class="level3">
<h3 class="anchored" data-anchor-id="redoc-documentation">ReDoc Documentation</h3>
<p>In addition to swagger, if you goto the endpoint <code>http://127.0.0.01:8000/redoc</code>, you will get an auto-generated reference documentation. There is information on parameters, request format, response format and status codes.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fastapi-redoc.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="ReDoc functionality in FastAPI"></p>
</figure>
</div>
</section>
</section>
<section id="cross-origin-resource-sharingcors" class="level2">
<h2 class="anchored" data-anchor-id="cross-origin-resource-sharingcors">Cross-Origin Resource Sharing(CORS)</h2>
<p><strong>Flask</strong><br>
Flask doesn’t provide CORS support out of the box. We need to use extension such as <a href="https://flask-cors.readthedocs.io/en/latest/">flask-cors</a> to configure CORS as shown below.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb50" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb50-1"></span>
<span id="cb50-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask</span>
<span id="cb50-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask_cors <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> CORS</span>
<span id="cb50-4"></span>
<span id="cb50-5">app_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb50-6">CORS(app_)</span></code></pre></div></div>
</div>
<p><strong>FastAPI:</strong><br>
FastAPI provides a <a href="https://fastapi.tiangolo.com/tutorial/cors/">built-in middleware</a> to handle CORS. We show an example of CORS below where we are allowing any origin to access our APIs.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>app.py</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb51" data-filename="app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FastAPI</span>
<span id="cb51-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastapi.middleware.cors <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> CORSMiddleware</span>
<span id="cb51-3"></span>
<span id="cb51-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FastAPI()</span>
<span id="cb51-5"></span>
<span id="cb51-6">app.add_middleware(</span>
<span id="cb51-7">    CORSMiddleware,</span>
<span id="cb51-8">    allow_origins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'*'</span>],</span>
<span id="cb51-9">    allow_credentials<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb51-10">    allow_methods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*"</span>],</span>
<span id="cb51-11">    allow_headers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*"</span>],</span>
<span id="cb51-12">)</span></code></pre></div></div>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, FastAPI is an excellent alternative to Flask for building robust APIs with best-practices baked in. You can refer to the <a href="https://fastapi.tiangolo.com/">documentation</a> to learn more.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://fastapi.tiangolo.com">FastAPI Documentation</a></li>
<li><a href="https://pydantic-docs.helpmanual.io/">Pydantic Documentation</a></li>
<li><a href="https://www.uvicorn.org/">Uvicorn: The lightning-fast ASGI server</a></li>
</ul>


</section>

 ]]></description>
  <category>python</category>
  <guid>https://amitness.com/posts/fastapi-vs-flask</guid>
  <pubDate>Mon, 29 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/flask-to-fastapi.png" medium="image" type="image/png" height="57" width="144"/>
</item>
<item>
  <title>Google Colab Tips for Power Users</title>
  <link>https://amitness.com/posts/google-colab-tips</link>
  <description><![CDATA[ 




<p>Colab is one of the best products to come from Google. It has made GPUs freely accessible to learners and practitioners like me who otherwise wouldn’t be able to afford a high-end GPU.</p>
<p>While the interface is very easy to use, there are many lesser-known and undocumented features in colab. In this post, I will share those features that I’ve discovered from basic usage and their official talks.</p>
<section id="scratchpad-notebook" class="level2">
<h2 class="anchored" data-anchor-id="scratchpad-notebook">1. Scratchpad Notebook</h2>
<p>It’s a pretty common scenario that we have a bunch of cluttered untitled notebooks created when we try out temporary stuff on colab.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-clutter.png" class="img-fluid figure-img" alt="Interactive example of CBOW"></p>
<figcaption>Clutter of Untitled Notebooks in Colab</figcaption>
</figure>
</div>
<p>To solve this, you can bookmark the link given below. It will open a special <strong>scratch notebook</strong> and any changes you make to that notebook are not saved to your main account.</p>
<blockquote class="blockquote">
<p><a href="https://colab.research.google.com/notebooks/empty.ipynb">https://colab.research.google.com/notebooks/empty.ipynb</a></p>
</blockquote>
</section>
<section id="timing-execution-of-cell" class="level2">
<h2 class="anchored" data-anchor-id="timing-execution-of-cell">2. Timing Execution of Cell</h2>
<p>It’s pretty common that we manually calculate the difference between start and end times of a piece of code to gauge the time taken.</p>
<p>Colab provides an inbuilt feature to do this. After a cell is executed, just hover over the cell run icon and you will get an estimate of the execution time taken.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-cell-hover.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Execution Time by hovering on run cell"></p>
</figure>
</div>
</section>
<section id="run-part-of-a-cell" class="level2">
<h2 class="anchored" data-anchor-id="run-part-of-a-cell">3. Run part of a cell</h2>
<p>You can also run only a part of the cell by selecting it and pressing the <code>Runtime &gt; Run Selection</code> button or using the keyboard shortcut <code>Ctrl + Shift + Enter</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-run-few-lines.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Running specific line in colab"></p>
</figure>
</div>
</section>
<section id="jupyter-notebook-keyboard-shortcuts" class="level2">
<h2 class="anchored" data-anchor-id="jupyter-notebook-keyboard-shortcuts">4. Jupyter Notebook Keyboard Shortcuts</h2>
<p>If you are familiar with keyboard shortcuts from Jupyter Notebook, they don’t work directly in Colab. But I found a mental model to map between them.</p>
<p>Just add <code>Ctrl + M</code> before whatever keyboard shortcut you were using in Jupyter. This rule of thumb works for the majority of common use-cases.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Action</th>
<th>Jupyter Notebook</th>
<th>Google Colab</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Add a cell above</td>
<td>A</td>
<td>Ctrl + <strong>M</strong> + A</td>
</tr>
<tr class="even">
<td>Add a cell below</td>
<td>B</td>
<td>Ctrl + <strong>M</strong> + B</td>
</tr>
<tr class="odd">
<td>See all keyboard shorcuts</td>
<td>H</td>
<td>Ctrl + <strong>M</strong> + H</td>
</tr>
<tr class="even">
<td>Change cell to code</td>
<td>Y</td>
<td>Ctrl + <strong>M</strong> + Y</td>
</tr>
<tr class="odd">
<td>Change cell to markdown</td>
<td>M</td>
<td>Ctrl + <strong>M</strong> + M</td>
</tr>
<tr class="even">
<td>Interrupt the kernel</td>
<td>II</td>
<td>Ctrl + <strong>M</strong> + I</td>
</tr>
<tr class="odd">
<td>Delete a cell</td>
<td>DD</td>
<td>Ctrl + <strong>M</strong> + D</td>
</tr>
<tr class="even">
<td>Checkpoint notebook</td>
<td>Ctrl + S</td>
<td>Ctrl + <strong>M</strong> + S</td>
</tr>
</tbody>
</table>
<p>Below are some notable exceptions to this rule for which either the shortcut is changed completely or kept the same.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 21%">
<col style="width: 27%">
</colgroup>
<thead>
<tr class="header">
<th>Action</th>
<th>Jupyter Notebook</th>
<th>Google Colab</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Restart runtime</td>
<td>00</td>
<td>Ctrl + <strong>M</strong> + <strong>.</strong></td>
</tr>
<tr class="even">
<td>Run cell</td>
<td>Ctrl + Enter</td>
<td>Ctrl + Enter</td>
</tr>
<tr class="odd">
<td>Run cell and add new cell below</td>
<td>Alt + Enter</td>
<td>Alt + Enter</td>
</tr>
<tr class="even">
<td>Run cell and goto the next cell below</td>
<td>Shift + Enter</td>
<td>Shift + Enter</td>
</tr>
<tr class="odd">
<td>Comment current line</td>
<td>Ctrl + /</td>
<td>Ctrl + /</td>
</tr>
</tbody>
</table>
</section>
<section id="jump-to-class-definition" class="level2">
<h2 class="anchored" data-anchor-id="jump-to-class-definition">5. Jump to Class Definition</h2>
<p>Similar to an IDE, you can go to a class definition by pressing <code>Ctrl</code> and then clicking a class name. For example, here we view the class definition of the Dense layer in Keras by pressing Ctrl and then clicking the <code>Dense</code> class name.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-goto-class.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Demo of jumping to class definition"></p>
</figure>
</div>
</section>
<section id="open-notebooks-from-github" class="level2">
<h2 class="anchored" data-anchor-id="open-notebooks-from-github">6. Open Notebooks from GitHub</h2>
<p>The Google Colab team provides an official chrome extension to open notebooks on GitHub directly on colab. You can install it from <a href="https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo">here</a>.</p>
<p>After installation, click the colab icon on any GitHub notebook to open it directly.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-from-github.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Extension for opening github notebook in colab"></p>
</figure>
</div>
<p>Alternatively, you can also manually open any GitHub notebook by replacing <code>github.com</code> with <code>colab.research.google.com/github</code>.</p>
<blockquote class="blockquote">
<p>https://<strong>github.com</strong>/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb</p>
</blockquote>
<p>to</p>
<blockquote class="blockquote">
<p>https://<strong>colab.research.google.com/github</strong>/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb</p>
</blockquote>
<p>An even easier way is to replace <code>github.com</code> with <code>githubtocolab.com</code>. It will redirect you to a colab notebook.</p>
<blockquote class="blockquote">
<p>https://<strong>github.com</strong>/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb</p>
</blockquote>
<p>to</p>
<blockquote class="blockquote">
<p>https://<strong>githubtocolab.com</strong>/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb</p>
</blockquote>
</section>
<section id="run-flask-apps-from-colab" class="level2">
<h2 class="anchored" data-anchor-id="run-flask-apps-from-colab">7. Run Flask apps from Colab</h2>
<p>With a library called <a href="https://github.com/gstaff/flask-ngrok">flask-ngrok</a>, you can easily expose a Flask web app running on colab to demo prototypes. First, you need to install <code>flask</code> and <code>flask-ngrok</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install flask<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>ngrok flask<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.12.2</span></span></code></pre></div></div>
<p>Then, you just need to pass your flask app object to <code>run_with_ngrok</code> function and it will expose a ngrok endpoint when the server is started.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Flask</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> flask_ngrok <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> run_with_ngrok</span>
<span id="cb2-3"></span>
<span id="cb2-4">app <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flask(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb2-5">run_with_ngrok(app)</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@app.route</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/'</span>)</span>
<span id="cb2-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> hello():</span>
<span id="cb2-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Hello World!'</span></span>
<span id="cb2-10"></span>
<span id="cb2-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'__main__'</span>:</span>
<span id="cb2-12">    app.run()</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-flask.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of running flask-ngrok"></p>
</figure>
</div>
<p>You can try this out from the package author’s <a href="https://colab.research.google.com/github/gstaff/flask-ngrok/blob/master/examples/flask_ngrok_example.ipynb">official example</a> on Colab.</p>
</section>
<section id="switch-between-tensorflow-versions" class="level2">
<h2 class="anchored" data-anchor-id="switch-between-tensorflow-versions">8. Switch between Tensorflow versions</h2>
<p>You can easily switch between Tensorflow 1 and Tensorflow 2 using this magic flag.<br>
To switch to Tensorflow 1.15.2, use this command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span>tensorflow_version <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">x</span></span></code></pre></div></div>
<p>To switch to Tensorflow 2.2, run this command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span>tensorflow_version <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">x</span></span></code></pre></div></div>
<p>You will need to restart the runtime for the effect to take place. Colab recommends using the pre-installed Tensorflow version instead of installing it from <code>pip</code> for performance reasons.</p>
</section>
<section id="tensorboard-integration" class="level2">
<h2 class="anchored" data-anchor-id="tensorboard-integration">9. Tensorboard Integration</h2>
<p>Colab also provides a magic command to use Tensorboard directly from the notebook. You just need to set the logs directory location using the <code>--logdir</code> flag. You can learn to use it from the <a href="https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_in_notebooks.ipynb">official notebook</a>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span>load_ext tensorboard</span>
<span id="cb5-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span>tensorboard <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">--</span>logdir logs</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-tensorboard.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Embedded Tensorboard in Colab"></p>
</figure>
</div>
</section>
<section id="gauge-resource-limits" class="level2">
<h2 class="anchored" data-anchor-id="gauge-resource-limits">10. Gauge resource limits</h2>
<p>Colab provides the following specs for their free and pro versions. Based on your use case, you can switch to the pro version at $10/month if you need a better runtime, GPU, and memory.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 9%">
<col style="width: 13%">
<col style="width: 9%">
<col style="width: 8%">
<col style="width: 9%">
<col style="width: 12%">
<col style="width: 16%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Version</th>
<th>GPU</th>
<th>GPU Ram</th>
<th>RAM</th>
<th>Storage</th>
<th>CPU Cores</th>
<th>Idle Timeout</th>
<th>Maximum Runtime</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Free</td>
<td>Tesla K80</td>
<td>11.44GB</td>
<td>13.7GB</td>
<td>37GB</td>
<td>2</td>
<td>90 min</td>
<td>12 hrs</td>
</tr>
<tr class="even">
<td>Pro</td>
<td>Tesla P100</td>
<td>16GB</td>
<td>27.4GB</td>
<td>37GB</td>
<td>4</td>
<td>90 min</td>
<td>24 hrs</td>
</tr>
</tbody>
</table>
<p>You can view the GPU you have been assigned by running the following command</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!nvidia-smi</span></span></code></pre></div></div>
</div>
<p>For information on the CPU, you can run this command</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!cat</span> /proc/cpuinfo</span></code></pre></div></div>
</div>
<p>Similarly, you can view the RAM capacity by running</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> psutil</span>
<span id="cb8-2">ram_gb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> psutil.virtual_memory().total <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e9</span></span>
<span id="cb8-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(ram_gb)</span></code></pre></div></div>
</section>
<section id="use-interactive-shell" class="level2">
<h2 class="anchored" data-anchor-id="use-interactive-shell">11. Use interactive shell</h2>
<p>There is no built-in interactive terminal in Colab. But you can use the <code>bash</code> command to try out shell commands interactively. Just run this command and you will get an interactive input.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb9-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!bash</span></span></code></pre></div></div>
</div>
<p>Now, you can run any shell command in the given input box.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-bash.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Using interactive shell in colab"></p>
</figure>
</div>
<p>To quit from the shell, just type <code>exit</code> in the input box.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-bash-exit.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Exiting interactive shell in colab"></p>
</figure>
</div>
</section>
<section id="current-memory-and-storage-usage" class="level2">
<h2 class="anchored" data-anchor-id="current-memory-and-storage-usage">12. Current memory and storage usage</h2>
<p>Colab provides an indicator of RAM and disk usage. If you hover over the indicator, you will get a popup with the current usage and the total capacity.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-ram-usage.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Showing current memory and ram usage in colab"></p>
</figure>
</div>
</section>
<section id="open-in-colab-badge" class="level2">
<h2 class="anchored" data-anchor-id="open-in-colab-badge">13. “Open in Colab” Badge</h2>
<p>You can add a ‘Open in Colab’ badge to your <code>README.md</code> or jupyter notebooks using the following markdown code.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://colab.research.google.com/assets/colab-badge.svg" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Open In Colab"></p>
</figure>
</div>
<p>In the markdown code, we’re loading an SVG image and then linking it to a colab notebook.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="al" style="color: #AD0000;
background-color: null;
font-style: inherit;">![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)</span></span></code></pre></div></div>
</section>
<section id="interactive-tables-for-pandas" class="level2">
<h2 class="anchored" data-anchor-id="interactive-tables-for-pandas">14. Interactive Tables for Pandas</h2>
<p>Colab provides a notebook extension to add interactive sorting and filtering capabilities to pandas dataframes. To use it, run the following code.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span>load_ext google.colab.data_table</span></code></pre></div></div>
<p>You can see the regular pandas dataframe and the interactive dataframe after loading the extension below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/pandas-table-before.png" class="img-fluid figure-img"></p>
<figcaption>Regular pandas dataframe output</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/colab-pandas-after.png" class="img-fluid figure-img"></p>
<figcaption>Interactive pandas dataframe output</figcaption>
</figure>
</div>
</section>
<section id="setup-conda-environment" class="level2">
<h2 class="anchored" data-anchor-id="setup-conda-environment">15. Setup Conda environment</h2>
<p>If you use miniconda as your python environment manager, you can setup it on colab by running these commands at the top of your notebook.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Download Miniconda installation script</span></span>
<span id="cb12-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!wget</span> https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh</span>
<span id="cb12-3"></span>
<span id="cb12-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make it executable</span></span>
<span id="cb12-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!chmod</span> +x Miniconda3-latest-Linux-x86_64.sh</span>
<span id="cb12-6"></span>
<span id="cb12-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start installation in silent mode</span></span>
<span id="cb12-8"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!bash</span> ./Miniconda3-latest-Linux-x86_64.sh <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-b</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-f</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-p</span> /usr/local</span>
<span id="cb12-9"></span>
<span id="cb12-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make conda packages available in current environment</span></span>
<span id="cb12-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">import</span> sys</span>
<span id="cb12-12"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">sys.path.append</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">(</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/usr/local/lib/python3.7/site-packages/'</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">)</span></span></code></pre></div></div>
</div>
<p>After the cell is executed, you can use conda to install packages as usual.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!conda</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-y</span> flask</span></code></pre></div></div>
</div>
<p>Alternatively, you can use <a href="https://github.com/conda-incubator/condacolab">condacolab</a> package to install it easily.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb14-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install condacolab</span></code></pre></div></div>
</div>
<p>Then, run these python commands to install miniconda.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> condacolab</span>
<span id="cb15-2">condacolab.install_miniconda()</span></code></pre></div></div>
</section>
<section id="manage-colab-notebooks-from-command-line" class="level2">
<h2 class="anchored" data-anchor-id="manage-colab-notebooks-from-command-line">16. Manage Colab Notebooks from Command Line</h2>
<p>You can use a library called <a href="https://github.com/Akshay090/colab-cli">colab-cli</a> to easily create and sync colab notebooks with your local notebooks.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://asciinema.org/a/314749"><img src="https://asciinema.org/a/314749.svg" class="img-fluid figure-img"></a></p>
<figcaption>colab-cli-demo</figcaption>
</figure>
</div>
</section>
<section id="run-background-tasks" class="level2">
<h2 class="anchored" data-anchor-id="run-background-tasks">17. Run background tasks</h2>
<p>There are use-cases when we need to start some web server or background tasks before we can execute our regular program.</p>
<p>To run background tasks, use the <code>nohup</code> command followed by your regular shell command and add <code>&amp;</code> to the end to run it in the background. This makes sure that you can run cells afterward in the notebook without your background task blocking it.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb16-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">!nohup</span> bash ping.sh <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">&amp;</span></span></code></pre></div></div>
</div>
</section>
<section id="notify-on-training-completion" class="level2">
<h2 class="anchored" data-anchor-id="notify-on-training-completion">18. Notify on Training Completion</h2>
<p>If you’re running a long task such as training a model, you can setup Colab to send a desktop notification once it’s completed.</p>
<p>To enable that, goto Tools ⮕ Settings ⮕ Site and enable <code>Show desktop notifications</code> checkbox.</p>
<p><img src="https://amitness.com/posts/images/colab-notification.png" class="img-fluid"></p>
<p>You will get a popup to enable browser notification. Just accept it and colab will notify you on task completion even if you are on another tab, window or application.</p>
</section>
<section id="run-javascript-code" class="level2">
<h2 class="anchored" data-anchor-id="run-javascript-code">19. Run javascript code</h2>
<p>You can run javascript code by using the <code>%%javascript</code> magic command.</p>
<p><img src="https://amitness.com/posts/images/colab-javascript.png" class="img-fluid"></p>
</section>
<section id="run-vscode-on-colab" class="level2">
<h2 class="anchored" data-anchor-id="run-vscode-on-colab">20. Run VSCode on Colab</h2>
<p>You can run a full-fledged VSCode editor on Colab by following the method I have explained in another <a href="https://amitness.com/vscode-on-colab/">article</a>.</p>
<p><img src="https://amitness.com/posts/images/colab-code-step-3.png" class="img-fluid"></p>
</section>
<section id="custom-snippets" class="level2">
<h2 class="anchored" data-anchor-id="custom-snippets">21. Custom snippets</h2>
<p>You can save your own collections of useful snippets and access them easily in any colab notebook.</p>
<ul>
<li><p>Create a colab notebook called <code>snippets.ipynb</code>. To add each of your snippets, create a markdown cell and add name of the snippet as header. Below, the markdown cell, add a code cell with the snippet code.</p>
<p><img src="https://amitness.com/posts/images/custom-snippets-step-1.png" class="img-fluid"></p></li>
<li><p>Copy the link of this notebook from the browser tab.</p>
<p><img src="https://amitness.com/posts/images/custom-snippets-step-2.png" class="img-fluid"></p></li>
<li><p>Click <code>Tools &gt; Settings</code> in your menu bar to open preference of colab.<br>
<img src="https://amitness.com/posts/images/custom-snippets-step-3.png" class="img-fluid"></p></li>
<li><p>Paste the link into the <code>Custom snippet notebook URL</code> textbox and click save.</p></li>
</ul>
<p><img src="https://amitness.com/posts/images/custom-snippets-step-4.png" class="img-fluid"></p>
<ul>
<li>Now, the snippets are available in any colab notebook you use. Just click the <strong>&lt;&gt;</strong> icon on sidebar, search for your snippet name and click <strong>Insert</strong>. The code will be inserted into a new cell.</li>
</ul>
<p><img src="https://amitness.com/posts/images/custom-snippets-usage.gif" class="img-fluid"></p>
</section>
<section id="run-jupyterlab-on-google-colab" class="level2">
<h2 class="anchored" data-anchor-id="run-jupyterlab-on-google-colab">22. Run JupyterLab on Google Colab</h2>
<p>You can start a JupyterLab instance on colab by running the following commands in a cell.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install jupyterlab pyngrok <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q</span>
<span id="cb17-2"></span>
<span id="cb17-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Run jupyterlab in the background</span></span>
<span id="cb17-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>nohup jupyter lab <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">--</span>ip<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0.0.0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span></span>
<span id="cb17-5"></span>
<span id="cb17-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get ngrok URL mapped to port 8888</span></span>
<span id="cb17-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyngrok <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ngrok</span>
<span id="cb17-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(ngrok.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8888</span>))</span></code></pre></div></div>
<p>Once executed, click the printed ngrok URL to access the JupyterLab interface.</p>
<p><img src="https://amitness.com/posts/images/colab-jupyterlab.png" class="img-fluid"></p>
</section>
<section id="run-r-programs-in-google-colab" class="level2">
<h2 class="anchored" data-anchor-id="run-r-programs-in-google-colab">23. Run R programs in Google Colab</h2>
<p>You can use R programming language in Google Colab by going to <a href="https://colab.research.google.com/notebook#create=true&amp;language=r">https://colab.research.google.com/notebook#create=true&amp;language=r</a>. It will open a new notebook with R set as the kernel instead of Python.</p>
<p><img src="https://amitness.com/posts/images/r-kerel-in-colab.png" class="img-fluid"></p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Timothy Novikoff, <a href="https://www.youtube.com/watch?v=pnClcwTCyc0">“Making the most of Colab (TF Dev Summit ’20)”</a></li>
<li>Gal Oshri, <a href="https://www.youtube.com/watch?v=xM8sO33x_OU">“What’s new in TensorBoard (TF Dev Summit ’19)”</a></li>
</ul>


</section>

 ]]></description>
  <category>colab</category>
  <guid>https://amitness.com/posts/google-colab-tips</guid>
  <pubDate>Fri, 26 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/colab-cover.png" medium="image" type="image/png" height="100" width="144"/>
</item>
<item>
  <title>A Visual Guide to FastText Word Embeddings</title>
  <link>https://amitness.com/posts/fasttext-embeddings</link>
  <description><![CDATA[ 




<p>Word Embeddings are one of the most interesting aspects of the Natural Language Processing field. When I first came across them, it was intriguing to see a simple recipe of unsupervised training on a bunch of text yield representations that show signs of syntactic and semantic understanding.</p>
<p>In this post, we will explore a word embedding algorithm called “FastText” that was introduced by <span class="citation" data-cites="bojanowski2017enrichingwordvectorssubword">Bojanowski et al. (2017)</span> and understand how it enhances the Word2Vec algorithm from 2013.</p>
<section id="intuition-on-word-representations" class="level2">
<h2 class="anchored" data-anchor-id="intuition-on-word-representations">Intuition on Word Representations</h2>
<p>Suppose we have the following words and we want to represent them as vectors so that they can be used in Machine Learning models.</p>
<blockquote class="blockquote">
<p>Ronaldo, Messi, Dicaprio</p>
</blockquote>
<p>A simple idea could be to perform a one-hot encoding of the words, where each word gets a unique position.</p>
<table class="table-hover table-bordered caption-top table">
<thead>
<tr class="header">
<th></th>
<th>isRonaldo</th>
<th>isMessi</th>
<th>isDicaprio</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Ronaldo</strong></td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="even">
<td><strong>Messi</strong></td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr class="odd">
<td><strong>Dicaprio</strong></td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>We can see that this sparse representation doesn’t capture any relationship between the words and every word is isolated from each other.</p>
<p>Maybe we could do something better. We know Ronaldo and Messi are footballers while Dicaprio is an actor. Let’s use our world knowledge and create manual features to represent the words better.</p>
<table class="table-hover table-bordered caption-top table">
<thead>
<tr class="header">
<th></th>
<th>isFootballer</th>
<th>isActor</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Ronaldo</strong></td>
<td>1</td>
<td>0</td>
</tr>
<tr class="even">
<td><strong>Messi</strong></td>
<td>1</td>
<td>0</td>
</tr>
<tr class="odd">
<td><strong>Dicaprio</strong></td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>This is better than the previous one-hot-encoding because related items are closer in space.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-manually-creating-embedding.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Related words closer in space"></p>
</figure>
</div>
<p>We could keep on adding even more aspects as dimensions to get a more nuanced representation.</p>
<table class="table-hover table-bordered caption-top table">
<colgroup>
<col style="width: 19%">
<col style="width: 19%">
<col style="width: 11%">
<col style="width: 16%">
<col style="width: 9%">
<col style="width: 9%">
<col style="width: 9%">
<col style="width: 4%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th>isFootballer</th>
<th>isActor</th>
<th>Popularity</th>
<th>Gender</th>
<th>Height</th>
<th>Weight</th>
<th>…</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Ronaldo</strong></td>
<td>1</td>
<td>0</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
<tr class="even">
<td><strong>Messi</strong></td>
<td>1</td>
<td>0</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
<tr class="odd">
<td><strong>Dicaprio</strong></td>
<td>0</td>
<td>1</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
</tbody>
</table>
<p>But manually doing this for every possible word is not scalable. If we designed features based on our world knowledge of the relationship between words, can we replicate the same with a neural network?</p>
<blockquote class="blockquote">
<p><em>Can we have neural networks comb through a large corpus of text and generate word representations automatically?</em></p>
</blockquote>
<p>This is the intention behind the research in word-embedding algorithms.</p>
</section>
<section id="recapping-word2vec" class="level2">
<h2 class="anchored" data-anchor-id="recapping-word2vec">Recapping Word2Vec</h2>
<p>In 2013, <span class="citation" data-cites="mikolov2013efficientestimationwordrepresentations">Mikolov et al. (2013)</span> introduced an efficient method to learn vector representations of words from large amounts of unstructured text data. The paper was an execution of this idea from Distributional Semantics.</p>
<blockquote class="blockquote">
<p><em>You shall know a word by the company it keeps - J.R. Firth 1957</em></p>
</blockquote>
<!--  -->
<p>Since similar words appear in a similar context, Mikolov et al.&nbsp;used this insight to formulate two tasks for representation learning.</p>
<p>The first was called “<strong>Continuous Bag of Words</strong>” where need to predict the center words given the neighbor words.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/nlp-ssl-center-word-prediction.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Interactive example of CBOW"></p>
</figure>
</div>
<p>The second task was called “<strong>Skip-gram</strong>” where we need to predict the neighbor words given a center word.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/nlp-ssl-neighbor-word-prediction.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Interactive example of skipgram method"></p>
</figure>
</div>
<p>Representations learned had interesting properties such as this popular example where arithmetic operations on word vectors seemed to retain meaning.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/word2vec-analogy.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of word analogy in word2vec"></p>
</figure>
</div>
</section>
<section id="limitations-of-word2vec" class="level2">
<h2 class="anchored" data-anchor-id="limitations-of-word2vec">Limitations of Word2Vec</h2>
<p>While Word2Vec was a game-changer for NLP, we will see how there was still some room for improvement:</p>
<section id="out-of-vocabularyoov-words" class="level3">
<h3 class="anchored" data-anchor-id="out-of-vocabularyoov-words">Out of Vocabulary(OOV) Words</h3>
<p>In Word2Vec, an embedding is created for each word. As such, it can’t handle any words it has not encountered during its training.</p>
<p>For example, words such as “<span style="color: #82B366;">tensor</span>” and “<span style="color: #6C8EBF;">flow</span>” are present in the vocabulary of Word2Vec. But if you try to get embedding for the compound word “<span style="color: #82B366;">tensor</span><span style="color: #6C8EBF;">flow</span>”, you will get an <span style="color: #B85450;">out of vocabulary error</span>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/word2vec-oov-tensorflow.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of out of vocab words"></p>
</figure>
</div>
</section>
<section id="morphology" class="level3">
<h3 class="anchored" data-anchor-id="morphology">Morphology</h3>
<p>For words with same radicals such as “eat” and “eaten”, Word2Vec doesn’t do any parameter sharing. Each word is learned uniquely based on the context it appears in. Thus, there is scope for utilizing the internal structure of the word to make the process more efficient.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/word2vec-radicals.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Words that have shared radicals"></p>
</figure>
</div>
</section>
</section>
<section id="fasttext" class="level2">
<h2 class="anchored" data-anchor-id="fasttext">FastText</h2>
<p>To solve the above challenges, <span class="citation" data-cites="bojanowski2017enrichingwordvectorssubword">Bojanowski et al. (2017)</span> proposed a new embedding method called FastText. Their key insight was to use the internal structure of a word to improve vector representations obtained from the skip-gram method.</p>
<p>The modification to the skip-gram method is applied as follows:</p>
<section id="sub-word-generation" class="level3">
<h3 class="anchored" data-anchor-id="sub-word-generation">1. Sub-word generation</h3>
<p>For a word, we generate character n-grams of length 3 to 6 present in it.</p>
<ol type="1">
<li>We take a word and add angular brackets to denote the beginning and end of a word</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-angular-brackets.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Adding angular bracket to a word"></p>
</figure>
</div>
<ol start="2" type="1">
<li>Then, we generate character n-grams of length n.&nbsp;For example, for the word “eating”, character n-grams of length 3 can be generated by sliding a window of 3 characters from the start of the angular bracket till the ending angular bracket is reached. Here, we shift the window one step each time.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-3-gram-sliding.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Interactive example of generating 3-grams"></p>
</figure>
</div>
<ol start="3" type="1">
<li><p>Thus, we get a list of character n-grams for a word.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-3-grams-list.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="3-character n-grams of a word eating"></p>
</figure>
</div>
<p>Examples of different length character n-grams are given below:</p>
<table class="table-hover table-bordered caption-top table">
<thead>
<tr class="header">
<th>Word</th>
<th>Length(n)</th>
<th>Character n-grams</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>eating</td>
<td>3</td>
<td>&lt;ea, eat, ati, tin, ing, ng&gt;</td>
</tr>
<tr class="even">
<td>eating</td>
<td>4</td>
<td>&lt;eat, eati, atin, ting, ing&gt;</td>
</tr>
<tr class="odd">
<td>eating</td>
<td>5</td>
<td>&lt;eati, eatin, ating, ting&gt;</td>
</tr>
<tr class="even">
<td>eating</td>
<td>6</td>
<td>&lt;eatin, eating, ating&gt;</td>
</tr>
</tbody>
</table></li>
<li><p>Since there can be huge number of unique n-grams, we apply hashing to bound the memory requirements. Instead of learning an embedding for each unique n-gram, we learn total B embeddings where B denotes the bucket size. The paper used a bucket of a size of 2 million.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-hashing-ngrams.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Hash dictionary to store n-grams"></p>
</figure>
</div>
<p>Each character n-gram is hashed to an integer between 1 to B. Though this could result in collisions, it helps control the vocabulary size. The paper uses the FNV-1a variant of the <a href="http://www.isthe.com/chongo/tech/comp/fnv/">Fowler-Noll-Vo hashing</a> function to hash character sequences to integer values.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-hashing-function.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Hashing function for a n-gram"></p>
</figure>
</div></li>
</ol>
</section>
<section id="skip-gram-with-negative-sampling" class="level3">
<h3 class="anchored" data-anchor-id="skip-gram-with-negative-sampling">2. Skip-gram with negative sampling</h3>
<p>To understand the pre-training, let’s take a simple toy example. We have a sentence with a center word “eating” and need to predict the context words “am” and “food”.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-toy-example.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Skip-gram with window size of 1"></p>
</figure>
</div>
<ol type="1">
<li><p>First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-center-word-embedding.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Summing n-grams with word vector"></p>
</figure>
</div></li>
<li><p>For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-context-words.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of context words"></p>
</figure>
</div></li>
<li><p>Now, we collect negative samples randomly with probability proportion to the square root of the unigram frequency. For one actual context word, 5 random negative words are sampled.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-negative-samples.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of negative context words"></p>
</figure>
</div></li>
<li><p>We take dot product between the center word and the actual context words and apply sigmoid function to get a match score between 0 and 1.</p></li>
<li><p>Based on the loss, we update the embedding vectors with SGD optimizer to bring actual context words closer to the center word but increase distance to the negative samples.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-negative-sampling-goal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Goal of negative sampling in skip-gram"></p>
</figure>
</div></li>
</ol>
</section>
</section>
<section id="insights-from-the-paper" class="level2">
<h2 class="anchored" data-anchor-id="insights-from-the-paper">Insights from the Paper</h2>
<ol type="1">
<li><p>FastText improves performance on syntactic word analogy tasks significantly for morphologically rich language like Czech and German.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-syntactic-analogy.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of syntactic analogy"></p>
</figure>
</div>
<table class="table-hover table-bordered caption-top table">
<thead>
<tr class="header">
<th></th>
<th>word2vec-skipgram</th>
<th>word2vec-cbow</th>
<th>fasttext</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Czech</strong></td>
<td>52.8</td>
<td>55.0</td>
<td><strong>77.8</strong></td>
</tr>
<tr class="even">
<td><strong>German</strong></td>
<td>44.5</td>
<td>45.0</td>
<td><strong>56.4</strong></td>
</tr>
<tr class="odd">
<td><strong>English</strong></td>
<td>70.1</td>
<td>69.9</td>
<td><strong>74.9</strong></td>
</tr>
<tr class="even">
<td><strong>Italian</strong></td>
<td>51.5</td>
<td>51.8</td>
<td><strong>62.7</strong></td>
</tr>
</tbody>
</table></li>
<li><p>FastText has degraded performance on semantic analogy tasks compared to Word2Vec.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/fasttext-semantic-analogy.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Example of semantic analogy"></p>
</figure>
</div>
<table class="table-hover table-bordered caption-top table">
<thead>
<tr class="header">
<th></th>
<th>word2vec-skipgram</th>
<th>word2vec-cbow</th>
<th>fasttext</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Czech</strong></td>
<td>25.7</td>
<td><strong>27.6</strong></td>
<td>27.5</td>
</tr>
<tr class="even">
<td><strong>German</strong></td>
<td>66.5</td>
<td><strong>66.8</strong></td>
<td>62.3</td>
</tr>
<tr class="odd">
<td><strong>English</strong></td>
<td><strong>78.5</strong></td>
<td>78.2</td>
<td>77.8</td>
</tr>
<tr class="even">
<td><strong>Italian</strong></td>
<td>52.3</td>
<td><strong>54.7</strong></td>
<td>52.3</td>
</tr>
</tbody>
</table></li>
<li><p>Using sub-word information with character-ngrams has better performance than CBOW and skip-gram baselines on word-similarity task. Representing out-of-vocab words by summing their sub-words has better performance than assigning null vectors.</p>
<table class="table-hover table-bordered caption-top table">
<colgroup>
<col style="width: 10%">
<col style="width: 8%">
<col style="width: 10%">
<col style="width: 5%">
<col style="width: 24%">
<col style="width: 39%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th></th>
<th>skipgram</th>
<th>cbow</th>
<th>fasttext(null OOV)</th>
<th>fasttext(char-ngrams for OOV)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Arabic</td>
<td>WS353</td>
<td>51</td>
<td>52</td>
<td>54</td>
<td><strong>55</strong></td>
</tr>
<tr class="even">
<td></td>
<td>GUR350</td>
<td>61</td>
<td>62</td>
<td>64</td>
<td><strong>70</strong></td>
</tr>
<tr class="odd">
<td>German</td>
<td>GUR65</td>
<td>78</td>
<td>78</td>
<td>81</td>
<td><strong>81</strong></td>
</tr>
<tr class="even">
<td></td>
<td>ZG222</td>
<td>35</td>
<td>38</td>
<td>41</td>
<td><strong>44</strong></td>
</tr>
<tr class="odd">
<td>English</td>
<td>RW</td>
<td>43</td>
<td>43</td>
<td>46</td>
<td><strong>47</strong></td>
</tr>
<tr class="even">
<td></td>
<td>WS353</td>
<td>72</td>
<td>73</td>
<td>71</td>
<td><strong>71</strong></td>
</tr>
<tr class="odd">
<td>Spanish</td>
<td>WS353</td>
<td>57</td>
<td>58</td>
<td>58</td>
<td><strong>59</strong></td>
</tr>
<tr class="even">
<td>French</td>
<td>RG65</td>
<td>70</td>
<td>69</td>
<td>75</td>
<td><strong>75</strong></td>
</tr>
<tr class="odd">
<td>Romanian</td>
<td>WS353</td>
<td>48</td>
<td>52</td>
<td>51</td>
<td><strong>54</strong></td>
</tr>
<tr class="even">
<td>Russian</td>
<td>HJ</td>
<td>69</td>
<td>60</td>
<td>60</td>
<td><strong>66</strong></td>
</tr>
</tbody>
</table></li>
<li><p>FastText is 1.5 times slower to train than regular skipgram due to added overhead of n-grams.</p></li>
</ol>
</section>
<section id="implementation" class="level2">
<h2 class="anchored" data-anchor-id="implementation">Implementation</h2>
<p>To train your own embeddings, you can either use the <a href="https://fasttext.cc/docs/en/unsupervised-tutorial.html">official CLI tool</a> or use the <a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html">fasttext implementation</a> available in gensim.</p>
<p>Pre-trained word vectors trained on Common Crawl and Wikipedia for 157 languages are available <a href="https://fasttext.cc/docs/en/crawl-vectors.html">here</a> and variants of English word vectors are available <a href="https://fasttext.cc/docs/en/english-vectors.html">here</a>.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-bojanowski2017enrichingwordvectorssubword" class="csl-entry">
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. <a href="https://arxiv.org/abs/1607.04606">Enriching word vectors with subword information</a>.
</div>
<div id="ref-joulin2016bagtricksefficienttext" class="csl-entry">
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. <a href="https://arxiv.org/abs/1607.01759">Bag of tricks for efficient text classification</a>.
</div>
<div id="ref-mikolov2013efficientestimationwordrepresentations" class="csl-entry">
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. <a href="https://arxiv.org/abs/1301.3781">Efficient estimation of word representations in vector space</a>.
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {A {Visual} {Guide} to {FastText} {Word} {Embeddings}},
  date = {2020-06-21},
  url = {https://amitness.com/posts/fasttext-embeddings.html},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-chaudhary2020" class="csl-entry quarto-appendix-citeas">
Amit Chaudhary. 2020. <a href="https://amitness.com/posts/fasttext-embeddings.html">A Visual
Guide to FastText Word Embeddings</a>.
</div></div></section></div> ]]></description>
  <category>nlp</category>
  <category>embeddings</category>
  <guid>https://amitness.com/posts/fasttext-embeddings</guid>
  <pubDate>Sun, 21 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/fasttext-center-word-embedding.png" medium="image" type="image/png" height="57" width="144"/>
</item>
<item>
  <title>Universal Sentence Encoder Visually Explained</title>
  <link>https://amitness.com/posts/universal-sentence-encoder</link>
  <description><![CDATA[ 




<p>With transformer models such as BERT and friends taking the NLP research community by storm, it might be tempting to just throw the latest and greatest model at a problem and declare it done. However, in industry, we have compute and memory limitations to consider and might not even have a dedicated GPU for inference.</p>
<p>Thus, it’s useful to keep simple and efficient models in your NLP problem-solving toolbox. <span class="citation" data-cites="cer2018universalsentenceencoder">Cer et al. (2018)</span> proposed one such model called “Universal Sentence Encoder”.</p>
<p>In this post, I will explain the core idea behind “Universal Sentence Encoder” and how it learns fixed-length sentence embeddings from a mixed corpus of supervised and unsupervised data.</p>
<section id="goal" class="level2">
<h2 class="anchored" data-anchor-id="goal">Goal</h2>
<p>We want to learn a model that can map a sentence to a <span style="color: #519657;">fixed-length vector representation</span>. This vector encodes the meaning of the sentence and thus can be used for downstream tasks such as searching for similar documents.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-goal.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Goal of Sentence Encoder"></p>
</figure>
</div>
</section>
<section id="why-learned-sentence-embeddings" class="level2">
<h2 class="anchored" data-anchor-id="why-learned-sentence-embeddings">Why Learned Sentence Embeddings?</h2>
<p>A naive technique to get sentence embedding is to average the embeddings of words in a sentence and use the average as the representation of the whole sentence. This approach has some challenges.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-word-embedding-average.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Averaging Word Vectors to Get Sentence Embedding"></p>
</figure>
</div>
<p>Let’s understand these challenges with some code examples using the spacy library. We first install spacy and create an <code>nlp</code> object to load the medium version of their model.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>shell</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="shell" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install spacy</span>
<span id="cb1-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> spacy download en_core_web_md</span></code></pre></div></div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> en_core_web_md</span>
<span id="cb2-2"></span>
<span id="cb2-3">nlp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> en_core_web_md.load()</span></code></pre></div></div>
<section id="challenge-1-loss-of-information" class="level3">
<h3 class="anchored" data-anchor-id="challenge-1-loss-of-information">Challenge 1: Loss of information</h3>
<p>If we calculate the cosine similarity of documents given below using averaged word vectors, the similarity is pretty high even if the second sentence has a single word <code>It</code> and doesn’t have the same meaning as the first sentence.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>python</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" data-filename="python" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> nlp(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'It is cool'</span>).similarity(nlp(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'It'</span>))</span>
<span id="cb3-2"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8963861908844291</span></span></code></pre></div></div>
</div>
</section>
<section id="challenge-2-no-respect-for-order" class="level3">
<h3 class="anchored" data-anchor-id="challenge-2-no-respect-for-order">Challenge 2: No Respect for Order</h3>
<p>In this example, we swap the order of words in a sentence resulting in a sentence with a different meaning. Yet, the similarity obtained from averaged word vectors is 100%.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>python</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" data-filename="python" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> nlp(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'this is cool'</span>).similarity(nlp(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'is this cool'</span>))</span>
<span id="cb4-2"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span></span></code></pre></div></div>
</div>
<p>We could fix some of these challenges with hacky manual feature engineering like skipping stop-words, weighting the words by their TF-IDF scores, adding n-grams to respect order when averaging, concatenating embeddings, stacking max pooling and averaged embeddings and so on.</p>
<p>A different line of thought is training an end-to-end model to get us sentence embeddings:</p>
<blockquote class="blockquote">
<p><em>What if we could train a neural network to figure out how to best combine the word embeddings?</em></p>
</blockquote>
</section>
</section>
<section id="universal-sentence-encoderuse" class="level2">
<h2 class="anchored" data-anchor-id="universal-sentence-encoderuse">Universal Sentence Encoder(USE)</h2>
<p>On a high level, the idea is to design an <span style="color: #43A047;">encoder</span> that summarizes any given sentence to a <span style="color: #43A047;">512-dimensional</span> sentence embedding. We use this same embedding to solve <span style="color: #4E91A5;">multiple tasks</span> and based on the <span style="color: #E57373;">mistakes</span> it makes on those, we update the sentence embedding. Since the same embedding has to work on multiple generic tasks, it will capture only the most informative features and discard noise. The intuition is that this will result in an generic embedding that transfers universally to wide variety of NLP tasks such as relatedness, clustering, paraphrase detection and text classification.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-overall-pipeline.png" class="img-fluid figure-img"></p>
<figcaption>Overall Pipeline of Universal Sentence Encoder</figcaption>
</figure>
</div>
<p>Let’s now dig deeper into each component of Universal Sentence Encoder.</p>
</section>
<section id="tokenization" class="level2">
<h2 class="anchored" data-anchor-id="tokenization">1. Tokenization</h2>
<p>First, the sentences are converted to lowercase and tokenized into tokens using the Penn Treebank(PTB) tokenizer.</p>
</section>
<section id="encoder" class="level2">
<h2 class="anchored" data-anchor-id="encoder">2. Encoder</h2>
<p>This is the component that encodes a sentence into fixed-length 512-dimension embedding. In the paper, there are two architectures proposed based on trade-offs in accuracy vs inference speed.</p>
<section id="variant-1-transformer-encoder" class="level3">
<h3 class="anchored" data-anchor-id="variant-1-transformer-encoder">Variant 1: Transformer Encoder</h3>
<p>In this variant, we use the encoder part of the original transformer architecture. The architecture consists of 6 stacked transformer layers. Each layer has a self-attention module followed by a feed-forward network.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-transformer-one-layer.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Encoder Layer in Transformer"></p>
</figure>
</div>
<p>The self-attention process takes word order and surrounding context into account when generating each word representation. The output context-aware word embeddings are added element-wise and divided by the square root of the length of the sentence to account for the sentence-length difference. We get a 512-dimensional vector as output sentence embedding.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-transformer-variant.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Transformer Variant of Universal Sentence Encoder"></p>
</figure>
</div>
<p>This encoder has better accuracy on downstream tasks but higher memory and compute resource usage due to complex architecture. Also, the compute time scales dramatically with the length of sentence as self-attention has <img src="https://latex.codecogs.com/png.latex?O(n%5E%7B2%7D)"> time complexity with the length of the sentence. But for short sentences, it is only moderately slower.</p>
</section>
<section id="variant-2-deep-averaging-networkdan" class="level3">
<h3 class="anchored" data-anchor-id="variant-2-deep-averaging-networkdan">Variant 2: Deep Averaging Network(DAN)</h3>
<p>In this simpler variant, the encoder is based on the architecture proposed by <span class="citation" data-cites="iyyer-etal-2015-deep">Iyyer et al. (2015)</span>. First, the embeddings for word and bi-grams present in a sentence are averaged together. Then, they are passed through 4-layer feed-forward deep DNN to get 512-dimensional sentence embedding as output. The embeddings for word and bi-grams are learned during training.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-deep-averaging-network-variant.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Deep Averaging Network Architecture"></p>
</figure>
</div>
<p>It has slightly reduced accuracy compared to the transformer variant, but the inference time is very efficient. Since we are only doing feedforward operations, the compute time is of linear complexity in terms of length of the input sequence.</p>
</section>
</section>
<section id="multi-task-learning" class="level2">
<h2 class="anchored" data-anchor-id="multi-task-learning">3. Multi-task Learning</h2>
<p>To learn the sentence embeddings, the encoder is shared and trained across a range of unsupervised tasks along with supervised training on the SNLI corpus. The tasks are as follows:</p>
<section id="a.-modified-skip-thought" class="level3">
<h3 class="anchored" data-anchor-id="a.-modified-skip-thought">a. Modified Skip-thought</h3>
<p>The idea with original skip-thought paper from <span class="citation" data-cites="kiros2015skipthoughtvectors">Kiros et al. (2015)</span> was to use the current sentence to predict the previous and next sentence.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/nlp-ssl-neighbor-sentence.gif" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Interactive example of skip-thought method"></p>
</figure>
</div>
<p>In USE, the same core idea is used but instead of LSTM encoder-decoder architecture, only an encoder based on transformer or DAN is used. USE was trained on this task using the Wikipedia and News corpus.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-skipthought-task.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Skipthought formulation in Universal Sentence Encoder"></p>
</figure>
</div>
</section>
<section id="b.-conversational-input-response-prediction" class="level3">
<h3 class="anchored" data-anchor-id="b.-conversational-input-response-prediction">b. Conversational Input-Response Prediction</h3>
<p>In this task, we need to predict the correct response for a given input among a list of correct responses and other randomly sampled responses. This task is inspired by <span class="citation" data-cites="henderson2017efficientnaturallanguageresponse">Henderson et al. (2017)</span> who proposed a scalable email reply prediction architecture. This also powered the “Smart Reply” feature in “Inbox by Gmail”.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-smart-reply-example.png" class="img-fluid figure-img"></p>
<figcaption>Smart reply in Google Inbox (<span class="citation" data-cites="henderson2017efficientnaturallanguageresponse">Henderson et al. (2017)</span>)</figcaption>
</figure>
</div>
<p>The USE authors use a corpus scraped from web question-answering pages and discussion forums and formulate this task using a sentence encoder. The input sentence is encoded into a vector u. The response is also encoded by the same encoder and response embeddings are passed through a DNN to get vector v. This is done to model the difference in meaning of input and response. The dot product of this two vectors gives the relevance of an input to response.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-input-response-prediction.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Response Prediction in Universal Sentence Encoder"></p>
</figure>
</div>
<p>Training is done by taking a batch of K randomly shuffled input-response pairs. In each batch, for a input, its response pair is taken as the correct response and the remaining responses are treated as incorrect. Then, the dot product scores are calculated and converted to probabilities using a softmax function. Model is trained to maximize the log likelihood of the correct response for each input.</p>
</section>
<section id="c.-natural-language-inference" class="level3">
<h3 class="anchored" data-anchor-id="c.-natural-language-inference">c.&nbsp;Natural Language Inference</h3>
<p>In this task, we need to predict if a hypothesis entails, contradicts, or is neutral to a premise. The authors used the 570K sentence pairs from <a href="https://nlp.stanford.edu/projects/snli/">SNLI</a> corpus to train USE on this task.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 34%">
<col style="width: 15%">
</colgroup>
<thead>
<tr class="header">
<th>Premise</th>
<th>Hypothesis</th>
<th>Judgement</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>A soccer game with multiple males playing</td>
<td>Some men are playing a sport</td>
<td>entailment</td>
</tr>
<tr class="even">
<td>I love Marvel movies</td>
<td>I hate Marvel movies</td>
<td>contradiction</td>
</tr>
<tr class="odd">
<td>I love Marvel movies</td>
<td>A ship arrived</td>
<td>neutral</td>
</tr>
</tbody>
</table>
<p>The sentence pairs are encoded using shared Transformer/DAN encoders and the output 512-dim embeddings u1 and u2 are obtained. Then, they are concatenated along with their L1 distance and their dot product(angle). This concatenated vector is passed through fully-connected layers and softmax is applied to get probability for entailment/contradiction/neutral classes.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/use-snli-task.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="SNLI Architecture"></p>
</figure>
</div>
<p>The idea to learn sentence embedding based on SNLI seems to be inspired by the InferSent(<span class="citation" data-cites="conneau2018supervisedlearninguniversalsentence">Conneau et al. (2018)</span>) paper though the authors don’t cite it.</p>
</section>
</section>
<section id="inference" class="level2">
<h2 class="anchored" data-anchor-id="inference">4. Inference</h2>
<p>Once the model is trained using the above tasks, we can use it to map any sentence into fixed-length 512 dimension sentence embedding. This can be used for semantic search, paraphrase detection, clustering, smart-reply, text classification, and many other NLP tasks.</p>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<p>One caveat with the USE paper was that it doesn’t have a section on comparison with other competing sentence embedding methods over standard benchmarks. The paper seems to be written from an engineering perspective based on learnings from products such as Inbox by Gmail and Google Books.</p>
</section>
<section id="implementation" class="level2">
<h2 class="anchored" data-anchor-id="implementation">Implementation</h2>
<p>The pre-trained models for “Universal Sentence Encoder” are available via Tensorflow Hub. You can use it to get embeddings as well as use it as a pre-trained model in Keras. You can refer to my article on <a href="https://amitness.com/posts/tensorflow-hub-for-transfer-learning">tutorial on Tensorflow Hub</a> to learn how to use it.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Thus, Universal Sentence Encoder is a strong baseline to try when comparing the accuracy gains of newer methods against the compute overhead. I have personally used it for semantic search, retrieval, and text clustering and it provides a decent balance of accuracy and inference speed.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-cer2018universalsentenceencoder" class="csl-entry">
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. <a href="https://arxiv.org/abs/1803.11175">Universal sentence encoder</a>.
</div>
<div id="ref-conneau2018supervisedlearninguniversalsentence" class="csl-entry">
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2018. <a href="https://arxiv.org/abs/1705.02364">Supervised learning of universal sentence representations from natural language inference data</a>.
</div>
<div id="ref-henderson2017efficientnaturallanguageresponse" class="csl-entry">
Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. <a href="https://arxiv.org/abs/1705.00652">Efficient natural language response suggestion for smart reply</a>.
</div>
<div id="ref-iyyer-etal-2015-deep" class="csl-entry">
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. <a href="https://doi.org/10.3115/v1/P15-1162">Deep unordered composition rivals syntactic methods for text classification</a>. In Chengqing Zong and Michael Strube, editors, <em>Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers)</em>, pages 1681–1691, Beijing, China. Association for Computational Linguistics.
</div>
<div id="ref-kiros2015skipthoughtvectors" class="csl-entry">
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. <a href="https://arxiv.org/abs/1506.06726">Skip-thought vectors</a>.
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{chaudhary2020,
  author = {Chaudhary, Amit},
  title = {Universal {Sentence} {Encoder} {Visually} {Explained}},
  date = {2020-06-15},
  url = {https://amitness.com/posts/universal-sentence-encoder.html},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-chaudhary2020" class="csl-entry quarto-appendix-citeas">
Amit Chaudhary. 2020. <a href="https://amitness.com/posts/universal-sentence-encoder.html">Universal
Sentence Encoder Visually Explained</a>.
</div></div></section></div> ]]></description>
  <category>nlp</category>
  <category>embeddings</category>
  <guid>https://amitness.com/posts/universal-sentence-encoder</guid>
  <pubDate>Mon, 15 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/use-overall-pipeline.png" medium="image" type="image/png" height="42" width="144"/>
</item>
<item>
  <title>Zero-shot Text Classification With Generative Language Models</title>
  <link>https://amitness.com/posts/zero-shot-classification-via-generation</link>
  <description><![CDATA[ 




<p>In my <a href="https://amitness.com/posts/zero-shot-text-classification">last post</a>, we explored a contrastive learning approach to zero-shot text classification. In this post, we will explore a different approach based on text generation. This approach was proposed by Puri et al.&nbsp;in their paper <a href="https://arxiv.org/abs/1912.10165">“Zero-shot Text Classification With Generative Language Models”</a>. The paper was also presented in the “3rd Workshop on Meta-Learning” at NeurIPS 2019.</p>
<p>The goal of zero-shot text classification is to design a general and flexible approach that can generalize to new classification tasks without the need for task-specific classification heads. &gt; Build a text classification model that can classify classes on a new dataset it was never trained on.</p>
<section id="paper-idea" class="level2">
<h2 class="anchored" data-anchor-id="paper-idea">Paper Idea</h2>
<p>In the paper, the authors reformulate text classification as a text generation problem. Instead of classifying a text into X classes, the model needs to generate the correct class when given a text and the classes in a multiple-choice question answering format. Both the input and the output of the model are in natural language.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-generation-idea.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="High-level idea of zero-shot classification"></p>
</figure>
</div>
<p>Let’s understand how the authors implemented this idea in a step-by-step process:</p>
</section>
<section id="phase-1-pre-training" class="level2">
<h2 class="anchored" data-anchor-id="phase-1-pre-training">Phase 1: Pre-training</h2>
<p>As seen in the formulation above, we need to teach GPT-2 to pick the correct class when given the problem as a multiple-choice problem. The authors teach GPT-2 to do this by fine-tuning on a simple pre-training task called title prediction.</p>
<section id="gathering-data-for-weak-supervision" class="level3">
<h3 class="anchored" data-anchor-id="gathering-data-for-weak-supervision">1. Gathering Data for Weak Supervision</h3>
<p>In the original GPT-2 paper, the training data was prepared by scraping outbound web links that were submitted or commented on Reddit and had a minimum of 3 karma score.</p>
<p>In the current paper, the authors build upon this idea with the <a href="https://github.com/jcpeterson/openwebtext">OpenWebText</a> dataset. Since we can know the subreddit the link was posted in and the submission title the user used, this metadata can be collected and used as the supervision signal.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-openwebtext.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Fetching submission title and subreddit"></p>
</figure>
</div>
<p>For multiple submissions of the same link, subreddits and submission titles can be aggregated. Thus, we have pairs of webpage text, submission title, and subreddit name as annotations.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 70%">
<col style="width: 23%">
<col style="width: 5%">
</colgroup>
<thead>
<tr class="header">
<th>Scraped Text</th>
<th>Submission Title</th>
<th>Subreddit</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many …</td>
<td>OpenAI Releases Largest GPT-2 Text Generation Model</td>
<td>r/artificial</td>
</tr>
<tr class="even">
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
</tbody>
</table>
<p>The authors found subreddit prediction didn’t generalize well and so they use submission title in their experiments.</p>
</section>
<section id="multiple-choice-question-answering-format" class="level3">
<h3 class="anchored" data-anchor-id="multiple-choice-question-answering-format">2. Multiple choice question answering format</h3>
<p>To feed the annotated data into GPT-2, the authors prepared 26 different multiple-choice question format. A random question format is sampled during training.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-26-questions.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Multiple choice question answering template"></p>
</figure>
</div>
<p>Now for each document, we randomly choose between 2 to 15 titles. One title is correct for that document while all others are random titles.</p>
<p>We also add regularization by replacing a title with “none of the above” 50% of the time. And the correct title is also replaced with “none of the above” with a probability 1/(number of titles). Such noise can help train the model to choose “none of the above” if none of the choices match the content.</p>
<p>As shown below, the <span style="color: #5f4339; font-weight: bold;">titles</span> are placed after the <span style="color: #49AD4D;font-weight: bold;">question</span> as a comma-separated list.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 75%">
<col style="width: 9%">
<col style="width: 15%">
</colgroup>
<thead>
<tr class="header">
<th>Question</th>
<th>Text</th>
<th>Answer</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><span style="color: #087f23;">Which of these choices best describes the following document?:</span> ” <span style="color: #5f4339;">OpenAI Releases Largest GPT-2 Text Generation Model</span> “,” <span style="color: #5f4339;">Facebook buys Whatsapp</span> ”</td>
<td>We’ve trained a large-scale …</td>
<td>OpenAI Releases Largest GPT-2 Text Generation Model</td>
</tr>
</tbody>
</table>
<p>The question is prepended to the document to simulate a multiple-choice question answering task and a pre-trained GPT-2 language model is fine-tuned on this dataset to learn the submission title prediction task.</p>
</section>
</section>
<section id="phase-2-zero-shot-classification" class="level2">
<h2 class="anchored" data-anchor-id="phase-2-zero-shot-classification">Phase 2: Zero-Shot Classification</h2>
<p>From the previous step, we have a model that has been trained on a wide variety of titles from the web and thus simulates meta-learning with N-way text classification tasks.</p>
<p>To test the zero-shot capabilities of the model, the authors tested it on 6 benchmark datasets without doing any finetuning.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 93%">
</colgroup>
<thead>
<tr class="header">
<th>Dataset</th>
<th>Classes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>SST-2</td>
<td>Positive Sentiment, Negative Sentiment</td>
</tr>
<tr class="even">
<td>Yelp-2</td>
<td>Positive polarity, Negative polarity</td>
</tr>
<tr class="odd">
<td>Amazon-2</td>
<td>Positive polarity, Negative polarity</td>
</tr>
<tr class="even">
<td>AGNews</td>
<td>Science &amp; Technology, Business, Sports, World News</td>
</tr>
<tr class="odd">
<td>DBPedia</td>
<td>Company, Mean Of Transportation, Film, Office Holder, Written Work, Animal, Natural Place, Artist, Plant, Athlete, Album, Building, Village, Educational Institution</td>
</tr>
<tr class="even">
<td>Yahoo Answers</td>
<td>Family &amp; Relationships, Business &amp; Finance, Health, Society &amp; Culture, Education &amp; Reference, Entertainment &amp; Music, Science &amp; Mathematics, Computers &amp; Internet, Sports, Politics &amp; Government</td>
</tr>
</tbody>
</table>
<p>For each dataset, they perform the following steps:</p>
<ul>
<li><p>They convert the classes in each dataset into the same multiple-choice question format as pre-training and prepend it to the text. For example, for SST-2 dataset which contains movie reviews, the format would be:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 62%">
<col style="width: 24%">
<col style="width: 13%">
</colgroup>
<thead>
<tr class="header">
<th>Question</th>
<th>Text</th>
<th>Answer</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>To which category does the text belong?:” Positive Sentiment “,” Negative Sentiment ”</td>
<td>the film is one of the year’s best</td>
<td>Positive Sentiment</td>
</tr>
</tbody>
</table></li>
<li><p>The question is prepended to the text and passed to GPT-2 as a prompt. Then we use greedy sampling to generate the output from GPT-2 and compare it with the actual class. Accuracy for each dataset is calculated.</p></li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-generation-downstream-usage.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Using GPT-2 to predict sentiment"></p>
</figure>
</div>
</section>
<section id="results-and-insights" class="level2">
<h2 class="anchored" data-anchor-id="results-and-insights">Results and Insights</h2>
<p>Even without access to the training data, the model was able to achieve up to 45% improvement in classification accuracy over random and majority class baselines.</p>
<ul>
<li><p>For sentiment datasets such as SST-2, Amazon-2, and Yelp-2, the larger size 335M GPT-2 model has a significant improvement over the random and majority class baselines. Zero-shot performance is still below direct finetuning and the SOTA held by XLNET.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 76%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>SST-2</th>
<th>Amazon-2</th>
<th>Yelp-2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Random Guess</td>
<td>50.6</td>
<td>52.9</td>
<td>50.4</td>
</tr>
<tr class="even">
<td>Majority Class</td>
<td>49.9</td>
<td>49.3</td>
<td>49.2</td>
</tr>
<tr class="odd">
<td><span style="color: #49AD4D; font-weight: bold;">Zero-Shot 355M All Data</span></td>
<td><strong>62.5</strong></td>
<td><strong>80.2</strong></td>
<td><strong>74.7</strong></td>
</tr>
<tr class="even">
<td>355M Finetuned</td>
<td>93.23</td>
<td>97.115</td>
<td>94.479</td>
</tr>
<tr class="odd">
<td>SOTA(XLNET, 2019)</td>
<td>96.8</td>
<td>97.6</td>
<td>98.45</td>
</tr>
</tbody>
</table></li>
<li><p>Increasing the model size from 117M to 355M parameters leads to better zero-shot performance on downstream tasks.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 72%">
<col style="width: 9%">
<col style="width: 9%">
<col style="width: 9%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>SST-2</th>
<th>Amazon-2</th>
<th>Yelp-2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Zero-Shot 117M All Data</td>
<td>51.8</td>
<td>50.3</td>
<td>50.1</td>
</tr>
<tr class="even">
<td><span style="font-weight: bold;">Zero-Shot 355M All Data</span></td>
<td><strong>62.5</strong></td>
<td><strong>80.2</strong></td>
<td><strong>74.7</strong></td>
</tr>
</tbody>
</table></li>
<li><p>When pretraining is done on the only 1/4th of the total data, it leads to a decrease in overall performance. This shows that pretraining across a diverse set of tasks is needed and a larger dataset provides that.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 72%">
<col style="width: 9%">
<col style="width: 9%">
<col style="width: 9%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>SST-2</th>
<th>Amazon-2</th>
<th>Yelp-2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Zero-Shot 355M 1/4 Data</td>
<td>61.7</td>
<td>64.5</td>
<td>58.5</td>
</tr>
<tr class="even">
<td><span style="font-weight: bold;">Zero-Shot 355M All Data</span></td>
<td><strong>62.5</strong></td>
<td><strong>80.2</strong></td>
<td><strong>74.7</strong></td>
</tr>
</tbody>
</table></li>
<li><p>For datasets like DBPedia, AGNews, and Yahoo Answer with many classes, the model performs noticeably better than random but struggles to break past 50% accuracy. The authors say this could be because the model can identify unlikely classes, but struggle to choose between most plausible options due to lack of any supervision. Also, performance is better with less data than with full dataset pretraining for them.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>AGNews</th>
<th>DBPedia</th>
<th>Yahoo Answers</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Random Guess</td>
<td>27.4</td>
<td>7.27</td>
<td>10.2</td>
</tr>
<tr class="even">
<td>Majority Class</td>
<td>25.3</td>
<td>7.6</td>
<td>9.9</td>
</tr>
<tr class="odd">
<td>Zero-Shot 117M All Data</td>
<td>40.2</td>
<td>39.6</td>
<td>26.1</td>
</tr>
<tr class="even">
<td>Zero-Shot 355M 1/4 Data</td>
<td><strong>68.3</strong></td>
<td><strong>52.5</strong></td>
<td><strong>52.2</strong></td>
</tr>
<tr class="odd">
<td>Zero-Shot 355M All Data</td>
<td>65.5</td>
<td>44.8</td>
<td>49.5</td>
</tr>
<tr class="even">
<td>355M Finetuned</td>
<td>94.87</td>
<td>99.0</td>
<td>72.79</td>
</tr>
<tr class="odd">
<td>SOTA</td>
<td>95.51</td>
<td>99.38</td>
<td>76.26</td>
</tr>
</tbody>
</table></li>
<li><p>The authors point out that there were controllability issues because GPT-2 was generating answers which were not a valid class. For example, for the yahoo answers dataset, valid classes are “education &amp; reference” and “science and mathematics’. But, the model sometimes mixed these two and generated ‘education and mathematics’. This problem diminished as the model size was increased to 355M and full data was used.</p></li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-generation-controllability-issue.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Mixing of classes during generation"></p>
</figure>
</div>
<ul>
<li>Another issue with the model was the generation of an empty string and rearranging the tokens of a valid answer e.g.&nbsp;“Positive Sentiment” -&gt; “Sentiment Positive”. This problem was frequent with top-k and top-p sampling and rare with greedy decoding, and so the authors chose greedy decoding.</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zsl-generation-challenges.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Challenges of using text generation"></p>
</figure>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The paper provides a good overview of the method and challenges of using generative language models for zero-shot classification and show that natural language could be a promising meta-learning strategy for text problems.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Raul Puri et al., <a href="https://arxiv.org/abs/1912.10165">“Zero-shot Text Classification With Generative Language Models”</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>zero-shot-learning</category>
  <category>llm</category>
  <guid>https://amitness.com/posts/zero-shot-classification-via-generation</guid>
  <pubDate>Sun, 07 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/zsl-generation-idea.png" medium="image" type="image/png" height="44" width="144"/>
</item>
<item>
  <title>Exploring Knowledge Captured in Probability of Strings</title>
  <link>https://amitness.com/posts/knowledge-in-language-model</link>
  <description><![CDATA[ 




<p>I recently completed the UC Berkeley’s <a href="https://www.youtube.com/playlist?list=PLwRJQ4m4UJjPiJP3691u-qWwPGVKzSlNP">Deep Unsupervised Learning</a> course. The course had an interesting <a href="https://www.youtube.com/watch?v=BnpB3GrpsfM">guest lecture</a> on the history of language modeling by Alec Radford, the author of GPT model.</p>
<p>In one of his slides, Alec mentions how by simply observing a bunch of strings, language models tend to capture useful knowledge. He also mentions that maybe in the future, we could have an unsupervised language model that can be directly used on tasks without further fine-tuning. This talk was before GPT-3 was released and GPT-3 has shown the few-shot learning ability of language models.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/alex-slides.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Screenshot of slides from Alec Radford"></p>
</figure>
</div>
<p>In this post, I will share my exploration of the simple examples he mentioned in the lecture with code and expand more on them.</p>
<section id="probabilistic-language-modeling" class="level2">
<h2 class="anchored" data-anchor-id="probabilistic-language-modeling">Probabilistic Language Modeling</h2>
<p>In language modeling, we want to learn a function that can observe a bunch of strings and then compute the probability for new strings. For example, the function can give us how likely this sentence is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(good%5C%20luck)%0A"></p>
<p>There are many ways you could formulate this function. Here are some:</p>
<ul>
<li>We could discard context and simply assume each token is independent to get a unigram language model.</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(good%5C%20luck)%20%20%20=%20p(good)%20*%20p(luck)%0A"></p>
<ul>
<li>We could condition only on the previous word to get a bigram language model.</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(good%5C%20luck)%20%20%20=%20p(good)%20*%20p(luck%20%7C%20good)%0A"></p>
<ul>
<li><p>We could use an RNN and variants to keep track of the previous context in a hidden state.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20p(good%5C%20luck%5C%20man)%20%20%20=%20p(good)%20*%20p(luck%20%7C%20good,%20hidden%5C%20state)%20*%20p(man%20%7C%20luck,%20hidden%5C%20state)%0A%20%20"></p></li>
</ul>
</section>
<section id="what-could-it-have-learned" class="level2">
<h2 class="anchored" data-anchor-id="what-could-it-have-learned">What could it have learned?</h2>
<p>Let’s take GPT-2 as a language model and explore what it has learned by just observing a bunch of strings over the internet.</p>
<p>We will use the <a href="https://github.com/simonepri/lm-scorer">lm-scorer</a> library to calculate the probability of a sentence using transformer-based language models.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">pip install lm<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>scorer</span></code></pre></div></div>
<p>Let’s create a scorer function that gives us a probability of a sentence using the GPT-2 language model.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> lm_scorer.models.auto <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> AutoLMScorer</span>
<span id="cb2-2">scorer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> AutoLMScorer.from_pretrained(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt2-large"</span>)</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> score(sentence):</span>
<span id="cb2-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> scorer.sentence_score(sentence)</span></code></pre></div></div>
<p>Now, we can use it for any sentence as shown below and it returns the probability.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'good luck'</span>)</span>
<span id="cb3-2"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.658163769270644e-11</span></span></code></pre></div></div>
<section id="grammar" class="level3">
<h3 class="anchored" data-anchor-id="grammar">Grammar</h3>
<p>A language model has no prior knowledge of grammar rules and structure. But it has been exposed to a bunch of grammatically correct sentences in the large training corpus. Let’s explore how much grammar it has picked up.</p>
<ul>
<li><p>The language model assigns a higher probability to sentence with the correct order of subject, verb, and object than an incorrect one.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I like it'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'like it I'</span>)</span>
<span id="cb4-2"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span></code></pre></div></div></li>
<li><p>We have two similar sentences given below. Sentence 2 has a grammatical mistake.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 26%">
<col style="width: 73%">
</colgroup>
<thead>
<tr class="header">
<th>sentence 1</th>
<th>sentence 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>The cat sat on the mat</td>
<td>The cat <span style="color: #d32f2f;">sats</span> on the mat</td>
</tr>
</tbody>
</table>
<p>We would want our language model to assign more probability to the correct sentence 1. Let’s verify if this is the case with GPT-2.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20p(sentence%201)%20%3E%20p(sentence%202)%0A%20%20"></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">p1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The cat sat on the mat'</span>)</span>
<span id="cb5-2">p2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The cat sats on the mat'</span>)</span></code></pre></div></div>
<p>The language model indeed assigns more probability to the gramatically correct sentence.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(p1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> p2)</span>
<span id="cb6-2"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span></code></pre></div></div></li>
</ul>
</section>
<section id="world-knowledge" class="level3">
<h3 class="anchored" data-anchor-id="world-knowledge">World Knowledge</h3>
<p>The text corpus a language model is trained on contains lots of facts about the world. Can a language model pick that up? Let’s see an example.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">fact1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The cat sat on the mat'</span>)</span>
<span id="cb7-2">fact2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The hyena sat on the mat'</span>)</span></code></pre></div></div>
<p>Who does GPT-2 think is more probable to sit on a mat: cat or the hyena?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(fact1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> fact2)</span>
<span id="cb8-2"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span></code></pre></div></div>
<p>It’s the cat. This makes sense as cats are domesticated and hyena is a wild animal.</p>
</section>
<section id="sentiment-analysis" class="level3">
<h3 class="anchored" data-anchor-id="sentiment-analysis">Sentiment Analysis</h3>
<p>Alec presents another idea where we find the conditional probability of positive/negative opinion following some text to perform sentiment analysis. For example, we could calculate the probability for “Sentiment: Positive.” and “Sentiment: Negative.” coming after a text and assign the sentiment as positive or negative respectively.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(Sentiment:%5C%20Positive.%5C%20%7C%5C%20sentence)%5C%5C%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(Sentiment:%5C%20Negative.%5C%20%7C%5C%20sentence)%0A"></p>
<p>Let’s build a function to compute the two scores and return the sentiment based on whichever is higher.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> sentiment(sentence):</span>
<span id="cb9-2">    positive_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>sentence<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> Sentiment: Positive.'</span>)</span>
<span id="cb9-3">    negative_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> score(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>sentence<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> Sentiment: Negative.'</span>)</span>
<span id="cb9-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'positive'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> positive_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> negative_score <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'negative'</span></span></code></pre></div></div>
<p>We can try with a few sentences.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> sentiment(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Awesome product.'</span>)</span>
<span id="cb10-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'positive'</span></span>
<span id="cb10-3"></span>
<span id="cb10-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> sentiment(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the app failed to run'</span>)</span>
<span id="cb10-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'negative'</span></span>
<span id="cb10-6"></span>
<span id="cb10-7"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> sentiment(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'this is not a good idea'</span>)</span>
<span id="cb10-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'negative'</span></span>
<span id="cb10-9"></span>
<span id="cb10-10"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> sentiment(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the app rocks'</span>)</span>
<span id="cb10-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">'positive'</span></span></code></pre></div></div>
</section>
<section id="bias" class="level3">
<h3 class="anchored" data-anchor-id="bias">Bias</h3>
<p>Since these models are trained on human-written text in the wild, they are bound to capture the inherent bias in these text. Here are some examples:</p>
<ul>
<li><p>The model finding it more probable for gender to be “he” for doctor and scientist and “she” for nurse.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The doctor came. He'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The doctor came. She'</span>)</span>
<span id="cb11-2"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.702219615279396</span></span>
<span id="cb11-3"></span>
<span id="cb11-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The scientist came. He'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The scientist came. She'</span>)</span>
<span id="cb11-5"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.9469981043432845</span></span>
<span id="cb11-6"></span>
<span id="cb11-7"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The nurse came. She'</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> score(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'The nurse came. He'</span>)</span>
<span id="cb11-8"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.709184896139912</span></span></code></pre></div></div></li>
</ul>
<!--
## Draft
- p("4" | "2+2=") be 1?

speech recognition:
- prune space of possible transcription from the acoustic model
famous example: "wreck a nice beach" vs "recognize speech"
context: "recognize speech" > "wreck a nice beach"

machine translation:
re-rank possible translations?
en - fr: proposal -> language model -> how likely is it?
-->
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Alec Radford, <a href="https://www.youtube.com/watch?v=BnpB3GrpsfM">“L11 Language Models – guest instructor: Alec Radford (OpenAI) — Deep Unsupervised Learning SP20”</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>llm</category>
  <guid>https://amitness.com/posts/knowledge-in-language-model</guid>
  <pubDate>Sun, 07 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/alex-slides.png" medium="image" type="image/png" height="92" width="144"/>
</item>
<item>
  <title>Zero Shot Learning for Text Classification</title>
  <link>https://amitness.com/posts/zero-shot-text-classification</link>
  <description><![CDATA[ 




<p>The recent release of GPT-3 got me interested in the state of zero-shot learning and few-shot learning in NLP. While most of the zero-shot learning research is concentrated in Computer Vision, there has been some interesting work in the NLP domain as well.</p>
<p>I will be writing a series of blog posts to cover existing research on zero-shot learning in NLP. In this first post, I will explain the paper <a href="https://arxiv.org/abs/1712.05972">“Train Once, Test Anywhere: Zero-Shot Learning for Text Classification”</a> by Pushp et al.&nbsp;This paper from December 2017 was the first work to propose a zero-shot learning paradigm for text classification.</p>
<section id="what-is-zero-shot-learning" class="level2">
<h2 class="anchored" data-anchor-id="what-is-zero-shot-learning">What is Zero-Shot Learning?</h2>
<p>Zero-Shot Learning is the ability to detect classes that the model has never seen during training. It resembles our ability as humans to generalize and identify new things without explicit supervision.</p>
<p>For example, let’s say we want to do <span style="color: #546E7A; font-weight: bold;">sentiment classification</span> and <span style="color: #795548; font-weight: bold;">news category</span> classification. Normally, we will train/fine-tune a new model for each dataset. In contrast, with zero-shot learning, you can perform tasks such as sentiment and news classification directly without any task-specific training.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-vs-transfer.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Zero Shot Learning vs Transfer Learning"></p>
</figure>
</div>
</section>
<section id="train-once-test-anywhere" class="level2">
<h2 class="anchored" data-anchor-id="train-once-test-anywhere">Train Once, Test Anywhere</h2>
<p>In the paper, the authors propose a simple idea for zero-shot classification. Instead of classifying texts into X classes, they re-formulate the task as a binary classification to determine if a text and a class are related or not.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-paper-idea.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="High level idea of Train Once, Test Anywhere"></p>
</figure>
</div>
<p>Let’s understand their formulation and end-to-end process in more detail now.</p>
</section>
<section id="data-preparation" class="level2">
<h2 class="anchored" data-anchor-id="data-preparation">1. Data Preparation</h2>
<p>The authors crawled 4.2 million <span style="color: #7E57C2; font-weight: bold;">news headlines</span> from the web and used the <span style="color: #795548; font-weight: bold;">SEO tags</span> for the news article as the <span style="color: #795548; font-weight: bold;">labels</span>. After crawling, they got total <span style="color: #795548; font-weight: bold;">300,000 unique tags</span> as the labels. We can see how troublesome it would have been if we had to train a supervised model on <span style="color: #795548; font-weight: bold;">300,000 classes</span>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-data-crawling.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Crawling headline and SEO metatags from news"></p>
</figure>
</div>
<p>Each <span style="color: #7E57C2; font-weight: bold;">headline</span> was truncated to 28 words and anything shorter was padded.</p>
</section>
<section id="word-embedding" class="level2">
<h2 class="anchored" data-anchor-id="word-embedding">2. Word Embedding</h2>
<p>The paper uses word2vec pre-trained on Google News as the word embeddings for both the sentences as well as the labels.</p>
</section>
<section id="model-architecture" class="level2">
<h2 class="anchored" data-anchor-id="model-architecture">3. Model Architecture</h2>
<p>The paper proposes three different architecture to learn the relation between sentence and label embeddings.</p>
</section>
<section id="a.-architecture-1" class="level2">
<h2 class="anchored" data-anchor-id="a.-architecture-1">a. Architecture 1</h2>
<p>In this architecture, we take the mean of word embeddings in the sentence as the sentence embedding and concatenate it with the <span style="color: #4396f3;">label embedding</span>. This vector is then passed through a <span style="color: #36a4ab;">fully connected layer</span> to classify if the sentence and label are related or not.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-architecture-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Architecture 1 of Zero-shot Text Classification"></p>
</figure>
</div>
</section>
<section id="b.-architecture-2" class="level2">
<h2 class="anchored" data-anchor-id="b.-architecture-2">b. Architecture 2</h2>
<p>In this architecture, instead of taking the mean, the word embeddings are passed through an LSTM and the <span style="color: #554f92;">last hidden state</span> of the network is treated as the sentence vector. It is concatenated with the <span style="color: #4396f3;">word vector of the label</span> and then passed through a <span style="color: #36a4ab;">fully connected layer</span> to classify if the sentence and label are related or not.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-architecture-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Architecture 2 of Zero-shot Text Classification"></p>
</figure>
</div>
</section>
<section id="c.-architecture-3" class="level2">
<h2 class="anchored" data-anchor-id="c.-architecture-3">c.&nbsp;Architecture 3</h2>
<p>In this architecture, the embedding of each word in the sentence is concatenated with the <span style="color: #4396f3;">embedding of the label</span>. This combined embedding is passed through an LSTM and the <span style="color: #554f92;">last hidden state</span> of the network is taken. It is then passed through a <span style="color: #36a4ab;">fully connected layer</span> to classify if the sentence and label are related or not.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-architecture-3.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Architecture 3 of Zero-shot Text Classification"></p>
</figure>
</div>
</section>
<section id="training" class="level2">
<h2 class="anchored" data-anchor-id="training">4. Training</h2>
<p>Using the crawled news headlines dataset, each headline is paired with 50% actual labels and 50% randomly selected unrelated labels. Then the model is trained using above 3 architectures with a binary cross-entropy loss with Adam optimizer.</p>
<p>In the paper, they achieve the highest accuracy of 74% on the binary classification task with Architecture 3, followed by 72.6% on architecture 2 and 72% on architecture 1 on the separated test set of the news headlines dataset.</p>
</section>
<section id="zero-shot-classification" class="level2">
<h2 class="anchored" data-anchor-id="zero-shot-classification">5. Zero-Shot Classification</h2>
<p>Now, taking the trained model that can compute relatedness score of sentences with labels, the authors tested its generalization capability to unseen datasets and labels.</p>
<ul>
<li><p>The authors tested their model on a hold-out test set containing labels not present during training. They achieve 78%, 76% and 81% accuracy on the binary classification task with architecture 1, 2 and 3 respectively.<br>
</p></li>
<li><p><strong>UCI News Aggregator Dataset:</strong><br>
In this dataset, there are 420,000 sentences with 4 labels: technology, business, medicine and entertainment. They propose a heuristic called category tree where they expand each label with related words. The process is as follows:</p>
<ul>
<li>Take the unseen labels and add a few words related to this concept. For example, related words for business can be ‘finance’ and ‘revenue’.</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-category-tree.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Category Tree of News Aggregator Dataset"></p>
</figure>
</div>
<ul>
<li>To predict the class(category) for a sentence, they predict the relatedness of the sentence to related words under that category and take their mean as the final relatedness.<br>
</li>
<li>The classes which had mean relatedness probability above a threshold are assumed as the predicted classes. This threshold is a hyperparameter and the paper uses 0.5 as the threshold.</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://amitness.com/posts/images/zero-shot-threshold.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Threshold to asssume label and text are matched"></p>
</figure>
</div>
<p>The authors tested this process on the entire dataset and achieved 61.73%, 63% and 64.21% accuracy. In comparison, the supervised methods achieve 94.75% accuracy. The result is still interesting because without even training on a single sample, it achieves better than random accuracy.</p></li>
<li><p><strong>Tweet Classification:</strong><br>
This dataset has 1993 sentences with 6 labels: business, health, politics, sports, technology and entertainment. The authors tested their method over the whole dataset using a threshold of 0.5 and a category tree expansion with 3 related words and achieved 64.5% accuracy with Architecture 3. In comparison, a supervised method such as multinominal naive bayes trained on the whole dataset can get 78% accuracy.</p></li>
</ul>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The paper proposes some really simple but clever techniques to learn the relationship between sentences and labels and achieves better than random accuracy on unseen datasets and labels. Since this was proposed in the pre-transformer era, it can be interesting to try these ideas with recent models.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Pushpankar Kumar Pushp, et al.&nbsp;<a href="https://arxiv.org/abs/1712.05972">“Train Once, Test Anywhere: Zero-Shot Learning for Text Classification”</a></li>
</ul>


</section>

 ]]></description>
  <category>nlp</category>
  <category>zero-shot-learning</category>
  <guid>https://amitness.com/posts/zero-shot-text-classification</guid>
  <pubDate>Sat, 30 May 2020 00:00:00 GMT</pubDate>
  <media:content url="https://amitness.com/posts/images/zero-shot-paper-idea.png" medium="image" type="image/png" height="48" width="144"/>
</item>
</channel>
</rss>
