<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Max Halford</title><link>https://maxhalford.github.io/</link><description>Recent content on Max Halford</description><generator>Hugo</generator><language>en-us</language><managingEditor>maxhalford25@gmail.com (Max Halford)</managingEditor><webMaster>maxhalford25@gmail.com (Max Halford)</webMaster><lastBuildDate>Wed, 08 Apr 2026 11:08:45 +0200</lastBuildDate><atom:link href="https://maxhalford.github.io/index.xml" rel="self" type="application/rss+xml"/><item><title>Lower your warehouse costs via DuckDB transpilation</title><link>https://maxhalford.github.io/blog/warehouse-cost-reduction-quack-mode/</link><pubDate>Thu, 12 Mar 2026 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/warehouse-cost-reduction-quack-mode/</guid><description>&lt;p&gt;Many people &lt;a href="https://survey.stackoverflow.co/2025/technology#2-databases"&gt;seem to admire&lt;/a&gt; DuckDB. But most of us are stuck with our traditional warehouses, because they&amp;rsquo;re entrenched in our data stacks and IT landscape. This is with good reason: BigQuery, Snowflake, ClickHouse and co. are great software. But they&amp;rsquo;re not cheap, and keeping a warehouse&amp;rsquo;s monthly bill under control is non-trivial.&lt;/p&gt;
&lt;p&gt;What if you could get the best of both worlds? Tables could be kept in your warehouse of choice, but computed with DuckDB. There&amp;rsquo;s been some discussion on multi-engine warehouses &amp;ndash; see &lt;a href="https://juhache.substack.com/p/multi-engine-stacks-deserve-to-be"&gt;this article&lt;/a&gt; by Julien Hurault and Sung Won Chung. But their proposition is to assign each table to one engine. Multi-engine stacks can be useful, and SQLMesh provides &lt;a href="https://sqlmesh.readthedocs.io/en/latest/guides/multi_engine"&gt;support&lt;/a&gt; for it. But to me it sounds too sophisticated, and I&amp;rsquo;m not convinced it&amp;rsquo;s what most practitioners want/need.&lt;/p&gt;</description></item><item><title>Text classification with Python 3.14's zstd module</title><link>https://maxhalford.github.io/blog/text-classification-zstd/</link><pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/text-classification-zstd/</guid><description>&lt;p&gt;Python 3.14 &lt;a href="https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-zstandard"&gt;introduced&lt;/a&gt; the &lt;a href="https://docs.python.org/3/library/compression.zstd.html"&gt;&lt;code&gt;compression.zstd&lt;/code&gt;&lt;/a&gt; module. It is a standard library implementation of Facebook&amp;rsquo;s &lt;a href="https://en.wikipedia.org/wiki/Zstd"&gt;Zstandard (Zstd)&lt;/a&gt; compression algorithm. It was developed a decade ago by Yann Collet, who holds a &lt;a href="https://fastcompression.blogspot.com/"&gt;blog&lt;/a&gt; devoted to compression algorithms.&lt;/p&gt;
&lt;p&gt;I am not a compression expert, but Zstd caught my eye because it supports incremental compression. You can feed it data to compress in chunks, and it will maintain an internal state. It&amp;rsquo;s particularly well &lt;a href="https://facebook.github.io/zstd/"&gt;suited&lt;/a&gt; for compressing small data. It&amp;rsquo;s perfect for the classify text via compression trick, which I described in &lt;a href="https://maxhalford.github.io/blog/text-classification-by-compression/"&gt;a previous blog post&lt;/a&gt; 5 years ago.&lt;/p&gt;</description></item><item><title>Solving Détrak with brute force</title><link>https://maxhalford.github.io/blog/detrak-solver/</link><pubDate>Mon, 02 Feb 2026 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/detrak-solver/</guid><description>&lt;p&gt;&lt;a href="https://cdn.1j1ju.com/medias/6a/e0/d8-detrak-regle.pdf"&gt;Détrak&lt;/a&gt; is a simple board game. There&amp;rsquo;s a 5x5 grid, and players need to place 12 domino-like pieces to cover the grid completely. The domino symbols are determined by rolling two dice. The dice are six-sided, and their rolls are shared by all players. Points are scored based on adjacencies of matching symbols.&lt;/p&gt;
&lt;p&gt;
 &lt;img src="https://github.com/MaxHalford/detrak/raw/main/46.jpg" width="60%"&gt;
&lt;/p&gt;
&lt;p&gt;The goal is to find the optimal placement of the pieces to maximize the score based on the rolled symbols. What makes it competitive is that everyone has to work with the same rolls. This involves logical thinking and spatial reasoning. But you also have to juggle luck and risk, because you can&amp;rsquo;t predict the dice rolls.&lt;/p&gt;</description></item><item><title>Nostalgia for a time I didn’t experience</title><link>https://maxhalford.github.io/blog/anemoia/</link><pubDate>Sat, 17 Jan 2026 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/anemoia/</guid><description>&lt;p&gt;Our everyday vocabulary fails to capture subtle emotions. Some languages developed to fill in these gaps. An example that became popular is &lt;em&gt;Schadenfreude&lt;/em&gt;, which means taking pleasure in someone else’s misfortune. It has an opposite in German, &lt;em&gt;Fremdscham&lt;/em&gt;, which is the vicarious embarrassment you feel when witnessing someone else’s humiliation. Its Finnish equivalent &lt;em&gt;myötähäpeä&lt;/em&gt; is slightly more well known.&lt;/p&gt;
&lt;p&gt;I like these words that capture complex feelings. When I was in high school, Wittgenstein made a lasting impression on my very naive mind. He argued that language can only picture facts about the world, and that ethics, aesthetics, and the mystical lie beyond what language can express. I think this is fascinatingly true. I&amp;rsquo;m convinced that a share of our daily sorrows come from our inability to put them in words. It reassures me when I stumble on a word that expresses a feeling I can&amp;rsquo;t explain.&lt;/p&gt;</description></item><item><title>Row level lineage at Carbonfact</title><link>https://maxhalford.github.io/blog/row-level-lineage/</link><pubDate>Fri, 09 Jan 2026 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/row-level-lineage/</guid><description/></item><item><title>No pain no startup</title><link>https://maxhalford.github.io/blog/startup-pain-management/</link><pubDate>Mon, 27 Oct 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/startup-pain-management/</guid><description>&lt;p&gt;I&amp;rsquo;ve been working at Carbonfact for close to 4 years. Two other people and I were the first hires. I got to build a large share of the initial systems. I&amp;rsquo;ve been involved in many business decisions, but I&amp;rsquo;ve mostly actively contributed to technical aspects.&lt;/p&gt;
&lt;p&gt;I had to decide on a lot of things at first. What data warehouse to use, the internal data models, how to organize our Python code, our definition of self-serve analytics, and even seemingly mundane things like Pandas vs. Polars. There was a lot to do and it was fun. There were no pushbacks because there was nobody else to push back. I could just build what I thought was best. I even pushed to the &lt;code&gt;main&lt;/code&gt; code branch, because nobody else had to review my code. There was always something to build as the startup garnered traction. It felt painless, in a naive sort of way.&lt;/p&gt;</description></item><item><title>Scraping Google Calendar events</title><link>https://maxhalford.github.io/blog/google-calendar-scraping/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/google-calendar-scraping/</guid><description>&lt;p&gt;At my day job we deal with enterprise customers. They pay us a subscription fee, and in return we help them in various ways to reduce their carbon footprint. To keep the boat afloat, we need to make some money. We shouldn&amp;rsquo;t spend more money than we make. So we need to keep track of our revenue and costs. Our gross margin is &lt;code&gt;(revenue - cost) / revenue&lt;/code&gt;, where the cost is mostly the salaries of our employees.&lt;/p&gt;</description></item><item><title>Warmshowers sparks joy</title><link>https://maxhalford.github.io/blog/warmshowers-sparks-joy/</link><pubDate>Sun, 24 Aug 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/warmshowers-sparks-joy/</guid><description>&lt;p&gt;I have a friend with whom I like to go travelling on a bicycle. We usually go for a couple of weeks, and we&amp;rsquo;ve been doing it regularly over the past five years. We&amp;rsquo;ve cycled in France and England so far. Both countries have many affordable campsites, and there&amp;rsquo;s always a decent hotel/Airbnb not too far for when it&amp;rsquo;s raining.&lt;/p&gt;
&lt;div align="center" &gt;
&lt;figure style="width: 100%; margin: 0;"&gt;
 &lt;img src="https://maxhalford.github.io/img/blog/warmshowers-sparks-joy/bike.jpg"&gt;
&lt;/figure&gt;
&lt;/div&gt;
&lt;p&gt;But this year we travelled through Switzerland, which is prohibitively expensive, and somewhat lacking in terms of places to camp. We were aware of &lt;a href="https://www.warmshowers.org/"&gt;Warmshowers&lt;/a&gt; and decided to give it a try. It was fantastic, and opened our eyes on what travelling and &lt;a href="https://talk.bradwoods.io/blog/connect/"&gt;connecting&lt;/a&gt; with people can be like. I don&amp;rsquo;t think we&amp;rsquo;ll ever travel like we used to anymore.&lt;/p&gt;</description></item><item><title>Do LLMs identify fonts?</title><link>https://maxhalford.github.io/blog/llm-font-identification/</link><pubDate>Wed, 30 Jul 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/llm-font-identification/</guid><description>&lt;p&gt;&lt;em&gt;Spoiler: &lt;a href="https://maxhalford.github.io/llm-font-recognition/"&gt;not really&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.dafont.com/fr/"&gt;dafont.com&lt;/a&gt; is a wonderful website that contains a large collection of fonts. It&amp;rsquo;s more comprehensive and esoteric than Google Fonts. One of its features is a forum where users can ask for help identifying fonts &amp;ndash; check out &lt;a href="https://www.dafont.com/forum/read/522670/font-identification"&gt;this poor fellow&lt;/a&gt; who&amp;rsquo;s been waiting for over two years and bumped his thread. I thought it would be interesting to see if an LLM could do this task, so I scraped the forum and set up a benchmark.&lt;/p&gt;</description></item><item><title>Thoughts on DuckLake</title><link>https://maxhalford.github.io/blog/ducklake-thoughts/</link><pubDate>Mon, 09 Jun 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/ducklake-thoughts/</guid><description>&lt;p&gt;&lt;a href="https://ducklake.select/"&gt;DuckLake&lt;/a&gt; is the new data lake/warehouse from the makers of DuckDB. I really like the direction they&amp;rsquo;re taking. I&amp;rsquo;m hopeful it has the potential to streamline the data engineering workflow for many people, vastly reducing costs along the way.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m a bit of a nut and don&amp;rsquo;t use SQLMesh or dbt. Instead, I built &lt;a href="https://github.com/carbonfact/lea"&gt;lea&lt;/a&gt; a few years ago, and we still use it at &lt;a href="https://carbonfact.org"&gt;Carbonfact&lt;/a&gt;. I would probably pick SQLMesh if I had to start over, but lea allows me to explore new ideas, so I&amp;rsquo;m sticking to it for now.&lt;/p&gt;</description></item><item><title>The total derivative of a metric tree</title><link>https://maxhalford.github.io/blog/metric-tree-total-derivative/</link><pubDate>Tue, 06 May 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/metric-tree-total-derivative/</guid><description>&lt;p&gt;&lt;em&gt;A metric tree is a visual way to organize a complex metric. Count gives a good introduction &lt;a href="https://count.co/blog/intro-to-metric-trees"&gt;here&lt;/a&gt;. &lt;a href="https://www.linkedin.com/in/abhi-sivasailam/"&gt;Abhi Sivasailam&lt;/a&gt; gave a popular &lt;a href="https://www.youtube.com/watch?v=Dbr8jmtfZ7Q&amp;amp;ab_channel=DataCouncil"&gt;talk&lt;/a&gt; at Data Council 2023 if watching videos is your thing. &lt;a href="https://www.linkedin.com/in/ergestx/"&gt;Ergest Xheblati&lt;/a&gt; is someone to follow if you want to go deeper. There&amp;rsquo;s also a &lt;a href="https://www.lightdash.com/blogpost/metric-trees-how-top-data-teams-impact-growth"&gt;recent article&lt;/a&gt; from Lightdash. Finally, there&amp;rsquo;s &lt;a href="https://timodechau.com/metric-trees-for-digital-analysts/"&gt;this article&lt;/a&gt; by Timo Dechau, but it&amp;rsquo;s behind a paywall. The concept has a &lt;a href="https://en.wikipedia.org/wiki/Metric_tree"&gt;homonym&lt;/a&gt;, so beware when you browse for it.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Minimizing the runtime of a SQL DAG</title><link>https://maxhalford.github.io/blog/minimizing-sql-dag-runtime/</link><pubDate>Sat, 08 Feb 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/minimizing-sql-dag-runtime/</guid><description>&lt;p&gt;I recently looked into reducing the runtime of &lt;a href="https://www.carbonfact.com/"&gt;Carbonfact&lt;/a&gt;&amp;rsquo;s SQL DAG. Our DAG is made up of roughly 160 SQL queries. It takes about 10 minutes to run with BigQuery, using on-demand pricing. It&amp;rsquo;s decent. However, the results of our DAG feed customer dashboards, and we have the (bad) habit of refreshing the DAG several times a day. Reducing the runtime by a few minutes can be a nice quality-of-life improvement.&lt;/p&gt;</description></item><item><title>Hard data integration problems at Carbonfact</title><link>https://maxhalford.github.io/blog/hard-data-integration-problems-at-carbonfact/</link><pubDate>Thu, 02 Jan 2025 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/hard-data-integration-problems-at-carbonfact/</guid><description>&lt;p&gt;Carbonfact&amp;rsquo;s customers are clothing brands and factories. Our mission is to measure (and ultimately reduce) the carbon footprint of their products. We need primary data to do this: purchase orders, bills of materials, energy consumption data, etc.&lt;/p&gt;
&lt;p&gt;Each customer has a unique IT setup, which makes it challenging to scale to hundreds/thousands of customers. Our success as a business depends on our ability to not reinvent the wheel for each customer.&lt;/p&gt;</description></item><item><title>Introducing icanexplain @ PyData Paris 2024</title><link>https://maxhalford.github.io/blog/icanexplain-pydata/</link><pubDate>Thu, 26 Sep 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/icanexplain-pydata/</guid><description/></item><item><title>@daily_cache implementation in Python</title><link>https://maxhalford.github.io/blog/python-daily-cache/</link><pubDate>Tue, 27 Aug 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/python-daily-cache/</guid><description>&lt;p&gt;I spend a lot of time at Carbonfact working on datasets shared by our customers. We typically set things up so that our customers can export data automatically. They usually deposit files to a GCP bucket, with a script, once a day. We then have an ETL script for each customer that runs afterwards to fetch their latest data and process it.&lt;/p&gt;
&lt;p&gt;During development, I load customer data to my laptop and work on it. The datasets can be quite heavy, and it takes time to fetch them, so I cache them to save some time. Python has &lt;a href="https://docs.python.org/3/library/functools.html"&gt;something&lt;/a&gt; for this in its standard library:&lt;/p&gt;</description></item><item><title>LCA software: exit the matrix</title><link>https://maxhalford.github.io/blog/lca-exit-the-matrix/</link><pubDate>Sun, 09 Jun 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/lca-exit-the-matrix/</guid><description>&lt;p&gt;Measuring the environmental impact of a product is done using &lt;a href="https://en.wikipedia.org/wiki/Life-cycle_assessment"&gt;life cycle assessment&lt;/a&gt; (LCA). This is a methodology that breaks down a product&amp;rsquo;s life cycle into stages (&lt;a href="https://en.wikipedia.org/wiki/Life-cycle_assessment#Life_cycle_inventory_(LCI)"&gt;LCI&lt;/a&gt;), and measures the impact of each stage on the environment (&lt;a href="https://en.wikipedia.org/wiki/Life-cycle_assessment#Life_cycle_impact_assessment_(LCIA)"&gt;LCIA&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;There are a few pieces of LCA software to choose from. The leading ones are &lt;a href="https://simapro.com/"&gt;SimaPro&lt;/a&gt;, &lt;a href="https://sphera.com/life-cycle-assessment-lca-software/"&gt;GaBi&lt;/a&gt;, &lt;a href="https://www.openlca.org/"&gt;openLCA&lt;/a&gt;, and &lt;a href="https://www.ifu.com/umberto/"&gt;Umberto&lt;/a&gt;. These are all proprietary software, and they&amp;rsquo;re expensive. But there&amp;rsquo;s a free and open source alternative: &lt;a href="https://docs.brightway.dev/en/latest/"&gt;Brightway&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Cutting up shoes to measure their footprint</title><link>https://maxhalford.github.io/blog/cutting-up-shoes/</link><pubDate>Fri, 17 May 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/cutting-up-shoes/</guid><description>&lt;p&gt;Our mission at &lt;a href="https://www.carbonfact.com/"&gt;Carbonfact&lt;/a&gt; is to measure the environmental impact of clothes. This involves a lot of steps. The main one is to determine what materials a product is made of, along with each material&amp;rsquo;s mass. This is straightforward for most clothes like jumpers and pants. These are typically made of a single fabric, such as cotton or polyester. The mass of each material is roughly the same as the product&amp;rsquo;s mass.&lt;/p&gt;</description></item><item><title>A training set for bike sharing forecasting</title><link>https://maxhalford.github.io/blog/bike-sharing-forecasting-training-set/</link><pubDate>Thu, 04 Apr 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/bike-sharing-forecasting-training-set/</guid><description>&lt;style&gt;
table {
 font-family: monospace; /* Apply monospace font */
}

table td, table th {
 white-space: nowrap; /* Prevent text from wrapping */
}
&lt;/style&gt;
&lt;p&gt;Last night I went to a &lt;a href="https://www.meetup.com/fr-FR/tlse-data-science/"&gt;Toulouse Data Science&lt;/a&gt; meetup. The talks were about generative AI and information retrieval, which aren&amp;rsquo;t topics I&amp;rsquo;m knowledgeable about. However, one of the speakers was a &lt;a href="https://github.com/raphaelsty"&gt;friend&lt;/a&gt; of mine, so I went to support him. Toulouse is my hometown, so I bumped into a few people I knew. It was a nice evening.&lt;/p&gt;</description></item><item><title>Fast Poetry and pre-commit with GitHub Actions</title><link>https://maxhalford.github.io/blog/fast-poetry-pre-commit-github-actions/</link><pubDate>Tue, 27 Feb 2024 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/fast-poetry-pre-commit-github-actions/</guid><description>&lt;p&gt;This is a short post to share a GitHub Actions pattern I use to setup &lt;a href="https://python-poetry.org/"&gt;Poetry&lt;/a&gt; and &lt;a href="https://pre-commit.com/"&gt;pre-commit&lt;/a&gt;. These two tools cover most of my Python development needs. I use Poetry to manage dependencies and pre-commit to run code checks and formatting. The setup is fast because it caches the virtual environment and the &lt;code&gt;.local&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;I like to use &lt;a href="https://docs.github.com/en/actions/creating-actions/about-custom-actions"&gt;custom actions&lt;/a&gt; for this type of stuff. These are base actions that can be re-used in multiple workflows. I have a custom action to install the Python environment. Here&amp;rsquo;s the action file:&lt;/p&gt;</description></item><item><title>Decomposing funnel metrics</title><link>https://maxhalford.github.io/blog/funnel-decomposition/</link><pubDate>Thu, 14 Dec 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/funnel-decomposition/</guid><description>&lt;h2 id="funnel-metrics-as-products"&gt;Funnel metrics as products&lt;/h2&gt;
&lt;p&gt;I talked about metric decomposition in a &lt;a href="https://maxhalford.github.io/blog/kpi-evolution-decomposition"&gt;previous article&lt;/a&gt;, and how it can be used to explain why metrics change values over time. That article explained how to decompose a sum, as well as a ratio. In this article, I&amp;rsquo;ll explain how to decompose a product.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;revenue = impressions * click_rate * conversion_rate * spend
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The decomposition in this article isn&amp;rsquo;t limited to funnels. It can be applied to any metric that is expressed as a product of factors. For instance, at Carbonfact, we decompose the carbon footprint of a clothing line as so:&lt;/p&gt;</description></item><item><title>Efficient ELT refreshes</title><link>https://maxhalford.github.io/blog/efficient-data-transformation/</link><pubDate>Fri, 01 Dec 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/efficient-data-transformation/</guid><description>&lt;p&gt;A tenant of the modern data stack is the use of ELT (Extract, Load, Transform) over ETL (Extract, Transform, Load). In a nutshell, this means that most of the data transformation is done in the data warehouse. This has become the &lt;em&gt;de facto&lt;/em&gt; standard for modern data teams, and is epitomized by &lt;a href="https://www.getdbt.com/"&gt;dbt&lt;/a&gt; and its ecosystem. It&amp;rsquo;s a great time to be a data engineer!&lt;/p&gt;
&lt;p&gt;We at &lt;a href="https://www.carbonfact.com/"&gt;Carbonfact&lt;/a&gt; fully embrace the ELT paradigm. In fact, our whole platform is powered by BigQuery, which acts as our single source of truth. We have a main BigQuery dataset where we materialize several SQL views that power what our customers see.&lt;/p&gt;</description></item><item><title>Online machine learning on the road @ IDE+A, TH Köln</title><link>https://maxhalford.github.io/blog/online-machine-learning-on-the-road/</link><pubDate>Thu, 26 Oct 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-machine-learning-on-the-road/</guid><description/></item><item><title>Sh*t flows downhill, but not at Carbonfact</title><link>https://maxhalford.github.io/blog/shit-flows-downhill-but-not-at-carbonfact/</link><pubDate>Mon, 16 Oct 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/shit-flows-downhill-but-not-at-carbonfact/</guid><description>&lt;p&gt;I&amp;rsquo;m writing this after watching the talk &lt;a href="https://josephreis.com/"&gt;Joe Reis&lt;/a&gt; gave at &lt;a href="https://bigdataldn.com/"&gt;Big Data LDN&lt;/a&gt;. It&amp;rsquo;s called &lt;a href="https://www.youtube.com/watch?app=desktop&amp;amp;v=OCClTPOEe5s&amp;amp;ref=blef.fr"&gt;Data Modeling is Dead! Long Live Data Modeling!&lt;/a&gt; It&amp;rsquo;s an easy-to-watch short talk that calls out on a few modern issues in the data world.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;d like to bounce off one of Joe&amp;rsquo;s slides:&lt;/p&gt;
&lt;div align="center" &gt;
&lt;figure style="width: 80%; margin: 0;"&gt;
 &lt;img src="https://maxhalford.github.io/img/blog/shit-flows-downhill-but-not-at-carbonfact/shit-flows-downhill.png"&gt;
&lt;/figure&gt;
&lt;/div&gt;
&lt;p&gt;I&amp;rsquo;m aligned with Joe that many issues stem from the lack of unison between engineering and data teams. A fundamental aspect of the Modern Data Stack is to replicate/copy production data into an analytics warehouse. For instance, copying the production PostgreSQL database into BigQuery.&lt;/p&gt;</description></item><item><title>Answering "Why did the KPI change?" using decomposition</title><link>https://maxhalford.github.io/blog/kpi-evolution-decomposition/</link><pubDate>Wed, 09 Aug 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/kpi-evolution-decomposition/</guid><description>&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;I published a notebook &lt;a href="https://gist.github.com/MaxHalford/9fba0c2d6800d0f0643902bf57b99780"&gt;here&lt;/a&gt; that deals with the case where dimension values may (dis)appear from one period of time to the next. The notebook decomposes a ratio, but the logic is also valid for decomposing a sum.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Edit 2&lt;/strong&gt; &amp;ndash; &lt;em&gt;I&amp;rsquo;ve stumbled on &lt;a href="https://medium.com/@shaozhifei/metric-decomposition-formula-to-understand-metric-trend-e693b7a4c8cf"&gt;this article&lt;/a&gt; by Shao Zhifei which provides a good derivation of the ratio decomposition formula. I contacted Shao Zhifei on LinkedIn, and he told me they heavily use these formulas at &lt;a href="https://www.grab.com/"&gt;Grab&lt;/a&gt;. He also pointed out a typo in the ratio decomposition formula which I have now fixed.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Measuring the carbon footprint of pizzas</title><link>https://maxhalford.github.io/blog/carbon-footprint-pizzas/</link><pubDate>Sun, 25 Jun 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/carbon-footprint-pizzas/</guid><description>&lt;p&gt;Making environmentally friendly decisions can only be done with the right information. At Carbonfact, we&amp;rsquo;ve realized a big challenge is the lack of information about industrial processes. We tackle that slowly but surely by gathering data from various sources, and making it available to our customers.&lt;/p&gt;
&lt;p&gt;Regarding food, the French government has a great initiative called &lt;a href="https://agribalyse.ademe.fr/"&gt;Agribalyse&lt;/a&gt;. It&amp;rsquo;s a free database of environmental footprints for various food products. It includes raw ingredients straight out from the farm, as well as ready to eat dishes from the supermarket. It&amp;rsquo;s a great initiative, as it allows anyone to do their own research and make informed decisions.&lt;/p&gt;</description></item><item><title>Graph components with DuckDB</title><link>https://maxhalford.github.io/blog/graph-components-duckdb/</link><pubDate>Sat, 03 Jun 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/graph-components-duckdb/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Graph problems are quite common. However, it&amp;rsquo;s rare to have access to a database offering graph semantics. There are graph databases, such as &lt;a href="https://neo4j.com/"&gt;Neo4j&lt;/a&gt; and &lt;a href="https://spark.apache.org/docs/latest/graphx-programming-guide.html"&gt;GraphX&lt;/a&gt;, but it&amp;rsquo;s difficult to justify setting one of those up. One could simply use &lt;a href="https://networkx.org/"&gt;networkx&lt;/a&gt; in Python. But that only works if the graph fits in memory.&lt;/p&gt;
&lt;p&gt;From a practical angle, the fact is that people are querying data warehouses in SQL. There are many good reasons to write graph algorithms in SQL. And anyway, one may argue that graphs are a special case of the &lt;a href="https://en.wikipedia.org/wiki/Relational_model"&gt;relational model&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>For analytics, don't use dynamic JSON keys</title><link>https://maxhalford.github.io/blog/no-dynamic-keys-in-json/</link><pubDate>Thu, 11 May 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/no-dynamic-keys-in-json/</guid><description>&lt;p&gt;I love the JSON format. It&amp;rsquo;s the kind of love that grows on you with time. Like others, I&amp;rsquo;ve been using JSON everywhere for so many years, to the point where I just take it for granted.&lt;/p&gt;
&lt;p&gt;I suppose the main thing I like about JSON is its flexibility. You can structure your JSONs without too much care. There will always be a way to consume and manipulate it. But I have discovered a bit of anti-pattern, which I believe is worth raising awareness about. Especially when you&amp;rsquo;re doing analytics.&lt;/p&gt;</description></item><item><title>Metric correctness doesn't matter, consistency does</title><link>https://maxhalford.github.io/blog/consistent-metrics/</link><pubDate>Fri, 28 Apr 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/consistent-metrics/</guid><description>&lt;p&gt;&lt;a href="https://www.un.org/en/dayof8billion"&gt;According to&lt;/a&gt; the United Nations, the 15th of November &lt;a href="https://www.bbc.co.uk/newsround/63632981"&gt;was the day&lt;/a&gt; we crossed 8 billion humans on the planet. How can they be so sure of that? Surely there has to be some margin of error, meaning it could have happened on the 14th or 16th. Then again, does it matter?&lt;/p&gt;
&lt;p&gt;I would argue almost all metrics we look at are incorrect. For instance, I work at a company whose goal is to measure the carbon footprint of clothing items. I can tell you first hand our measurements are stock full of assumptions. In the sustainability world, it&amp;rsquo;s not surprising to get reports like this one:&lt;/p&gt;</description></item><item><title>Online gradient descent written in SQL</title><link>https://maxhalford.github.io/blog/ogd-in-sql/</link><pubDate>Tue, 07 Mar 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/ogd-in-sql/</guid><description>&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;this post &lt;a href="https://news.ycombinator.com/item?id=35054786"&gt;generated&lt;/a&gt; a few insightful comments on Hacker News. I&amp;rsquo;ve also put the code in a &lt;a href="https://gist.github.com/MaxHalford/823c4e7f9216607dc853724ec74ec692"&gt;notebook&lt;/a&gt; for ease of use.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Modern MLOps is complex because it involves too many components. You need a message bus, a stream processing engine, an API, a model store, a feature store, a monitoring service, etc. Sadly, containerisation software and the unbundling trend have encouraged an appetite for complexity. I believe MLOps shouldn&amp;rsquo;t be this complex. For instance, MLOps can be made simpler by &lt;a href="https://www.ethanrosenthal.com/2022/05/10/database-bundling/"&gt;bundling the logic into your database&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Using SymPy in Python doctests</title><link>https://maxhalford.github.io/blog/sympy-doctests/</link><pubDate>Wed, 15 Feb 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/sympy-doctests/</guid><description>&lt;p&gt;A program which compiles and runs without errors isn&amp;rsquo;t necessarily correct. I find this to be especially true for statistical software, both as a developer and as a user. Small but nasty bugs creep up on me every week. I keep sane in the membrane by writing many unit tests 🐛🔨&lt;/p&gt;
&lt;p&gt;I make heavy use of &lt;a href="https://docs.python.org/3/library/doctest.html"&gt;doctests&lt;/a&gt;. These are unit tests which you write as Python &lt;a href="https://realpython.com/documenting-python-code/#documenting-your-python-code-base-using-docstrings"&gt;docstrings&lt;/a&gt;. They&amp;rsquo;re really handy because they kill two birds with one stone: the unit tests you write for a function also act as documentation.&lt;/p&gt;</description></item><item><title>Online active learning in 80 lines of Python</title><link>https://maxhalford.github.io/blog/online-active-learning-river-databutton/</link><pubDate>Sun, 22 Jan 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-active-learning-river-databutton/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)"&gt;Active learning&lt;/a&gt; is a way to get humans to label data efficiently. A good active learning strategy minimizes the number of necessary labels, while maximizing a model&amp;rsquo;s performance. This usually works by focusing on samples where the model is unsure of its prediction.&lt;/p&gt;
&lt;p&gt;In a batch setting, the model is periodically retrained to learn from the freshly labeled samples. However, the training time is usually too prohibitive for this to happen each time a new label is provided. This isn&amp;rsquo;t the case with online models, because they are able to learn one sample at a time. Active and online learning naturally fit together.&lt;/p&gt;</description></item><item><title>Are Airbnb guests less energy efficient than their host?</title><link>https://maxhalford.github.io/blog/airbnb-energy-usage/</link><pubDate>Tue, 17 Jan 2023 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/airbnb-energy-usage/</guid><description>&lt;h2 id="tldr"&gt;TLDR&lt;/h2&gt;
&lt;p&gt;I compared the energy consumption of Airbnb guests versus their host, in the same apartment, during 2022. It appears that guests do in fact consume more energy than hosts. The data I used is available to any Airbnb host. I also open-sourced all the code I wrote for this analysis.&lt;/p&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;European energy prices have soared in 2022. It&amp;rsquo;s gone to the point where some Airbnb hosts have become reluctant to rent, believing their guests are too wasteful and cost too much. You can see this by scrolling Airbnb groups on Facebook.&lt;/p&gt;</description></item><item><title>The future of River</title><link>https://maxhalford.github.io/blog/future-of-river/</link><pubDate>Tue, 13 Dec 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/future-of-river/</guid><description>&lt;div align="center"&gt;
&lt;figure &gt;
 &lt;img src="https://maxhalford.github.io/img/blog/future-of-river/tweet.png" style="box-shadow: none;"&gt;
 &lt;figcaption&gt;&lt;a href="https://twitter.com/josh_wills/status/1585328751646109696"&gt;Source&lt;/a&gt;&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/div&gt;
&lt;p&gt;When I see tweets like this one, I&amp;rsquo;m both happy because people are aware of &lt;a href="https://riverml.xyz/"&gt;River&lt;/a&gt;, but also irked because it&amp;rsquo;s really difficult to make production-grade open source software.&lt;/p&gt;
&lt;p&gt;We just had a developer meeting a week ago. We planned &lt;a href="https://github.com/orgs/online-ml/projects/3?query=is%3Aopen+sort%3Aupdated-desc"&gt;what we will work on&lt;/a&gt; during the first half of 2023. I thought it would be worthwhile to give a high-level view of how we envision River&amp;rsquo;s future. If not to be comprehensive, at least to reassure potential users that River is alive and kicking 🤺&lt;/p&gt;</description></item><item><title>Parsing garment descriptions with GPT-3</title><link>https://maxhalford.github.io/blog/garment-parsing-gpt3/</link><pubDate>Sun, 20 Nov 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/garment-parsing-gpt3/</guid><description>&lt;h2 id="the-task"&gt;The task&lt;/h2&gt;
&lt;p&gt;You&amp;rsquo;ll have heard of GPT-3 if you haven&amp;rsquo;t been hiding under a rock. I&amp;rsquo;ve recently been impressed by Nat Friedman &lt;a href="https://twitter.com/natfriedman/status/1575631194032549888"&gt;teaching&lt;/a&gt; GPT-3 to use a browser, and SeekWell &lt;a href="https://blog.seekwell.io/gpt3"&gt;generating&lt;/a&gt; SQL queries from free-text. I think the most exciting usecases are yet to come. But GPT-3 has a good chance of changing the way we approach mundane tasks at work.&lt;/p&gt;
&lt;p&gt;I wrote an &lt;a href="https://maxhalford.github.io/blog/carbonfact-nlp-open-problem"&gt;article&lt;/a&gt; a couple of months ago about a boring task I have to do at work. I got a few interesting suggestions by email. Raphaël suggested a &lt;a href="https://huggingface.co/tasks/question-answering"&gt;question-answering&lt;/a&gt; model &lt;a href="https://raphaelsty.github.io/blog/qa/"&gt;here&lt;/a&gt;. I&amp;rsquo;ve slowly become convinced that GPT-3 might be the perfect tool for the job. It&amp;rsquo;s cold and miserable where I am, so I thought it would be opportune to take GPT-3 for a spin 🧙&lt;/p&gt;</description></item><item><title>Dynamic on-screen TV keyboards</title><link>https://maxhalford.github.io/blog/dynamic-on-screen-keyboards/</link><pubDate>Sun, 25 Sep 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/dynamic-on-screen-keyboards/</guid><description>&lt;p&gt;&lt;em&gt;This article has some interactive keyboards, therefore I recommend reading it from your computer rather than your phone.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="on-screen-tv-keyboards"&gt;On-screen TV keyboards&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve recently been spending time at my brother&amp;rsquo;s place. We usually eat in front of TV. I&amp;rsquo;ve thus found myself typing stuff on the Netflix/Amazon/Plex TV apps. The typing happens through a remote controller, which is slower than typing with ones fingers. However, the TV apps usually suggest the correct show/movie after five or six keystrokes, so it&amp;rsquo;s not that bad.&lt;/p&gt;</description></item><item><title>NLP at Carbonfact: how would you do it?</title><link>https://maxhalford.github.io/blog/carbonfact-nlp-open-problem/</link><pubDate>Tue, 06 Sep 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/carbonfact-nlp-open-problem/</guid><description>&lt;h2 id="the-task"&gt;The task&lt;/h2&gt;
&lt;p&gt;I work at a company called &lt;a href="https://www.carbonfact.com/"&gt;Carbonfact&lt;/a&gt;. Our core value proposal is computing the &lt;a href="https://en.wikipedia.org/wiki/Carbon_footprint"&gt;carbon footprint&lt;/a&gt; of clothing items, expressed in &lt;a href="https://en.wikipedia.org/wiki/Carbon_Dioxide_Equivalent"&gt;carbon dioxide equivalent&lt;/a&gt; &amp;ndash; $kgCO_2e$ in short. For instance, we started by measuring the footprint of shoes &amp;ndash; no pun intended. We do these measurements with &lt;a href="https://en.wikipedia.org/wiki/Life-cycle_assessment"&gt;life cycle analysis (LCA)&lt;/a&gt; software we built ourselves. We use these analyses to fuel higher-level tasks for our clients, such as &lt;a href="https://en.wikipedia.org/wiki/Carbon_accounting"&gt;carbon accounting&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Sustainable_procurement"&gt;sustainable procurement&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A life cycle analysis is essentially a recipe, the output of which is a carbon footprint assessment. Like any recipe, an LCA necessitates ingredients. In a &lt;a href="https://en.wikipedia.org/wiki/Life-cycle_assessment#/Cradle-to-gate"&gt;cradle-to-gate&lt;/a&gt; scenario, this includes everything that is needed to make the product: the materials, the mass, the manufacturing methods, the transport between factories, etc. In our experience, the biggest impact on a product&amp;rsquo;s footprint come from the materials which it is made of.&lt;/p&gt;</description></item><item><title>Matrix inverse mini-batch updates</title><link>https://maxhalford.github.io/blog/matrix-inverse-mini-batch/</link><pubDate>Wed, 24 Aug 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/matrix-inverse-mini-batch/</guid><description>&lt;p&gt;The inverse covariance matrix, also called &lt;a href="https://en.wikipedia.org/wiki/Precision_matrix"&gt;precision matrix&lt;/a&gt;, is useful in many places across the field of statistics. For instance, in machine learning, it is used for &lt;a href="https://maxhalford.github.io/blog/bayesian-linear-regression"&gt;Bayesian regression&lt;/a&gt; and &lt;a href="https://scikit-learn.org/stable/modules/mixture.html#gmm"&gt;mixture modelling&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s interesting is that any batch model which uses a precision matrix can be turned into an online model. That is, provided the precision matrix can be estimated in a streaming fashion. For instance, scikit-learn&amp;rsquo;s &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html#sklearn.covariance.EllipticEnvelope"&gt;elliptic envelope&lt;/a&gt; method could have an online variant with a &lt;code&gt;partial_fit&lt;/code&gt; method.&lt;/p&gt;</description></item><item><title>A rant against dbt ref</title><link>https://maxhalford.github.io/blog/dbt-ref-rant/</link><pubDate>Tue, 28 Jun 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/dbt-ref-rant/</guid><description>&lt;h2 id="disclaimer"&gt;Disclaimer&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Let me be absolutely clear: I think dbt is a great tool. Although this post is a rant, the goal is to be constructive and suggest an improvement.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="dbt-in-a-nutshell"&gt;dbt in a nutshell&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://www.getdbt.com/"&gt;dbt&lt;/a&gt; is a workflow orchestrator for SQL. In other words, it&amp;rsquo;s a fancy &lt;a href="https://en.wikipedia.org/wiki/Make_(software)"&gt;Make&lt;/a&gt; for data analytics. What makes dbt special is that it is the first workflow orchestrator that is dedicated to the SQL language. It said out loud what many data teams were thinking: you can get a lot done with SQL.&lt;/p&gt;</description></item><item><title>First IRL meetup with the River developers</title><link>https://maxhalford.github.io/blog/first-river-meetup/</link><pubDate>Thu, 09 Jun 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/first-river-meetup/</guid><description>&lt;p&gt;&lt;a href="https://github.com/online-ml/river/"&gt;River&lt;/a&gt; is a Python software for doing online machine learning. It&amp;rsquo;s the result of a merger in early 2020 between &lt;a href="https://github.com/online-ml/river"&gt;creme&lt;/a&gt; and &lt;a href="https://github.com/scikit-multiflow/scikit-multiflow"&gt;scikit-multiflow&lt;/a&gt;. &lt;a href="https://smastelini.github.io/"&gt;Saulo Mastelini&lt;/a&gt;, &lt;a href="https://jacobmontiel.github.io/"&gt;Jacob Montiel&lt;/a&gt;, and myself are the three core developers. But there are many more people who contribute here and there!&lt;/p&gt;
&lt;p&gt;This week Saulo Mastelini and I got to meet in person. This is worth mentioning because Saulo is originally from Brazil, whereas I&amp;rsquo;m based in Europe. We connected and I&amp;rsquo;m glad to think of him as a good friend from now on. Of course we were not alone: some friends of mine from university also joined the fun. These are people who initially contributed to creme, back in what we already call the old days! Each one of them has their own areas of expertise, and contributed to various parts of the codebase.&lt;/p&gt;</description></item><item><title>Online machine learning with River @ GAIA</title><link>https://maxhalford.github.io/blog/online-machine-learning-with-river/</link><pubDate>Thu, 07 Apr 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-machine-learning-with-river/</guid><description/></item><item><title>Fuzzy regex matching in Python</title><link>https://maxhalford.github.io/blog/fuzzy-regex-matching-in-python/</link><pubDate>Mon, 04 Apr 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/fuzzy-regex-matching-in-python/</guid><description>&lt;h2 id="fuzzy-string-matching-in-a-nutshell"&gt;Fuzzy string matching in a nutshell&lt;/h2&gt;
&lt;p&gt;Say we&amp;rsquo;re looking for a pattern in a blob of text. If you know the text has no typos, then determining whether it contains a pattern is trivial. In Python you can use the &lt;code&gt;in&lt;/code&gt; function. You can also write a regex pattern with the &lt;code&gt;re&lt;/code&gt; module from the standard library. But what about if the text contains typos? For instance, this might be the case with user inputs on a website, or with OCR outputs. This is a much harder problem.&lt;/p&gt;</description></item><item><title>OCR spelling correction is hard</title><link>https://maxhalford.github.io/blog/ocr-spelling-correction-is-hard/</link><pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/ocr-spelling-correction-is-hard/</guid><description>&lt;p&gt;I recently saw &lt;a href="https://news.ycombinator.com/item?id=30576435"&gt;SymSpell&lt;/a&gt; pop up on Hackernews. It claims to be a million times faster than &lt;a href="https://norvig.com/spell-correct.html"&gt;Peter Norvig&amp;rsquo;s spelling corrector&lt;/a&gt;. I think it&amp;rsquo;s great that there&amp;rsquo;s a fast open source solution for spelling correction. But in my experience, the most challenging aspect of spelling correction is not necessarily speed.&lt;/p&gt;
&lt;p&gt;When I &lt;a href="https://maxhalford.github.io/blog/one-year-at-alan"&gt;worked at Alan&lt;/a&gt;, I mostly wrote logic to extract structured information from medical documents. After some months working on the topic, I have to admit I hadn&amp;rsquo;t cracked the problem. The goal was to process &amp;gt;80% of documents with no human interaction, but when I left we had only reached 35%. However, I developed a good understanding of what made this task so difficult.&lt;/p&gt;</description></item><item><title>Comic book panel segmentation</title><link>https://maxhalford.github.io/blog/comic-book-panel-segmentation/</link><pubDate>Sat, 05 Mar 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/comic-book-panel-segmentation/</guid><description>&lt;p&gt;&lt;strong&gt;Edit (2023-05-26)&lt;/strong&gt; &amp;ndash; &lt;em&gt;I&amp;rsquo;ve learnt about the &lt;a href="https://github.com/njean42/kumiko"&gt;Kumiko project&lt;/a&gt;, which is exactly devoted to slicing comic book panels. There&amp;rsquo;s even a live &lt;a href="https://kumiko.njean.me/demo"&gt;tool&lt;/a&gt;. I discovered it thanks to being pinged on &lt;a href="https://github.com/njean42/kumiko/issues/12"&gt;this&lt;/a&gt; issue.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve recently been reading some comic books I used to devour as a kid. Especially those from the golden era of francophone comics: Thorgal, Lanfeust, XIII, Tintin, Largo Winch, Blacksad, Aldebaran, etc.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s not easy to get my hands on many of them. Luckily enough I found a website called &lt;a href="https://readcomiconline.li/"&gt;ReadComicOnline&lt;/a&gt; which is delightfully profuse. It gives access to comics for free under the murky &amp;ldquo;fair use&amp;rdquo; copyright doctrine. I&amp;rsquo;m very doubtful about the legality of the website, but I&amp;rsquo;m still using it for lack of a better option.&lt;/p&gt;</description></item><item><title>Online machine learning in practice @ PyData PDX</title><link>https://maxhalford.github.io/blog/online-machine-learning-in-practice-pydata-pdx/</link><pubDate>Wed, 09 Feb 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-machine-learning-in-practice-pydata-pdx/</guid><description/></item><item><title>The online machine learning predict/fit switcheroo</title><link>https://maxhalford.github.io/blog/predict-fit-switcheroo/</link><pubDate>Thu, 06 Jan 2022 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/predict-fit-switcheroo/</guid><description>&lt;h2 id="why-im-writing-this"&gt;Why I&amp;rsquo;m writing this&lt;/h2&gt;
&lt;p&gt;Fact: designing open source software is hard. It&amp;rsquo;s difficult to make design decisions which don&amp;rsquo;t make any compromises. I like to fall back on Dieter Rams&amp;rsquo; &lt;a href="https://ifworlddesignguide.com/design-specials/dieter-rams-10-principles-for-good-design"&gt;10 principles for good design&lt;/a&gt;. I feel like they apply rather well to software design. Especially when said software is open source, due to the many users and the plethora of use cases.&lt;/p&gt;
&lt;p&gt;I had to make a significant design decision for &lt;a href="https://github.com/online-ml/river/"&gt;River&lt;/a&gt;. It boils down to the fact that making a prediction with a model pipeline is a stateful operation, whereas users understandably expect it to be pure with no side-effects. This regularly comes up on the issue tracker, as you can see &lt;a href="https://github.com/online-ml/river/issues/130"&gt;here&lt;/a&gt;, &lt;a href="https://github.com/online-ml/river/issues/359"&gt;here&lt;/a&gt;, and &lt;a href="https://github.com/online-ml/river/issues/499"&gt;here&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Weighted sampling without replacement in pure Python</title><link>https://maxhalford.github.io/blog/weighted-sampling-without-replacement/</link><pubDate>Fri, 24 Dec 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/weighted-sampling-without-replacement/</guid><description>&lt;p&gt;I&amp;rsquo;m working on a problem where I need to sample &lt;code&gt;k&lt;/code&gt; items from a list without replacement. The sampling has to be weighted. In Python, &lt;code&gt;numpy&lt;/code&gt; has &lt;code&gt;random.choice&lt;/code&gt; method which allows doing this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;population&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirichlet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;I&amp;rsquo;m always wary of using &lt;code&gt;numpy&lt;/code&gt; without thinking because I know it incurs some overhead. This overhead is usually meaningful when small amounts of data are involved. In such a case, a pure Python implementation may be faster.&lt;/p&gt;</description></item><item><title>Online machine learning in practice @ Applied AI</title><link>https://maxhalford.github.io/blog/real-time-ml-next-frontier-applied-ai/</link><pubDate>Fri, 17 Dec 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/real-time-ml-next-frontier-applied-ai/</guid><description/></item><item><title>Online machine learning in practice @ LVMH</title><link>https://maxhalford.github.io/blog/real-time-ml-next-frontier-lvmh/</link><pubDate>Fri, 10 Dec 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/real-time-ml-next-frontier-lvmh/</guid><description/></item><item><title>Web scraping, upside down</title><link>https://maxhalford.github.io/blog/declarative-web-scraping/</link><pubDate>Thu, 11 Nov 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/declarative-web-scraping/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Web scraping is the art of extracting information from web pages. A web page is essentially an amalgamation of HTML tags. Usually, we&amp;rsquo;re looking for a particular piece of information on a given web page. This may be done by fetching the HTML content of the page in question, and then running some HTML parsing logic. It&amp;rsquo;s quite straightforward.&lt;/p&gt;
&lt;p&gt;There are many tools in the wild to perform web scraping. For instance, in Python, you may use &lt;a href="https://docs.python-requests.org/en/latest/"&gt;requests&lt;/a&gt; in combination with &lt;a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"&gt;Beautiful Soup&lt;/a&gt;. You can also automate some of the more mundane aspects of scraping by using &lt;a href="https://scrapy.org/"&gt;Scrapy&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>One year at Alan</title><link>https://maxhalford.github.io/blog/one-year-at-alan/</link><pubDate>Tue, 26 Oct 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/one-year-at-alan/</guid><description>&lt;h2 id="context"&gt;Context&lt;/h2&gt;
&lt;p&gt;Today marks the 1 year anniversary since I started working at &lt;a href="https://alan.com/"&gt;Alan&lt;/a&gt;. It&amp;rsquo;s my first real job, and certainly the place where I grew up the most professionally. I&amp;rsquo;m writing this post to summarise what I did and what I learnt at Alan.&lt;/p&gt;
&lt;p&gt;Alan is a special company. It has a unique culture that is starting to become famous in France. I won&amp;rsquo;t expand on the way things work at Alan, and will simply focus on the way I experienced it. Let me just say this: it works. The pace at which stuff gets shipped is insane. And yet, it&amp;rsquo;s a healthy environment to be working in. Alaners are some of the kindest and wisest human beings I&amp;rsquo;ve had the chance to meet.&lt;/p&gt;</description></item><item><title>Manipulating ephemeral data with git</title><link>https://maxhalford.github.io/blog/manipulating-ephemeral-data-with-git/</link><pubDate>Thu, 07 Oct 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/manipulating-ephemeral-data-with-git/</guid><description/></item><item><title>Dashboards and GROUPING SETS</title><link>https://maxhalford.github.io/blog/grouping-sets/</link><pubDate>Fri, 10 Sep 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/grouping-sets/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;At &lt;a href="https://alan.com/"&gt;Alan&lt;/a&gt;, we do almost all our data analysis in SQL. Our data warehouse used to be &lt;a href="https://www.postgresql.org/"&gt;PostgreSQL&lt;/a&gt;, and have since switched to &lt;a href="https://www.snowflake.com/"&gt;Snowflake&lt;/a&gt; for performance reasons. We load data into our warehouse with &lt;a href="https://airflow.apache.org/"&gt;Airflow&lt;/a&gt;. This includes dumps of our production database, third-party data, and health data from other actors in the health ecosystem. This is raw data. We transform this into prepared data via an in-house tool that resembles &lt;a href="https://www.getdbt.com/"&gt;dbt&lt;/a&gt;. You can read more about it &lt;a href="https://medium.com/alan/how-we-solve-the-problem-of-sharing-actionable-data-with-the-team-7e4afeff3cac"&gt;here&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Homoglyphs: different characters that look identical</title><link>https://maxhalford.github.io/blog/homoglyphs/</link><pubDate>Thu, 19 Aug 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/homoglyphs/</guid><description>&lt;h2 id="a-wild-homoglyph-appears"&gt;A wild homoglyph appears&lt;/h2&gt;
&lt;p&gt;For instance, can you tell if there&amp;rsquo;s a difference between &lt;code&gt;H&lt;/code&gt; and &lt;code&gt;Η&lt;/code&gt;? How about &lt;code&gt;N&lt;/code&gt; and &lt;code&gt;Ν&lt;/code&gt;? These characters may seem identical, but they are actually different. You can try this out for yourself in Python:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;H&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Η&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kc"&gt;False&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;N&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Ν&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kc"&gt;False&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Indeed, these all represent different Unicode characters:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;ord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;H&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;ord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Η&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;919&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;ord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;N&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;ord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Ν&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;925&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;Η&lt;/code&gt; in fact represents the capital &lt;a href="https://en.wikipedia.org/wiki/Eta"&gt;Eta&lt;/a&gt; letter, while &lt;code&gt;Ν&lt;/code&gt; is a capital &lt;a href="https://en.wikipedia.org/wiki/Nu_(letter)"&gt;Nu&lt;/a&gt;. In fact, entering &lt;code&gt;H&lt;/code&gt; or &lt;code&gt;Η&lt;/code&gt; in Google will produce different results. The same goes for &lt;code&gt;N&lt;/code&gt; and &lt;code&gt;Ν&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Automated document processing at Alan</title><link>https://maxhalford.github.io/blog/medium-document-processing/</link><pubDate>Thu, 10 Jun 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/medium-document-processing/</guid><description/></item><item><title>Text classification by data compression</title><link>https://maxhalford.github.io/blog/text-classification-by-compression/</link><pubDate>Tue, 08 Jun 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/text-classification-by-compression/</guid><description>&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;I posted this &lt;a href="https://news.ycombinator.com/item?id=27440093"&gt;on Hackernews&lt;/a&gt; and got some valuable feedback. Many brought up the fact that you should be able to reuse the internal state of the compressor instead of recompressing the training data each time a prediction is made. There&amp;rsquo;s also some insightful references to data compression theory and its ties to statistical learning&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Edit (2025-06-29)&lt;/strong&gt; &amp;ndash; *Python 3.14 introduces &lt;a href="https://docs.python.org/3/library/compression.zstd.html#compression.zstd.train_dict"&gt;&lt;code&gt;compression.zstd&lt;/code&gt;&lt;/a&gt;, which implement&amp;rsquo;s Facebook&amp;rsquo;s Zstandard compression algorithm, as discussed in the comments section below.&lt;/p&gt;</description></item><item><title>Reducing the memory footprint of a scikit-learn text classifier</title><link>https://maxhalford.github.io/blog/sklearn-text-classifier-memory-footprint-reduction/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/sklearn-text-classifier-memory-footprint-reduction/</guid><description>&lt;h2 id="context"&gt;Context&lt;/h2&gt;
&lt;p&gt;This week at Alan I&amp;rsquo;ve been working on parsing &lt;a href="https://www.wikiwand.com/fr/Ordonnance_(m%C3%A9decine)"&gt;French medical prescriptions&lt;/a&gt;. There are three types of prescriptions: lenses, glasses, and pharmaceutical prescriptions. Different information needs to be extracted depending on the prescription type. Therefore, the first step is to classify the prescription. The prescriptions we receive are pictures taken by users with their phone. We run each image through an OCR to obtain a text transcription of the image. We can thus use the text transcription to classify the prescription.&lt;/p&gt;</description></item><item><title>An overview of dataset time travel</title><link>https://maxhalford.github.io/blog/dataset-time-travel/</link><pubDate>Wed, 07 Apr 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/dataset-time-travel/</guid><description>&lt;h2 id="tldr"&gt;TLDR&lt;/h2&gt;
&lt;p&gt;You&amp;rsquo;re a data scientist. The engineers in your company overwrite data in the production database. You want to access overwritten data to train your models. How?&lt;/p&gt;
&lt;h2 id="i-thought-time-travel-only-existed-in-the-movies"&gt;I thought time travel only existed in the movies&lt;/h2&gt;
&lt;p&gt;You&amp;rsquo;re probably right, expect maybe for &lt;a href="https://en.wikipedia.org/wiki/Time_travel_claims_and_urban_legends#/Present-day_hipster_at_1941_bridge_opening"&gt;this guy&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I want to discuss a concept that&amp;rsquo;s been on my mind for a while now. I like to call it &amp;ldquo;dataset time travel&amp;rdquo; because it has a nice ring to it. But the association of &amp;ldquo;time travel&amp;rdquo; and &amp;ldquo;data&amp;rdquo; has already been used elsewhere. It&amp;rsquo;s not something I&amp;rsquo;m pulling out from thin air. Essentially, what I want to discuss is the ability to view a dataset at any given point in the past. Having this ability is powerful, as it allows answering important business questions. As an example, let&amp;rsquo;s say we have a database table called &lt;code&gt;users&lt;/code&gt;. We might ask the following question:&lt;/p&gt;</description></item><item><title>The challenges of online machine learning in production @ Itaú Unibanco</title><link>https://maxhalford.github.io/blog/challenges-of-online-machine-learning-in-production/</link><pubDate>Fri, 26 Feb 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/challenges-of-online-machine-learning-in-production/</guid><description/></item><item><title>Quelle est l’empreinte écologique du Big Data? @ Toulouse Tech</title><link>https://maxhalford.github.io/blog/empreinte-ecologie-du-big-data/</link><pubDate>Fri, 22 Jan 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/empreinte-ecologie-du-big-data/</guid><description/></item><item><title>Organising a Kaggle InClass competition with a fairness metric</title><link>https://maxhalford.github.io/blog/fairness-competition/</link><pubDate>Thu, 21 Jan 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/fairness-competition/</guid><description>&lt;h2 id="some-context"&gt;Some context&lt;/h2&gt;
&lt;p&gt;I co-organised a data science competition during the second half of 2020. This was in fact the 5th edition of the &amp;ldquo;Défi IA&amp;rdquo;, which is a recurring event that happens on a yearly basis. It is essentially a supervised machine learning competition for students from French speaking universities and engineering schools. This year was the first time that Kaggle was used to host the competition. Before that we used a custom platform that I wrote during my student years. You can read more about this &lt;a href="https://maxhalford.github.io/blog/openbikes-challenge"&gt;here&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Converting Amazon Textract tables to pandas DataFrames</title><link>https://maxhalford.github.io/blog/textract-table-to-pandas/</link><pubDate>Thu, 14 Jan 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/textract-table-to-pandas/</guid><description>&lt;p&gt;I&amp;rsquo;m currently doing a lot of document processing at work. One of my tasks is to extract tables from PDF files. I evaluated &lt;a href="https://aws.amazon.com/textract/?nc1=h_ls"&gt;Amazon Textract&lt;/a&gt;&amp;rsquo;s &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html"&gt;table extraction&lt;/a&gt; capability as part of this task. It&amp;rsquo;s very well documented, as is the rest of Textract. I was slightly disappointed by &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/examples-blocks.html"&gt;the examples&lt;/a&gt;, but nothing serious.&lt;/p&gt;
&lt;p&gt;I wanted to write this short blog post to share a piece of code I use to convert tables extracted through Amazon Textract to &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html"&gt;&lt;code&gt;pandas.DataFrame&lt;/code&gt;&lt;/a&gt;s. I&amp;rsquo;ll be using the following anonymised image as an example:&lt;/p&gt;</description></item><item><title>What my PhD was about</title><link>https://maxhalford.github.io/blog/phd-about/</link><pubDate>Wed, 06 Jan 2021 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/phd-about/</guid><description>&lt;p&gt;I defended my PhD thesis on the 12th of October 2020, exactly 3 years and 11 days after having started it. The title of my PhD is &lt;em&gt;Machine learning for query selectivity estimation in relational databases&lt;/em&gt;. I thought it would be worthwhile to summarise what I did. Note sure anyone will read this, but at least I&amp;rsquo;ll be able to remember what I did when I grow old and senile.&lt;/p&gt;</description></item><item><title>Computing cross-correlations in SQL</title><link>https://maxhalford.github.io/blog/sql-cross-correlations/</link><pubDate>Tue, 17 Nov 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/sql-cross-correlations/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;m currently working on a problem at work where I have to measure the impact of a &lt;span style="color: SlateBlue;"&gt;growth initiative&lt;/span&gt; on a &lt;span style="color: MediumSeaGreen;"&gt;performance metric&lt;/span&gt;. Hypothetically, this might answer the following kind of question:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I&amp;rsquo;ve spent &lt;span style="color: SlateBlue;"&gt;X amount of money&lt;/span&gt;, what is the impact on the &lt;span style="color: MediumSeaGreen;"&gt;number of visitors on my website&lt;/span&gt;?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Of course, there are many measures that can be taken to answer such a question. I decided to measure the correlation between the &lt;span style="color: SlateBlue;"&gt;initiative&lt;/span&gt; and the &lt;span style="color: MediumSeaGreen;"&gt;metric&lt;/span&gt;, with the latter being shifted forward in time. This measure is called the &lt;a href="https://en.wikipedia.org/wiki/Cross-correlation"&gt;cross-correlation&lt;/a&gt;. It&amp;rsquo;s different from &lt;a href="https://en.wikipedia.org/wiki/Autocorrelation"&gt;serial correlation&lt;/a&gt;, which is the correlation of a series with a shifted version of itself.&lt;/p&gt;</description></item><item><title>Unsupervised text classification with word embeddings</title><link>https://maxhalford.github.io/blog/unsupervised-text-classification/</link><pubDate>Sat, 03 Oct 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/unsupervised-text-classification/</guid><description>&lt;div align="center" &gt;
 &lt;img height="300px" src="https://maxhalford.github.io/img/blog/document-classification/morpheus.jpg" alt="morpheus"&gt;
 &lt;br&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;since writing this article, I have discovered that the method I describe is a form of &lt;a href="https://en.wikipedia.org/wiki/Zero-shot_learning"&gt;zero-shot learning&lt;/a&gt;. So I guess you could say that this article is a tutorial on zero-shot learning for NLP.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;I stumbled on a &lt;a href="https://www.aclweb.org/anthology/P19-1036/"&gt;paper&lt;/a&gt; entitled &amp;ldquo;Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings&amp;rdquo; which proposes something very similar. The paper is rather well written, so you might want to check it out. Note that they call the &lt;code&gt;tech -&amp;gt; technology&lt;/code&gt; trick &amp;ldquo;label enrichment&amp;rdquo;.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Focal loss implementation for LightGBM</title><link>https://maxhalford.github.io/blog/lightgbm-focal-loss/</link><pubDate>Sun, 20 Sep 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/lightgbm-focal-loss/</guid><description>&lt;p&gt;&lt;strong&gt;Edit (2021-01-26)&lt;/strong&gt; &amp;ndash; &lt;em&gt;I initially wrote this blog post using version 2.3.1 of LightGBM. I&amp;rsquo;ve now updated it to use version 3.1.1. There are a couple of subtle but important differences between version 2.x.y and 3.x.y. If you&amp;rsquo;re using version 2.x.y, then I strongly recommend you to upgrade to version 3.x.y.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re reading this blog post, then you&amp;rsquo;re likely to be aware of &lt;a href="https://github.com/microsoft/LightGBM"&gt;LightGBM&lt;/a&gt;. The latter is a best of breed &lt;a href="https://explained.ai/gradient-boosting/"&gt;gradient boosting&lt;/a&gt; library. As of 2020, it&amp;rsquo;s still the go-to machine learning model for tabular data. It&amp;rsquo;s also ubiquitous in competitive machine learning.&lt;/p&gt;</description></item><item><title>A few intermediate pandas tricks</title><link>https://maxhalford.github.io/blog/pandas-tricks/</link><pubDate>Mon, 17 Aug 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/pandas-tricks/</guid><description>&lt;p&gt;I want to use this post to share some &lt;code&gt;pandas&lt;/code&gt; snippets that I find useful. I use them from time to time, in particular when I&amp;rsquo;m doing &lt;a href="https://www.kaggle.com/search?q=time+series+in%3Acompetitions"&gt;time series competitions&lt;/a&gt; on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in &lt;a href="https://github.com/MaxHalford/xam"&gt;&lt;code&gt;xam&lt;/code&gt;&lt;/a&gt;, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.&lt;/p&gt;</description></item><item><title>A brief introduction to online machine learning @ Hong Kong Machine Learning Meetup</title><link>https://maxhalford.github.io/blog/brief-introduction-to-online-machine-learning/</link><pubDate>Wed, 10 Jun 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/brief-introduction-to-online-machine-learning/</guid><description/></item><item><title>The correct way to evaluate online machine learning models</title><link>https://maxhalford.github.io/blog/online-learning-evaluation/</link><pubDate>Sun, 07 Jun 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-learning-evaluation/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Most supervised machine learning algorithms work in the batch setting, whereby they are fitted on a training set offline, and are used to predict the outcomes of new samples. The only way for batch machine learning algorithms to learn from new samples is to train them from scratch with both the old samples and the new ones. Meanwhile, some learning algorithms are online, and can predict as well as update themselves when new samples are available. This encompasses any model trained with &lt;a href="https://leon.bottou.org/publications/pdf/compstat-2010.pdf"&gt;stochastic gradient descent&lt;/a&gt; &amp;ndash; which includes deep neural networks, &lt;a href="https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf"&gt;factorisation machines&lt;/a&gt;, and &lt;a href="https://www.cs.huji.ac.il/~shais/papers/ShalevSiSrCo10.pdf"&gt;SVMs&lt;/a&gt; &amp;ndash; as well as &lt;a href="https://homes.cs.washington.edu/~pedrod/papers/kdd00.pdf"&gt;decision trees&lt;/a&gt;, &lt;a href="https://ai.stanford.edu/~ang/papers/icml04-onlinemetric.pdf"&gt;metric learning&lt;/a&gt;, and &lt;a href="https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf"&gt;naïve Bayes&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Online machine learning with decision trees @ Toulouse AOC workgroup</title><link>https://maxhalford.github.io/blog/online-machine-learning-with-decision-trees/</link><pubDate>Thu, 07 May 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-machine-learning-with-decision-trees/</guid><description/></item><item><title>Server-sent events in Flask without extra dependencies</title><link>https://maxhalford.github.io/blog/flask-sse-no-deps/</link><pubDate>Mon, 04 May 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/flask-sse-no-deps/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Server-sent_events"&gt;Server-sent events (SSE)&lt;/a&gt; is a mechanism for sending updates from a server to a client. The fundamental difference with &lt;a href="https://en.wikipedia.org/wiki/WebSocket"&gt;WebSockets&lt;/a&gt; is that the communication only goes in one direction. In other words, the client cannot send information to the server. For many usecases this is all you might need. Indeed, if you just want to receive notifications/updates/messages, then using a WebSocket is overkill. Once you&amp;rsquo;ve implemented the SSE functionality on your server, then all you need on a JavaScript client is an &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/EventSource"&gt;&lt;code&gt;EventSource&lt;/code&gt;&lt;/a&gt;. Trust me, it&amp;rsquo;s very straightforward.&lt;/p&gt;</description></item><item><title>I got plagiarized and Google didn't help</title><link>https://maxhalford.github.io/blog/plagiarism-google-didnt-help/</link><pubDate>Fri, 17 Apr 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/plagiarism-google-didnt-help/</guid><description>&lt;p&gt;One of my most popular articles is &lt;a href="https://maxhalford.github.io/blog/target-encoding"&gt;the one on target encoding&lt;/a&gt;. It gets a fair amount of mentions on Kaggle discussions and I see it pop up from time to time in other contexts. It also brings me around 2500 unique monthly viewers. That&amp;rsquo;s quite a chunk of people for an unambitious blogger like me. Up to a few months ago, my article was on the first page of Google when you typed in searches such as &amp;ldquo;&lt;em&gt;target encoding python&lt;/em&gt;&amp;rdquo; and &amp;ldquo;&lt;em&gt;bayesian target encoding&lt;/em&gt;&amp;rdquo;. This was purely organic and it felt nice to have a relevant article, even though that&amp;rsquo;s not the main reason why I blog.&lt;/p&gt;</description></item><item><title>Our solution to the IDAO 2020 qualifiers</title><link>https://maxhalford.github.io/blog/idao-2020-qualifiers-solution/</link><pubDate>Sun, 12 Apr 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/idao-2020-qualifiers-solution/</guid><description/></item><item><title>Speeding up scikit-learn for single predictions</title><link>https://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/</link><pubDate>Tue, 31 Mar 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/</guid><description>&lt;p&gt;It is now common practice to train machine learning models offline before putting them behind an API endpoint to serve predictions. Specifically, we want an API route which can make a prediction for a single row/instance/sample/data point/individual (&lt;a href="https://www.youtube.com/watch?v=1prhCWO_518"&gt;call it what you want&lt;/a&gt;). Nowadays, we have great tools to do this that take care of the nitty-gritty details, such as &lt;a href="https://github.com/cortexlabs/cortex"&gt;Cortex&lt;/a&gt;, &lt;a href="https://www.mlflow.org/docs/latest/models.html"&gt;MLFlow&lt;/a&gt;, &lt;a href="https://www.kubeflow.org/docs/components/serving/"&gt;Kubeflow&lt;/a&gt;, and &lt;a href="https://github.com/ucbrise/clipper"&gt;Clipper&lt;/a&gt;. There are also paid services that hold your hand a bit more, such as &lt;a href="https://www.datarobot.com/"&gt;DataRobot&lt;/a&gt;, &lt;a href="https://www.h2o.ai/"&gt;H2O&lt;/a&gt;, and &lt;a href="https://www.cubonacci.com/"&gt;Cubonacci&lt;/a&gt;. One could argue that deploying machine learning models has never been easier.&lt;/p&gt;</description></item><item><title>Machine learning for streaming data with creme</title><link>https://maxhalford.github.io/blog/medium-creme/</link><pubDate>Thu, 26 Mar 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/medium-creme/</guid><description/></item><item><title>Global explanation of machine learning with sensitivity analysis @ MASCOT-NUM</title><link>https://maxhalford.github.io/blog/global-explanation-of-ml-with-sensitivity-analysis/</link><pubDate>Tue, 10 Mar 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/global-explanation-of-ml-with-sensitivity-analysis/</guid><description/></item><item><title>Bayesian linear regression for practitioners</title><link>https://maxhalford.github.io/blog/bayesian-linear-regression/</link><pubDate>Wed, 26 Feb 2020 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/bayesian-linear-regression/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Suppose you have an infinite stream of feature vectors $x_i$ and targets $y_i$. In this case, $i$ denotes the order in which the data arrives. If you&amp;rsquo;re doing supervised learning, then your goal is to estimate $y_i$ &lt;em&gt;before&lt;/em&gt; it is revealed to you. In order to do so, you have a model which is composed of parameters denoted $\theta_i$. For instance, $\theta_i$ represents the feature weights when using linear regression. After a while, $y_i$ will be revealed, which will allow you to update $\theta_i$ and thus obtain $\theta_{i+1}$. To perform the update, you may apply whichever learning rule you wish &amp;ndash; for instance most people use &lt;a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent#/Extensions_and_variants"&gt;some flavor of stochastic gradient descent&lt;/a&gt;. The process I just described is called &lt;a href="https://en.wikipedia.org/wiki/Online_machine_learning"&gt;online supervised machine learning&lt;/a&gt;. The difference between online machine learning and the more traditional batch machine learning is that an online model is dynamic and learns on the fly. Online learning solves a lot of pain points in real-world environments, mostly because it doesn&amp;rsquo;t require retraining models from scratch every time new data arrives.&lt;/p&gt;</description></item><item><title>Under-sampling a dataset with desired ratios</title><link>https://maxhalford.github.io/blog/undersampling-ratios/</link><pubDate>Tue, 17 Dec 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/undersampling-ratios/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve just spent a few hours looking at under-sampling and how it can help a classifier learn from an imbalanced dataset. The idea is quite simple: randomly sample the majority class and leave the minority class untouched. There are more sophisticated ways to do this &amp;ndash; for instance by creating synthetic observations from the minority class &lt;em&gt;à la&lt;/em&gt; &lt;a href="http://rikunert.com/SMOTE_explained"&gt;SMOTE&lt;/a&gt; &amp;ndash; but I won&amp;rsquo;t be discussing that here.&lt;/p&gt;
&lt;p&gt;I checked out the &lt;a href="https://imbalanced-learn.readthedocs.io/en/stable/index.html"&gt;&lt;code&gt;imblearn&lt;/code&gt;&lt;/a&gt; library and noticed they have an implementation of random under-sampling aptly named &lt;a href="https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler"&gt;&lt;code&gt;RandomUnderSampler&lt;/code&gt;&lt;/a&gt;. It contains a &lt;code&gt;sampling_strategy&lt;/code&gt; parameter which gives some control over the sampling. By the default the observations are resampled so that each class is equally represented:&lt;/p&gt;</description></item><item><title>The benefits of online machine learning @ Quantmetry</title><link>https://maxhalford.github.io/blog/the-benefits-of-online-learning-quantmetry/</link><pubDate>Tue, 29 Oct 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/the-benefits-of-online-learning-quantmetry/</guid><description/></item><item><title>The benefits of online machine learning @ Element AI</title><link>https://maxhalford.github.io/blog/the-benefits-of-online-learning-element-ai/</link><pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/the-benefits-of-online-learning-element-ai/</guid><description/></item><item><title>Finding fuzzy duplicates with pandas</title><link>https://maxhalford.github.io/blog/transitive-duplicates/</link><pubDate>Mon, 16 Sep 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/transitive-duplicates/</guid><description>&lt;p&gt;Duplicate detection is the task of finding two or more instances in a dataset that are in fact identical. As an example, take the following toy dataset:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th style="text-align: center"&gt;&lt;/th&gt;
 &lt;th style="text-align: center"&gt;&lt;strong&gt;First name&lt;/strong&gt;&lt;/th&gt;
 &lt;th style="text-align: center"&gt;&lt;strong&gt;Last name&lt;/strong&gt;&lt;/th&gt;
 &lt;th style="text-align: center"&gt;&lt;strong&gt;Email&lt;/strong&gt;&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td style="text-align: center"&gt;0&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Erlich&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Bachman&lt;/td&gt;
 &lt;td style="text-align: center"&gt;&lt;a href="mailto:eb@piedpiper.com"&gt;eb@piedpiper.com&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td style="text-align: center"&gt;1&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Erlich&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Bachmann&lt;/td&gt;
 &lt;td style="text-align: center"&gt;&lt;a href="mailto:eb@piedpiper.com"&gt;eb@piedpiper.com&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td style="text-align: center"&gt;2&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Erlik&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Bachman&lt;/td&gt;
 &lt;td style="text-align: center"&gt;&lt;a href="mailto:eb@piedpiper.co"&gt;eb@piedpiper.co&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td style="text-align: center"&gt;3&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Erlich&lt;/td&gt;
 &lt;td style="text-align: center"&gt;Bachmann&lt;/td&gt;
 &lt;td style="text-align: center"&gt;&lt;a href="mailto:eb@piedpiper.com"&gt;eb@piedpiper.com&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each of these instances (rows, if you prefer) corresponds to the same &amp;ldquo;thing&amp;rdquo; &amp;ndash; note that I&amp;rsquo;m not using the word &amp;ldquo;entity&amp;rdquo; because &lt;a href="https://en.wikipedia.org/wiki/Record_linkage#/Entity_resolution"&gt;entity resolution&lt;/a&gt; is a different, and yet related, concept. In my experience there are two main reasons why data duplication may occur:&lt;/p&gt;</description></item><item><title>A smooth approach to putting machine learning into production</title><link>https://maxhalford.github.io/blog/machine-learning-production/</link><pubDate>Sat, 13 Jul 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/machine-learning-production/</guid><description>&lt;p&gt;Putting machine learning into production is hard. Usually I&amp;rsquo;m doubtful of such statements, but in this case I&amp;rsquo;ve never met anyone for whom everything has gone smoothly. Most data scientists might agree that there is a huge gap between their local environment and a live environment. In fact, &amp;ldquo;productionalizing&amp;rdquo; machine learning is such a complex topic that entire companies have risen to address the issue. I&amp;rsquo;m not just talking about running a gigantic grid search and finding the best model, I&amp;rsquo;m talking about putting a machine learning model live so that it actually has a positive impact on your business/project. Off the top of my head: &lt;a href="https://www.cubonacci.com/"&gt;Cubonacci&lt;/a&gt;, &lt;a href="https://www.h2o.ai/"&gt;H2O&lt;/a&gt;, &lt;a href="https://cloud.google.com/automl/"&gt;Google AutoML&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html"&gt;Amazon Sagemaker&lt;/a&gt;, and &lt;a href="https://www.datarobot.com/"&gt;DataRobot&lt;/a&gt;. In other words people are making money off businesses because data scientists and engineers are having a hard putting their models into production. In my opinion if a data scientist can&amp;rsquo;t put her model into production herself then something is wrong. Life should be simpler.&lt;/p&gt;</description></item><item><title>The benefits of online machine learning @ Airbus Bizlab</title><link>https://maxhalford.github.io/blog/the-benefits-of-online-learning-airbus-bizlab/</link><pubDate>Fri, 28 Jun 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/the-benefits-of-online-learning-airbus-bizlab/</guid><description/></item><item><title>Machine learning incrémental: des concepts à la pratique @ Toulouse Data Science Meetup</title><link>https://maxhalford.github.io/blog/machine-learning-incremental-tds/</link><pubDate>Tue, 28 May 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/machine-learning-incremental-tds/</guid><description/></item><item><title>Skyline queries in Python</title><link>https://maxhalford.github.io/blog/skyline-queries/</link><pubDate>Tue, 21 May 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/skyline-queries/</guid><description>&lt;p&gt;Imagine that you&amp;rsquo;re looking to buy a home. If you have an analytical mind then you might want to tackle this with a quantitative. Let&amp;rsquo;s suppose that you have a list of potential homes, and each home has some attributes that can help you compare them. As an example, we&amp;rsquo;ll consider three attributes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;price&lt;/code&gt; of the house, which you want to minimize&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;size&lt;/code&gt; of the house, which you want to maximize&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;city&lt;/code&gt; where the house if located, which you don&amp;rsquo;t really care about&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some houses will be objectively better than others because they will be cheaper and bigger. However, for some pairs of houses the comparison will not be as clear. It might be that house A is more expensive than house B but is also larger. In data analysis this set of best houses which are incomparable with each other is called a &lt;a href="https://en.wikipedia.org/wiki/Skyline_operator"&gt;skyline&lt;/a&gt;. As they say, a picture is worth a thousand words, so let&amp;rsquo;s draw one.&lt;/p&gt;</description></item><item><title>Online machine learning with creme @ PyData Amsterdam</title><link>https://maxhalford.github.io/blog/online-machine-learning-with-creme-pydata/</link><pubDate>Sat, 11 May 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/online-machine-learning-with-creme-pydata/</guid><description/></item><item><title>SQL subquery enumeration</title><link>https://maxhalford.github.io/blog/sql-subquery-enumeration/</link><pubDate>Mon, 06 May 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/sql-subquery-enumeration/</guid><description>&lt;p&gt;I recently stumbled on a rather fun problem during my PhD. I wanted to generate all possible subqueries from a given SQL query. In this case an example is easily worth a 1000 thousand words. Take the following SQL query:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sql" data-lang="sql"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;purchases&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;shops&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shop_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nationality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Swedish&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hair&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Blond&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Stockholm&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here all the possible subqueries that can be generated from the above query.&lt;/p&gt;</description></item><item><title>An approach based on Bayesian networks for query selectivity estimation @ DASFAA</title><link>https://maxhalford.github.io/blog/an-approach-based-on-bayesian-networks-for-query-selectivity-estimation-slides/</link><pubDate>Mon, 22 Apr 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/an-approach-based-on-bayesian-networks-for-query-selectivity-estimation-slides/</guid><description/></item><item><title>Morellet crosses with JavaScript</title><link>https://maxhalford.github.io/blog/morellet/</link><pubDate>Sun, 03 Feb 2019 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/morellet/</guid><description>&lt;p&gt;The days I&amp;rsquo;m working on a deep learning project. I hate it but I promised myself to give it a real try. My scripts are taking a long time so I decided to do some procedural art while I waited. This time I&amp;rsquo;m going to reproduce the following crosses made by &lt;a href="https://en.wikipedia.org/wiki/Fran%C3%A7ois_Morellet"&gt;François Morellet&lt;/a&gt;. I saw them the last I went to the Musée Pompidou with some friends from university. I don&amp;rsquo;t have any smartphone anymore so one my friends was kind enough to take a few pictures for me, including this one. The painting is called &lt;a href="https://www.centrepompidou.fr/cpv/resource/cxx585o/ryjG5EL"&gt;&lt;em&gt;Violet, bleu, vert, jaune, orange, rouge&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Streaming groupbys in pandas for big datasets</title><link>https://maxhalford.github.io/blog/pandas-streaming-groupby/</link><pubDate>Wed, 05 Dec 2018 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/pandas-streaming-groupby/</guid><description>&lt;p&gt;If you&amp;rsquo;ve done a bit of Kaggling, then you&amp;rsquo;ve probably been typing a fair share of &lt;code&gt;df.groupby(some_col)&lt;/code&gt;. That is, if you&amp;rsquo;re using Python. If you&amp;rsquo;re handling tabular data, then a lot of your features will revolve around computing &lt;em&gt;aggregate statistics&lt;/em&gt;. This is very true for the ongoing &lt;a href="https://www.kaggle.com/c/PLAsTiCC-2018"&gt;PLAsTiCC Astronomical Classification challenge&lt;/a&gt;. The goal of the competition is to classify objects in the sky into one of 14 groups. The bulk of the available data is a set of so-called &lt;em&gt;light curve&lt;/em&gt;. A light curve is a sequence of brightness measures observations along time. Each light curve is filtered at different passbands. The idea is that there is one light curve per passband and per object and that the shape of each light curve should tell us what kind of object we&amp;rsquo;re looking at. Yada yada.&lt;/p&gt;</description></item><item><title>Target encoding done the right way</title><link>https://maxhalford.github.io/blog/target-encoding/</link><pubDate>Sat, 13 Oct 2018 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/target-encoding/</guid><description>&lt;p&gt;When you&amp;rsquo;re doing supervised learning, you often have to deal with categorical variables. That is, variables which don&amp;rsquo;t have a natural numerical representation. The problem is that most machine learning algorithms require the input data to be numerical. At some point or another a data science pipeline will require converting categorical variables to numerical variables.&lt;/p&gt;
&lt;p&gt;There are many ways to do so:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html"&gt;Label encoding&lt;/a&gt; where you choose an arbitrary number for each category&lt;/li&gt;
&lt;li&gt;&lt;a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html"&gt;One-hot encoding&lt;/a&gt; where you create one binary column per category&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tensorflow.org/tutorials/representation/word2vec"&gt;Vector representation&lt;/a&gt; a.k.a. word2vec where you find a low dimensional subspace that fits your data&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Microsoft/LightGBM/blob/master/docs/Advanced-Topics.rst#categorical-feature-support"&gt;Optimal binning&lt;/a&gt; where you rely on tree-learners such as LightGBM or CatBoost&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.saedsayad.com/encoding.htm"&gt;Target encoding&lt;/a&gt; where you average the target value by category&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each and every one of these method has its own pros and cons. The best approach typically depends on your data and your requirements. If a variable has a lot of categories, then a one-hot encoding scheme will produce many columns, which can cause memory issues. In my experience, relying on LightGBM/CatBoost is the best out-of-the-box method. Label encoding is useless and you should never use it. However if your categorical variable happens to be ordinal then you can and should represent it with increasing numbers (for example &amp;ldquo;cold&amp;rdquo; becomes 0, &amp;ldquo;mild&amp;rdquo; becomes 1, and &amp;ldquo;hot&amp;rdquo; becomes 2). &lt;a href="https://en.wikipedia.org/wiki/Word2vec"&gt;Word2vec&lt;/a&gt; and others such methods are cool and good but they require some fine-tuning and don&amp;rsquo;t always work out of the box.&lt;/p&gt;</description></item><item><title>Stella triangles with JavaScript</title><link>https://maxhalford.github.io/blog/stella-triangles/</link><pubDate>Thu, 26 Apr 2018 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/stella-triangles/</guid><description>&lt;p&gt;Around the same time last year I visited the &lt;a href="https://www.sfmoma.org/"&gt;San Francisco Museum of Modern Art&lt;/a&gt;. &lt;a href="https://en.wikipedia.org/wiki/Frank_Stella"&gt;Frank Stella&lt;/a&gt;&amp;rsquo;s compositions really caught my eye. When I saw them I started thinking about how I could write a computer program to imitate his work. In this post I&amp;rsquo;m going to attempt to reproduce his so-called &lt;em&gt;V Series&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://maxhalford.github.io/img/blog/stella-triangles/1.jpg" alt="1"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://maxhalford.github.io/img/blog/stella-triangles/2.jpg" alt="2"&gt;&lt;/p&gt;
&lt;p&gt;Nice and simple right? Indeed in a lot of his work Frank Stella uses straight lines without much randomness. There are quite a few prints in the V Series. However in each one of them the common denominator is a single triangle. If we have a routine for drawing one triangle then we can use to make compositions. As always let&amp;rsquo;s start by creating a canvas.&lt;/p&gt;</description></item><item><title>Unknown pleasures with JavaScript</title><link>https://maxhalford.github.io/blog/unknown-pleasures/</link><pubDate>Mon, 24 Jul 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/unknown-pleasures/</guid><description>&lt;p&gt;No this blog post is not about how nice JavaScript can be, instead it&amp;rsquo;s just another one of my attempts at reproducing modern art with &lt;a href="https://en.wikipedia.org/wiki/Procedural_generation"&gt;procedural generation&lt;/a&gt; and the &lt;a href="https://www.w3schools.com/html/html5_canvas.asp"&gt;HTML5 &lt;code&gt;&amp;lt;canvas&amp;gt;&lt;/code&gt; element&lt;/a&gt;. This time I randomly generated images resembling the cover of the album by Joy Division called &amp;ldquo;Unknown Pleasures&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://maxhalford.github.io/img/blog/unknown-pleasures/album.png" alt="album"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Unknown_Pleasures#/Artwork_and_packaging"&gt;According to Wikipedia&lt;/a&gt;, this somewhat iconic album cover is based on radio waves. I saw a poster of it in a bar not long ago and decided to reproduce the next time I had some time to kill.&lt;/p&gt;</description></item><item><title>Subsampling a training set to match a test set - Part 1</title><link>https://maxhalford.github.io/blog/subsampling-1/</link><pubDate>Mon, 19 Jun 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/subsampling-1/</guid><description>&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt; &amp;ndash; &lt;em&gt;it&amp;rsquo;s 2022 and I still haven&amp;rsquo;t written a part 2. That&amp;rsquo;s because I believe this problem is easily solved with &lt;a href="https://www.kaggle.com/carlmcbrideellis/what-is-adversarial-validation"&gt;adversarial validation&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Some friends and I recently qualified for the final of the 2017 edition of the &lt;a href="http://www.datasciencegame.com"&gt;Data Science Game&lt;/a&gt; competition. The first part was a Kaggle competition with data provided by Deezer. The problem was a binary classification task where one had to predict if a user was going to listen to a song that was proposed to him. Like many teams we extracted clever features and trained an XGBoost classifier, classic. However, the one special thing we did was to subsample our training set so that it was more representative of the test set.&lt;/p&gt;</description></item><item><title>Docker for data science @ HelloFresh Berlin</title><link>https://maxhalford.github.io/blog/docker-for-data-science/</link><pubDate>Thu, 01 Jun 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/docker-for-data-science/</guid><description/></item><item><title>Halftoning with Go - Part 2</title><link>https://maxhalford.github.io/blog/halftoning-2/</link><pubDate>Mon, 20 Mar 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/halftoning-2/</guid><description>&lt;p&gt;The next stop on my travel through the world of halftoning will be the implementation of &lt;em&gt;Weighted Voronoi Stippling&lt;/em&gt; as described in &lt;a href="https://cs.nyu.edu/~ajsecord/"&gt;Adrian Secord&lt;/a&gt;&amp;rsquo;s 2002 &lt;a href="http://www.mrl.nyu.edu/~ajsecord/npar2002/npar2002_ajsecord_preprint.pdf"&gt;paper&lt;/a&gt;. This method is more involved than the ones I detailed in my &lt;a href="https://maxhalford.github.io/blog/halftoning-1"&gt;previous blog post&lt;/a&gt;, however the results are quite interesting. Again, I did the implementation in Go.&lt;/p&gt;
&lt;div align="center" &gt;
&lt;figure style="width: 80%;"&gt;
 &lt;img src="https://maxhalford.github.io/img/blog/halftoning-2/coliseum.jpg" alt="colosseum"&gt;
 &lt;img src="https://maxhalford.github.io/img/blog/halftoning-2/coliseum_stippled.jpg" alt="colosseum_stippled"&gt;
 &lt;figcaption&gt;Notice the black dot in the middle of the white square?&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/div&gt;
&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;I found a fair amount of resources about the method, most of them being implementations of Adrian Secord&amp;rsquo;s paper. However, not many of these resources went into the nitty-gritty details which are not obvious for beginners in image processing. Before delving into the code, I want to go through some concepts that may seem obvious to some readers but that I judge worthy of detailing.&lt;/p&gt;</description></item><item><title>Grid paintings à la Mondrian with JavaScript</title><link>https://maxhalford.github.io/blog/mondrian/</link><pubDate>Sat, 04 Mar 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/mondrian/</guid><description>&lt;p&gt;I was at a laundrette today and had just finished my book so I had some time to kill. Naturally I devised an algorithm for generating drawings that would resemble the &lt;a href="https://www.google.co.uk/search?q=piet+mondrian+grid+painting"&gt;grid-like paintings&lt;/a&gt; that &lt;a href="https://en.wikipedia.org/wiki/Piet_Mondrian"&gt;Piet Mondrian&lt;/a&gt; made famous. With the benefit of hindsight I guess I could indulge in saner activities while waiting for my laundry to dry!&lt;/p&gt;
&lt;p&gt;I went through different ideas but in the end I settled on a recursive approach. My idea is to divide a rectangle into two smaller ones and then to do the same with each sub-rectangle. Every time a rectangle is generated and is then filled with a random color; like Mondrian I use yellow, red, blue, black and white.&lt;/p&gt;</description></item><item><title>A short introduction and conclusion to the OpenBikes 2016 Challenge</title><link>https://maxhalford.github.io/blog/openbikes-challenge/</link><pubDate>Thu, 26 Jan 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/openbikes-challenge/</guid><description>&lt;p&gt;During my undergraduate internship in 2015 I started a side project called OpenBikes. The idea was to visualize and analyze bike sharing over multiple cities. &lt;a href="http://axelbellec.fr/"&gt;Axel Bellec&lt;/a&gt; joined me and in 2016 we &lt;a href="http://www.opendatafrance.net/2016/02/05/le-prix-open-data-toulouse-metropole-remis-a-openbikes"&gt;won a national open data competition&lt;/a&gt;. Since then we haven&amp;rsquo;t pursued anything major, instead we use OpenBikes to try out technologies and to apply concepts we learn at university and online.&lt;/p&gt;
&lt;p&gt;Before the 2016 summer holidays one of my professors, &lt;a href="https://www.math.univ-toulouse.fr/~agarivie/"&gt;Aurélien Garivier&lt;/a&gt; mentioned that he was considering using our data for a Kaggle-like competition between some statistics curriculums in France. Near the end of the summer, I sat down with a group of professors and we decided upon a format for the so-called &amp;ldquo;Challenge&amp;rdquo;. The general idea was to provide student teams with historical data on multiple bike stations and ask them to do some forecasting which we would then score based on a secret truth. The whole thing lasted from the 5th of October 2016 till the 26th of January 2017 when the best team was crowned.&lt;/p&gt;</description></item><item><title>Challenge Big Data @ TSE</title><link>https://maxhalford.github.io/blog/challenge-big-data/</link><pubDate>Mon, 09 Jan 2017 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/challenge-big-data/</guid><description/></item><item><title>Halftoning with Go - Part 1</title><link>https://maxhalford.github.io/blog/halftoning-1/</link><pubDate>Sun, 27 Nov 2016 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/halftoning-1/</guid><description>&lt;p&gt;Recently I stumbled upon &lt;a href="http://www.cgl.uwaterloo.ca/csk/projects/tsp/"&gt;this webpage&lt;/a&gt; which shows how to use a TSP solver as a &lt;a href="https://en.wikipedia.org/wiki/Halftone"&gt;&lt;em&gt;halftoning&lt;/em&gt;&lt;/a&gt; technique. I began to read about related concepts like &lt;a href="https://en.wikipedia.org/wiki/Dither"&gt;&lt;em&gt;dithering&lt;/em&gt;&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Stippling"&gt;&lt;em&gt;stippling&lt;/em&gt;&lt;/a&gt;. I don&amp;rsquo;t have any background in photography but I can appreciate the visual appeal of these techniques. As I understand it these techniques were first invented to save ink for printing. However nowadays printing has become cheaper and the modern use of these technique is mostly aesthetic, at least for images.&lt;/p&gt;</description></item><item><title>Predire la disponibilité des Velib' @ Toulouse Data Science Meetup</title><link>https://maxhalford.github.io/blog/forecasting-bicycle-sharing-usage/</link><pubDate>Wed, 30 Mar 2016 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/forecasting-bicycle-sharing-usage/</guid><description/></item><item><title>Recursive polygons with JavaScript</title><link>https://maxhalford.github.io/blog/recursive-polygons/</link><pubDate>Fri, 25 Mar 2016 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/recursive-polygons/</guid><description>&lt;p&gt;I like modern art, I enjoy looking at the stuff that was made at the beginning of the 20th century and thinking how it is still shaping today&amp;rsquo;s style. I&amp;rsquo;m not an expert, it&amp;rsquo;s just a hobby of mine. I especially like the &lt;a href="https://www.centrepompidou.fr/"&gt;Centre Pompidou&lt;/a&gt; in Paris, it&amp;rsquo;s got loads of fascinating stuff. While I was going through the galleries it struck me that some of the paintings were very geometrical. In fact they were so geometrical that a machine could have produced them! I&amp;rsquo;m not talking about artificial intelligence but rather a set of rules that could be given to a programming language. Through a series of blog posts I would like to try to emulate some works with my computer. I realize it&amp;rsquo;s a waste of time but it&amp;rsquo;s a good opportunity for me to learn some more JavaScript and refreshen my geometry. I also want to insist on making these drawings random, not deterministic.&lt;/p&gt;</description></item><item><title>The Naïve Bayes classifier</title><link>https://maxhalford.github.io/blog/naive-bayes/</link><pubDate>Thu, 10 Sep 2015 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/naive-bayes/</guid><description>&lt;p&gt;The objective of a classifier is to decide to which &lt;em&gt;class&lt;/em&gt; (also called &lt;em&gt;label&lt;/em&gt;) to assign an observation based on observed data. In &lt;em&gt;supervised learning&lt;/em&gt;, this is done by taking into account previous classifications. In other words if we &lt;em&gt;know&lt;/em&gt; that certain observations are classified in a certain way, the goal is to determine the class of a new observation. The first group of observations on which the classifier is built is called the &lt;em&gt;training set&lt;/em&gt;.&lt;/p&gt;</description></item><item><title>An introduction to genetic algorithms</title><link>https://maxhalford.github.io/blog/genetic-algorithms-introduction/</link><pubDate>Sun, 02 Aug 2015 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/genetic-algorithms-introduction/</guid><description>&lt;p&gt;The goal of genetic algorithms (GAs) is to solve problems whose solutions are not easily found (ie. NP problems, nonlinear optimization, etc.). For example, finding the shortest path from A to B in a directed graph is easily done with &lt;em&gt;Djikstra&amp;rsquo;s algorithm&lt;/em&gt;, it can be solved in polynomial time. However the time to find the smallest path that joins all points on a non-directed graph, also known as the &lt;a href="http://www.wikiwand.com/en/Travelling_salesman_problem"&gt;Travelling Salesman Problem&lt;/a&gt; (TSP) increases exponentially as the number of points increases. More generally, GAs are useful for problems where an analytical approach is complicated or even impossible. By giving up on perfection they manage to find a good approximation of the optimal solution.&lt;/p&gt;</description></item><item><title>Setting up a droplet to host a Flask app</title><link>https://maxhalford.github.io/blog/flask-droplet/</link><pubDate>Tue, 14 Jul 2015 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/flask-droplet/</guid><description>&lt;p&gt;After having worked for some weeks on the &lt;a href="http://openbikes.co"&gt;OpenBikes website&lt;/a&gt;, it was time to put it online. &lt;a href="https://www.digitalocean.com/"&gt;Digital Ocean&lt;/a&gt; seemed to provide a good service and so I decided to give it a spin. Their documentation is quite good but it doesn&amp;rsquo;t cover exactly everything for setting up Flask. In this post I simply want to record every single step I took.&lt;/p&gt;
&lt;p&gt;OpenBikes is a project with a Flask backend and a few upstart jobs. It lives at the &lt;a href="http://openbikes.co"&gt;openbikes.co&lt;/a&gt; domain name. In this blog post I will list every step it takes to make it happen on Ubuntu 14.04 with Apache (it&amp;rsquo;s robust and easy to setup). I didn&amp;rsquo;t always say when to use &lt;code&gt;sudo&lt;/code&gt; before the commands to avoid clutter, however you can safely use it everywhere.&lt;/p&gt;</description></item><item><title>Visualizing bike stations live data</title><link>https://maxhalford.github.io/blog/bike-stations/</link><pubDate>Wed, 03 Jun 2015 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/blog/bike-stations/</guid><description>&lt;p&gt;Recently some friends and I decided to launch &lt;a href="http://openbikes.co/"&gt;openbikes.co&lt;/a&gt;, a website for visualizing (and later on analyzing) urban bike traffic. We have a lot of ideas that we will progressively implement. Anyway, the point is that all of it started with me fiddling about with the &lt;em&gt;JCDecaux&lt;/em&gt; API and the &lt;em&gt;leaflet.js&lt;/em&gt; library and I would like to share it with you. Shall we?&lt;/p&gt;
&lt;h2 id="presentation"&gt;Presentation&lt;/h2&gt;
&lt;p&gt;In this post I want to show you the tools and the code to get a fully functional website for visualizing live data. In this particular case we will display bike stations in Toulouse, however I will keep the scripts as general as possible so they are easily modifiable for different data. Before starting here is a glimpse of the end result:&lt;/p&gt;</description></item><item><title/><link>https://maxhalford.github.io/slides/creme-pydata/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/slides/creme-pydata/</guid><description>&lt;!DOCTYPE html&gt;
&lt;html&gt;
 &lt;head&gt;
 &lt;title&gt;Incremental machine learning&lt;/title&gt;
 &lt;meta charset="utf-8"&gt;
 &lt;style&gt;
 @import url(https://fonts.googleapis.com/css?family=Open+Sans:700);
 @import url(https://fonts.googleapis.com/css?family=Fira+Mono:400,700,400italic);

 body, h1, h2, h3 { font-family: 'Open Sans'; }
 h1, h2, h3 { text-align: center; }

 .bullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 }

 .bigbullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 font-size: 35px;
 }

 .remark-slide-content {
 font-size: 25px;
 color: #1f282d;
 }

 .remark-code, .remark-inline-code { font-family: 'Fira Mono'; }
 .remark-inline-code { background: #f0f0f0; padding: 0px 4px; }

 .left-column { width: 50%; float: left; }
 .right-column { width: 50%; float: right; }
 .white {
 color: #FFFAFA;
 }

 .title-slide .remark-slide-number {
 display: none;
 }

 blockquote {
 background: #f9f9f9;
 border-left: 10px solid #ccc;
 margin: 1.5em 10px;
 padding: 0.5em 10px;
 quotes: "\201C""\201D""\2018""\2019";
 }
 blockquote:before {
 color: #ccc;
 content: open-quote;
 font-size: 4em;
 line-height: 0.1em;
 margin-right: 0.25em;
 vertical-align: -0.4em;
 }
 blockquote p {
 display: inline;
 }

 a { color: hotpink; text-decoration: none; }
 li { margin: 10px 0; }

 &lt;/style&gt;
 &lt;/head&gt;
 &lt;body&gt;
 &lt;textarea id="source"&gt;

class: center, middle

## Online machine learning with creme

### Max Halford

#### 11th of May 2019, Amsterdam

&lt;div style="display: flex; flex-direction: row; justify-content: center;"&gt;

 &lt;div align="center"&gt;
 &lt;img height="100px" src="https://maxhalford.github.io/img/slides/creme/logo_pydata.png" /&gt;
 &lt;/div&gt;

 &lt;div align="center"&gt;
 &lt;img height="125px" src="https://docs.google.com/drawings/d/e/2PACX-1vSl80T4MnWRsPX3KvlB2kn6zVdHdUleG_w2zBiLS7RxLGAHxiSYTnw3LZtXh__YMv6KcIOYOvkSt9PB/pub?w=841&amp;h=350" /&gt;
 &lt;/div&gt;

&lt;/div&gt;

???

Hello!

---

### Outline

.bullets[
1. You're not doing ML the correct way
2. Cool kids do online learning
3. Introducing `creme`, the fresh kid on the block
4. Forecasting League of Legends match durations
5. Future work
]

&lt;div align="center"&gt;
 &lt;iframe src="https://giphy.com/embed/ZoAa7lsmym6UE" width="480" height="223" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

???

My goal is to convince you that the way most people do machine learning isn't the way to go in production scenarios. Online learning is more suitable and nullifies many issues. I'll introduce a new library me and some friends have been working on, and I'll show you we're using for a pet project
trying to forecast the duration of League of Legends matches.

---

### Batch learning

.bullets[
1. Collect features $X$ and labels $Y$
2. Train a model on $(X, Y)$
3. Save the model somewhere
4. Load the model to make predictions
]

With code:

```python
&gt;&gt;&gt; model.fit(X_train, y_train)
&gt;&gt;&gt; dump(model, 'model.json')
&gt;&gt;&gt; model = load('model.json')
&gt;&gt;&gt; y_pred = model.predict(X_test)
```

???

The most common form of machine learning is called batch learning. It's what you do when you use scikit-learn and do Kaggle competitions. I won't dwell on this too long as I'm pretty sure you're all acquainted with it. In short: load, fit, predict.

---

class: center, middle

### Batch machine learning in production

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/batch_production.svg" /&gt;
&lt;/div&gt;

???

There isn't a clear pattern as to how to put into production a batch learning system. Typically this involves training a model periodically, and then serializing it before saving it somewhere. API calls can then be issues to make predictions. This is such a difficult thing to get right that startups and tech giants are building products around it.

---

class: center, middle

### Batch machine learning in production is hard

&lt;div align="center"&gt;
 &lt;img height="300px" src="https://maxhalford.github.io/img/slides/creme/boromir.jpg" /&gt;
&lt;/div&gt;

???

Works fine for Kaggle cases and when you have a clear split between train and test. But things are not so smooth in production.

---

background-color: #2ac380
class: center, middle, white

# Models have to be retrained from scratch with new data

---

background-color: #dbaaa8
class: center, middle, white

# Ever increasing computational requirements

---

background-color: #1f282d
class: center, middle, white

# Models are static and rot faster than bananas 🍌

---

background-color: #e69138
class: center, middle, white

# Features that you develop locally might not be available in real-time

???

This is more subtle. Sometimes it isn't possible to recreate features for past instances. If you don't store information for an instance at every particular point in time, then it isn't possible to reproduce the true state during your feature extraction process. In other words you need a solid data engineering team that cares about time.

---

class: center, middle

&lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/edward-bear-bump-bump.png" /&gt;

&gt; It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it.

???

And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things.

---

background-color: #607bd4
class: middle, white

## Online learning

.bigbullets[
- Data comes from a stream
- Models learn 1 observation at a time
- Observations do not have to be stored
- Features and labels are dynamic
]

---

background-color: #FF7F50
class: middle, white

## Different names, same thing

.bigbullets[
- Online learning
- Incremental learning
- Sequential learning
- Iterative learning
- Out-of-core learning
]

---

background-color: #008080
class: middle, white

## Applications

.bigbullets[
- Time series forecasting
- Spam filtering
- Recommender systems
- Ad placement
- Internet of things
- Basically, &lt;span style="text-decoration:underline"&gt;anything event based&lt;/span&gt;
]

---

class: center, middle

### Online learning in a nutshell

&lt;iframe src="https://giphy.com/embed/HsXfypimWpPcQ" width="480" height="350" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;

---

background-color: #e66868ff
class: middle, white

## Why is batch learning so popular?

.bigbullets[
- Taught at university
- (Bad) habits
- Hype
- Kaggle
- Library availability
- Higher accuracy
]

---

class: middle

&lt;div align="center"&gt;
 &lt;img height="220" src="https://docs.google.com/drawings/d/e/2PACX-1vSl80T4MnWRsPX3KvlB2kn6zVdHdUleG_w2zBiLS7RxLGAHxiSYTnw3LZtXh__YMv6KcIOYOvkSt9PB/pub?w=841&amp;h=350" alt="creme_logo"/&gt;
&lt;/div&gt;

.bullets[
- Python library for doing online learning
- API heavily inspired from `sklearn` and easy to pick up
- Focus on feature extraction, not just learning
- First commit in January 2019
- Version `0.1.0` released earlier this week
]

---

### Observations

Representing an observation with a `dict` is natural:

```python
x = {
 'date': dt.datetime(2019, 4, 22),
 'price': 42.95,
 'shop': 'Ikea'
}
```

- Values can be of any type
- Feature names can be used instead of array indexes
- Python's standard library plays nicely with `dict`s

---

### Targets

A target's type depends on the context:

```python
# Regression
y = 42

# Binary classification
y = True

# Multi-class classification
y = 'setosa'

# Multi-output regression
y = {
 height: 29.7,
 width: 21
}
```

---

### Streaming data

```python
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

for x, y in X_y:
 print(x, y)
```

- `X_y` is a **generator** and consumes a tiny amount of memory
- The point is that we want to **handle data points one by one**
- Source depends on your use case (can be Kakfa producer, CSV file, HTTP requests, etc.)

---

### Training with `fit_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
 model.fit_one(x, y)
```

Every `creme` estimator has a `fit_one` method

---

### Predicting with `predict_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
* y_pred_before = model.predict_one(x)
 model.fit_one(x, y)
* y_pred_after = model.predict_one(x)
```

- Classifiers also have a `predict_proba_one` method
- Transformers have a `transform_one` method
- Training and predicting phases are inter-leaved

---

### Monitoring performance

```python
from creme import linear_model
from creme import metrics
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

* metric = metrics.Accuracy()

for x, y in X_y:
 y_pred = model.predict_one(x)
 model.fit_one(x, y)
* metric.update(y, y_pred)
 print(metric)
```

Validation score is available for free! No need to do any cross-validation. You can also use `online_score` from the `model_selection` module.

---

### Composing estimators is easy

```python
from creme import linear_model
from creme import metrics
from creme import preprocessing
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

scale = preprocessing.StandardScaler()
lin_reg = linear_model.LogisticRegression()

* model = scale | lin_reg # Pipeline shorthand

metric = metrics.Accuracy()

for x, y in X_y:
 y_pred = model.predict_one(x)
 model.fit_one(x, y)
 metric.update(y, y_pred)
 print(metric)
```

---

### `creme`'s current modules

.left-column[
- `cluster`
- `compat`
- `compose`
- `datasets`
- `dummy`
- `ensemble`
- `feature_extraction`
- `feature_selection`
- `impute`
- `linear_model`
]
.right-column[
- `model_selection`
- `multiclass`
- `naive_bayes`
- `optim`
- `preprocessing`
- `reco`
- `stats`
- `stream`
- `tree`
- `utils`
]

---

### Online mean

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$

```python
&gt;&gt;&gt; mean = creme.stats.Mean()

&gt;&gt;&gt; mean.update(5)
&gt;&gt;&gt; mean.get()
5

&gt;&gt;&gt; mean.update(10)
&gt;&gt;&gt; mean.get()
7.5
```

---

### Online variance

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$
3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$ (running sum of squares)
4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$

```python
&gt;&gt;&gt; variance = creme.stats.Variance()

&gt;&gt;&gt; X = [2, 3, 4, 5]
&gt;&gt;&gt; for x in X:
... variance.update(x)
&gt;&gt;&gt; variance.get()
1.25

&gt;&gt;&gt; numpy.var(X)
1.25

```

???

This is called Welford's algorithm, it can be extended to skew and kurtosis

---

### Standard scaling

With the mean and the variance, we can scale incoming data so that it has zero mean and unit variance.

```python
&gt;&gt;&gt; scaler = creme.preprocessing.StandardScaler()

&gt;&gt;&gt; for x in [2, 3, 4, 5]:
... print(x, 'becomes', scaler.fit_one(x)['x'])
2 becomes 0.0
3 becomes 0.9999999999999996
4 becomes 1.224744871391589
5 becomes 1.3416407864998738

```

&lt;div align="center"&gt;
 &lt;iframe src="https://giphy.com/embed/r1HGFou3mUwMw" width="480" height="180" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

### Linear regression (1)

Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example:

- Stochastic gradient descent (SGD)
- Adam
- RMSProp
- Follow the Regularized Leader (FTRL)

```python
from creme import linear_model
from creme import optim

lin_reg = linear_model.LinearRegression(
 optimizer=optim.Adam(lr=0.01)
)
```

---

### Linear regression (2)

The intercept term $b_t$ is difficult to learn

Some people (Léon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay)

`creme` uses any running statistic from the `creme.stats` module, which is a powerful trick

```python
from creme import linear_model
from creme import optim
from creme import stats

lin_reg = linear_model.LinearRegression(
 optimizer=optim.Adam(lr=0.01),
 intercept=stats.RollingMean(42)
)
```

---

### Online aggregations

```python
&gt;&gt;&gt; import creme

&gt;&gt;&gt; X = [
... {'place': 'Taco Bell', 'revenue': 42},
... {'place': 'Burger King', 'revenue': 16},
... {'place': 'Burger King', 'revenue': 24},
... {'place': 'Taco Bell', 'revenue': 58}
... ]

&gt;&gt;&gt; agg = creme.feature_extraction.Agg(
... on='revenue',
... by='place',
... how=creme.stats.Mean()
... )

&gt;&gt;&gt; for x in X:
... print(agg.fit_one(x).transform_one(x))
{'revenue_mean_by_place': 42.0}
{'revenue_mean_by_place': 16.0}
{'revenue_mean_by_place': 20.0}
{'revenue_mean_by_place': 50.0}
```

---

### Bagging (1)

Bagging is a popular and simple ensemble algorithm:

1. Pick $m$ base models (usually identical copies)
2. Train each model on a sample with replacement
3. Average the predictions of each model on the test set

Each observation is sampled $K$ times where $K$ follows a binomial distribution:

$$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$

As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1):

$$P(K=k) \sim \frac{e^{-1}}{k!} $$

---

### Bagging (2)

`ensemble.BaggingClassifier` is very simple:

```python
def fit_one(self, x, y):

 for estimator in self.estimators:
 for _ in range(self.rng.poisson(1)):
 estimator.fit_one(x, y)

 return self


def predict_proba_one(self, x):
 y_pred = statistics.mean(
 estimator.predict_proba_one(x)[True]
 for estimator in self.estimators
 )
 return {
 True: y_pred,
 False: 1 - y_pred
 }
```

---

### League of Legends match duration forecasting (1)

&lt;div align="center"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/lol_home.png" /&gt;
&lt;/div&gt;

---

### League of Legends match duration forecasting (2)

&lt;div align="center"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/lol_matches.png" /&gt;
&lt;/div&gt;

---

### Architecture

&lt;div align="center"&gt;
 &lt;img height="450px" src="https://maxhalford.github.io/img/slides/creme/lol_architecture.svg" /&gt;
&lt;/div&gt;

---

class: middle

### Django model

```python
from django.db import models
from picklefield.fields import PickledObjectField


class CremeModel(models.Model):
 name = models.TextField(unique=True)
 pipeline = PickledObjectField()

 class Meta:
 db_table = 't_models'
 verbose_name_plural = 'models'

 def fit_one(self, x, y):
 self.pipeline.fit_one(x, y)
 return self

 def predict_one(self, x):
 return self.pipeline.predict_one(x)
```

---

class: middle

### Code for predicting

```python
match = fetch_match(match_id)

model = models.CremeModel.objects.get(name='My awesome model')

duration = model.predict_one(match.raw_info)

match.predicted_ended_at = match.started_at + predicted_duration
match.predicted_by = model
match.save()
```

---

class: middle

### Code for training

When the match ends, it has a `true_duration` property.

```python
model = models.CremeModel.objects.get(id=match.predicted_by.id)

model.fit_one(
 x=match.raw_info,
 y=match.true_duration.seconds
)

model.save()
```

---

class: middle

### Code for calculating performance

Just some Django black magic.

```python
duration = ExpressionWrapper(
 Func(F('predicted_ended_at') - F('ended_at'), function='ABS'),
 output_field=fields.DurationField()
)

agg = models.Match.objects.exclude(ended_at__isnull=True)\
 .annotate(duration=duration)\
 .aggregate(Avg('duration'))

avg_error = agg['duration__avg']
```

---

### Benefits of online learning

.bullets[
- No need to schedule model training
- Easy to debug and to monitor
- You're not "far" from production
- Way more fun than batch learning
]

&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;iframe src="https://giphy.com/embed/26tPplGWjN0xLybiU" width="480" height="180" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

### Future work

.left-column[
.bullets[
- Decision trees (nearly there!)
- Gradient boosting
- Bayesian linear models
- More feature extraction
- More models
- More benchmarks
- Many issues [on GitHub](https://github.com/creme-ml/creme/issues)
]
]
.right-column[

&lt;div align="center" style="margin-top: 80px;"&gt;
 &lt;iframe src="https://giphy.com/embed/SuEFqeWxlLcvm" width="240" height="240" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

]

---

### If you want to contribute

- [creme-ml.github.io](https://creme-ml.github.io/)
- [github.com/creme-ml](https://github.com/creme-ml/)
- You can shoot emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com)
- Get in contact if you want to try `creme` and want advice
- Spread the word!

&lt;div align="center"&gt;
 &lt;img height="230px" src="https://maxhalford.github.io/img/slides/creme/we_need_you.jpg" /&gt;
&lt;/div&gt;

---

class: center, middle

# Thanks for listening!

.left-column[
&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;iframe src="https://giphy.com/embed/7zusy37fwKjfy" width="480px" height="343px" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
]

.right-column[
&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;img height="343px" src="https://maxhalford.github.io/img/slides/creme/qr_code.svg" /&gt;
&lt;/div&gt;
]

 &lt;/textarea&gt;
 &lt;script src="https://remarkjs.com/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/contrib/auto-render.min.js"&gt;&lt;/script&gt;
 &lt;link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css"&gt;
 &lt;script type="text/javascript"&gt;
 var options = {};
 var renderMath = function() {
 renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
 {left: "$$", right: "$$", display: true},
 {left: "$", right: "$", display: false},
 {left: "\\[", right: "\\]", display: true},
 {left: "\\(", right: "\\)", display: false},
 ]});
 }
 var slideshow = remark.create(
 {
 slideNumberFormat: function (current, total) {
 if (current === 1) { return "" }
 return current;
 },
 highlightStyle: 'github',
 highlightLines: true,
 ratio: '16:9'
 },
 renderMath
 );
 &lt;/script&gt;
 &lt;/body&gt;
&lt;/html&gt;</description></item><item><title/><link>https://maxhalford.github.io/slides/creme-tds/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/slides/creme-tds/</guid><description>&lt;!DOCTYPE html&gt;
&lt;html&gt;
 &lt;head&gt;
 &lt;title&gt;Machine learning incrémental: des concepts à la pratique&lt;/title&gt;
 &lt;meta charset="utf-8"&gt;
 &lt;style&gt;
 @import url(https://fonts.googleapis.com/css?family=Open+Sans:700);
 @import url(https://fonts.googleapis.com/css?family=Fira+Mono:400,700,400italic);

 body, h1, h2, h3 { font-family: 'Open Sans'; }
 h1, h2, h3 { text-align: center; }

 .bullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 }

 .bigbullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 font-size: 35px;
 }

 .remark-slide-content {
 font-size: 25px;
 color: #1f282d;
 }

 .remark-code, .remark-inline-code { font-family: 'Fira Mono'; }
 .remark-inline-code { background: #f0f0f0; padding: 0px 4px; }

 .left-column { width: 50%; float: left; }
 .right-column { width: 50%; float: right; }
 .white {
 color: #FFFAFA;
 }

 .title-slide .remark-slide-number {
 display: none;
 }

 blockquote {
 background: #f9f9f9;
 border-left: 10px solid #ccc;
 margin: 1.5em 10px;
 padding: 0.5em 10px;
 quotes: "\201C""\201D""\2018""\2019";
 }
 blockquote:before {
 color: #ccc;
 content: open-quote;
 font-size: 4em;
 line-height: 0.1em;
 margin-right: 0.25em;
 vertical-align: -0.4em;
 }
 blockquote p {
 display: inline;
 }

 a { color: hotpink; text-decoration: none; }
 li { margin: 10px 0; }

 &lt;/style&gt;
 &lt;/head&gt;
 &lt;body&gt;
 &lt;textarea id="source"&gt;

class: center, middle

## Machine learning incrémental: des concepts à la pratique

### Max Halford

#### 28 mai 2019

#### Toulouse Data Science Meetup

&lt;div style="display: flex; flex-direction: row; justify-content: center;"&gt;

 &lt;div align="center"&gt;
 &lt;img height="100px" src="https://maxhalford.github.io/img/slides/creme/logo_tds.png" /&gt;
 &lt;/div&gt;

 &lt;div align="center"&gt;
 &lt;img height="125px" src="https://docs.google.com/drawings/d/e/2PACX-1vSl80T4MnWRsPX3KvlB2kn6zVdHdUleG_w2zBiLS7RxLGAHxiSYTnw3LZtXh__YMv6KcIOYOvkSt9PB/pub?w=841&amp;h=350" /&gt;
 &lt;/div&gt;

&lt;/div&gt;

???

Hello!

---

### Outline

.bullets[
1. You're doing machine learning the wrong way 😱
2. Cool kids do online learning 😎
3. Introducing `creme`, a Python lib for online learning 🐍
4. Bike stations forecasting demo 🚲 🔮
]

&lt;div align="center"&gt;
 &lt;iframe src="https://giphy.com/embed/ZoAa7lsmym6UE" width="480" height="223" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

class: center, middle

### PyData Amsterdam 2019 🐍 🇳🇱 🧀

&lt;div align="center"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/max_pydata.jpg" /&gt;
&lt;/div&gt;

---

### Batch learning

.bullets[
1. Collect features $X$ and labels $Y$
2. Train a model on $(X, Y)$
3. Save the model somewhere
4. Load the model to make predictions
]

With code:

```python
&gt;&gt;&gt; model.fit(X_train, y_train)
&gt;&gt;&gt; dump(model, 'model.json')
&gt;&gt;&gt; model = load('model.json')
&gt;&gt;&gt; y_pred = model.predict(X_test)
```
---

class: center, middle

### Batch machine learning in production

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/batch_production.svg" /&gt;
&lt;/div&gt;

---

background-color: #2ac380
class: center, middle, white

# Models have to be retrained from scratch with new data ➰➰➰

---

background-color: #663399
class: center, middle, white

# Models needs increasing amounts of power 🔌

---

background-color: #1f282d
class: center, middle, white

# Models are static and "rot" faster than bananas 🍌

---

background-color: #e69138
class: center, middle, white

# Models that work locally don't always work in production 😭

---

class: center, middle

&lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/edward-bear-bump-bump.png" /&gt;

&gt; It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it.

???

And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things.

---

background-color: #607bd4
class: middle, white

## Online learning

.bigbullets[
- Data comes from a stream
- Models learn 1 observation at a time
- Features and labels are dynamic
]

---

## Everything changes 💥

&lt;div style="display: flex; flex-direction: row;"&gt;
&lt;iframe src="https://giphy.com/embed/1xdHOjP8eRr6tpuYJG" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;div style="display: flex; flex-direction: column;"&gt;
&lt;iframe src="https://giphy.com/embed/H1BE3dRF6z5FtR7bGo" width="480" height="225" frameBorder="0" class="giphy-embed" allowFullScreen style="margin-bottom: 30px"&gt;&lt;/iframe&gt;
&lt;iframe src="https://giphy.com/embed/l1J9wSXhprtRF1HYA" width="480" height="225" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;/div&gt;

---

background-color: #008080
class: middle, white

## Different names, same thing 🤷

.bigbullets[
- Online learning
- Incremental learning
- Sequential learning
- Iterative learning
- Continuous learning
- Out-of-core learning
]

---

background-color: #FF7F50
class: middle, white

## Applications

.bigbullets[
- Time series forecasting
- Spam filtering
- Recommender systems
- Ad placement
- Internet of things
- Basically, &lt;span style="text-decoration:underline"&gt;anything event based&lt;/span&gt;
]

---

class: center, middle

### Online learning in a nutshell

&lt;iframe src="https://giphy.com/embed/HsXfypimWpPcQ" width="480" height="350" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;

---

background-color: #e66868ff
class: middle, white

## Why is batch learning so popular?

.bigbullets[
- Taught at university 🎓
- (Bad) habits
- Hype
- Kaggle 🎯
- Library availability
]

---

class: center, middle

# Questions?

---

class: middle

&lt;div align="center"&gt;
 &lt;img height="250" src="https://docs.google.com/drawings/d/e/2PACX-1vSl80T4MnWRsPX3KvlB2kn6zVdHdUleG_w2zBiLS7RxLGAHxiSYTnw3LZtXh__YMv6KcIOYOvkSt9PB/pub?w=841&amp;h=350" alt="creme_logo"/&gt;
&lt;/div&gt;

.bullets[
- Online machine learning library for Python 🐍
- Easy to pick up API inspired by `sklearn`
- Written with production scenarios in mind
- First commit in January 2019
- Version `0.2.0` released yesterday
]

---

#### scikit-learn

```python
from sklearn import datasets
from sklearn import linear_model

X, y = datasets.load_boston(return_X_y=True)
model = linear_model.LinearRegression()

model.fit(X, y)
```

#### creme

```python
from creme import linear_model
from creme import stream
from sklearn import datasets

X_y = stream.iter_sklearn_dataset(datasets.load_boston)
model = linear_model.LinearRegression()

for x, y in X_y:
 model.fit_one(x, y)
```

---

class: middle

### Features

Representing a set of features using a `dict` is natural:

```python
x = {
 'date': dt.datetime(2019, 4, 22),
 'price': 42.95,
 'shop': 'Ikea'
}
```

- Values can be of any type
- Feature names can be used instead of array indexes
- Python's standard library plays nicely with `dict`s

---

class: middle

### Targets

A target's type depends on the context:

```python
# Regression
y = 42

# Binary classification
y = True

# Multi-class classification
y = 'setosa'

# Multi-output regression
y = {
 height: 29.7,
 width: 21
}
```

---

class: middle

### Streaming data

```python
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

for x, y in X_y:
 print(x, y)
```

- `X_y` is a **generator** and consumes a tiny amount of memory
- The point is that we only need one data point at a time
- Source depends on your use case (CSV file, Kafka consumer, HTTP requests)

---

class: middle

### Training with `fit_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
* model.fit_one(x, y)
```

Every `creme` estimator has a `fit_one` method

---

class: middle

### Predicting with `predict_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
* y_pred = model.predict_one(x)
 model.fit_one(x, y)
```

- Classifiers also have a `predict_proba_one` method
- Transformers have a `transform_one` method
- Training and predicting phases are inter-leaved

---

class: middle

### Progressive validation 💯

```python
from creme import linear_model
from creme import metrics
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

metric = metrics.Accuracy()

for x, y in X_y:
 y_pred = model.predict_one(x)
 model.fit_one(x, y)
* metric.update(y, y_pred)
 print(metric)
```

Validation score is available for free! No need for cross-validation. You can also use `online_score` from the `model_selection` module.

---

class: middle

### Composing estimators is easy

```python
from creme import compose
from creme import linear_model
from creme import preprocessing

scale = preprocessing.StandardScaler()
lin_reg = linear_model.LogisticRegression()

# You can do this...
model = compose.Pipeline([
 ('scale', scale),
 ('lin_reg', lin_reg
])

# Or this...
model = scale | lin_reg
```

---

class: center, middle

# Questions?

---

### Online mean

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$

```python
&gt;&gt;&gt; mean = creme.stats.Mean()

&gt;&gt;&gt; mean.update(5)
&gt;&gt;&gt; mean.get()
5

&gt;&gt;&gt; mean.update(10)
&gt;&gt;&gt; mean.get()
7.5
```

---

### Online variance

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$
3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$
4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$

```python
&gt;&gt;&gt; variance = creme.stats.Variance()

&gt;&gt;&gt; X = [2, 3, 4, 5]
&gt;&gt;&gt; for x in X:
... variance.update(x)
&gt;&gt;&gt; variance.get()
1.25

&gt;&gt;&gt; numpy.var(X)
1.25

```

???

This is called Welford's algorithm, it can be extended to skew and kurtosis

---

### Standard scaling

Using the mean and the variance, we can rescale incoming data.

```python
&gt;&gt;&gt; scaler = creme.preprocessing.StandardScaler()

&gt;&gt;&gt; for x in [2, 3, 4, 5]:
... features = {'x': x}
... scaler.fit_one(features)
... new_x = scaler.transform_one(features)['x']
... print(f'{x} becomes {new_x})
2 becomes 0.0
3 becomes 0.9999999999999996
4 becomes 1.224744871391589
5 becomes 1.3416407864998738

```

&lt;div align="center"&gt;
 &lt;iframe src="https://giphy.com/embed/nlWGe7Q64zwQ0" width="480" height="180" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

### Linear regression (1)

Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example:

- Stochastic gradient descent (SGD)
- Adam
- RMSProp
- Follow the Regularized Leader (FTRL)

```python
from creme import linear_model
from creme import optim

lin_reg = linear_model.LinearRegression(
 optimizer=optim.Adam(lr=0.01)
)
```

---

### Linear regression (2)

Some people (Léon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay)

`creme` uses any running statistic from the `creme.stats` module, which is a powerful trick

```python
from creme import linear_model
from creme import optim
from creme import stats

lin_reg = linear_model.LinearRegression(
 optimizer=optim.Adam(lr=0.01),
 intercept=stats.RollingMean(42)
)
```

---

### Online aggregations

```python
&gt;&gt;&gt; import creme

&gt;&gt;&gt; X = [
... {'meal': 🍕, 'sales': 42},
... {'meal': 🍔, 'sales': 16},
... {'meal': 🍔, 'sales': 24},
... {'meal': 🍕, 'sales': 58}
... ]

&gt;&gt;&gt; agg = creme.feature_extraction.Agg(
... on='sales',
... by='meal',
... how=creme.stats.Mean()
... )

&gt;&gt;&gt; for x in X:
... print(agg.fit_one(x).transform_one(x))
{'sales_mean_by_meal': 42.0}
{'sales_mean_by_meal': 16.0}
{'sales_mean_by_meal': 20.0}
{'sales_mean_by_meal': 50.0}
```

---

### Bagging (1)

Each observation is sampled $K$ times where $K$ follows a binomial distribution:

$$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$

As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1):

$$P(K=k) \sim \frac{e^{-1}}{k!} $$

This leads to a simple and efficient online algorithm.

---

### Bagging (2)

`ensemble.BaggingClassifier` is very simple:

```python
def fit_one(self, x, y):

 for estimator in self.estimators:
 for _ in range(self.rng.poisson(1)):
 estimator.fit_one(x, y)

 return self


def predict_proba_one(self, x):
 y_pred = statistics.mean(
 estimator.predict_proba_one(x)[True]
 for estimator in self.estimators
 )
 return {
 True: y_pred,
 False: 1 - y_pred
 }
```

---

### Decision trees 🌳

- A version of Hoeffding trees is being implemented
- Basic idea:
 - Start with a leaf 🍃
 - Find the leaf where an observation belongs 🔎
 - Update the leaf's sufficient statistics 📊
 - Measure information gain every so often 🔬
 - Split when the information gain is good enough 🍂
- Mondrian trees 👨‍🎨 are another possibility but they only work for continuous attributes

---

class: center, middle

# Questions?

---

### `creme`'s current modules

&lt;div style="display: flex; justify-content: space-around;"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;pre&gt;cluster&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;compat&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;compose&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;datasets&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;dummy&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;ensemble&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;feature_extraction&lt;/pre&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;pre&gt;feature_selection&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;impute&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;linear_model&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;model_selection&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;multiclass&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;naive_bayes&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;optim&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;plot&lt;/pre&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;pre&gt;preprocessing&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;proba&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;reco&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;stats&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;stream&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;tree&lt;/pre&gt;&lt;/li&gt;
 &lt;li&gt;&lt;pre&gt;utils&lt;/pre&gt;&lt;/li&gt;
 &lt;/ul&gt;

&lt;/div&gt;

---

### Cool stuff in `creme` we skipped 😢

.bullets[
- Clustering
- Factorization machines
- Feature selection
- Passive-aggressive models
- Recommender systems
- Histograms
- Skyline queries
- Fourier transforms
- Imputation
- Naive Bayes
]

---

### Alternative frameworks

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/others.svg" /&gt;
&lt;/div&gt;

---

### Benefits of online learning

.bullets[
- No need to schedule model training
- Easy to monitor
- You're very close to production
- Way more fun than batch learning
]

&lt;div align="center" style="margin-top: 20px;"&gt;
 &lt;iframe src="https://giphy.com/embed/IKyw3IP1keKk0" width="480" height="220" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

### Current work

.bullets[
- Decision trees (nearly there)
- Gradient boosting (easy)
- Bayesian linear models (part of my PhD)
- Latent Dirichlet Allocation (ask Raphael)
- Many issues [on GitHub](https://github.com/creme-ml/creme/issues)
]

&lt;div align="center"&gt;
 &lt;iframe src="https://giphy.com/embed/JltOMwYmi0VrO" width="480" height="220" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;

---

### What next?

- [creme-ml.github.io](https://creme-ml.github.io/)
- [github.com/creme-ml](https://github.com/creme-ml/)
- You can send emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com)
- Get in touch if you want help and/or advice
- Starring us on GitHub helps a lot 🌟

&lt;div align="center"&gt;
 &lt;img height="230px" src="https://maxhalford.github.io/img/slides/creme/we_need_you_fr.jpg" /&gt;
&lt;/div&gt;

---

class: center, middle

# Thanks for listening!

.left-column[
&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;iframe src="https://giphy.com/embed/DUrdT2xEmJWbS" width="400px" height="400px" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
]

.right-column[
&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/qr_code.svg" /&gt;
&lt;/div&gt;
]

 &lt;/textarea&gt;
 &lt;script src="https://remarkjs.com/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/contrib/auto-render.min.js"&gt;&lt;/script&gt;
 &lt;link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css"&gt;
 &lt;script type="text/javascript"&gt;
 var options = {};
 var renderMath = function() {
 renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
 {left: "$$", right: "$$", display: true},
 {left: "$", right: "$", display: false},
 {left: "\\[", right: "\\]", display: true},
 {left: "\\(", right: "\\)", display: false},
 ]});
 }
 var slideshow = remark.create(
 {
 slideNumberFormat: function (current, total) {
 if (current === 1) { return "" }
 return current;
 },
 highlightStyle: 'github',
 highlightLines: true,
 ratio: '16:9'
 },
 renderMath
 );
 &lt;/script&gt;
 &lt;/body&gt;
&lt;/html&gt;</description></item><item><title/><link>https://maxhalford.github.io/slides/the-benefits-of-online-learning/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/slides/the-benefits-of-online-learning/</guid><description>&lt;!DOCTYPE html&gt;
&lt;html&gt;
 &lt;head&gt;
 &lt;title&gt;The Benefits of Online Learning&lt;/title&gt;
 &lt;meta charset="utf-8"&gt;
 &lt;style&gt;
 @import url(https://fonts.googleapis.com/css?family=Open+Sans:700);
 @import url(https://fonts.googleapis.com/css?family=Fira+Mono:400,700,400italic);

 body, h1, h2, h3 { font-family: 'Open Sans'; }
 h1, h2, h3 { text-align: center; }

 .bullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 }

 .bigbullets {
 display: flex;
 flex-direction: row;
 justify-content: center;
 font-size: 35px;
 }

 .remark-slide-content {
 font-size: 25px;
 color: #1f282d;
 }

 .remark-code, .remark-inline-code { font-family: 'Fira Mono'; }
 .remark-inline-code { background: #f0f0f0; padding: 0px 4px; }

 .left-column { width: 50%; float: left; }
 .right-column { width: 50%; float: right; }
 .white {
 color: #FFFAFA;
 }

 .title-slide .remark-slide-number {
 display: none;
 }

 blockquote {
 background: #f9f9f9;
 border-left: 10px solid #ccc;
 margin: 1.5em 10px;
 padding: 0.5em 10px;
 quotes: "\201C""\201D""\2018""\2019";
 }
 blockquote:before {
 color: #ccc;
 content: open-quote;
 font-size: 4em;
 line-height: 0.1em;
 margin-right: 0.25em;
 vertical-align: -0.4em;
 }
 blockquote p {
 display: inline;
 }

 a { color: hotpink; text-decoration: none; }
 li { margin: 10px 0; }

 .green { color: green; }
 .red { color: red; }

 .pure-table {
 font-size: 17px;
 border-style: hidden !important;
 }

 .pure-table td {
 border-left: 0px !important;
 }

 .pure-table th {
 border-left: 0px !important;
 }

 &lt;/style&gt;
 &lt;/head&gt;
 &lt;body&gt;
 &lt;textarea id="source"&gt;

class: center, middle

## The Benefits of Online Learning

### (and other shenanigans)

#### Max Halford

&lt;div style="display: flex; flex-direction: row; justify-content: center;"&gt;

 &lt;div align="center"&gt;
 &lt;img height="180px" src="https://maxhalford.github.io/img/slides/creme/creme.svg" /&gt;
 &lt;/div&gt;

&lt;/div&gt;

???

Hello!

---

class: middle

## Outline

&lt;div align="center"&gt;
 &lt;img height="300px" src="https://maxhalford.github.io/img/slides/creme/knowledge.jpg" /&gt;
&lt;/div&gt;

---

### A bit about me

.left-column[
&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/moneyball.jpg" /&gt;
&lt;/div&gt;
]

.right-column[
- 3rd year PhD student in Toulouse
- PhD on Bayesian networks applied to database cost models
- Topics of interest:
 - Online machine learning
 - Systems for machine learning
 - Machine learning for systems
 - Competitive machine learning
 - Fair and explainable learning
- Into opensource (mostly Python)
- Kaggle Master (rank 247)
]

---

### Batch learning in a nutshell

.bullets[
1. Collect features $X$ and labels $Y$
2. Train a model on $(X, Y)$
3. Save the model somewhere
4. Load the model to make predictions
]

With code:

```python
&gt;&gt;&gt; model.fit(X_train, y_train)
&gt;&gt;&gt; dump(model, 'model.json')
&gt;&gt;&gt; model = load('model.json')
&gt;&gt;&gt; y_pred = model.predict(X_test)
```

---

## Lambda architecture

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/lambda_architecture.svg" /&gt;
&lt;/div&gt;

---

background-color: #e66868ff
class: middle, white

## Batch learning in production has issues

.bigbullets[

1. Models are retrained from scratch with new data 🕒
2. Models require powerful machines 💰
3. Models are static and "rot" faster than bananas 🍌
4. Models that work in development don't always work in production 🤷
]

???

- [Continuum: a platform for cost-aware low-latency continual learning](https://blog.acolyer.org/2018/11/21/continuum-a-platform-for-cost-aware-low-latency-continual-learning/)
- [Applied machine learning at Facebook: a datacenter infrastructure perspective](https://blog.acolyer.org/2018/12/17/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/)

As we looked at last month with Continuum, the latency of incorporating the latest data into the models is also really important. There’s a nice section of this paper where the authors study the impact of losing the ability to train models for a period of time and have to serve requests from stale models. The Community Integrity team for example rely on frequently trained models to keep up with the ever changing ways adversaries try to bypass Facebook’s protections and show objectionable content to users. Here training iterations take on the order of days. Even more dependent on the incorporation of recent data into models is the news feed ranking. “Stale News Feed models have a measurable impact on quality.” And if we look at the very core of the business, the Ads Ranking models, “we learned that the impact of leveraging a stale ML model is measured in hours. In other words, using a one-day-old model is measurably worse than using a one-hour old model.” One of the conclusions in this section of the paper is that disaster recovery / high availability for training workloads is key importance. (Another place to practice your chaos engineering ;) ).

---

## Banana rotting time

&lt;div align="center"&gt;
 &lt;img height="440px" src="https://maxhalford.github.io/img/slides/creme/banana_rotting_time.png" /&gt;
&lt;/div&gt;

---

### "If everyone's doing it, it's got to be the best way, right?"

&lt;div align="center"&gt;
 &lt;img height="480px" src="https://maxhalford.github.io/img/slides/creme/everyone_is_doing_it.png" /&gt;
&lt;/div&gt;

---

## Kappa architecture

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/kappa_architecture.svg" /&gt;
&lt;/div&gt;

???

This looks great, but our favorite models such as LightGBM, can be updated incrementally.

---

background-color: #b5ddd1
class: middle

## Online machine learning

.bigbullets[
- Subdiscipline of machine learning
- Data is a stream, potentially infinite
- Models learn from one observation at a time
- Features and labels can be dynamic
- Can be supervised or unsupervised
]

---

background-color: #2ac380
class: middle, white

## Different names, same thing 🤔

.bigbullets[
- Online learning
- Incremental learning
- Sequential learning
- Iterative learning
- Continuous learning
- Out-of-core learning
]

---

background-color: #607bd4
class: middle, white

## Benefits of online machine learning

.bigbullets[
1. Models don't have to be retrained
2. Nothing is too big
3. Online models (usually) adapt to drift
4. Model development is closer to reality
5. Training can be monitored in real-time
]

---

background-color: #e69138
class: middle, white

## Applications

.bigbullets[
- Time series forecasting
- Spam filtering
- Recommender systems
- Ad placement
- Internet of things
- Basically, &lt;span style="text-decoration:underline"&gt;anything event based&lt;/span&gt;
]

---

background-color: #FF7F50
class: middle, white

## Why is batch learning so popular?

.bigbullets[
- Taught at university 🎓
- (Bad) habits
- Hype
- Kaggle 🎯
- Library availability
]

---

class: center, middle

&lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/edward-bear-bump-bump.png" /&gt;

&gt; It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it.

???

And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things.

---

class: middle

&lt;div align="center"&gt;
 &lt;img height="250" src="https://maxhalford.github.io/img/slides/creme/creme.svg" alt="creme_logo"/&gt;
&lt;/div&gt;

.bullets[
- Online machine learning library for Python 🐍
- Easy to pick up API inspired by scikit-learn
- Written with production scenarios in mind
- First commit in January 2019
- Version `0.4.3` released this week (with wheels!)
]

---

class: center, middle

### PyData Amsterdam, May 2019 🐍 🇳🇱 🧀

&lt;div align="center"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/max_pydata.jpg" /&gt;
&lt;/div&gt;

---

class: middle

### Features

Representing a set of features using a `dict` is natural:

```python
x = {
 'date': dt.datetime(2019, 4, 22),
 'price': 42.95,
 'shop': 'Ikea'
}
```

- Values can be of any type
- Feature names can be used instead of array indexes
- Web request payloads are dictionaries

---

class: middle

### Targets

A target's type depends on the context:

```python
# Regression
y = 42

# Binary classification
y = True

# Multi-class classification
y = 'setosa'

# Multi-output regression
y = {
 'height': 29.7,
 'width': 21
}
```

---

class: middle

### Streaming data

```python
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

for x, y in X_y:
 print(x, y)
```

- `X_y` is a **generator**, it doesn't hold data in memory
- Source depends on your use case (CSV file, Kafka consumer, HTTP requests)

---

class: middle

### Training with `fit_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
* model.fit_one(x, y)
```

Every `creme` estimator has a `fit_one` method

---

class: middle

### Predicting with `predict_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
* y_pred = model.predict_one(x)
 model.fit_one(x, y)
```

- Classifiers also have a `predict_proba_one` method
- Transformers have a `transform_one` method
- Training and predicting phases are inter-leaved

---

class: middle

### Progressive validation 💯

```python
from creme import linear_model
from creme import metrics
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

metric = metrics.Accuracy()

for x, y in X_y:
 y_pred = model.predict_one(x)
 model.fit_one(x, y)
* metric.update(y, y_pred)
 print(metric)
```

Validation score is available for free! No need for cross-validation. You can also use `online_score` from the `model_selection` module.

---

class: middle

### Composing estimators is easy

.left-column[
```python
from creme import *

counts = feature_extraction.CountVectorizer()
tdidf = feature_extraction.TFIDFVectorizer()

scale = preprocessing.StandardScaler()

log_reg = linear_model.LogisticRegression()

model = (counts + tdidf) | scale | log_reg

model.draw()
```
]

.right-column[
&lt;div align="center"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/pipeline.svg" /&gt;
&lt;/div&gt;
]

---

### Online mean

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$

```python
&gt;&gt;&gt; mean = creme.stats.Mean()

&gt;&gt;&gt; mean.update(5)
&gt;&gt;&gt; mean.get()
5

&gt;&gt;&gt; mean.update(10)
&gt;&gt;&gt; mean.get()
7.5
```

---

### Online variance

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$
3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$
4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$

```python
&gt;&gt;&gt; variance = creme.stats.Variance()

&gt;&gt;&gt; X = [2, 3, 4, 5]
&gt;&gt;&gt; for x in X:
... variance.update(x)
&gt;&gt;&gt; variance.get()
1.25

&gt;&gt;&gt; numpy.var(X)
1.25

```

???

This is called Welford's algorithm, it can be extended to skew and kurtosis

---

class: middle

### Standard scaling

Using the mean and the variance, we can rescale incoming data.

```python
&gt;&gt;&gt; scaler = creme.preprocessing.StandardScaler()

&gt;&gt;&gt; for x in [2, 3, 4, 5]:
... features = {'x': x}
... scaler.fit_one(features)
... new_x = scaler.transform_one(features)['x']
... print(f'{x} becomes {new_x})
2 becomes 0.0
3 becomes 0.9999999999999996
4 becomes 1.224744871391589
5 becomes 1.3416407864998738

```

In practice, works better than normalized gradient descent 😲

---

class: middle

### Linear models

Model is:

$$\hat{y}_t = f(w_t, x_t) + b_t$$

Update weights with gradients:

$$w_{t+1} = u(w_t, x_t, \partial L(y_t, \hat{y}_t))$$

Many models can be derived, for example:

- Use Hinge loss for SVM
- Add L1/L2 regularisation for LASSO/ridge
- Add interactions for factorization machines

---

class: middle

### Many online optimizers to choose from

.bullets[
- Stochastic gradient descent (SGD)
- Passive-Aggressive (PA)
- ADAM
- RMSProp
- Follow the Regularized Leader (FTRL)
- Approximate Large Margin Algorithm (ALMA)
]

&lt;div align="center"&gt;
Many variants of each, as you know
&lt;/div&gt;

---

class: middle

### Bayesian linear models

We want the posterior target distribution on the target:

$$\color{forestgreen} p(y\_t | x\_t) \color{black} \propto \color{crimson} p(y_t | w_t, x_t) \color{royalblue} p(w_t)$$

We first need to compute the posterior distribution of the weights:

$$\color{blueviolet} p(w\_{t} | w\_{t-1}, x\_t, y\_t) \color{black} \propto \color{crimson} p(y\_t | w\_{t-1}, x\_t) \color{royalblue} p(w\_{t-1})$$

This is old-school Bayesian learning, it is different from and predecesses the Monte-Carlo mumbo-jumbo.

---

class: middle

### Online belief updating

Before any data comes in, the model parameters follow the initial distribution we picked, which is $\color{royalblue} p(w\_0)$. Next, once the first observation $(x\_0, y\_0)$ arrives, we can obtain the distribution of $w\_1$:

$$\color{blueviolet} p(w\_1 | w\_0, x\_0, y\_0) \color{black} \propto \color{crimson} p(y\_0 | w\_0, x\_0) \color{royalblue} p(w\_0)$$

Once the second observation $(x\_1, y\_1)$ is available, the distribution of the model parameters is obtained in the same way:

$$\color{blueviolet} p(w\_2 | w\_1, x\_1, y\_1) \color{black} \propto \color{crimson} p(y\_1 | w\_1, x\_1) \color{royalblue} p(w\_1) \propto \color{crimson} p(y\_1 | w\_1, x\_1) \color{blueviolet} p(w\_1 | w\_0, x\_0, y\_0)$$

The $\propto$ symbol means there is an analytical formula that can be derived.

---

class: middle

### Nearest neighbors

.bullets[
- Three parameters:
 1. The distance function
 2. The number of neighbors
 3. The window size
- .green[Naturally adapts to drift]
- .red[Lazy]
]

---

class: middle

### Decision trees 🌳

- A version of Hoeffding trees is being implemented
- Basic idea:
 - Start with a leaf 🍃
 - Find the leaf where an observation belongs 🔎
 - Update the leaf's "sufficient statistics" 📊
 - Measure information gain every so often 🔬
 - Split when the information gain is good enough 🍂
- Mondrian trees 👨‍🎨 are another possibility but they only work for continuous attributes

---

class: middle

### Decision trees 🌳

Quality criterion of split $x &lt; t$ can be evaluated with:

$$P(y \mid x &lt; t) = \frac{P(x &lt; t \mid y) \times P(y)}{P(x &lt; t)}$$

and:

$$P(y \mid x \geq t) = \frac{(1 - P(x &lt; t \mid y)) \times P(y)}{1 - P(x &lt; t)}$$

- For classification, $P(x &lt; t \mid y)$ is a set of online CDFs and $P(y)$ is a PMF.
- For regression, $P(x &lt; t \mid y)$ is a 2D CDF and $P(y)$ is a PDF
- All these distributions can be updated online

---

class: middle

### Bagging

Each observation is sampled $K$ times where $K$ follows a binomial distribution:

$$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$

As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1):

$$P(K=k) \sim \frac{e^{-1}}{k!} $$

This leads to a simple and efficient online algorithm:

```python
for model in models:
 for _ in range(random.poisson(λ=1)):
 model.fit_one(x, y)
```

---

class: middle

### (S)(N)(AR)(I)(MA)(X)

ARMA model is defined as so:

$$\hat{y}\_t = \sum\_{i=1}^p \alpha\_i y\_{t-i} + \sum\_{i=1}^q \beta\_i (y\_{t-i} - \hat{y}\_{t-i}) $$

Classically, Kalman filters are used to find the weights $\alpha\_i$ and $\beta\_i$. But $y\_{t-i}$ and $\hat{y}\_{t-i}$ can also be [seen as features in an online setting](https://dl.acm.org/citation.cfm?id=3016160):

- Seasonality can be handled online
- Any online learning model can be used
- Detrending by differencing can be done online
- Heteroscedasticity can be handled online
- Exogenous variables can be added

---

class: middle

### Online aggregated features

```python
&gt;&gt;&gt; import creme

&gt;&gt;&gt; X = [
... {'meal': 'tika masala', 'sales': 42},
... {'meal': 'kale salad', 'sales': 16},
... {'meal': 'kale salad', 'sales': 24},
... {'meal': 'tika masala', 'sales': 58}
... ]

&gt;&gt;&gt; agg = creme.feature_extraction.Agg(
... on='sales',
... by='meal',
... how=creme.stats.Mean()
... )

&gt;&gt;&gt; for x in X:
... print(agg.fit_one(x).transform_one(x))
{'sales_mean_by_meal': 42.0}
{'sales_mean_by_meal': 16.0}
{'sales_mean_by_meal': 20.0}
{'sales_mean_by_meal': 50.0}
```

---

background-color: #008080
class: middle, white

## There is much more

.bullets[
- Half-space trees for anomaly detection
- $k$-means clustering
- Latent Dirichlet allocation (LDA)
- Expert learning
- Stacking
- Recommendation systems
- See [creme-ml.github.io/api](https://creme-ml.github.io/api.html)
]

---

### Alternative frameworks

&lt;div align="center"&gt;
 &lt;img height="500px" src="https://maxhalford.github.io/img/slides/creme/others.svg" /&gt;
&lt;/div&gt;

---

class: center, middle

### Binary classification benchmark with default parameters

&lt;div align="center"&gt;
.pure-table.pure-table-striped[
| Library | Method | Accuracy | Average fit time | Average predict time |
|---------|------------|----------|------------------|----------------------|
| creme | LogisticRegression | 0.61810 | .green[26μs] | .green[10μs] |
| creme | PAClassifier | 0.55009 | 35μs | 22μs |
| creme | DecisionTreeClassifier | 0.64663 | 356μs | 15μs |
| creme | RandomForestClassifier | .green[0.65915] | 3ms, 972μs | 208μs |
| Keras on TF (CPU) | Dense | 0.61840 | 463μs | 534μs |
| PyTorch (CPU) | Linear | 0.61840 | 926μs | 621μs |
| scikit-garden | MondrianTreeClassifier | .red[0.53875] | 864μs | 208μs |
| scikit-garden | MondrianForestClassifier | 0.60061 | .red[9ms, 773μs] | .red[1ms, 233μs] |
| scikit-learn | SGDClassifier | 0.56161 | 420μs | 116μs |
| scikit-learn | PassiveAggressiveClassifier | 0.55009 | 398μs | 114μs |
]
&lt;/div&gt;

---

class: center, middle

### Linear regression benchmark

&lt;div align="center"&gt;
.pure-table.pure-table-striped[
| Library | Method | MSE | Average fit time | Average predict time |
|---------|------------|----------|------------------|----------------------|
| creme | LinearRegression | 23.035085 | 18μs | 4μs |
| Keras on TF (CPU) | Dense | 23.035086 | 1ms, 208μs | 722μs |
| PyTorch (CPU) | Linear | 23.035086 | 577μs | 187μs |
| scikit-learn | SGDRegressor | 25.295369 | 305μs | 108μs |
]
&lt;/div&gt;

---

background-color: #1f282d
class: middle, white

## Current work (1)

.bullets[
- Boosting, many methods but no clear winner:
 - [Online Bagging and Boosting (Oza-Russell, 2005)](https://ti.arc.nasa.gov/m/profile/oza/files/ozru01a.pdf)
 - [Online Gradient Boosting (Beygelzimer, 2015)](https://arxiv.org/pdf/1506.04820.pdf)
 - [Optimal and Adaptive Algorithms for Online Boosting (Beygelzimer, 2015)](http://proceedings.mlr.press/v37/beygelzimer15.pdf)
- Mixture models through expectation-maximization:
 - [Recursive Parameter Estimation Using Incomplete Data (Titterington, 1982)](https://apps.dtic.mil/dtic/tr/fulltext/u2/a116190.pdf)
 - [A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants (Neal-Hinton, 1998)](http://www.cs.toronto.edu/~fritz/absps/emk.pdf)
 - [Online EM Algorithm for Latent Data Models (Cappé-Moulines 2009)](https://hal.archives-ouvertes.fr/hal-00201327/document)
]

---

background-color: #1f282d
class: middle, white

## Current work (2)

.bullets[
- Field-aware factorization machines (FFM):
 - [Factorization Machines (Rendle, 2010)](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
 - [Field-aware Factorization Machines for CTR Prediction (Juan et al., 2016)](https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf)
 - [Field-aware Factorization Machines in a Real-world Online Advertising System (Juan-Lefortier-Chappelle, 2017)](https://arxiv.org/pdf/1701.04099.pdf)
- Metric learning:
 - [Online and Batch Learning of Pseudo-Metrics (Shwartz-Singer-Ng, 2004)](https://ai.stanford.edu/~ang/papers/icml04-onlinemetric.pdf)
 - [Information-Theoretic Metric Learning (Davis et al., 2007)](http://www.cs.utexas.edu/users/pjain/pubs/metriclearning_icml.pdf)
 - [Online Metric Learning and Fast Similarity Search (Jain et al., 2009)](http://people.bu.edu/bkulis/pubs/nips_online.pdf)
]

---

background-color: #85144b
class: middle, white

## You can help

.bigbullets[
- Use it and tell us about it
- Share it with others
- Take on issues on GitHub
- Become a core contributor
]

---

class: center, middle

# Thanks for listening!

.left-column[
&lt;div align="center"&gt;
 &lt;img height="440px" src="https://maxhalford.github.io/img/slides/creme/yoda.jpg" /&gt;
&lt;/div&gt;
]

.right-column[
&lt;div align="center" style="margin-top: 50px;"&gt;
 &lt;img height="400px" src="https://maxhalford.github.io/img/slides/creme/qr_code.svg" /&gt;
&lt;/div&gt;
]

 &lt;/textarea&gt;
 &lt;script src="https://remarkjs.com/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"&gt;&lt;/script&gt;
 &lt;script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/contrib/auto-render.min.js"&gt;&lt;/script&gt;
 &lt;link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/pure-min.css" integrity="sha384-oAOxQR6DkCoMliIh8yFnu25d7Eq/PHS21PClpwjOTeU2jRSq11vu66rf90/cZr47" crossorigin="anonymous"&gt;
 &lt;link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css"&gt;
 &lt;script type="text/javascript"&gt;
 var options = {};
 var renderMath = function() {
 renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
 {left: "$$", right: "$$", display: true},
 {left: "$", right: "$", display: false},
 {left: "\\[", right: "\\]", display: true},
 {left: "\\(", right: "\\)", display: false},
 ]});
 }
 var slideshow = remark.create(
 {
 slideNumberFormat: function (current, total) {
 if (current === 1) { return "" }
 return current;
 },
 highlightStyle: 'github',
 highlightLines: true,
 ratio: '16:9'
 },
 renderMath
 );
 &lt;/script&gt;
 &lt;/body&gt;
&lt;/html&gt;</description></item><item><title>An introduction to symbolic regression</title><link>https://maxhalford.github.io/slides/symbolic-regression/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/slides/symbolic-regression/</guid><description>&lt;div align="center"&gt;
&lt;h1&gt;An introduction to symbolic regression&lt;/h1&gt;
&lt;h3&gt;Max Halford - PhD student IRIT/IMT&lt;/h3&gt;
&lt;h4&gt;Toulouse Data Science Meetup - December 2017&lt;/h4&gt;
&lt;/div&gt;

.center[
.left-column[![tds_logo](/assets/img/presentations/tds_logo.jpeg)]
.right-column[![xgp_logo](/assets/img/presentations/xgp_logo.png)]
]

---

layout: true

# Symbolic regression

---

## Quick overview

- The goal is to evolve "programs" with selection, mutation, and crossover
- Selection keeps programs that perform well
- Mutation changes a piece of the program
- Crossover combines two programs

---

&lt;img src="https://maxhalford.github.io/assets/img/presentations/evolutionary_algorithms.png" width="120%"&gt;

---

## Example programs

&lt;img src="https://maxhalford.github.io/assets/img/presentations/example_programs.png" width="100%"&gt;

---

## Kaggle Titanic top 1% 🚢

2 years ago [scirpus](https://www.kaggle.com/scirpus) posted a [solution](https://www.kaggle.com/scirpus/genetic-programming-lb-0-88) to the [Kaggle Titanic competiton](https://www.kaggle.com/c/titanic)

```python
y_raw = ((np.minimum(((((0.058823499828577 + X['Sex']) - np.cos((X['Pclass'] / 2.0))) * 2.0)), ((0.885868))) * 2.0) +np.maximum(((X['SibSp'] - 2.409090042114258)), (-(np.minimum((X['Sex']), (np.sin(X['Parch']))) * X['Pclass']))) +(0.138462007045746 * ((np.minimum((X['Sex']), (((X['Parch'] / 2.0) / 2.0))) * X['Age']) - X['Cabin'])) +np.minimum(((np.sin((X['Parch'] * ((X['Fare'] - 0.720430016517639) * 2.0))) * 2.0)), ((X['SibSp'] / 2.0))) +np.maximum((np.minimum((-np.cos(X['Embarked'])), (0.138462007045746))), (np.sin(((X['Cabin'] - X['Fare']) * 2.0)))) +-np.minimum(((((X['Age'] * X['Parch']) * X['Embarked']) + X['Parch'])), (np.sin(X['Pclass']))) +np.minimum((X['Sex']), ((np.sin(-(X['Fare'] * np.cos((X['Fare'] * 1.630429983139038)))) / 2.0))) +np.minimum(((0.230145)), (np.sin(np.minimum((((67.0 / 2.0) * np.sin(X['Fare']))), (0.31830988618379069))))) +np.sin((np.sin(X['Cabin']) * (np.sin((12.6275)) * np.maximum((X['Age']), (X['Fare']))))) +np.sin(((np.minimum((X['Fare']), ((X['Cabin'] * X['Embarked']))) / 2.0) * -X['Fare'])) +np.minimum((((2.675679922103882 * X['SibSp']) * np.sin(((96) * np.sin(X['Cabin']))))), (X['Parch'])) +np.sin(np.sin((np.maximum((np.minimum((X['Age']), (X['Cabin']))), ((X['Fare'] * 0.31830988618379069))) * X['Cabin']))) +np.maximum((np.sin(((12.4148) * (X['Age'] / 2.0)))), (np.sin((-3.0 * X['Cabin'])))) +(np.minimum((np.sin((((np.sin(((X['Fare'] * 2.0) * 2.0)) * 2.0) * 2.0) * 2.0))), (X['SibSp'])) / 2.0) +((X['Sex'] - X['SibSp']) * (np.cos(((X['Embarked'] - 0.730768978595734) + X['Age'])) / 2.0)) +((np.sin(X['Cabin']) / 2.0) - (np.cos(np.minimum((X['Age']), (X['Embarked']))) * np.sin(X['Embarked']))) +np.minimum((0.31830988618379069), ((X['Sex'] * (2.212120056152344 * (0.720430016517639 - np.sin((X['Age'] * 2.0))))))) +(np.minimum((np.cos(X['Fare'])), (np.maximum((np.sin(X['Age'])), (X['Parch'])))) * np.cos((X['Fare'] / 2.0))) +np.sin((X['Parch'] * np.minimum(((X['Age'] - 1.5707963267948966)), ((np.cos((X['Pclass'] * 2.0)) / 2.0))))) +(X['Parch'] * (np.sin(((X['Fare'] * (0.623655974864960 * X['Age'])) * 2.0)) / 2.0)) +(0.31830988618379069 * np.cos(np.maximum(((0.602940976619720 * X['Fare'])), ((np.sin(0.720430016517639) * X['Age']))))) +(np.minimum(((X['SibSp'] / 2.0)), (np.sin(((X['Pclass'] - X['Fare']) * X['SibSp'])))) * X['SibSp']) +np.tanh((X['Sex'] * np.sin((5.199999809265137 * np.sin((X['Cabin'] * np.cos(X['Fare']))))))) +(np.minimum((X['Parch']), (X['Sex'])) * np.cos(np.maximum(((np.cos(X['Parch']) + X['Age'])), (3.1415926535897931)))) +(np.minimum((np.tanh(((X['Cabin'] / 2.0) + X['Parch']))), ((X['Sex'] + np.cos(X['Age'])))) / 2.0) +(np.sin((np.sin(X['Sex']) * (np.sin((X['Age'] * X['Pclass'])) * X['Pclass']))) / 2.0) +(X['Sex'] * (np.cos(((X['Sex'] + X['Fare']) * ((8.48635) * (63)))) / 2.0)) +np.minimum((X['Sex']), ((np.cos((X['Age'] * np.tanh(np.sin(np.cos(X['Fare']))))) / 2.0))) +(np.tanh(np.tanh(-np.cos((np.maximum((np.cos(X['Fare'])), (0.094339601695538)) * X['Age'])))) / 2.0) +(np.tanh(np.cos((np.cos(X['Age']) + (X['Age'] + np.minimum((X['Fare']), (X['Age'])))))) / 2.0) +(np.tanh(np.cos((X['Age'] * ((-2.0 + np.sin(X['SibSp'])) + X['Fare'])))) / 2.0) +(np.minimum((((281) - X['Fare'])), (np.sin((np.maximum(((176)), (X['Fare'])) * X['SibSp'])))) * 2.0) +np.sin(((np.maximum((X['Embarked']), (X['Age'])) * 2.0) * (((785) * 3.1415926535897931) * X['Age']))) +np.minimum((X['Sex']), (np.sin(-(np.minimum(((X['Cabin'] / 2.0)), (X['SibSp'])) * (X['Fare'] / 2.0))))) +np.sin(np.sin((X['Cabin'] * (X['Embarked'] + (np.tanh(-X['Age']) + X['Fare']))))) +(np.cos(np.cos(X['Fare'])) * (np.sin((X['Embarked'] - ((734) * X['Fare']))) / 2.0)) +((np.minimum((X['SibSp']), (np.cos(X['Fare']))) * np.cos(X['SibSp'])) * np.sin((X['Age'] / 2.0))) +(np.sin((np.sin((X['SibSp'] * np.cos((X['Fare'] * 2.0)))) + (X['Cabin'] * 2.0))) / 2.0) +(((X['Sex'] * X['SibSp']) * np.sin(np.sin(-(X['Fare'] * X['Cabin'])))) * 2.0) +(np.sin((X['SibSp'] * ((((5.428569793701172 + 67.0) * 2.0) / 2.0) * X['Age']))) / 2.0) +(X['Pclass'] * (np.sin(((X['Embarked'] * X['Cabin']) * (X['Age'] - (1.07241)))) / 2.0)) +(np.cos(((((-X['SibSp'] + X['Age']) + X['Parch']) * X['Embarked']) / 2.0)) / 2.0) +(0.31830988618379069 * np.sin(((X['Age'] * ((X['Embarked'] * np.sin(X['Fare'])) * 2.0)) * 2.0))) +((np.minimum(((X['Age'] * 0.058823499828577)), (X['Sex'])) - 0.63661977236758138) * np.tanh(np.sin(X['Pclass']))) +-np.minimum(((np.cos(((727) * ((X['Fare'] + X['Parch']) * 2.0))) / 2.0)), (X['Fare'])) +(np.minimum((np.cos(X['Fare'])), (X['SibSp'])) * np.minimum((np.sin(X['Parch'])), (np.cos((X['Embarked'] * 2.0))))) +(np.minimum((((X['Fare'] / 2.0) - 2.675679922103882)), (0.138462007045746)) * np.sin((1.5707963267948966 * X['Age']))) +np.minimum(((0.0821533)), (((np.sin(X['Fare']) + X['Embarked']) - np.cos((X['Age'] * (9.89287)))))))
y_pred = 1 / (1 + np.exp(-y_raw))
```

0.88516 (top 1% as of today) on the public leaderboard 😵

---

## Types of nodes

- Constants
- Variables
- Functions

Huge search space because the shape of the model is included in the search space 🙊

---

## Algorithm (high-level) ➰

```python
programs = make_random_programs()

for i in range(generations):
 evaluate(programs)
 new_programs = select(programs)
 crossover(new_programs)
 mutate(new_programs)
 programs = new_programs
```

---

## Mutation

---

## Crossover 💏

---

## Evaluation 💯

---

## Selection 

---

## Initialization

---

## Pros 😺

- Flexible model
- Built-in feature selection
- Can optimise non-differentiable metrics
- Useful for stacking
- Makes you look cool 😎

---

## Cons 😿

- Black box model
- No mathematical foundation
- Requires a lot of CPU power
- Non-deterministic and volatile

---

layout: true

.center[![xgp_logo](/assets/img/presentations/xgp_logo.png)]

---

## Enter XGP 🎉

- Written in [Go](https://golang.org/) 
- Optimization done with [gago](https://github.com/MaxHalford/gago)
- SIMD operations thanks to [gonum](https://www.gonum.org/) ⚡
- Readable code 📖
- Very customizable (with sensible defaults!)
- Command line interface (CLI) 💻
- Python bindings (scikit-learn API) 🐍

---

## Command line interface (CLI) 💻

```sh
$ xgp fit train.csv
```

```sh
$ xgp predict test.csv
```

---

## Python package 🐍

### Regression 📈

```python
model = xgp.XGPRegressor()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
```

---

## Python package 🐍

### Feature extraction 🔬

```py
model = xgp.XGPTransformer()

model.fit(X_train, y_train)

train_gp_features = model.predict(X_train)
test_gp_features = model.predict(X_test)
```

---

## Future work 🔮

- Binary and multi-class classification
- Boosting (promising)
- Caching to handle large datasets
- More bindings (R, Ruby, ...)
- Extensive testing
- Documentation and examples

---

layout: false
class: center, middle

# Thanks! ✌
[github.com/MaxHalford](https://github.com/MaxHalford/)</description></item><item><title>Bio</title><link>https://maxhalford.github.io/bio/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/bio/</guid><description>&lt;p&gt;Hello ✌️&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m Head of Data at &lt;a href="https://www.carbonfact.com/"&gt;Carbonfact&lt;/a&gt;, where we measure the carbon footprint of clothing items 🍃. Before that I worked for &lt;a href="https://alan.com/"&gt;Alan&lt;/a&gt;, a health insurance company. My &lt;a href="https://maxhalford.github.io/blog/phd-about"&gt;PhD topic&lt;/a&gt; was about applying machine learning &amp;ndash; &lt;a href="https://en.wikipedia.org/wiki/Bayesian_network"&gt;Bayesian networks&lt;/a&gt; in particular 🕸️ &amp;ndash; to &lt;a href="https://en.wikipedia.org/wiki/Query_optimization"&gt;query optimisation&lt;/a&gt; in relational databases 🤖. My current areas of interest revolve around &lt;a href="https://github.com/online-ml/awesome-online-machine-learning"&gt;online machine learning&lt;/a&gt; 🍥, &lt;a href="https://en.wikipedia.org/wiki/Document_processing"&gt;document processing&lt;/a&gt; 🔬, as well as tooling and good practices for data analytics 📊 and engineering 📦&lt;/p&gt;</description></item><item><title>Links</title><link>https://maxhalford.github.io/links/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><author>maxhalford25@gmail.com (Max Halford)</author><guid>https://maxhalford.github.io/links/</guid><description>&lt;h2 id="smart-people"&gt;Smart people&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://timsalimans.com/"&gt;Tim Salimans on Data Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.randalolson.com/blog/"&gt;Randal Olson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://sametmax.com/"&gt;Sam &amp;amp; Max&lt;/a&gt; &amp;ndash; French and NSFW!&lt;/li&gt;
&lt;li&gt;&lt;a href="http://sebastianraschka.com/blog/index.html"&gt;Sebastian Raschka&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.google.com/site/unclebobconsultingllc/"&gt;Clean Coder&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jakevdp.github.io/"&gt;Pythonic Perambulations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://erikbern.com/"&gt;Erik Bernhardsson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.otoro.net/"&gt;otoro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.christianperone.com/"&gt;Terra Incognita&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://realpython.com/blog/"&gt;Real Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nerds.airbnb.com/"&gt;Airbnb Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.kaggle.com/"&gt;No Free Hunch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.unofficialgoogledatascience.com/"&gt;The Unofficial Google Data Science Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://willwolf.io/"&gt;will wolf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.echen.me/"&gt;Edwin Chen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://use-the-index-luke.com/"&gt;Use the index, Luke!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://unwttng.com/"&gt;Jack Preston&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wiseodd.github.io/"&gt;Agustinus Kristiadi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://datagenetics.com/blog.html"&gt;DataGenetics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://katbailey.github.io/"&gt;Katherine Bailey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.netflix.com/"&gt;Netflix Research&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.inference.vc/"&gt;inFERENce&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://robjhyndman.com/hyndsight/"&gt;Hyndsight&lt;/a&gt; &amp;ndash; Rob Hyndman is a time series specialist.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twiecki.io/"&gt;While My MCMC Gently Samples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ines.io/"&gt;Ines Montani&lt;/a&gt; &amp;ndash; by one of the founders of &lt;a href="https://spacy.io/"&gt;spaCy&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://smerity.com/articles/articles.html"&gt;Stephen Smerity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://norvig.com/"&gt;Peter Norvig&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/developerworks/community/blogs/jfp/?lang=en"&gt;IT Best Kept Secret Is Optimization&lt;/a&gt; &amp;ndash; By Jean-Francois Puget, aka &lt;a href="https://www.kaggle.com/cpmpml"&gt;CPMP&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://explained.ai/"&gt;explained.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://betterexplained.com/"&gt;Better Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://geneticargonaut.blogspot.com/"&gt;Genetic Argonaut&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pandas-dev.github.io/pandas-blog/"&gt;pandas blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/"&gt;Towards Data Science&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.allendowney.com/blog/"&gt;Probably Overthinking It&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simplystatistics.org/"&gt;Simply Statistics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://practicallypredictable.com/"&gt;Practically Predictable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://koaning.io/"&gt;koaning&lt;/a&gt; &amp;ndash; by Vincent Warmerdam, who made &lt;a href="https://calmcode.io/"&gt;calmcode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blogarithms.github.io/"&gt;blogarithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://possiblywrong.wordpress.com/"&gt;Possibly Wrong&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://fastml.com/"&gt;FastML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://parameterfree.com/"&gt;Parameter-free Learning and Optimization Algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://toddwschneider.com/"&gt;Todd W. Schneider&lt;/a&gt; &amp;ndash; This guy is really good at exploratory data analysis.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://espadrine.github.io/blog/"&gt;Yann Thaddée&lt;/a&gt; &amp;ndash; Not directly related to data science but interesting nonetheless.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.solipsys.co.uk/new/ColinsBlog.html"&gt;Colins Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fabiensanglard.net"&gt;Fabien Sanglard&lt;/a&gt; &amp;ndash; nothing to do with data science, but such good taste!&lt;/li&gt;
&lt;li&gt;&lt;a href="https://glowingpython.blogspot.com/"&gt;The Glowing Python&lt;/a&gt; &amp;ndash; By the creator of &lt;a href="https://github.com/JustGlowing/minisom"&gt;MiniSom&lt;/a&gt;, which is worth checking out too.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://notmatthancock.github.io/notes/"&gt;Matt Hancock&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://francisbach.com/"&gt;Francis Bach&lt;/a&gt; &amp;ndash; Someone with an h-index of 80+ who takes the time to blog is worth reading.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gwern.net/index"&gt;Gwern Branwen&lt;/a&gt; &amp;ndash; Cool in a weird way.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://djalil.chafai.net/blog/"&gt;Libres pensées d&amp;rsquo;un mathématicien ordinaire&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.countbayesie.com/"&gt;Count Bayesie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://khakieconomics.github.io/"&gt;Jim Savage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nhigham.com/"&gt;Nick Higham&lt;/a&gt; &amp;ndash; A lot of well explained algebra.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://calmcode.io/"&gt;Calmcode&lt;/a&gt; &amp;ndash; Not a blog per se, but a nice collection of short to the point tutorials about various tools.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chris-said.io/"&gt;Chris Said&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.evanmiller.org/index.html"&gt;Evan Miller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://evjang.com/"&gt;Eric Jang&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aakinshin.net/"&gt;Andrey Akinshin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.singlelunch.com/blog/"&gt;Single Lunch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://freakonometrics.hypotheses.org/"&gt;Freakonometrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.martindaniel.co/"&gt;Martin Daniel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chriskiehl.com/"&gt;Chris Kiehl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ithaka.im/"&gt;ithaka.im&lt;/a&gt; &amp;ndash; A guy I met who travelled for 6 years with his wife on a bike, very inspiring.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://muthu.co/"&gt;Muthukrishnan&lt;/a&gt; &amp;ndash; Has written some neat document processing stuff.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bottosson.github.io/"&gt;Björn Ottosson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gdmarmerola.github.io/"&gt;Guilherme Duarte Marmerola&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://calpaterson.com/"&gt;Cal Paterson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://clrcrl.com/"&gt;Claire Carroll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://lukemetz.com/blog/"&gt;Luke Metz&lt;/a&gt; &amp;ndash; Luke is working on the niche topic of meta-learning at Google. He also happens to a very kind person.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://practicalrecs.com/"&gt;Practical Recommendations&lt;/a&gt; &amp;ndash; A blog about recommender systems.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.robinlinacre.com/"&gt;Robin Linacre&lt;/a&gt; &amp;ndash; Some good stuff related to record linkage.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nlathia.github.io/"&gt;Neal Lathia&lt;/a&gt; &amp;ndash; Machine learning in production stuff.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.johndcook.com/blog/"&gt;John D. Cook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bxroberts.org/"&gt;Brandon Roberts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.allendowney.com/blog/"&gt;Allen Downey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.blef.fr/"&gt;Christophe Blefari&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://srome.github.io/"&gt;Scott Rome&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://eugeneyan.com/writing/"&gt;Eugene Yan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ljvmiranda921.github.io/notebook/"&gt;Lj Miranda&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://death.andgravity.com/"&gt;death and gravity&lt;/a&gt; &amp;ndash; Great advanced Python resource.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://shapeofdata.wordpress.com/"&gt;The Shape of Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://idea-instructions.com/"&gt;IDEA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.shadedrelief.com/"&gt;Shaded relief&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lamport.azurewebsites.net/pubs/pubs.html"&gt;Leslie Lamport&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ntguardian.wordpress.com/blog/"&gt;Curtis Miller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.naftaliharris.com/"&gt;Naftali Harris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.lbreyer.com/welcome.html"&gt;Laird Breyer&lt;/a&gt; &amp;ndash; wrote some cool software for text classification called dbacl, and markovpr which is a PageRank implementation.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://vickiboykis.com/"&gt;Vicky Boykis&lt;/a&gt; &amp;ndash; the OG behind &lt;a href="https://normconf.com/"&gt;Normconf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.djnavarro.net/"&gt;Danielle Navarro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.redblobgames.com/"&gt;Amit Patel&lt;/a&gt; &amp;ndash; visual explanations of algorithms used in games.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nogilnick.com/"&gt;nogilnick&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.nathanielbullard.com/"&gt;Nat Bullard&lt;/a&gt; &amp;ndash; known for making &lt;a href="https://www.nathanielbullard.com/presentations"&gt;annual presentations&lt;/a&gt; on the state of decarbonization.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://minimaxir.com/"&gt;Max Woolf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://yuinchien.com/"&gt;Yuin Chien&lt;/a&gt; &amp;ndash; does design at Google.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://interconnected.org/home/"&gt;Matt Webb&lt;/a&gt; &amp;ndash; this madlad has been blogging since 2000.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://noperator.dev/"&gt;Caleb Gross&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.geoffreylitt.com/"&gt;Geoffrey Litt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://claytonwramsey.com/"&gt;Clayton Ramsey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rusty.today/"&gt;Rusty Conover&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://austinhenley.com/blog.html"&gt;Austin Z. Henley&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://notes.crmarsh.com/"&gt;Charlie Marsh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cs.cmu.edu/~pavlo/"&gt;Andy Pavlo&lt;/a&gt; &amp;ndash; &lt;a href="https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-retrospective.html?#fileformats"&gt;reviews&lt;/a&gt; database landscape every year.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ccs.neu.edu/home/fell/"&gt;Harriet Fell&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://campedersen.com/"&gt;Cam Pedersen&lt;/a&gt; &amp;ndash; has a pretty Berkeley Mono website&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="machine-learning"&gt;Machine learning&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf"&gt;The Elements of Statistical Learning - Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf"&gt;Machine Learning - Tom Mitchell&lt;/a&gt; &amp;ndash; I think this wonderful textbook is under-appreciated.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://aima.cs.berkeley.edu/"&gt;Artificial Intelligence: A Modern Approach - Russel &amp;amp; Norvig&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlcourse.ai/"&gt;mlcourse.ai&lt;/a&gt; &amp;ndash; Of all the introductions to machine learning, I think this is the one that strikes the best balance between theory and practice.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stanford.edu/~shervine/teaching/cs-229.html"&gt;Machine learning cheat sheets - Shervine Amidi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nbviewer.jupyter.org/github/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/table_of_contents.ipynb"&gt;Kalman and Bayesian Filters in Python - Roger Labbe&lt;/a&gt; &amp;ndash; Kalman filters are notoriously hard to grok, this tutorial nicely builds up the steps to understanding them.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://cs231n.github.io/convolutional-networks/"&gt;CS231n Convolutional Neural Networks for Visual Recognition - Stanford&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.math.u-bordeaux.fr/~mbergman/PDF/These/annexeC.pdf"&gt;Algorithmes d’optimisation non-linéaire sans contrainte (French) - Michel Bergmann&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.stanford.edu/~koller/Papers/Koller+al:SRL07.pdf"&gt;Graphical Models in a Nutshell - Koller et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/machine-learning/guides/rules-of-ml/"&gt;Rules of Machine Learning: Best Practices for ML Engineering - Martin Zinkevich&lt;/a&gt; &amp;ndash; You should read this once a year.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf"&gt;A Few Useful Things to Know about Machine Learning - Pedro Domingos&lt;/a&gt; &amp;ndash; This short paper summarizes basic truths in machine learning.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://norvig.com/spell-correct.html"&gt;How to Write a Spelling Corrector - Peter Norvig&lt;/a&gt; &amp;ndash; Magic in 36 lines of code.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://twiecki.github.io/blog/2015/11/10/mcmc-sampling/"&gt;MCMC sampling for dummies - Thomas Wiecki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d"&gt;Your Easy Guide to Latent Dirichlet Allocation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/"&gt;An Intuitive Explanation of Convolutional Neural Networks - Ujjwal Karn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ruder.io/optimizing-gradient-descent/"&gt;An overview of gradient descent optimization algorithms - Sebastian Ruder&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://explained.ai/gradient-boosting/index.html"&gt;How to explain gradient boosting - Terence Parr and Jeremy Howard&lt;/a&gt; &amp;ndash; A very good introduction to vanilla gradient boosting with step by step examples.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://brage.bibsys.no/xmlui/bitstream/handle/11250/2433761/16128_FULLTEXT.pdf"&gt;Why Does XGBoost Win &amp;ldquo;Every&amp;rdquo; Machine Learning Competition? - Didrik Nielsen&lt;/a&gt; &amp;ndash; This Master&amp;rsquo;s thesis goes into some of the details of XGBoost without being too bloated.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fermatslibrary.com/s/statistical-tests-p-values-confidence-intervals-and-power-a-guide-to-misinterpretations"&gt;Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://web.archive.org/web/20100613220918/http://astro.temple.edu/~powersmr/vol7no3.pdf"&gt;The Cramér-Rao Lower Bound on Variance: Adam and Eve’s &amp;ldquo;Uncertainty Principle&amp;rdquo; - Michael Powers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nbviewer.jupyter.org/url/norvig.com/ipython/Probability.ipynb"&gt;A Concrete Introduction to Probability (using Python) - Peter Norvig&lt;/a&gt; &amp;ndash; Extremely elegant Python coding.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://louisabraham.github.io/notebooks/hungarian_trick.html"&gt;The Hungarian Maximum Likelihood Trick - Louis Abraham&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://courses.engr.illinois.edu/cs598ps/fa2018/"&gt;Machine Learning for Signal Processing - University of Illinois&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://yugeten.github.io/posts/2019/09/GP/"&gt;Gaussian Process, not quite for dummies - Yuge Shi&lt;/a&gt; &amp;ndash; Gaussian processes are quite difficult to understand (at least, for me) but Yuge gives some great visual intuitions.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1411.5018.pdf"&gt;Frequentism and Bayesianism: A Python-driven Primer - Jake VanderPlas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1601.00670.pdf"&gt;Variational Inference: A Review for Statisticians - David Blei and his flock&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://tullo.ch/articles/decision-tree-evaluation/"&gt;The Performance of Decision Tree Evaluation Strategies - Andrew Tulloch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1902.07153.pdf"&gt;Simplifying Graph Convolutional Networks - Felix Wu et al.&lt;/a&gt; &amp;ndash; A nice example of putting the horse before the cart.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/"&gt;MIT 6.867 machine learning course notes - Tommi Jaakola&lt;/a&gt; &amp;ndash; For people who enjoy concise mathematical notation.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://karpathy.github.io/2019/04/25/recipe/"&gt;A Recipe for Training Neural Networks - Andrej Karpathy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://incompleteideas.net/IncIdeas/BitterLesson.html"&gt;The Bitter Lesson - Richard Sutton&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://unboxresearch.com/articles/lsh_post1.html"&gt;Introduction to Locality-Sensitive Hashing - Tyler Neylon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://peterbloem.nl/blog/transformers"&gt;Transformers from scratch - Peter Bloem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.confetti.ai/assets/ml-primer/ml_primer.pdf"&gt;A Machine Learning Primer - Mihail Eric&lt;/a&gt; &amp;ndash; A good read for beginners in machine learning algorithms.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html"&gt;Fitting Bayesian structural time series with the bsts R package - Steven L. Scott&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bergvca.github.io/2017/10/14/super-fast-string-matching.html"&gt;Super Fast String Matching in Python - Chris van den Berg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://scikit-learn.org/dev/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html#sphx-glr-auto-examples-linear-model-plot-poisson-regression-non-normal-loss-py"&gt;Poisson regression and non-normal loss - scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@u39kun/i-find-machine-learning-competitions-exciting-and-addicting-438fe95b33f5"&gt;Perfect lung cancer detections in a $1 million ML competition with an ingenious hack - Yusaku Sako&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://explained.ai/rf-importance/index.html"&gt;Beware Default Random Forest Importances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf"&gt;From RankNet to LambdaRank to LambdaMART: An Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/"&gt;Word2Vec Tutorial - The Skip-Gram Model - Chris McCormick&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://matheusfacure.github.io/python-causality-handbook/landing-page.html"&gt;Causal Inference for The Brave and True&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html"&gt;Data Distribution Shifts and Monitoring - Chip Huyen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sebastianraschka.com/blog/2022/confidence-intervals-for-ml.html"&gt;Creating Confidence Intervals for Machine Learning Classifiers - Sebastian Raschka&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlvu.github.io/"&gt;Machine Learning @ VU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dmicz.github.io/machine-learning/svd-image-compression/"&gt;SVD Image Compression, Explained - Dennis Miczek&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://snats.xyz/pages/articles/classifying_a_bunch_of_pdfs.html"&gt;Classifying all of the pdfs on the internet - Santiago Pedroza&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.griffens.net/blog/weird-kaggle-books-reflections/"&gt;Weird Kaggle, the superiority of books, and other reflections - Nick Griffiths&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://peterbloem.nl/blog/transformers"&gt;Transformers from scratch - Peter Bloem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://explained.ai/rnn/index.html"&gt;Explaining RNNs without neural networks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://course.fast.ai/"&gt;Practical Deep Learning for Coders&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/"&gt;What Is ChatGPT Doing&amp;hellip; and Why Does It Work?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gael-varoquaux.info/science/carte-toward-table-foundation-models.html"&gt;CARTE: toward table foundation models - Gaël Varoquaux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/blog/everything-you-always-wanted-to-know-about-extreme-classification-but-were-afraid-to-ask/"&gt;Everything you always wanted to know about extreme classification - Microsoft Research&lt;/a&gt; &amp;ndash; I love the idea that recsys can be framed as an extreme classification problem.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://daniel.lawrence.lu/blog/y2025m09d21/"&gt;Line scan camera image processing - Daniel Lawrence Lu&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="data-science"&gt;Data science&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://paulbutler.org/2012/make-for-data-scientists/"&gt;Make for data scientists - Paul Butler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html"&gt;Tidy Data - Hadley Wickham&lt;/a&gt; &amp;ndash; You need to be aware of this framework if you want to be serious about analysing tabular data.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.getdbt.com/modeling-marketing-attribution/"&gt;Modeling marketing attribution - Claire Carroll&lt;/a&gt; &amp;ndash; I worked on this problem for a short time at Alan. I definitely would have done a better job if I had read this article first.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.keithschwarz.com/darts-dice-coins/"&gt;Darts, Dice, and Coins: Sampling from a Discrete Distribution - Keith Schwarz&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mzucker.github.io/2016/10/11/unprojecting-text-with-ellipses.html"&gt;Unprojecting text with ellipses - Matt Zucker&lt;/a&gt; &amp;ndash; See also &lt;a href="https://mzucker.github.io/2016/08/15/page-dewarping.html"&gt;this article on page dewarping&lt;/a&gt; by the same author.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://dbacl.sourceforge.net/tutorial.html"&gt;Language models, classification and dbacl - Laird A. Breyer&lt;/a&gt; &amp;ndash; Machine learning on text with a UNIX philosophy.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.chriskamphuis.com/2019/03/06/teaching-an-old-dog-a-new-trick.html"&gt;Teaching An Old Dog A New Trick - Chris Kamphuis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ethanrosenthal.com/2020/08/25/optimal-peanut-butter-and-banana-sandwiches/"&gt;Optimal Peanut Butter and Banana Sandwiches - Ethan Rosenthal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007"&gt;The Data Science Hierarchy of Needs - Monica Rogati&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.jesperjuul.net/ludologist/2010/06/08/tuesday-changes-everything-a-mathematical-puzzle/"&gt;Tuesday Changes Everything - Jesper Juul&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html"&gt;Doing Named Entity Recognition? Don&amp;rsquo;t optimize for F1 - Christopher Manning&lt;/a&gt; &amp;ndash; A rather niche topic, but well explained.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://archive.ph/A7298#selection-63.0-63.72"&gt;Lessons learned building an ML trading system that turned \$5k into \$200k&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lindeloev.github.io/tests-as-linear/"&gt;Common statistical tests are linear models (or: how to teach stats) - Jonas Kristoffer Lindeløv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://win-vector.com/2024/12/19/kelly-cant-fail/"&gt;Kelly Can&amp;rsquo;t Fail - John Mount&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="analytics-engineering"&gt;Analytics engineering&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://workingbackwards.com/concepts/input-metrics/"&gt;Input metrics &amp;amp; weekly business review - Working backwards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dataduel.co/if-spreadsheets-are-eternal-are-bi-tools-transitory/"&gt;If spreadsheets are eternal, are BI tools transitory?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/"&gt;Dimensional Modeling Techniques&lt;/a&gt; &amp;ndash; old but gold.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=itR-eFJ3voY"&gt;Motif&lt;/a&gt; &amp;ndash; they had cool ideas about funnel analytics before they sadly shut down.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pair-code.github.io/facets/"&gt;Facets&lt;/a&gt; &amp;ndash; it&amp;rsquo;s too bad this project is not actively maintained anymore, it had a lot of potential.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crossfilter.github.io/crossfilter/"&gt;Crossfilter&lt;/a&gt; &amp;ndash; I like these tools that let you slice and dice data in the browser.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="educational-material"&gt;Educational material&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tutorial.math.lamar.edu/"&gt;Paul&amp;rsquo;s Online Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mathopolis.com/questions/quizzes.php"&gt;Mathematics Quizzes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://andrewkchan.dev/posts/fire.html"&gt;Simulating Fluids, Fire, and Smoke in Real-Time - Andrew Chan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ciechanow.ski/gps/"&gt;GPS - Bartosz Ciechanowski&lt;/a&gt; &amp;ndash; &lt;a href="https://ciechanow.ski/archives/"&gt;all his articles&lt;/a&gt; are great.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grid.space/stem/"&gt;Grid.Space&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="data-engineering"&gt;Data engineering&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://future.com/emerging-architectures-modern-data-infrastructure/"&gt;Emerging Architectures for Modern Data Infrastructure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://technically.dev/posts/what-your-data-team-is-using"&gt;What your data team is using: the analytics stack - Technically&lt;/a&gt; &amp;ndash; Another solid article to understand what an analytics stack looks like in 2021.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1606.03966.pdf"&gt;Multiworld Testing Decision Service: A System for Experimentation, Learning, And Decision-Making&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf"&gt;Machine Learning: The High-Interest Credit Card of Technical Debt - Google&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/articles/cd4ml.html"&gt;Continuous Delivery for Machine Learning - Martin Fowler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf"&gt;Hidden Technical Debt in Machine Learning Systems - Google&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying"&gt;The Log: What every software engineer should know about real-time data&amp;rsquo;s unifying abstraction - Jay Kreps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html"&gt;Command-line Tools can be 235x Faster than your Hadoop Cluster - Adam Drake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;Git scraping, the five minute lightning talk - Simon Willison&lt;/a&gt; &amp;ndash; I wish I had thought about this first!&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gentlydownthe.stream/"&gt;Gently down the stream - Mitch Seymour&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/"&gt;Turning the database inside-out with Apache Samza&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pages.cs.wisc.edu/~yxy/cs839-s20/papers/snowflake.pdf"&gt;The Snowflake Elastic Data Warehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://timelydataflow.github.io/differential-dataflow/"&gt;Differential Dataflow&lt;/a&gt; &amp;ndash; also see the &lt;a href="https://dl.acm.org/doi/pdf/10.1145/2517349.2522738"&gt;Naiad&lt;/a&gt; paper&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lamport.azurewebsites.net/pubs/time-clocks.pdf"&gt;Time, Clocks, and the Ordering of Events in a Distributed System - Leslie Lamport&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://howqueryengineswork.com/00-introduction.html"&gt;How Query Engines Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modal.com/blog/analytics-stack"&gt;Building a cost-effective analytics stack with Modal, dlt, and dbt&lt;/a&gt; &amp;ndash; prime example of what a modern analytics stack looks like in late 2024.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://locallyoptimistic.com/post/data-warehouse-sla-p1/#:~:text=Yes%2C%20if%20you%20want%20to,and%20it%20will%20be%20accurate."&gt;Should Your Data Warehouse Have an SLA?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/"&gt;Dimensional Modeling Techniques - Kimball Group&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gendignoux.com/blog/2025/03/03/rust-interning-2000x.html"&gt;The power of interning: making a time series database 2000x smaller in Rust - Guillaume Endignoux&lt;/a&gt; &amp;ndash; this guy takes the git scraping pattern really far, I like his taste.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a"&gt;Functional Data Engineering — a modern paradigm for batch data processing&lt;/a&gt; &amp;ndash; I strongly believe in this approach.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://maxhalford.github.io/files/misc/BigQuery%20Best%20Practices.pdf"&gt;BigQuery Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rmoff.net/2026/02/19/ten-years-late-to-the-dbt-party-duckdb-edition/"&gt;Ten years late to the dbt party (DuckDB edition)&lt;/a&gt; &amp;ndash; Robin Moffatt graces us with a fresh yet deep dbt primer&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="inspiring-data-analysis"&gt;Inspiring data analysis&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.ethanrosenthal.com/2022/04/15/bayesian-rock-climbing/"&gt;Bayesian Rock Climbing Rankings - Ethan Rosenthal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jakevdp.github.io/blog/2014/06/10/is-seattle-really-seeing-an-uptick-in-cycling/"&gt;Is Seattle Really Seeing an Uptick In Cycling? - Jake VanderPlas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.martindaniel.co/roof/index.html"&gt;How we changed our roof and cut 1.5 tons of CO2e - Martin Daniel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/norvig/pytudes/blob/main/ipynb/WWW.ipynb"&gt;WWW: Who Will Win? - Peter Norvig&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mkorostoff.github.io/1-pixel-wealth/"&gt;Wealth shown to scale - Matt Korostoff&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pudding.cool/2017/05/song-repetition/"&gt;Are Pop Lyrics Getting More Repetitive? - Colin Morris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dagster.io/blog/fake-stars"&gt;Tracking the Fake GitHub Star Black Market - Fraser Marlow, Yuhan Luo, Alana Glassco&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pudding.cool/2022/12/yard-sale"&gt;Why the super rich are inevitable - The Pudding&lt;/a&gt; &amp;ndash; Really cool dataviz.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb#Example:-Kaggle-contest-on-Observing-Dark-World"&gt;Kaggle contest on Observing Dark World - Cam Davidson-Pilon&lt;/a&gt; &amp;ndash; If you&amp;rsquo;re doubtful about the power of Bayesian machine learning, then read this and get mindblown.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.looria.com/reddit"&gt;looria.com/reddit&lt;/a&gt; &amp;ndash; This is a website that aggregates informal product reviews found on Reddit. There&amp;rsquo;s a bunch of cool NLP stuff going on behind the scenes. For instance here&amp;rsquo;s recommendations for &lt;a href="https://looria.com/reddit/cycling/products"&gt;cycling&lt;/a&gt; and &lt;a href="https://looria.com/reddit/campinggear/products"&gt;camping gear&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nomadlist.com/digital-nomad-statistics#unattractive"&gt;Who is the average nomad?&lt;/a&gt; &amp;ndash; feeds from NomadList live data.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://everynoise.com/engenremap.html"&gt;Every Noise at Once&lt;/a&gt; &amp;ndash; uses PCA to map music genres.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ethanzuckerman.com/2023/12/22/how-big-is-youtube/"&gt;How Big is YouTube? - Ethan Zuckerman&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://uwdata.github.io/mosaic-framework-example/nyc-taxi-rides"&gt;NYC Taxi Rides viz&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mayerowitz.io/blog/mario-meets-pareto"&gt;Mario meets Pareto - Antoine Mayerowitz&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.washingtonpost.com/climate-environment/interactive/2024/how-accurate-is-the-weather-forecast/"&gt;We mapped weather forecast accuracy across the U.S. Look up your city&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://storymaps.arcgis.com/stories/41d4bd6029044afbb1b9ad805a4731d8"&gt;Resurfacing the past&lt;/a&gt; - a madlad decides to pinpoint all the ships that sank during WWII.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.jmspae.se/write-ups/kebabs-train-stations/"&gt;The closer to the train station, the worse the kebab - James Pae&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://annas-archive.org/blog/all-isbns-winners.html"&gt;Winners of the $10,000 ISBN visualization bounty - Anna&amp;rsquo;s Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.engora.com/2025/07/vanishing-home-field-advantage-in.html?m=1"&gt;Vanishing home field advantage in English football&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jameshard.ing/pilot"&gt;I am an Airbus A350 Pilot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://depuis1958.fr/"&gt;Depuis 1958&lt;/a&gt; &amp;ndash; &lt;a href="https://fivethirtyeight.com/methodology/how-our-pollster-ratings-work/"&gt;538&lt;/a&gt; for French presidential elections&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="sustainability"&gt;Sustainability&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.donellameadows.org/wp-content/userfiles/Limits-to-Growth-digital-scan-version.pdf"&gt;The Limits to Growth - Donella Meadows&lt;/a&gt; &amp;ndash; it&amp;rsquo;s not very often that a paper is so accurate in its predictions.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gstatic.com/gumdrop/sustainability/google-2024-consumer-hardware-carbon-reduction.pdf"&gt;Consumer Hardware Carbon Reduction Guide - Google&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://meaningfulsustainabilityjobs.blog/2024/09/29/the-lca-paradox/"&gt;The LCA paradox - Frida Røyne&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://video.ethz.ch/events/lca/2023/autumn/83rd/9094196a-4300-4494-87d3-c5e872ad8e62.html"&gt;Scope 3 Data in LCA of organisations Betw^een Simplification, Overwhelming &amp;amp; Greenwashing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://climatetrace.org/explore"&gt;Climate TRACE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jancovici.com/en/video/40-min-can-the-economy-become-fossil-free/"&gt;Can the economy become fossil free? - Jean-Marc Jancovici&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orionmagazine.org/article/forget-shorter-showers/"&gt;Forget Shorter Showers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ineteconomics.org/perspectives/blog/growth-with-decarbonization-is-not-an-oxymoron"&gt;Conditional Optimism: Economic Perspectives on Deep Decarbonization - Michael Grubb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tmrow.com/climatechange/"&gt;Climate Change: a practical guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://maxhalford.github.io/files/misc/The%20Computational%20Structure%20of%20Life%20Cycle%20Assessment.pdf"&gt;The computational structure of life cycle assessment - Reinout Heijungs &amp;amp; Sangwon Suh&lt;/a&gt; &amp;ndash; good introduction to LCA algorithms for technical people.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getecodex.com/"&gt;Ecodex&lt;/a&gt; &amp;ndash; homogenous database of emission factors&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="data-sources"&gt;Data sources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://apirank.dev/"&gt;API Rank&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://inspectelement.org/apis.html"&gt;Finding Undocumented APIs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://console.cloud.google.com/bigquery?p=bigquery-public-data&amp;amp;d=new_york&amp;amp;t=citibike_trips&amp;amp;page=table&amp;amp;project=orc-orc&amp;amp;ws=!1m4!1m3!3m2!1sbigquery-public-data!2samerica_health_rankings"&gt;bigquery-public-data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://console.cloud.google.com/bigquery?p=bigquery-public-data&amp;amp;d=new_york&amp;amp;t=citibike_trips&amp;amp;page=table&amp;amp;project=orc-orc&amp;amp;ws=!1m4!1m3!3m2!1sfh-bigquery!2sbbqtoronto"&gt;fh-bigquery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://query.wikidata.org/#%23Map%20of%20hospitals%0A%23added%202017-08%0A%23defaultView%3AMap%0ASELECT%20DISTINCT%20%2a%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ16917%3B%0A%20%20%20%20%20%20%20%20wdt%3AP625%20%3Fgeo%20.%0A%7D"&gt;Wikidata Query Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.nyc.gov/html/dot/html/about/datafeeds.shtml#vision"&gt;New-York City transport data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.securityevaluators.com/reverse-engineering-bumbles-api-a2a0d39b3a87"&gt;Reverse Engineering Bumble’s API&lt;/a&gt; &amp;ndash; a fun/scary API reverse engineering example that worked in 2020&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ccxt/ccxt"&gt;ccxt&lt;/a&gt; &amp;ndash; access cryptocurrency exchanges&amp;rsquo; APIs&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ourworldindata.org/"&gt;Our World in Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://new.mta.info/article/beyond-route-introducing-granular-mta-bus-speed-data"&gt;Beyond the route: Introducing granular MTA bus speed data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://csvbase.com/"&gt;csvbase&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cia-factbook-archive.fly.dev/"&gt;CIA World Factbook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/nsppolls/nsppolls"&gt;nsppolls&lt;/a&gt; &amp;ndash; French election polls&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8421DX"&gt;Replication Data for: Election Polling Errors across Time and Space&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/fivethirtyeight/data/tree/master"&gt;fivethirtyeight/data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="tables-is-all-you-need"&gt;Tables is all you need&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://x.com/ash_uxi/status/1775606725795610725?s=46"&gt;I&amp;rsquo;ve spent years perfecting table design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://x.com/brotzky_/status/1775553206103474354"&gt;Unlimited columns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://posit-dev.github.io/great-tables/blog/design-philosophy/"&gt;The Design Philosophy of Great Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datawrapper.de/tables"&gt;Create responsive tables with Datawrapper&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="food-for-thought"&gt;Food for thought&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://neilkakkar.com/sapiens.html"&gt;If Sapiens were a blog post - Neil Kakkar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://patrickcollison.com/fast"&gt;Fast - Patrick Collison&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mcfunley.com/choose-boring-technology"&gt;Choose Boring Technology - Dan McKinley&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sriramk.com/memos"&gt;Memos - Sriram Krishnan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lynalden.com/what-is-money/"&gt;What is Money, Anyway? - Lyn Alden&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.charliechaplin.com/en/articles/29-the-final-speech-from-the-great-dictator-"&gt;The Final Speech from The Great Dictator&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theguardian.com/news/2024/jan/16/the-tyranny-of-the-algorithm-why-every-coffee-shop-looks-the-same"&gt;The tyranny of the algorithm: why every coffee shop looks the same&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lithub.com/against-disruption-on-the-bulletpointization-of-books/"&gt;Against Disruption: On the Bulletpointization of Books&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://catb.org/jargon/html/H/hacker.html"&gt;hacker definition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ssp.sh/blog/finding-flow/"&gt;Finding Flow: Escaping Digital Distractions Through Deep Work and Slow Living - Simon Späti&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://earlyretirementnow.com/safe-withdrawal-rate-series/"&gt;The Safe Withdrawal Rate Series&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://calteches.library.caltech.edu/51/2/CargoCult.htm"&gt;Cargo Cult Science&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.codinghorror.com/falling-into-the-pit-of-success/"&gt;Falling Into The Pit of Success&lt;/a&gt; &amp;ndash; also known as &lt;a href="https://en.wikipedia.org/wiki/Poka-yoke"&gt;poka-yoke&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://quarter--mile.com/You-Could-Just-Choose-Optimism"&gt;You Could Just Choose Optimism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tombrady.com/posts/your-actions-reflect-your-priorities"&gt;Your actions reflect your priorities&lt;/a&gt; &amp;ndash; Tom Brady&amp;rsquo;s take on it being about the process, not the outcome.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://andrewkelley.me/post/renting-is-for-suckers.html"&gt;Renting is for Suckers&lt;/a&gt; &amp;ndash; good arguments as to why you shouldn&amp;rsquo;t default to using cloud services.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.joanwestenberg.com/p/i-deleted-my-second-brain"&gt;I Deleted My Second Brain&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=X2wLP0izeJE"&gt;Ira Glass on Storytelling&lt;/a&gt; &amp;ndash; good taste and being critical of one&amp;rsquo;s own work is the key to becoming a better creator.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lizandmollie.com/burnout-profile"&gt;Burnout profile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.highagency.com/"&gt;High Agency In 30 Minutes - George Mack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=36GT2zI8lVA"&gt;Richard Feynman on the question &lt;em&gt;Why?&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mitxela.com/rants/thinflation"&gt;Thinflation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="sql"&gt;SQL&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://quip.com/2gwZArKuWk7W"&gt;The Best Medium-Hard Data Analyst SQL Interview Questions&lt;/a&gt; &amp;ndash; There are some great interactive SQL tutorials out there, such as &lt;a href="https://sqlbolt.com/"&gt;SQLBolt&lt;/a&gt; and &lt;a href="https://selectstarsql.com/"&gt;Select Star SQL&lt;/a&gt;, but this one takes the cake due to its complexity. &lt;a href="https://count.co/canvas/pB7iGb4yyi2"&gt;The Ultimate SQL guide&lt;/a&gt; is a comprehensive guide made with Count.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ian.sh/tsa"&gt;Bypassing airport security via SQL injection&lt;/a&gt; &amp;ndash; A &lt;del&gt;fun&lt;/del&gt; dangerous example of what can happen when you don&amp;rsquo;t sanitize your inputs.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sqlook.com/"&gt;SQLook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sqlnoir.com/"&gt;SQL Noir&lt;/a&gt; &amp;ndash; this hits the sweet spot between two things I enjoy. Try listening &lt;a href="https://www.youtube.com/watch?v=1vxl0Yd7TOY&amp;amp;ab_channel=BOHREN%26DERCLUBOFGORE"&gt;Bohren &amp;amp; Der Club Of Gore&lt;/a&gt; while doing the exercises.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="programming"&gt;Programming&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://danuker.go.ro/the-grand-unified-theory-of-software-architecture.html"&gt;The Grand Unified Theory of Software Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kalzumeus.com/2011/10/28/dont-call-yourself-a-programmer/"&gt;Don&amp;rsquo;t Call Yourself A Programmer, And Other Career Advice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://users.ece.utexas.edu/~adnan/pike.html"&gt;Rules of Programming - Rob Pike&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/naver/lispe/wiki/6.16-Why-Lisp"&gt;Why Lisp?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="writing"&gt;Writing&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.cs.columbia.edu/~hgs/etc/writing-bugs.html"&gt;Common Bugs in Writing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.signalsblog.ca/right-turn-cormac-mccarthy-and-me-on-how-to-write-a-good-science-paper/"&gt;Novelist Cormac McCarthy’s tips on how to write a great science paper - Savage and Yeh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fermatslibrary.com/p/e2e6484d#email-newsletter"&gt;How to Build an Economic Model in Your Spare Time - Hal R. Varian&lt;/a&gt; &amp;ndash; The academic wisdom in this article goes beyond the world of economics.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://beancount.github.io/docs/the_double_entry_counting_method.html#double-entry-bookkeeping"&gt;The Double-Entry Counting Method&lt;/a&gt; - Great example of documenting a technical concept.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gael-varoquaux.info/programming/technical-discussions-are-hard-a-few-tips.html"&gt;Technical discussions are hard; a few tips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themarginalian.org/2023/09/20/octavia-butler-advice-on-writing/"&gt;Octavia Butler’s Advice on Writing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lithub.com/writing-advice-and-literary-wisdom-from-the-great-e-b-white/"&gt;Writing Advice and Literary Wisdom from the Great E.B. White&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="web-development"&gt;Web development&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://anthonyhobday.com/sideprojects/saferules/"&gt;Visual design rules you can safely follow every time - Anthony Hobday&lt;/a&gt; &amp;ndash; Good follow-up to &lt;a href="https://jgthms.com/web-design-in-4-minutes/"&gt;Web Design in 4 minutes&lt;/a&gt; by Jeremy Thomas.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://practicaltypography.com/typography-in-ten-minutes.html"&gt;Typography in ten minutes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alpinejs.dev/"&gt;alpine.js&lt;/a&gt; &amp;ndash; I usually go to Vue.js for web dev, but my brother made me realize alpine.js is a great alternative for small projects.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hot.page/"&gt;Hot Page&lt;/a&gt; &amp;ndash; looks like a good idea to create a landing page.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://uchu.style/"&gt;uchū&lt;/a&gt; &amp;ndash; decent default color palettes.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.val.town/"&gt;Val Town&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mmm.page/"&gt;mmm.page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alto.so/"&gt;Alto&lt;/a&gt; &amp;ndash; turn Apple Notes into a website.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pontus.granstrom.me/scrappy/"&gt;Scrappy&lt;/a&gt; &amp;ndash; loads of links and good ideas in this one.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.hyperclay.com/docs/example-apps/"&gt;Hyperclay&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.gingerbeardman.com/2025/10/11/how-to-tame-a-user-interface-using-a-spreadsheet/"&gt;How to tame a user interface using a spreadsheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="building-a-product"&gt;Building a product&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://vimeo.com/842437838"&gt;Beautiful Polished Rocks - Steve Jobs&lt;/a&gt; &amp;ndash; the best metaphor for product design I&amp;rsquo;ve ever heard.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/chitchcock/1281611"&gt;Stevey&amp;rsquo;s Google Platforms Rant&lt;/a&gt; &amp;ndash; insights about product design at GAFAs.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://x.com/StartupArchive_/status/1736006653722362216?s=20"&gt;Jeff Bezos on the disagree and commit principle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://handbook.duolingo.com/"&gt;The Duolinguo Handbook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lennysnewsletter.com/p/product-sense"&gt;How to develop product sense&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.thebrowser.company/values/"&gt;The Browser Company&amp;rsquo;s company values&lt;/a&gt; &amp;ndash; they made the Arc browser, which rekindled the browser wars.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=hJbwyN4ZoCg&amp;amp;t=174s"&gt;Secrets to Optimal Client Service&lt;/a&gt; &amp;ndash; actionable advice from a Goldman Sachs bigwig.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="llms"&gt;LLMs&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://interconnected.org/home/2023/03/16/singularity"&gt;The surprising ease and effectiveness of AI in a loop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.geoffreylitt.com/2025/07/27/enough-ai-copilots-we-need-ai-huds"&gt;Enough AI copilots! We need AI HUDs&lt;/a&gt; &amp;ndash; funny how someone had the right idea 33 years ago.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ampcode.com/how-to-build-an-agent"&gt;How to Build an Agent - Thorsten Ball&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="i-dont-have-a-clue-but-it-looks-cool"&gt;I don&amp;rsquo;t have a clue but it looks cool&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.gregegan.net/SCIENCE/Superpermutations/Superpermutations.html"&gt;Superpermutations - Greg Egan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://slehar.wordpress.com/2014/03/18/clifford-algebra-a-visual-introduction/"&gt;Clifford Algebra: A visual introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://explainextended.com/2023/12/31/happy-new-year-15/"&gt;GPT in 500 lines of SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2501.00536"&gt;Phase behavior of Cacio e Pepe sauce&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spidermonkey.dev/blog/2025/10/28/iongraph-web.html"&gt;Who needs Graphviz when you can build it yourself?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wichm.home.xs4all.nl/filmsize.html"&gt;More than one hundred years of film sizes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://timvieira.github.io/table-theorem/"&gt;The Wobbly Table Theorem&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="eye-candy"&gt;Eye candy&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tylerxhobbs.com/"&gt;Tyler Hobbs&lt;/a&gt; &amp;ndash; The god of generative arts.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://imgur.com/gallery/AZvIf"&gt;Some Jean Giraud stuff&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mauromartins.com/"&gt;Mauro Martins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://artof01.com/vrellis/works/knit.html"&gt;A new way to knit by Petros Vrellis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.artnome.com/news/2018/8/8/generative-art-finds-its-prodigy"&gt;A fascinating article about Manolo Gamboa Naon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ukiyo-e.org/"&gt;Some Ukiyo-e&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://turtletoy.net/"&gt;Turtletoy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dwitter.net/"&gt;Dwitter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://generated.space/"&gt;generated.space&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://essenmitsosse.de/pixel/"&gt;Pixel art by Marcus Blättermann&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.the42.ie/bbc-nick-barnes-football-notes-2111888-May2015/"&gt;Nick Barnes&amp;rsquo; football bible&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.simonstalenhag.se/"&gt;Simon Stålenhag&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.iamag.co/the-art-of-syd-mead/"&gt;Syd Mead&lt;/a&gt; (who worked on Blade Runner)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.michaelfogleman.com/"&gt;Michael Fogleman&amp;rsquo;s blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://imgur.com/user/imadreamwalker/posts"&gt;World of Warcraft art by Dreamwalker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.artcena.fr/artcena-tv/hors-sol-de-akoreacro"&gt;&lt;em&gt;Hors-sol&lt;/em&gt; de AKOREACRO&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ericaofanderson.com/"&gt;Erica Anderson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.jacksharp.co.uk/"&gt;Jack Sharp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://archillect.com/"&gt;Archillect&lt;/a&gt; &amp;ndash; An AI that curates cool pictures, how awesome is that?&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aem1k.com/"&gt;Martin Kleppe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zoomquilt.org/"&gt;Zoomquilt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lossfunctions.tumblr.com/"&gt;lossfunctions.tumblr.com&lt;/a&gt; &amp;ndash; Yes, that&amp;rsquo;s a thing.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://charlesbroskoski.com/_/view.php?id=shirts-of-peter-norvig"&gt;Shirts of Peter Norvig&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.creamelectricart.com/united-airlines/"&gt;United Airlines ads by Cream Electric Art&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://miniature-calendar.com/"&gt;Miniature Calendar by Tatsuya Tanaka&lt;/a&gt; &amp;ndash; Broccolis that look like trees, staples that look like workout benches&amp;hellip; I love it!&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sandspiel.club/"&gt;sandspiel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.jorgejacinto.com/"&gt;Jorge Jacinto&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mxgmn/WaveFunctionCollapse"&gt;WaveFunctionCollapse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://owenpomery.com/cabin"&gt;Owen D. Pomery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://singularityhub.com/2012/10/15/19th-century-french-artists-predicted-the-world-of-the-future-in-this-series-of-postcards/"&gt;19th Century French Artists Predicted The World Of The Future In This Series Of Postcards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tomcritchlow.com/2023/04/03/blog-maps/"&gt;Blog maps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.instagram.com/deck_two/"&gt;Decktwo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://eycndy.com/"&gt;eycndy.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.fmwconcepts.com/imagemagick/index.php"&gt;Fred&amp;rsquo;s ImageMagick Scripts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://surma.dev/things/ditherpunk/"&gt;Ditherpunk - Surma&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://math.stackexchange.com/questions/733754/visually-stunning-math-concepts-which-are-easy-to-explain"&gt;Visually stunning math concepts which are easy to explain - StackExchange&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theguardian.com/artanddesign/gallery/2022/nov/22/cars-bars-and-burger-joints-william-egglestons-iconic-america-in-pictures"&gt;Cars, bars and burger joints: William Eggleston’s iconic America – in pictures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spectrolite.app/"&gt;Spectrolite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ramen.haus/"&gt;RamenHaus&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sportsnetusa.net/"&gt;SportsNetUSA.net&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://readcomiconline.li/"&gt;readcomiconline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mubi.com/en/fr/showing"&gt;MUBI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/lavidaenvinetas"&gt;La vida en viñetas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://leeoniya.github.io/uPlot/demos/time-periods.html"&gt;Plotting 3 years of hourly data in 150ms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://damoonrashidi.me/articles/flow-field-methods"&gt;What I&amp;rsquo;ve learned about flow fields so far&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dear-data.com/theproject"&gt;Dear Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.beautifulpublicdata.com/faa-aviation-maps/"&gt;FAA Aviation Maps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://floor796.com"&gt;Floor796&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artsandculture.google.com/entity/john-martin/m0383lx?hl=en"&gt;John Martin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marimekko.com/eu_en/maripedia/patterns"&gt;marimekko.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theparisreview.org/blog/2017/02/15/rhythmical-lines/"&gt;Wacław Szpakowski&amp;rsquo;s rhythmical lines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://vimeo.com/1023120442"&gt;10,946: a Year-Long Post-It Note Animation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://suberic.net/~dmm/projects/mystical/README.html"&gt;Mystical&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themarginalian.org/2022/01/27/virginia-frances-sterrett-old-french-fairy-tales/"&gt;Teenage Artist Virginia Frances Sterrett&amp;rsquo;s Hauntingly Beautiful Century-Old Dreamscapes for French Fairy Tales&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.decryption.net.au/posts/macpaint.html"&gt;MacPaint Art From The Mid-80s Still Looks Great Today&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.beautifulpublicdata.com/"&gt;Beautiful Public Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.guybuffet.com/gallery-image/Limited-Edition-Prints/G0000gRfwWbFR9iw/I0000XT5fhOfxSBM"&gt;Making of a Great Martini&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://choishine.com/Giants.html"&gt;The Land of Giants™&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="pretty-websites"&gt;Pretty websites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://motherduck.com/"&gt;MotherDuck: Data Infrastructure and Analytics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.operationalanalytics.club/"&gt;Welcome to the Operational Analytics Club 👋&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.snaplet.dev/"&gt;Snaplet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thenounproject.com/"&gt;Noun Project: Free Icons &amp;amp; Stock Photos for Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://studio.benthos.dev/"&gt;Benthos Studio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="einsteigenbitte.eu"&gt;Claire Glanois&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bannerbear.com/"&gt;API for Automated Image and Video Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.spencerchang.me/"&gt;𝚜𝚙𝚎𝚗𝚌𝚎𝚛𝚌𝚑𝚊𝚗𝚐.𝚖𝚎 𝚒𝚜 𝚠𝚊𝚗𝚍𝚎𝚛𝚒𝚗𝚐&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://digitalgarden.hypha.coop/maintenance"&gt;Maintenance 🌱 Digital Garden&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://equals.com/"&gt;Equals | The fastest way for startups to do any analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://maki.vc/"&gt;Maki.vc | European Venture Capital Firm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://harlequin.sh/"&gt;Harlequin: The DuckDB IDE for Your Terminal.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rsms.me/inter/"&gt;Inter font family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neatnik.net/"&gt;Neatnik&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thecreativeindependent.com/"&gt;The Creative Independent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.bay12games.com/dwarves/"&gt;Bay 12 Games: Dwarf Fortress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.browserbear.com/"&gt;Browserbear&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://localghost.dev/blog/this-page-is-under-construction/"&gt;This page is under construction&lt;/a&gt; &amp;ndash; the list at the end is great&lt;/li&gt;
&lt;li&gt;&lt;a href="https://usgraphics.com/products/berkeley-mono"&gt;Berkeley Mono™&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gatyou.studio/"&gt;gatyou.studio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://yeet.cx/compose"&gt;yeet.cx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.helenazhang.com/"&gt;Helena Zhang&lt;/a&gt; &amp;ndash; made &lt;a href="https://departuremono.com/"&gt;Departure Mono&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hey.milo.gg/"&gt;hey.milo.gg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mcmaster.com/"&gt;McMaster-Carr&lt;/a&gt; &amp;ndash; every online store should feel more like this.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://delphi.tools/"&gt;delphi.tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marianopascual.me/"&gt;Mariano Pascual&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://loack.me/"&gt;loackme&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I like these retrocool websites:&lt;/p&gt;</description></item></channel></rss>