<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Dashdoc - Medium]]></title>
        <description><![CDATA[Behind the scenes at Dashdoc — building tools for the transportation industry - Medium]]></description>
        <link>https://medium.com/dashdoc?source=rss----4cbc68d7a693---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Dashdoc - Medium</title>
            <link>https://medium.com/dashdoc?source=rss----4cbc68d7a693---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 23 Jun 2026 07:44:01 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/dashdoc" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[How we hire engineers at Dashdoc]]></title>
            <link>https://medium.com/dashdoc/how-we-hire-engineers-at-dashdoc-9c0b5864971b?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/9c0b5864971b</guid>
            <category><![CDATA[dashdoc]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[hiring]]></category>
            <dc:creator><![CDATA[Axel Haddad]]></dc:creator>
            <pubDate>Fri, 29 May 2026 12:39:34 GMT</pubDate>
            <atom:updated>2026-05-29T12:39:35.464Z</atom:updated>
            <content:encoded><![CDATA[<p><em>A straight answer to ‘what happens if I apply?’, the rounds in order, and what we look for in adaptable, product-oriented engineers.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*H2JgkhPCzFaa4V63Vi6rDQ.jpeg" /></figure><p>Hiring is one of the ways Dashdoc becomes itself. We look for a <strong>mutual fit</strong>: people who share our values and way of working, and who will also leave their own mark on the team.</p><p>Dashdoc is the transportation management product used by thousands of carriers and shippers in Europe. Our engineering team is <strong>40+ people</strong>. This post is simply how we hire today: what you’ll go through, in order, and what we care about.</p><p><strong>Heads-up:</strong> We tweak this process over time. What you read here is our current version, not carved in stone.</p><p><strong>Order and pace:</strong> The sequence below is the usual one, but it can shift a bit depending on calendars (yours and ours). We don’t have a single fixed “standard duration”: it depends on how fast we can book slots and how deep we need to go. We aim for <strong>steady progress</strong> once we’ve started: concrete dates, keeping back-and-forth over email to a minimum, and a clear “where we are” when you ask.</p><p><strong>Selection process:</strong> So far, every CV is read by a human, no automated filtering. We can’t commit to giving personal feedback to every applicant, but every application is genuinely read. When we don’t move forward with someone, it’s usually less a matter of skill level than of fit for the specific role we’re hiring on.</p><h3><strong>The rounds</strong></h3><p>Every step is a <strong>two-way conversation</strong>: we try to be clear about what that slot is for, you can ask us anything, and nothing here is meant to trip you up. <strong>We leave time in each session for your questions</strong> on the role, the stack, how we work, or what comes next.</p><h4><strong>Why these rounds</strong></h4><p>We want our interviews to <strong>reflect the actual work</strong>, not reward people for being good at interviews. Day to day, our engineers <strong>read unfamiliar code</strong>, <strong>debug real systems</strong>, <strong>shrink big problems into shippable slices</strong>, and <strong>talk about trade-offs</strong> with people who aren’t engineers. The mix below is what we’ve found maps best to that, including the parts that are tiring for everyone, because hiring without enough depth tends to hurt both sides later.</p><h4><strong>1. Screening call, ~30 min, remote 💬</strong></h4><p>We use the same baseline questions for everyone so it stays fair, and so we don’t forget the basics.</p><p>We want to know your path, what you’re looking for, and what drives you. <strong>You can ask us anything</strong>, we’ll be straight about what’s great here and what’s hard.</p><p>This is usually conducted by an engineering manager, and it’s a good opportunity to ask us anything you want to know about the role, the company, the team, the stack, etc. You’ll be asked about your salary expectations, your availability, your notice period, etc.</p><p><strong>What helps on your side:</strong> you don’t need to “study.” A rough idea of what you’ve done recently, what you want next, and <strong>why Dashdoc</strong> (even if that reason is still fuzzy) would work for you. If you’re unsure about seniority, team, or scope, don’t be afraid to say it!</p><h4><strong>2. Technical test, ~45 min to 1 h, remote 🐛</strong></h4><p>No whiteboard. No LeetCode.</p><p>You get a <strong>full-stack debugging</strong> task on a small, messy legacy codebase. The goal is for it to be closer to “joining a team” than to a textbook exercise (our codebase is tidier than the exercise, for what it’s worth 😅).</p><p>We care how you <strong>find your way</strong> in unfamiliar code: how you read it, what you ask, what you’d change and why. We’re not testing whether you’ve memorized Django or React. We are used to interviewing people with no experience on our stack; we’ll help with details if you need it. We’re interested in how a typical SaaS (frontend + API + how they talk) fits together.</p><p>The bug itself isn’t huge; what we care about is how you <strong>navigate unknown code</strong> calmly and methodically.</p><p><strong>Format:</strong> it’s a <strong>live</strong> session with engineers from the team. You share your screen, explore the repo, and talk us through what you’re seeing. We’re not grading your typing speed; we’re watching <strong>how you structure the search</strong> (where you look first, what you verify, how you narrow it down). If you’re stuck, say so! We’d rather see how you unblock with help than watch anyone pretend.</p><p><strong>What helps:</strong> a quiet slot and a good internet connection. Let’s be honest: the bug itself is usually not the hardest part, yet this is the most selective step, because it shows how clearly someone can build a mental model of unfamiliar code and follow the execution path under time pressure. My advice: don’t rush, have no assumptions about the code, just follow the code and how the frontend and backend interact.</p><h4><strong>3. Product &amp; tech shaping, ~1.5 h, remote or on-site 🎯</strong></h4><p>This part is a bit specific to us, and we like it.</p><p>We imagine you’re shipping an MVP for a product idea. Together we <strong>shape scope</strong>: what is the problem we’re really trying to solve, who is it for, what ships first, what waits, how you’d structure it, what you’d bet on technically.</p><p>There’s <strong>no model answer</strong>. We want to see how you think with us: product sense, trade-offs, and how you explain choices. At Dashdoc, engineers <strong>own features end to end</strong> (problem → ship); this session is a compressed version of that kind of discussion.</p><p><strong>Format:</strong> we bring a <strong>short written brief</strong>; the rest is discussion. Whiteboard, notes, bullet lists, whatever helps you think. It’s normal to <strong>challenge the premise</strong>, to ask “who is this for?”, or to change your mind once we add a constraint. We’re not scoring a slide deck; we’re watching how you <strong>collaborate under uncertainty</strong>.</p><h4><strong>4. The “Who” interview, ~45 min to 1 h, remote or on-site 📖</strong></h4><p>Loosely based on the <a href="https://whothebook.com/"><em>Who </em>approach</a>: we <strong>walk through your story</strong>. What you built, what you’re proud of, what you’d do differently, how you grew.</p><p>We’re listening for <strong>how you work</strong>: ownership, honesty about mistakes, curiosity.</p><p><strong>What to expect: </strong>we’ll spend time on <strong>a few chapters</strong> of your career, not every bullet on your CV. Come with <strong>2–3 examples</strong> you’re happy to go deep on (a messy project, a conflict you handled, a decision you’d revise). Specific beats abstract every time.</p><h4><strong>5. Reference calls 🤝</strong></h4><p>We’ll ask for a few people you’ve worked with (peers, managers, etc.) and have short chats with them. It rounds out what interviews can’t always show: <strong>strengths</strong>, <strong>how you collaborate</strong>, and where you’re still growing.</p><p>We usually ask for references <strong>when we’re seriously considering an offer</strong>, not as a generic paperwork step at the start. We’ll tell you what we’re trying to learn so you can pick people who’ve actually seen you work.</p><h4><strong>6. Meet a founder, ~30 min, remote ☕</strong></h4><p>Informal chat with our CEO or CTO; where the company is headed on our side, what pulls you in on yours. A last mutual check-in before we hopefully work together.</p><p><strong>Why it’s there:</strong> at our size, founders are still close to strategy and culture. This isn’t a veto round on trivia; it’s a chance for <strong>both sides</strong> to sanity-check “does this still feel right?”</p><h4><strong>Things we want you to know 💛</strong></h4><p><strong>You can stop anytime, so can we.</strong></p><p>If Dashdoc isn’t for you, say so; we’ll take it well. If we don’t continue, we’ll tell you <strong>as soon as we can</strong>.</p><p><strong>We try not to drag things out.</strong></p><p>Long ghosting helps nobody. We want enough time to decide properly, without losing momentum.</p><p><strong>We hire the way we’d want to be hired.</strong></p><p>Interviewers know the brief: be <strong>explicit</strong> about what each round is for, give you room to think out loud, and answer your questions honestly. We’re not playing roles or setting ambushes, we’re checking whether working together would make sense on both sides.</p><p>That lines up with how we talk about our values day to day. <strong>Care</strong> means feedback you can use and a <strong>no</strong> that comes with clarity, not silence. <strong>Ambition </strong>and <strong>Passion</strong> show up in serious, engaged conversations, not in pressure tricks. <strong>Speed</strong> is about <strong>respecting your time</strong>: keeping the process moving and not leaving you guessing for weeks.</p><p><strong>You don’t need a logistics background.</strong> Some of us knew freight before joining; most didn’t. You just need to be curious about the domain!</p><h4><strong>We’re still improving this</strong></h4><p>Like any product, our hiring process gets <strong>feedback and iterations</strong>. We ask candidates and interviewers what felt fair, clear, and respectful.</p><p>If something was confusing or unnecessarily heavy, we want to know! It helps us fix the process.</p><p>Been through it? We’d love your honest take. Thinking of applying? <strong>Hope this helps</strong> you know what to expect.</p><p>If helping reshape how transport runs sounds like your kind of problem, <a href="https://dashdoc.welcomekit.co/">we’d love to hear from you</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9c0b5864971b" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/how-we-hire-engineers-at-dashdoc-9c0b5864971b">How we hire engineers at Dashdoc</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Build Software at Dashdoc]]></title>
            <link>https://medium.com/dashdoc/how-we-build-software-at-dashdoc-78d2685cc40a?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/78d2685cc40a</guid>
            <category><![CDATA[software-architecture]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[software-development]]></category>
            <dc:creator><![CDATA[Axel Haddad]]></dc:creator>
            <pubDate>Fri, 29 Aug 2025 09:50:57 GMT</pubDate>
            <atom:updated>2026-05-29T12:47:03.826Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Balancing speed, clarity, and care in software development</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3b9uZw1aXnsyGhHXo1NoIA.jpeg" /></figure><p>When we started Dashdoc more than eight years ago, our goal was simple: make transport management easier and more transparent. What began as a small product has grown into a platform used by thousands of carriers and shippers across Europe (and soon the US!). Along the way, our engineering team grew to more than 30 developers, and with that growth came lessons, false starts, and a clearer sense of how we want to work.</p><p>Our values: <strong>Care, Ambition, Passion, and Speed</strong>, shape both our product and our engineering culture. They guide how we ship, how we collaborate, and how we grow. Here’s a look at what that means in practice.</p><h3>Shipping with purpose, not just speed 🚀</h3><p>In the early days, speed often meant rushing features out the door. We moved quickly, but we sometimes found ourselves rewriting the same code weeks later. Those moments taught us an important lesson: <strong><em>speed only matters if it’s sustainable</em></strong>.</p><p>Now, we focus on shipping with purpose. That means releasing in small, frequent increments instead of large batches, keeping risk low and feedback loops short. It means choosing pragmatic solutions that solve today’s problems <strong><em>without creating tomorrow’s bottlenecks</em></strong>. It also means taking technical debt seriously; not as a distraction, but as essential maintenance for long-term velocity.</p><p>Automation is what makes this sustainable. Strong test coverage, type checking, and continuous integration give us confidence. The safety net allows us to move quickly without second-guessing every change.</p><h3>Engineers who think product-first 🧑‍💻</h3><p>At Dashdoc, engineers don’t just write code; <strong><em>they own features end to end</em></strong>. We believe the best outcomes happen when developers are close to the “why,” not just the “how.”</p><p>That translates into full-stack ownership. While some of us lean backend or frontend, everyone is encouraged to work across layers — from infrastructure to UI — if that’s what it takes to deliver. Just as importantly, we invest time in <strong><em>understanding the business domain</em></strong>. Transport logistics is complex, and the best technical choices often come from understanding a dispatcher’s day-to-day reality as well as the intricacies of a codebase.</p><h3>Developer experience is team speed ⚡</h3><p>We think of developer experience (DX) as <strong><em>the engine behind both speed and quality</em></strong>. A clunky workflow slows everything down; a smooth one makes shipping fast, safe, and enjoyable.</p><p>This shows up in our codebase, which we want to feel both efficient and pleasant to work in. Clarity matters more than cleverness. Conventions are well defined so no one wastes time guessing patterns.</p><p>That’s why we invest in DX as a first-class priority:</p><ul><li>Clarity over cleverness: code should be easy to read and reason about, even for junior developers. That’s what makes it easy to improve and easy to debug.</li><li>Clear conventions: well-defined patterns and practices keep focus on solving problems, not guessing how to write code.</li><li>Automation everywhere: repetitive tasks belong to scripts and pipelines, not humans.</li></ul><p>A good developer experience pays back in more than happiness. It means fewer bugs, faster delivery, and <strong><em>a team that has the energy to keep improving</em></strong>.</p><h3>Care in collaboration 🤝</h3><p>One of our core values is “Care,” and it shows in how we collaborate. We review code with rigor and empathy. We share knowledge openly, so no expertise stays siloed. We balance ambition with pragmatism, aiming for excellence without burning people out.</p><p><strong><em>Pair programming</em></strong> has become central to this. It accelerates learning, improves quality, and keeps remote teammates connected. Some of our best solutions came out of two people sharing a screen, exploring a problem together, and finding a better answer than either would have alone.</p><h3>Growing with ambition 🚀</h3><p>We’re still small compared to the giants of our industry, but our ambition is large: to change how transport is managed across Europe and beyond. To do that, we need an engineering team that grows in skill, scope, and impact.</p><p><strong><em>Growth at Dashdoc isn’t limited to becoming a manager</em></strong>. Some of our strongest contributors are individual engineers who have taken ownership of complex domains, driven architectural changes, or mentored teammates. We want everyone to see their personal development aligned with the company’s growth — whether that path leads to leadership or deep technical expertise.</p><h3>Takeaway ✨</h3><p>Our engineering culture is about <strong>moving fast with purpose, building with a product mindset, <em>and investing in developer experience as a multiplier of both speed and happiness</em></strong>. We collaborate with care, and we grow with ambition.</p><p>The journey hasn’t always been straightforward, but each lesson has made our team stronger. As we continue to expand, we’re excited about what comes next — and about shaping the future of transport logistics together.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=78d2685cc40a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/how-we-build-software-at-dashdoc-78d2685cc40a">How We Build Software at Dashdoc</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hunting N+1 errors in our Django code: the implicit queries case.]]></title>
            <link>https://medium.com/dashdoc/hunting-n-1-errors-in-our-django-code-the-implicit-queries-case-711683c98cc4?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/711683c98cc4</guid>
            <category><![CDATA[query-optimization]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[django]]></category>
            <dc:creator><![CDATA[Daniel Barbeau]]></dc:creator>
            <pubDate>Tue, 09 Apr 2024 07:55:45 GMT</pubDate>
            <atom:updated>2024-04-09T07:55:45.704Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y6Xh_JrX6oJREeIIcpfrxA.png" /></figure><p>This story is co-authored with <a href="https://medium.com/u/387fd6699274">Jules Ricou</a>, who implemented extended diagnostics into the tool.</p><h3>Introduction</h3><p>Like any company providing on-line services, Dashdoc strives to deliver performance to its beloved customers. If performance drops, everybody suffers, starting from our customers up to our customer success team, our CEO, and of course, us, techies. We are really a bunch of empathic Care Bears at Dashdoc ☺️.</p><p>Jokes aside, technically, we have several aspects we look into for performance : pure DB optimisation (<a href="https://use-the-index-luke.com/">Use the Index Luke</a>), React UI performance, Django performance, and overall integration sanity (allow the UI to do bulk operations instead of tens or hundreds of similar calls for example).</p><p>We<a href="https://medium.com/dashdoc/filter-django-querysets-in-memory-b9cea7cf0959"> already talked</a> about avoiding one aspect of N+1 in Django.</p><p><a href="https://medium.com/dashdoc/filter-django-querysets-in-memory-b9cea7cf0959">Filter Django QuerySets In Memory</a></p><p>We described in that previous article two variations of this issue in our code:</p><ul><li>The typical instance.related.attribute version: if instance.related was not prefetched it would issue a new DB call.</li><li>The less obvious instance.related_set.filter(column=value) version.</li></ul><p>It gave an overview of how we tackled the second point. So in this article, we will explain how we tackled the first issue which is : not doing the correct complete prefetch.</p><h3>The problem</h3><p>During 2022 we became obsessed with performance issues. We were aware that our Django code did undesired implicit requests in different places, and although we kept adding unit tests with assertNumQueries we knew we were not always creating the situation where those tools would catch extra queries.</p><p>Indeed, to catch N+1s in tests reliably, the setup is often tedious. Of course, we will create numbers of first or second level entities — which are those we want to test — but the devil is in the details : we might forget to create enough remotely connected entities to catch redundant SQL queries, those entities that you will access in your business logic and completely forget they need to be prefetched.</p><h3>The intuition</h3><p>Eventually, we figured out a way to gain some insight on those N+1. Specifically the case where a instance.relation wasn’t loaded. In Django’s ORM, there is one point where we can sit and look at the accesses to the data: FieldCacheMixin.get_cached_value.</p><blockquote>🐒 Careful, monkey-patching ahead! Monkey-patching is an often frowned-upon technique to dynamically reconfigure code. Among the reasons that make reviewers frown are that it can introduce hard-to-discover behaviours and that it often patches internal APIs. However, sometimes the benefits outweigh the drawbacks so much!</blockquote><p>We monkey-patched FieldCacheMixin.get_cached_value with our own version which logged cache misses and a script was written to collect those logs and identify which QuerySet was missing a prefetch_related or a select_related.</p><p>This is the original monkey-patch.</p><pre>def install_missed_prefetched_detector():<br>    def get_cached_value(self, instance, default=NOT_PROVIDED):<br>        cache_name = self.get_cache_name()<br>        try:<br>            return instance._state.fields_cache[cache_name]<br>        except KeyError:<br>            if default is NOT_PROVIDED:<br>                # _REPORTER just does whatever you want to do: <br>                # system log, sentry, raise, spam slack, keep quiet.<br>                _REPORTER(cache_name, instance)<br>                raise<br>        return default<br><br>    FieldCacheMixin.get_cached_value = get_cached_value</pre><p>We continued this tedious hand-crafted way for a few months, letting the logger log, and running manually a script to analyze our logs every now and then to find the next code paths to target.</p><h3>The solution</h3><p>Finally, during a <a href="https://medium.com/dashdoc/how-we-organize-our-work-shape-up-774d73eba4dd">cool-down</a>, we used the same monkey-patching approach and built a decorator for our unit tests : <strong>@break_on_missed_prefetches.</strong></p><p><a href="https://medium.com/dashdoc/how-we-organize-our-work-shape-up-774d73eba4dd">How we organize our work — Shape Up</a></p><p>Its role was to make the test fail as soon as FieldCacheMixin.get_cached_value missed its cache, indicating an N+1.</p><p>Concurrently, we had already a helper testing class to centralize all the API calls we used in our tests, like:</p><pre>class TestClientMixin:<br>   def api_create_transport(self, *args, **kwargs):<br>       self.client.post(<br>           # ...<br>       )<br>       <br>   def api_assign_trucker(self, *args, **kwargs):<br>        self.client.patch(<br>            # ...<br>	)</pre><p>And all our tests use these methods. All we needed was to decorate those methods to start guarding our code against missed prefetches!</p><pre>class TestClientMixin:<br>   @break_on_missed_prefetches<br>   def api_create_transport(self, *args, **kwargs):<br>       self.client.post(<br>           # ...<br>       )<br>   @break_on_missed_prefetches<br>   def api_assign_trucker(self, *args, **kwargs):<br>      self.client.patch(<br>           # ...<br>      )</pre><p>Bingo! As soon as a missed prefetch happened, we got an exception and the test failed. No need to set up complex data situations and count the number of queries to detect the issue! The expected related object is not there? Then it is an error.</p><p>We later enforced decorating these methods with a <a href="https://semgrep.dev/">Semgrep</a> rule. By using it every day, we iterated the tool quite extensively.</p><ul><li>Firstly, we improved this tool to also catch deferred fields being fetched: .only(...)and .defer(...) can be used to avoid fetching all fields of a model, and in our case accessing those fields is a sign of a bug.</li><li>Secondly, we added the call stack of the location where the query was executed, to that end, we monkey-patched QuerySet._fetch_all to assign it to the model that was fetched. This solution being only useful for developers (and being quite expensive) is disabled by default. It is often useful because if you miss a prefetch, you most likely want to add select_related / prefetched_related to the query.</li><li>Finally, with a deep hierarchy of prefetch/select_related it can be hard to find the related object with only the message “implicit query accessing &lt;field&gt; on ModelClass”, to help us we store the parent of children entities from withinFieldCacheMixin.get_cached_value. With this information we can go as far as constructing a likely answer to fix the query: “Hint: add ‘.select_related(grand_parent__parent__field’)” to the query.</li></ul><h3>Conclusion</h3><p>Nowadays, the decorator is called @break_on_implict_queries and reports a lot more information to make it even easier for us to pinpoint incomplete querysets. We go all the way into giving the faulty queryset and where it is defined, together with where the failing N+1 call happened. We also cross the boundary of Celery tasks to be able to debug them too.</p><p>Besides the decorator used in tests, we can turn on the mechanism in our developing environment to catch the mistakes early.</p><p>This tool has proved to be a very valuable addition to our performance tuning tool kit. Its integration in our unit testing makes sure we don’t let missed prefetches go live. Although it doesn’t completely avoid false positives (for example situations where doing the correct prefetch explicitly or letting Django do the query lazily implicitly makes no performance difference) we use it as a proactive tool to avoid heavy production performance incidents.</p><p>More recently, this tool caught an issue while migrating to Django 5. An extra query appeared in complex situations. The investigation is still ongoing to pin-point what is wrong though.</p><p>But we were not done yet, performance is an ever lasting battle and the following year, 2023, was crucial for us in that regard.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=711683c98cc4" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/hunting-n-1-errors-in-our-django-code-the-implicit-queries-case-711683c98cc4">Hunting N+1 errors in our Django code: the implicit queries case.</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Filter Django QuerySets In Memory]]></title>
            <link>https://medium.com/dashdoc/filter-django-querysets-in-memory-b9cea7cf0959?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/b9cea7cf0959</guid>
            <category><![CDATA[queryset]]></category>
            <category><![CDATA[django]]></category>
            <category><![CDATA[performance]]></category>
            <dc:creator><![CDATA[Daniel Barbeau]]></dc:creator>
            <pubDate>Tue, 11 Apr 2023 11:44:49 GMT</pubDate>
            <atom:updated>2023-04-11T11:44:49.213Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KTQi-1C99hW6-4rsm2pwuw.png" /><figcaption>Filtering Querysets without hitting the database</figcaption></figure><p>Django’s ORM is great to work with business entities. It also provides tools to do efficient queries with QuerySet.select_related (for to-one relations) and QuerySet.prefetch_related (for to-many relations).</p><p>But even so, it is easy to write code that hits the database too many times. This happened to us on our spreadsheet export system. This system was vital for our customers as it was, back then, the only way for them to move data from Dashdoc to their invoicing software, as we didn’t yet have the third party integrations we have today.</p><p>The system would be used to export thousands of transports at once, and we would quickly run into timeouts. We did have a thorough eager loading at the beginning of the export procedure.</p><p>After some investigation we found two variations of the N+1 problem:</p><ul><li>The typical instance.related.attribute version: if instance.related was not prefetched it would issue a new DB call.</li><li>The less obvious instance.related_set.filter(column=value) version.</li></ul><p>Both required fixing. We tackled both and developed some tools to guard us against them. In this blog, we will talk about the second problem.</p><h3>The problem</h3><p>The faulty code was in extractors. These extractors are functions that receive a Transport and extract data from it or from its related instances. The following pattern appeared very often :</p><pre>some_data = transport.related_set.filter(attribute__operator=value)</pre><p>This came from the misconception that a .filter() can work on the already fetched data. And it <em>could</em> indeed be very handy, once you are used to Django’s ORM filtering syntax, it is natural to want to use it everywhere.</p><p>Except this hits the DB, not the <em>prefetched</em> data. Now repeat this for tens of related instances for thousands of transports and you get the idea.</p><h3>The (odd but satisfying) solution</h3><p>When a QuerySet is evaluated, it holds the data in a cache. And if there happens to be select_related and prefetch_related in the QuerySet’s clause, then the instances in the cache will themselves have caches for related to-ones and to-manies.</p><p>Since we already had a correct eager loading at the beginning of our export procedure, all the data was already in memory, but was being ignored because of the subsequent .filter() calls.</p><p>What we came up with is a module called in_memory_filtering. It very simply implements Django’s ORM filtering syntax to work over the QuerySet. So the previous example becomes :</p><pre>from in_memory_filtering import filter_set</pre><pre>some_data = filter_set(transport.related_set.all(), attribute__operator=value)</pre><p>Sounds dumb? It is! The implementation is as straightforward as it can be. No caching strategies, no indexes. We just iterate through the QuerySet and convert the filter expression(s) into lookups on the instances of the QuerySet as many times as requested.</p><p>Indeed, once the QuerySet is fetched, one can iterate over it without hitting the DB again. And the magic is that it works in related sets too : instance.related_set.all() will return the cached data!</p><p>This very simple implementation serves as the base to express the equivalents for :</p><ul><li>QuerySet.first()</li><li>QuerySet.values_list()</li><li>QuerySet.values()</li><li>QuerySet.get()</li></ul><p>The only catch there is, is that by design, it doesn’t prevent accidental DB accesses: we prefer some accidental extra DB hits than a user facing exception or missing data.</p><h3>It looks silly and over-engineered</h3><p>Yes, this is nothing that can’t be done in other ways. And even more efficiently too! Actually, compared to a list comprehension, there is more machinery involved to evaluate the filtering expressions, so this adds overhead.</p><p>But it is enough for us to have decently fast and reliable exports. What this module actually provides is not a technical breakthrough. It instead tries to sit somewhere between raw efficiency and programmer friendliness.</p><p>It aims to be easy to adopt by a Django developer. The expressions used in faulty .filter() queries can be easily converted since the syntax is the same (we added the negation of the operator : __not__eq). It supports JSONField. Also, when the expression uses unknown attributes that do not exist on the filter instances, we provide error messages similar to Django’s Cannot resolve keyword ....</p><p>Is is also a good place to put sanity checks on the querysets being filtered.</p><h3>Conclusion</h3><p>The in_memory_filtering module has been in use for more than two years now and was the first bit of code we rolled out to fight against performance bottlenecks. But it only works if the eager loading is properly done. So, since then, we have added tools to guard us against missed prefetches and it is now one tool in a growing toolkit to deliver more performance to Dashdoc users.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b9cea7cf0959" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/filter-django-querysets-in-memory-b9cea7cf0959">Filter Django QuerySets In Memory</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Tailored API documentation with OpenAPI]]></title>
            <link>https://medium.com/dashdoc/tailored-api-documentation-with-openapi-93775c8fe712?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/93775c8fe712</guid>
            <category><![CDATA[api]]></category>
            <category><![CDATA[documentation]]></category>
            <category><![CDATA[swagger]]></category>
            <category><![CDATA[open-api]]></category>
            <dc:creator><![CDATA[Thomas Loiret]]></dc:creator>
            <pubDate>Wed, 27 Jul 2022 14:13:44 GMT</pubDate>
            <atom:updated>2022-07-27T14:13:44.406Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>OpenAPI is the standard nowadays to specify your API. Among the benefit of using it, you can generate nice documentation out of your specification. In this article, we will go from documentation writing to having a web page for your API documentation.</p><p>It is important to note that there are solutions to auto-generate your OpenAPI specification from your code and we actually started with that. For example, you can use drf-yasg for Django Rest Framework. However, this kind of library has its own limitations when it tries to automatically discover new endpoint specifications, especially if the implementation is a bit complex. Because the documentation was not reflecting the reality anymore, we decided to switch to the solution presented here.</p><figure><img alt="Screenshot of the API documentation when using redoc with OpenAPI." src="https://cdn-images-1.medium.com/max/1024/1*4kPZ-YXHoWbOJyliM4qt5w.png" /><figcaption><em>A glimpse of our API documentation using the method presented here.</em></figcaption></figure><h3>Describe your API using YAML</h3><p>The specification can be written either with YAML or JSON. We suggest using YAML because you can split it across several files making it easier to read and maintain.</p><p>Below is a simplified version of our API documentation entry file.</p><pre>openapi: 3.0.1<br>info:<br>  version: &quot;4.0&quot;<br>  title: Dashdoc public API<br>servers:<br>  - url: &lt;https://api.dashdoc.eu/api/v4&gt;</pre><pre>paths:<br>  /addresses/:<br>    $ref: &quot;./paths/addresses/addresses.yaml&quot;<br>  /addresses/{id}/:<br>    $ref: &quot;./paths/addresses/addresses-id.yaml&quot;</pre><p>The main interest of the YAML usage versus JSON is the $ref keyword that allows us to split the specification into smaller files and avoid text duplication.</p><p>We split our files into 2 main directories, one where we describe the endpoints (the ./paths/ visible above) and another to describe the objects manipulated by those endpoints. It helps us keep things tidy.</p><h3>Convert YAML to JSON definition</h3><p>Having a YAML definition is nice but in most cases, we need the JSON version.</p><p>This can be easily done with the help of OpenAPI Generator. This tool is often used to generate an API client (SDK) for the provided API definition. But it can also be used to convert the YAML definition into a JSON file.</p><p>OpenAPI Generator is written in Java. There are several ways to install it (see <a href="https://openapi-generator.tech/docs/installation/">documentation</a>). We chose the Docker image to avoid handling the java dependency and keep our local environment clean.</p><pre>docker run \\<br>    --rm \\<br>    -v $BASEDIR:/docs \\<br>    openapitools/openapi-generator-cli generate \\<br>        -i /docs/v4.yaml \\<br>        -g openapi \\<br>        -o /docs/tmp/v4</pre><p>Parameters explanation for docker:</p><ul><li>--rm: Automatically remove the container when it exits</li><li>-v: Bind mount a volume. Associate a local directory to a directory inside the container.</li></ul><p>Parameters explanation for openapi-generator-cli generate:</p><ul><li>-i: Input data, here, the main YAML file.</li><li>-g: Generator name, always openapi for the documentation.</li><li>-o: Output directory. The JSON definition will be saved here. It should be something within the mounted volume if you want to retrieve it on your local directory.</li></ul><p>Finally, the $BASEDIR matches the local path in your environment where you store your documentation.</p><h3>Serve your API documentation using redoc</h3><p>As said in the introduction, the JSON definition is often used by other tools. One of them is [redoc](&lt;https://github.com/Redocly/redoc&gt;), which generates interactive API documentation from OpenAPI definitions.</p><p>As you can read in their <a href="https://github.com/Redocly/redoc#tldr-final-code-example">documentation</a>, you just need to create an HTML page and use a &lt;redoc&gt; tag with the URL to your JSON definition file in it!</p><pre>&lt;!DOCTYPE html&gt;<br>&lt;html&gt;<br>  &lt;head&gt;<br>    &lt;title&gt;Redoc&lt;/title&gt;<br>    &lt;!-- needed for adaptive design --&gt;<br>    &lt;meta charset=&quot;utf-8&quot;/&gt;<br>    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1&quot;&gt;<br>    &lt;link href=&quot;&lt;https://fonts.googleapis.com/css?family=Montserrat:300,400,700|Roboto:300,400,700&gt;&quot; rel=&quot;stylesheet&quot;&gt;</pre><pre>    &lt;!--<br>    Redoc doesn&#39;t change outer page styles<br>    --&gt;<br>    &lt;style&gt;<br>      body {<br>        margin: 0;<br>        padding: 0;<br>      }<br>    &lt;/style&gt;<br>  &lt;/head&gt;<br>  &lt;body&gt;<br>    &lt;redoc spec-url=&#39;&lt;http://petstore.swagger.io/v2/swagger.json&gt;&#39;&gt;&lt;/redoc&gt;<br>    &lt;script src=&quot;&lt;https://cdn.jsdelivr.net/npm/redoc@latest/bundles/redoc.standalone.js&gt;&quot;&gt; &lt;/script&gt;<br>  &lt;/body&gt;<br>&lt;/html&gt;</pre><p>Check out our live documentation to have an idea of redoc: <a href="https://www.dashdoc.eu/api/v4/docs/">https://www.dashdoc.eu/api/v4/docs/</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=93775c8fe712" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/tailored-api-documentation-with-openapi-93775c8fe712">Tailored API documentation with OpenAPI</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How did tech and marketing work together to ship a website in 3 days?]]></title>
            <link>https://medium.com/dashdoc/how-did-tech-and-marketing-work-together-to-ship-a-website-in-3-days-81f3a697a2cd?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/81f3a697a2cd</guid>
            <category><![CDATA[webflow]]></category>
            <category><![CDATA[growth-mindset]]></category>
            <category><![CDATA[marketing]]></category>
            <category><![CDATA[tech]]></category>
            <category><![CDATA[growth]]></category>
            <dc:creator><![CDATA[Emilien Bidet]]></dc:creator>
            <pubDate>Tue, 26 Apr 2022 15:18:20 GMT</pubDate>
            <atom:updated>2022-04-26T15:18:20.109Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><figure><img alt="Google trends — Diesel indexing research evolution" src="https://cdn-images-1.medium.com/max/1024/1*BVZDlpXtLnHJbPzSBPu0Ug.png" /><figcaption>Google trends — Diesel indexing research evolution</figcaption></figure><p>In France, oil is getting more and more expensive and we realised that our prospects and customers current biggest pain was to increase their invoices to their customers in order to keep their margin.</p><p>Our objective was to help them quickly with a tool and make them aware of Dashdoc.</p><p>Only <strong>3 days</strong> lasted between the slack channel creation about the topic and the website launching.</p><p>After the product team spend the first day digging into the need and building strong documentation that sums up what we found about the topic, we started designing an MVP.</p><h3>Design and implement the MVP</h3><figure><img alt="Prototypes of the landing page on Figma" src="https://cdn-images-1.medium.com/max/1024/1*YIr8IH3HdNS9zc8d6j2UHA.png" /><figcaption>Prototypes of the landing page on Figma</figcaption></figure><p>I started by prototyping the website on Figma (not pixel-perfect at all) just to illustrate what we were talking about. Two ideas emerged from this first step :</p><ul><li>On the left, a complex online tool</li><li>On the right, a simple download page</li></ul><p>We wanted to ship quickly and stay simple so we choose to implement the second option. To implement this I used <a href="https://webflow.com/">WebFlow</a> because I heard that I could create a website in minutes and that’s real: it’s powerful.</p><p>Everyone was working together in symbiosis:</p><ul><li>our finance manager Hélène was creating the Excel tool,</li><li>the marketing team was thinking about the content (title, CTA text, excel file) and the communication to promote the website (Blog article, Newsletter, LinkedIn post, Webinar)</li><li>and I was creating the first iteration of the website on WebFlow.</li></ul><h3>Time to ship!</h3><figure><img alt="Screenshot of the result of indexation-gazole.fr" src="https://cdn-images-1.medium.com/max/1024/1*3KjmtrnlRK8g2D1P09U_5w.png" /><figcaption>Screenshot of the result of <a href="http://indexation-gazole.fr">indexation-gazole.fr</a></figcaption></figure><p>On the second day, we already had a functional website.</p><p>A simple download page with a form linked to our CRM. On submitting the form, a workflow is triggered in our CRM that sends the Excel file by email.</p><p>On the final day, we connected the related domain <a href="http://indexation-gazole.fr">indexation-gazole.fr</a> that we bought for the occasion to the Webflow website within minutes and we launched communication on all our social media.</p><h3>Conclusion</h3><p>This small campaign resulted in <strong>300 new leads</strong> in a month for 3 days of conception/development.</p><p>We will now work on SEO to start generating organic traffic by adding more useful content to become the reference website about the topic.</p><p>Three key factors have made this possible in such a short time:</p><ul><li>Use only <strong>low code tools</strong> to iterate as fast as we can</li><li>Strong internal communication</li><li>Keep the user’s pain in mind</li></ul><p>At Dashdoc we move fast and adapt quickly to the needs of our customers, if you’re looking for challenges, come <a href="https://www.welcometothejungle.com/fr/companies/dashdoc/tech-1">reach us</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*j65ZwV20hP4hyN1K" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=81f3a697a2cd" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/how-did-tech-and-marketing-work-together-to-ship-a-website-in-3-days-81f3a697a2cd">How did tech and marketing work together to ship a website in 3 days?</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How we organize our work — Shape Up]]></title>
            <link>https://medium.com/dashdoc/how-we-organize-our-work-shape-up-774d73eba4dd?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/774d73eba4dd</guid>
            <category><![CDATA[shape-up]]></category>
            <category><![CDATA[tech]]></category>
            <category><![CDATA[organisation]]></category>
            <dc:creator><![CDATA[Corentin Smith]]></dc:creator>
            <pubDate>Wed, 23 Mar 2022 10:54:39 GMT</pubDate>
            <atom:updated>2022-03-23T10:55:07.312Z</atom:updated>
            <content:encoded><![CDATA[<h3>How we organize our work — Shape Up</h3><h3>The setup</h3><p>2 years ago, we had 3 fresh new engineers in our tech team (5 in total), and at last, a real product manager who helped us square things up. We started planning sprints again, feeling good about the new process.</p><p>But things started feeling off. As a founder I was used to having a lot of freedom in how I planned my work, and I was used to owning the whole development flow, from design to implementation and deployment.</p><p>Now we were spending a lot of time in retrospective meetings, in planning sessions, splitting hairs trying to estimate points for user stories. It was a lot of work for our product manager to write these detailed specs and user stories for everyone, and it was not even pleasant for the tech team.</p><p>I felt that this way of working was mostly holding back the best people in the team and helping mostly the more junior people.</p><p>We were processing ticket after ticket and hoping that the end of the sprint would bring the whole epic together. What happened is that we ended up in integration hell and the parts developed in isolation did not fit together so well. It felt more like a mini-waterfall than an agile process.</p><p>There was no ownership in the tech team; every person was responsible for a small part of a feature, and then we tried to glue it all together.</p><p>With the team growing, things were only going to get worse if we continued on the same path. So I started looking around to see if there was another way of doing things, and I found <a href="https://basecamp.com/shapeup/webbook">Shape Up</a>.</p><h3>Shape Up</h3><p>The tagline for the Shape Up book is “Stop Running in Circles and Ship Work that Matters”.</p><p>The basic idea is the following:</p><ul><li>New features are described in “pitches” where a pitch is a short document outlining the problem we are solving, why it matters, and what we intend to build to solve it. The specification of what we’re building is voluntarily light on details; it’s more precise than an idea but less than a full mockup.</li><li>The development cycle is 8 weeks long, split into 2 parts: 6 weeks working on pitches, and 2 weeks of “cool-down” where the tech team can spend time working on technical improvements. The idea is that 6 weeks is long enough to tackle important work but short enough that you can feel the deadline. (<a href="https://www.intercom.com/blog/6-week-cycle-for-product-teams/">Intercom also has a good article on why 6 weeks is a good cycle time</a>)</li><li>One or two devs work on a single pitch (“big batch”) or a few pitches (“small batch”) together during the cycle.</li><li>A small team of core people writes the pitches (mainly the product manager in our case).</li><li>Before the beginning of every cycle, we have a meeting called the “Betting table” where the co-founders and heads of all departments choose together which pitches will be selected for the upcoming cycle.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_qL5R5qXsp_5oJlMox6DiQ.jpeg" /><figcaption>Dhinesh and Arthur hard at work on their pitch</figcaption></figure><h3>Starting out</h3><p>Once a few of us in the team had read a few articles and the Shape Up book, we were convinced that we could try it out for real. We shared some resources with the whole team, took time to explain what we were going to try and why. We wrote enough pitches to have a first full cycle, and we held the first betting table.</p><p>The first thing we gained from holding the betting table meeting is that it aligned the whole company on what we were building in the product. No more trying to add a “small feature” at the last minute to sign a customer.</p><h3>What went well</h3><p>Things went pretty smoothly from the start, and everyone felt a lot more productive. The product team had a lot more time to spend on discovery, the engineers could concentrate on one subject at a time.</p><p>There is a lot more ownership in the tech team. The goal is to solve the user’s problem during the cycle, and the pitch can and should be challenged to make sure we reach this goal.</p><p>During the past 2 years, the product + engineering team grew to about 25 people split in 4 teams and the method scaled nicely.</p><h3>Adapting the method to our needs</h3><p>Unlike Basecamp, we don’t serve only small businesses on the same pricing, and we have a high-touch sales cycle. In many ways the Basecamp model is closer to B2C; they can choose to ignore some of their customers much more easily. Our customers range from independent truckers to enterprise companies with thousands of users, so we had to adapt the method to our needs.</p><ul><li>The main thing we didn’t keep from the book was the idea to only fix bugs during the cool-down. We have a product that is critical for our users and most of the time we can’t wait a few weeks to fix a bug. At first, we had the whole team dedicated every Tuesday and Thursday morning to bug fixing. We found that this was not great in terms of focus (one of the main goals of Shape Up) so we’re now experimenting with a “Delight team”. The Delight team’s goal is to fix bugs and implement small features that can be blocking the onboarding of new customers, and generally help unblock the ops team any way they can.</li><li>At first we tried not to commit to a roadmap, then we added an “orientation meeting” in the middle of the cycle to give visibility in advance to the rest of the team, and finally as we pivoted our product to a full transport management platform we returned to a roadmap as the objectives for the next ~6 months are very clear.</li><li>We create one Slack channel per pitch, where the engineering, product, and customer success can talk about ongoing work on the feature.</li><li>We do not use hill charts to communicate progress. More generally, each team can choose to organize their work how they see fit. Some hold daily standups, some only have a weekly meeting, some use Notion to track their work, some use Linear…</li></ul><h3>Still figuring it out</h3><p>After each cycle we hold a retrospective and change our workflow to better fit our needs. As the team grows, new challenges arise and we have to keep updating the way we work.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=774d73eba4dd" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/how-we-organize-our-work-shape-up-774d73eba4dd">How we organize our work — Shape Up</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Extracting structured data from scanned documents]]></title>
            <link>https://medium.com/dashdoc/extracting-structured-data-from-scanned-documents-57238ba98d49?source=rss----4cbc68d7a693---4</link>
            <guid isPermaLink="false">https://medium.com/p/57238ba98d49</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[ocr]]></category>
            <category><![CDATA[keras]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Cecile Prat]]></dc:creator>
            <pubDate>Thu, 10 Mar 2022 14:53:23 GMT</pubDate>
            <atom:updated>2022-03-10T14:53:23.676Z</atom:updated>
            <content:encoded><![CDATA[<p>At Dashdoc, our goal is to digitize the transportation sector. But our customers still have to deal with a lot of paper documents. Extracting informations from these documents could help us reduce manual processing.</p><p>The aim here was to get the weight of the load of a transport from a ticket picture taken by a trucker.</p><p>These weight notes can have various different layouts, often several weights are present on the tickets, they are not all in the same orientation, we have handwriting, blurred pictures, shadows…</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h2wAs2mZq18y2psLEZ8nYg.png" /></figure><p>The general approach was to proceed step by step, splitting the problem into successive subtasks:</p><ol><li>Straighten the document</li><li>Extract the text from the image</li><li>Find the requested weight parts in the text</li><li>Rebuild the weight</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZyTKr_klvNo1H4ixsB2Q3A.png" /></figure><h3>Straighten the document</h3><p>Thanks to a tool we use on the mobile side, most of the pictures are reframed around the document we want. But it can lead to documents with a 90° angle.</p><p>We chose to straighten these images as an automatic preprocessing. The idea was to make the task easier for the following steps.</p><p>To do so, we built an image classification model. We used transfer learning, taking a MobileNet model pretrained on ImageNet, on top of which we added some layers to fit our problem of 4 orientations (0°, 90°, 180°, 270°) classification. We trained these new layers on 1000 training tickets coming from our database, that we annotated with a home-made annotation tool, based on OpenCV. At the end of the training, we chose the model that gave the best result an on another 1000 validation tickets. This was done using Keras.</p><pre>import tensorflow as tf</pre><pre>idg = tf.keras.preprocessing.image.ImageDataGenerator() <br>training_data_generator = idg.flow_from_dataframe(<br>    training_data, directory=output_path,<br>    x_col=&quot;filename&quot;, y_col=&quot;orientation&quot;,<br>    target_size=(224, 224),<br>)<br>validation_data_generator = idg.flow_from_dataframe(<br>    validation_data, directory=output_path,<br>    x_col=&quot;filename&quot;, y_col=&quot;orientation&quot;,<br>    target_size=(224, 224),<br>)</pre><pre>inputs = tf.keras.Input(shape=(224, 224, 3))<br>scale_layer = tf.keras.layers.Rescaling(scale=1 / 127.5, offset=-1)<br>x = scale_layer(inputs)<br>base_model = tf.keras.applications.MobileNet(<br>    weights=&quot;imagenet&quot;,<br>    input_shape=(224, 224, 3),<br>    include_top=False,<br>  )<br>base_model.trainable = False<br>x = base_model(x, training=False)<br>x = tf.keras.layers.GlobalAveragePooling2D()(x)<br>outputs = tf.keras.layers.Dense(4, activation=&#39;softmax&#39;)(x)<br>model = tf.keras.Model(inputs, outputs)</pre><pre>model.compile(<br>    optimizer=tf.keras.optimizers.Adam(),<br>    loss=&#39;categorical_crossentropy&#39;,<br>    metrics=[&#39;categorical_accuracy&#39;],<br>)<br>checkpoint = tf.keras.callbacks.ModelCheckpoint(<br>        save_dir + &quot;mobilenet.h5&quot;, <br>        monitor=&#39;val_categorical_accuracy&#39;, verbose=1, <br>        save_best_only=True, mode=&#39;max&#39;<br>    )<br>callbacks = [checkpoint]<br>model.fit(<br>    training_data_generator,<br>    epochs=20,<br>    validation_data=validation_data_generator,<br>		callbacks=callbacks<br>)</pre><p>Thanks to this, we managed to reduce the number of wrong orientations from 13.5% to 2.5% on our 1000 test tickets.</p><h3>Extract the text</h3><p>The following step was to extract the text from the image. This is the OCR (Optical Character Recognition) part. For this, we used the docTR package from Mindee. DocTR divides the task into two parts: first text detection (isolate words regions), then text recognition (identify characters). We use the default pretrained detection and recognition models provided by docTR.</p><pre>from doctr.models import ocr_predictor</pre><pre>ocr_model = ocr_predictor(pretrained=True)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/306/0*ubd1K8JuUG_QY_Lg.png" /></figure><h3>Find the requested weight</h3><p>The next step was to find the good weight among all the extracted text. We did this part using token classification. It consists in assigning labels to individual tokens in a sentence. Tokens can be words, letters or subwords. Here we try to identify tokens belonging to two classes:</p><ul><li>O, Outside of a named entity</li><li>W, net Weight entity</li></ul><p>For this task we used a LayoutLM model, which is a transformer model that takes into account both the text and the layout of the document. This was done using the Transformers package from Hugging Face.</p><p>The LayoutLM model expects tokens and their boxes coordinates as input. So the first part of the token classification step is to build those tokens and corresponding boxes. In our case, tokens will correspond to subwords.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oYmH3hycCO8PbT4OMcfaFA.png" /></figure><p>DocTR already gives us words and their bounding boxes.</p><p>First, LayoutLM expects the coordinates to be on a 0–1000 scale. Since docTR gives the boxes as relative coordinates, you just have to multiply the OCR result coordinates by 1000.</p><p>Then we have to split these words into tokens that match the ones the model has been pretrained on. This can be done by using the tokenizer corresponding to the model given by Transformers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PEHFerFYMy7GAn5ch3JZoQ.png" /></figure><pre>import numpy as np<br>from transformers import (<br>    LayoutLMConfig,<br>		LayoutLMTokenizer,<br>)<br>import tensorflow as tf</pre><pre>tokenizer = LayoutLMTokenizer.from_pretrained(&quot;microsoft/layoutlm-base-uncased&quot;)</pre><pre>train_features = tokenizer(<br>    list(train_images_words_data_df[&quot;sentence&quot;]),<br>    padding=&quot;max_length&quot;,<br>    truncation=True,<br>    return_tensors=&quot;tf&quot;,<br>).data</pre><p>Extending the bounding boxes to these tokens must be done manually. This extend must take into account the special tokens that are added by the tokenizer, such as the tokens of start and end of the document, or the padding tokens (that are added so all our documents have the same size and can be processed by batch). We also have to take into account the possible truncation of the document (that is necessary to fit with the model maximum admissible input size). For training, labels must also be extended. We set the labels of all special tokens to -100 (the index that is ignored by the loss function) and the labels of all other tokens to the label of the word they come from.</p><pre>config = LayoutLMConfig.from_pretrained(&quot;microsoft/layoutlm-base-uncased&quot;)</pre><pre># Build boxes and labels for tokens<br>max_tokens = config.max_position_embeddings</pre><pre>images_token_boxes_list = []<br>images_labels_list = []<br>for _, row in images_words_data_df.iterrows():<br>    words = row[&quot;value&quot;]<br>    normalized_word_boxes = np.transpose(<br>        [row[&quot;x0_scaled&quot;], row[&quot;y0_scaled&quot;], row[&quot;x1_scaled&quot;], row[&quot;y1_scaled&quot;]]<br>    ).tolist()<br>    labels = row[&quot;label&quot;]<br>    token_boxes = []<br>    token_labels = []<br>    # Words tokens<br>    for word, box, label in zip(words, normalized_word_boxes, labels):<br>        word_tokens = tokenizer.tokenize(word)<br>        token_boxes.extend([box] * len(word_tokens))<br>        token_labels.extend(<br>            [label] * len(word_tokens)<br>        )<br>    # Truncation<br>    special_tokens_count = 2<br>    if len(token_boxes) &gt; max_tokens - special_tokens_count:<br>        token_boxes = token_boxes[:(max_tokens - special_tokens_count)]<br>        token_labels = token_labels[:(max_tokens - special_tokens_count)]<br>    # Add bounding boxes of special tokens([CLS] and [SEP])<br>    token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]<br>    token_labels = [-100] + token_labels + [-100]<br>    # Padding<br>    padding_length = max_tokens - len(token_boxes)<br>    token_boxes += [[0, 0, 0, 0]] * padding_length<br>    token_labels += [-100] * padding_length</pre><pre>    # Add image result<br>    images_token_boxes_list.append(token_boxes)<br>    images_labels_list.append(token_labels)</pre><pre>train_features[&quot;bbox&quot;] = tf.convert_to_tensor(train_images_token_boxes_list)<br>train_features[&quot;labels&quot;] = tf.convert_to_tensor(images_labels_list)</pre><p>Regarding the model, we fine-tuned the Transformers LayoutLM pretrained model on 135 tickets. As for the orientation model, these notes were annotated thanks to a home-made annotation tool, also based on OpenCV.</p><pre>import tensorflow as tf<br>from transformers import (<br>    TFLayoutLMForTokenClassification,<br>)</pre><pre>BATCH_SIZE = 2<br>EPOCH_NUMBER = 5</pre><pre>train_tf_dataset = tf.data.Dataset.from_tensor_slices(train_features)<br>train_tf_dataset = train_tf_dataset.shuffle(len(train_tf_dataset)).batch(BATCH_SIZE)</pre><pre>model = TFLayoutLMForTokenClassification.from_pretrained(<br>    &quot;microsoft/layoutlm-base-uncased&quot;, num_labels=2<br>)<br>model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5))<br>model.fit(train_tf_dataset, epochs=EPOCH_NUMBER)</pre><h3>Rebuild the weight</h3><p>After the token classification task, we have some tokens identified as weight. It is worth noticing that along the different steps, the weight can be split in several different words (by OCR task) and into different tokens (tokenization for token classification). In an ideal world, at this step we should just have to paste all the tokens identified as weight together to rebuild our weight. But in reality, this part can be messy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/898/1*fJxhou8HZG-Wi1vifCxAKg.png" /></figure><p>For instance, tokens that are part of the requested weight could have been predicted as not weight (false negative), and on the contrary some tokens that are not part of the wanted weight can be predicted as weight (false positive). For the first issue, we tried to smooth the prediction by using the box of the tokens identified as weight: we consider that all the tokens inside a box are part of the weight if one is predicted as weight. And so, all the tokens of a box containing a weight token are merged together to constitute a partial candidate string.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eCNMykJl1SVbxTtPZr60sA.png" /></figure><p>Then, to handle the weight split into several words (this can for instance happen when there is a space after the thousands), we use the order of the boxes: if two consecutive boxes are considered as weight, we merge them.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*U8TCMGZwCuuuWeV1L9AU_Q.png" /></figure><p>After this, we have several potential weight strings that can contain any character. For each of them, we use a simple regular expression to extract the digits and add some basic business rules to compute a number, check if it can constitute a plausible weight and determine also the unit. We finally select one of the plausible weights.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3i3vE8BIIlGJIGfX6VAnzA.png" /></figure><h3>Results</h3><p>Thanks to this approach, we managed to achieve an accuracy of 74% on a test set of 1000 weight notes, excluding the 3% of tickets where the weight is not visible at all (weight is not on the picture, extremely blurry).</p><p>This result can already be useful for some Dashdoc features, as automatically validating the weight input by the trucker before invoicing, enabling our users to avoid opening the document to check it manually before making the invoice.</p><p>Here we built the frame of our solution, knowing each step can be improved. We could try other models for the orientation classification model. We used the raw models of docTR, but we could train them on our data to see if detection or recognition can be improved. For the weight identification part, we could train the model on more data or try to see if some business rules could be useful. And the final heuristic, which is really basic for now, could also be improved.</p><p>Extracting information from documents has become a common issue and the applications are huge. Here we chose a special use case to begin with, but it could be extended to a lot of other subjects at Dashdoc. Technically, it covers different very interesting technical domains. Dividing the problem into different parts highlights this and makes it easier to iterate. The results are promising, stay tuned for the next step: integrating the weight extraction features inside the product!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=57238ba98d49" width="1" height="1" alt=""><hr><p><a href="https://medium.com/dashdoc/extracting-structured-data-from-scanned-documents-57238ba98d49">Extracting structured data from scanned documents</a> was originally published in <a href="https://medium.com/dashdoc">Dashdoc</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>