<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Waleed Abdulla on Medium]]></title>
        <description><![CDATA[Stories by Waleed Abdulla on Medium]]></description>
        <link>https://medium.com/@waleedka?source=rss-1a69ae209bc4------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*RDTSqB0ocRKRgYEA.JPG</url>
            <title>Stories by Waleed Abdulla on Medium</title>
            <link>https://medium.com/@waleedka?source=rss-1a69ae209bc4------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 13 Apr 2026 10:31:37 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@waleedka/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow]]></title>
            <link>https://engineering.matterport.com/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46?source=rss-1a69ae209bc4------2</link>
            <guid isPermaLink="false">https://medium.com/p/7c761e238b46</guid>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[instance-segmentation]]></category>
            <category><![CDATA[object-detection]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[mask-rcnn]]></category>
            <dc:creator><![CDATA[Waleed Abdulla]]></dc:creator>
            <pubDate>Tue, 20 Mar 2018 00:23:01 GMT</pubDate>
            <atom:updated>2018-12-10T01:53:20.068Z</atom:updated>
            <content:encoded><![CDATA[<h4>Explained by building a color splash filter</h4><p>Back in November, we open-sourced our <a href="https://github.com/matterport/Mask_RCNN">implementation of Mask R-CNN</a>, and since then it’s been forked 1400 times, used in a lot of projects, and improved upon by many generous contributors. We received a lot of questions as well, so in this post I’ll explain how the model works and show how to use it in a real application.</p><p>I’ll cover two things: First, an overview of Mask RCNN. And, second, how to train a model from scratch and use it to build a smart color splash filter.</p><blockquote><strong>Code Tip:</strong><br>We’re sharing the code <a href="https://github.com/matterport/Mask_RCNN/tree/master/samples/balloon">here</a>. Including the dataset I built and the trained model. Follow along!</blockquote><h3>What is Instance Segmentation?</h3><p>Instance segmentation is the task of identifying object outlines at the pixel level. Compared to similar computer vision tasks, it’s one of the hardest possible vision tasks. Consider the following asks:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/996/1*-zw_Mh1e-8YncnokbAFWxg.png" /></figure><ul><li><strong>Classification: </strong>There is a balloon in this image.</li><li><strong>Semantic Segmentation:</strong> These are all the balloon pixels.</li><li><strong>Object Detection: </strong>There are 7 balloons in this image at these locations. We’re starting to account for objects that overlap.</li><li><strong>Instance Segmentation</strong>: There are 7 balloons at these locations, and these are the pixels that belong to each one.</li></ul><h3>Mask R-CNN</h3><p>Mask R-CNN (regional convolutional neural network) is a two stage framework: the first stage scans the image and generates <em>proposals</em>(areas likely to contain an object). And the second stage classifies the proposals and generates bounding boxes and masks.</p><p>It was introduced last year via the <a href="https://arxiv.org/abs/1703.06870">Mask R-CNN paper</a> to extend its predecessor, <a href="https://arxiv.org/abs/1506.01497">Faster R-CNN</a>, by the same authors. Faster R-CNN is a popular framework for object detection, and Mask R-CNN extends it with instance segmentation, among other things.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IWWOPIYLqqF9i_gXPmBk3g.png" /><figcaption>Mask R-CNN framework. Source: <a href="https://arxiv.org/abs/1703.06870">https://arxiv.org/abs/1703.06870</a></figcaption></figure><p>At a high level, Mask R-CNN consists of these modules:</p><h3>1. Backbone</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/309/1*IDjLXsSw5QMFWDudayIBfw.png" /><figcaption>Simplified illustration of the backbone nework</figcaption></figure><p>This is a standard convolutional neural network (typically, ResNet50 or ResNet101) that serves as a feature extractor. The early layers detect low level features (edges and corners), and later layers successively detect higher level features (car, person, sky).</p><p>Passing through the backbone network, the image is converted from 1024x1024px x 3 (RGB) to a feature map of shape 32x32x2048. This feature map becomes the input for the following stages.</p><blockquote><strong>Code Tip:</strong><br>The backbone is built in the function <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L171">resnet_graph()</a>. The code supports ResNet50 and ResNet101.</blockquote><h4>Feature Pyramid Network</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/452/1*1sCveJrqfthOQsGGZRs2tQ.png" /><figcaption>Source: Feature Pyramid Networks paper</figcaption></figure><p>While the backbone described above works great, it can be improved upon. The <a href="https://arxiv.org/abs/1612.03144">Feature Pyramid Network (FPN)</a> was introduced by the same authors of Mask R-CNN as an extension that can better represent objects at multiple scales.</p><p>FPN improves the standard feature extraction pyramid by adding a second pyramid that takes the high level features from the first pyramid and passes them down to lower layers. By doing so, it allows features at every level to have access to both, lower and higher level features.</p><p>Our implementation of Mask RCNN uses a ResNet101 + FPN backbone.</p><blockquote><strong>Code Tip:</strong><br> The FPN is created in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L1840">MaskRCNN.build()</a>. The section after building the ResNet. <br>RPN introduces additional complexity: rather than a single backbone feature map in the standard backbone (i.e. the top layer of the first pyramid), in FPN there is a feature map at each level of the second pyramid. We pick which to use dynamically depending on the size of the object. I’ll continue to refer to the <strong>backbone feature map</strong> as if it’s one feature map, but keep in mind that when using FPN, we’re actually picking one out of several at runtime.</blockquote><h3>2. Region Proposal Network (RPN)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/593/1*ESpJx0XLvyBa86TNo2BfLQ.png" /><figcaption>Simplified illustration showing 49 anchor boxes</figcaption></figure><p>The RPN is a lightweight neural network that scans the image in a sliding-window fashion and finds areas that contain objects.</p><p>The regions that the RPN scans over are called <em>anchors</em>. Which are boxes distributed over the image area, as show on the left. This is a simplified view, though. In practice, there are about 200K anchors of different sizes and aspect ratios, and they overlap to cover as much of the image as possible.</p><p>How fast can the RPN scan that many anchors? Pretty fast, actually. The sliding window is handled by the convolutional nature of the RPN, which allows it to scan all regions in parallel (on a GPU). Further, the RPN doesn’t scan over the image directly (even though we draw the anchors on the image for illustration). Instead, the RPN scans over the backbone feature map. This allows the RPN to reuse the extracted features efficiently and avoid duplicate calculations. With these optimizations, the RPN runs in about 10 ms according to the <a href="https://arxiv.org/abs/1506.01497">Faster RCNN paper</a> that introduced it. In Mask RCNN we typically use larger images and more anchors, so it might take a bit longer.</p><blockquote><strong>Code Tip:<br></strong>The RPN is created in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L831">rpn_graph()</a>. Anchor scales and aspect ratios are controlled by RPN_ANCHOR_SCALES and RPN_ANCHOR_RATIOS in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/config.py">config.py</a>.</blockquote><p>The RPN generates two outputs for each anchor:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/407/1*EMNE8bxOT4RI3HMjIqjCwQ.png" /><figcaption>3 anchor boxes (dotted) and the shift/scale applied to them to fit the object precisely (solid). Several anchors can map to the same object.</figcaption></figure><ol><li><strong>Anchor Class:</strong> One of two classes: foreground or background. The FG class implies that there is likely an object in that box.</li><li><strong>Bounding Box Refinement:</strong> A foreground anchor (also called positive anchor) might not be centered perfectly over the object. So the RPN estimates a delta (% change in x, y, width, height) to refine the anchor box to fit the object better.</li></ol><p>Using the RPN predictions, we pick the top anchors that are likely to contain objects and refine their location and size. If several anchors overlap too much, we keep the one with the highest foreground score and discard the rest (referred to as Non-max Suppression). After that we have the final <em>proposals </em>(regions of interest)<em> </em>that we pass to the next stage.</p><blockquote><strong>Code Tip:</strong><br>The <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L255">ProposalLayer</a> is a custom Keras layer that reads the output of the RPN, picks top anchors, and applies bounding box refinement.</blockquote><h3>3. ROI Classifier &amp; Bounding Box Regressor</h3><p>This stage runs on the regions of interest (ROIs) proposed by the RPN. And just like the RPN, it generates two outputs for each ROI:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/933/1*xQYuM_9mu5kt8nNN8Ms2TQ.png" /><figcaption>Illustration of stage 2. Source: Fast R-CNN (<a href="https://arxiv.org/abs/1504.08083">https://arxiv.org/abs/1504.08083</a>)</figcaption></figure><ol><li><strong>Class:</strong> The class of the object in the ROI. Unlike the RPN, which has two classes (FG/BG), this network is deeper and has the capacity to classify regions to specific classes (person, car, chair, …etc.). It can also generate a <em>background</em> class, which causes the ROI to be discarded.</li><li><strong>Bounding Box Refinement:</strong> Very similar to how it’s done in the RPN, and its purpose is to further refine the location and size of the bounding box to encapsulate the object.</li></ol><blockquote><strong>Code Tip:</strong><br>The classifier and bounding box regressor are created in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L901">fpn_classifier_graph()</a>.</blockquote><h4>ROI Pooling</h4><p>There is a bit of a problem to solve before we continue. Classifiers don’t handle variable input size very well. They typically require a fixed input size. But, due to the bounding box refinement step in the RPN, the ROI boxes can have different sizes. That’s where ROI Pooling comes into play.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/645/1*bsT00ickNk7vaRJNrTvKPQ.png" /><figcaption>The feature map here is from a low-level layer, for illustration, to make it easier to understand.</figcaption></figure><p>ROI pooling refers to cropping a part of a feature map and resizing it to a fixed size. It’s similar in principle to cropping part of an image and then resizing it (but there are differences in implementation details).</p><p>The authors of Mask R-CNN suggest a method they named ROIAlign, in which they sample the feature map at different points and apply a bilinear interpolation. In our implementation, we used TensorFlow’s <a href="https://www.tensorflow.org/api_docs/python/tf/image/crop_and_resize">crop_and_resize</a> function for simplicity and because it’s close enough for most purposes.</p><blockquote><strong>Code Tip:</strong><br>ROI pooling is implemented in the class <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L344">PyramidROIAlign</a>.</blockquote><h3>4. Segmentation Masks</h3><p>If you stop at the end of the last section then you have a <a href="https://arxiv.org/abs/1506.01497">Faster R-CNN</a> framework for object detection. The mask network is the addition that the Mask R-CNN paper introduced.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/455/1*l55WzUq1ZD2b5EGwW05LDA.png" /></figure><p>The mask branch is a convolutional network that takes the positive regions selected by the ROI classifier and generates masks for them. The generated masks are low resolution: 28x28 pixels. But they are <em>soft</em> masks, represented by float numbers, so they hold more details than binary masks. The small mask size helps keep the mask branch light. During training, we scale down the ground-truth masks to 28x28 to compute the loss, and during inferencing we scale up the predicted masks to the size of the ROI bounding box and that gives us the final masks, one per object.</p><blockquote><strong>Code Tip:</strong><br>The mask branch is in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L957">build_fpn_mask_graph()</a>.</blockquote><h3>Let’s Build a Color Splash Filter</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/460/1*lAP6vX1tLQaxFn6XGEQ32g.gif" /><figcaption>Sample generated by this project</figcaption></figure><p>Unlike most image editing apps that include this filter, our filter will be a bit smarter: It finds the objects automatically. Which becomes even more useful if you want to apply it to videos rather than a single image.</p><h3>Training Dataset</h3><p>Typically, I’d start by searching for public datasets that contain the objects I need. But in this case, I wanted to document the full cycle and show how to build a dataset from scratch.</p><p>I searched for balloon images on flickr, limiting the license type to “Commercial use &amp; mods allowed”. This returned more than enough images for my needs. I picked a total of 75 images and divided them into a training set and a validation set. Finding images is easy. Annotating them is the hard part.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/871/1*Q4tCdhwrklvJLM9zn5aDhg.png" /></figure><p>Wait! Don’t we need, like, a million images to train a deep learning model? Sometimes you do, but often you don’t. I’m relying on two main points to reduce my training requirements significantly:</p><p>First, <em>transfer learning. </em>Which simply means that, instead of training a model from scratch, I start with a weights file that’s been trained on the COCO dataset (we provide that in the github repo). Although the COCO dataset does <strong>not</strong> contain a balloon class, it contains a lot of other images (~120K), so the trained weights have already learned a lot of the features common in natural images, which really helps. And, second, given the simple use case here, I’m not demanding high accuracy from this model, so the tiny dataset should suffice.</p><p>There are a lot of tools to annotate images. I ended up using <a href="http://www.robots.ox.ac.uk/~vgg/software/via/">VIA (VGG Image Annotator)</a> because of its simplicity. It’s a single HTML file that you download and open in a browser. Annotating the first few images was very slow, but once I got used to the user interface, I was annotating at around an object a minute.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/948/1*6SICkQA-YCLp88A7GFM4Ag.png" /><figcaption>UI of the VGG Image Annotator tool</figcaption></figure><p>If you don’t like the VIA tool, here is a list of the other tools I tested:</p><ul><li><a href="http://labelme2.csail.mit.edu/">LabelMe</a>: One of the most known tools. The UI was a bit too slow, though, especially when zooming in on large images.</li><li><a href="https://rectlabel.com/">RectLabel</a>: Simple and easy to work with. Mac only.</li><li><a href="https://www.labelbox.io/">LabelBox</a>: Pretty good for larger labeling projects and has options for different types of labeling tasks.</li><li><a href="http://www.robots.ox.ac.uk/~vgg/software/via/">VGG Image Annotator (VIA)</a>: Fast, light, and really well designed. This is the one I ended up using.</li><li><a href="https://github.com/tylin/coco-ui">COCO UI</a>: The tool used to annotate the COCO dataset.</li></ul><h3>Loading the Dataset</h3><p>There isn’t a universally accepted format to store segmentation masks. Some datasets save them as PNG images, others store them as polygon points, and so on. To handle all these cases, our implementation provides a Dataset class that you inherit from and then override a few functions to read your data in whichever format it happens to be.</p><p>The VIA tool saves the annotations in a JSON file, and each mask is a set of polygon points. I didn’t find documentation for the format, but it’s pretty easy to figure out by looking at the generated JSON. I included comments in the code to explain how the parsing is done.</p><blockquote><strong>Code Tip:</strong><br>An easy way to write code for a new dataset is to copy <a href="https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/coco.py">coco.py</a> and modify it to your needs. Which is what I did. I saved the new file as <a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/balloon.py">balloons.py</a></blockquote><p>My BalloonDataset class looks like this:</p><pre>class <strong>BalloonDataset</strong>(utils.Dataset):</pre><pre>    def <strong>load_balloons</strong>(self, dataset_dir, subset):<br>        ...</pre><pre>    def <strong>load_mask</strong>(self, image_id):<br>        ...</pre><pre>    def <strong>image_reference</strong>(self, image_id):<br>        ...</pre><p>load_balloons reads the JSON file, extracts the annotations, and iteratively calls the internal add_class and add_image functions to build the dataset.</p><blockquote>load_mask generates bitmap masks for every object in the image by drawing the polygons.</blockquote><p>image_reference simply returns a string that identifies the image for debugging purposes. Here it simply returns the path of the image file.</p><p>You might have noticed that my class doesn’t contain functions to load images or return bounding boxes. The default load_image function in the base Dataset class handles loading images. And, bounding boxes are generated dynamically from the masks.</p><blockquote><strong>Code Tip:</strong><br>Your dataset might not be in JSON. My BalloonDataset class reads JSON because that’s what the VIA tool generates. Don’t convert your dataset to a format similar to COCO or the VIA format. Insetad, write your own Dataset class to load whichever format your dataset comes in. See the <a href="https://github.com/matterport/Mask_RCNN/tree/master/samples">samples</a> and notice how each uses its own Dataset class.</blockquote><h4>Verify the Dataset</h4><p>To verify that my new code is implemented correctly I added this <a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/inspect_balloon_data.ipynb">Jupyter notebook</a>. It loads the dataset, visualizes masks and bounding boxes, and visualizes the anchors to verify that my anchor sizes are a good fit for my object sizes. Here is an example of what you should expect to see:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/831/1*OKE6wyZFfh2f_aZ3rd9BRw.png" /><figcaption>Sample from inspect_balloon_data notebook</figcaption></figure><blockquote><strong>Code Tip:</strong><br>To create this notebook I copied <a href="https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/inspect_data.ipynb">inspect_data.ipynb</a>, which we wrote for the COCO dataset, and modified one block of code at the top to load the Balloons dataset instead.</blockquote><h3>Configurations</h3><p>The configurations for this project are similar to the base configuration used to train the COCO dataset, so I just needed to override 3 values. As I did with the Dataset class, I inherit from the base Config class and add my overrides:</p><pre>class BalloonConfig(Config):</pre><pre>    # Give the configuration a recognizable name<br>    NAME = &quot;balloons&quot;</pre><pre>    # Number of classes (including background)<br>    NUM_CLASSES = 1 + 1  # Background + balloon</pre><pre>    # Number of training steps per epoch<br>    STEPS_PER_EPOCH = 100</pre><p>The base configuration uses input images of size 1024x1024 px for best accuracy. I kept it that way. My images are a bit smaller, but the model resizes them automatically.</p><blockquote><strong>Code Tip:</strong><br>The base Config class is in <a href="https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/config.py">config.py</a>. And BalloonConfig is in<a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/balloon.py#L61"> balloons.py</a>.</blockquote><h3>Training</h3><p>Mask R-CNN is a fairly large model. Especially that our implementation uses ResNet101 and FPN. So you need a modern GPU with 12GB of memory. It might work on less, but I haven’t tried. I used <a href="https://aws.amazon.com/ec2/instance-types/p2/">Amazon’s P2 instances</a> to train this model, and given the small dataset, training takes less than an hour.</p><p>Start the training with this command, running from the balloon directory. Here, we’re specifying that training should start from the pre-trained COCO weights. The code will download the weights from our repository automatically:</p><pre>python3 balloon.py train --dataset=/path/to/dataset <strong>--model=coco</strong></pre><p>And to resume training if it stopped:</p><pre>python3 balloon.py train --dataset=/path/to/dataset <strong>--model=last</strong></pre><blockquote><strong>Code Tip:<br></strong>In addition to balloons.py, the repository has three more examples: <a href="https://github.com/matterport/Mask_RCNN/blob/master/samples/shapes/train_shapes.ipynb">train_shapes.ipynb</a> which trains a toy model to detect geometric shapes, <a href="https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/coco.py">coco.py</a> which trains on the COCO dataset, and <a href="https://github.com/matterport/Mask_RCNN/tree/master/samples/nucleus">nucleus</a> which segments nuclei in microscopy images.</blockquote><h3>Inspecting the Results</h3><p>The <a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/inspect_balloon_model.ipynb">inspect_balloon_model</a> notebook shows the results generated by the trained model. Check the notebook for more visualizations and a step by step walk through the detection pipeline.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/812/1*BvqnziHW514YyO20UNtS3g.png" /></figure><blockquote><strong>Code Tip:</strong><br>This notebook is a simplified version of <a href="https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/inspect_model.ipynb">inspect_mode.ipynb</a>, which includes visualizations and debugging code for the COCO dataset.</blockquote><h3>Color Splash</h3><p>Finally, now that we have object masks, let’s use them to apply the color splash effect. The method is really simple: create a grayscale version of the image, and then, in areas marked by the object mask, copy back the color pixels from original image. Here is an example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/942/1*iPAtWFnShPhX5atbY3V0pQ.png" /></figure><blockquote><strong>Code Tip:</strong><br>The code that applies the effect is in the <a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/balloon.py#L201">color_splash()</a> function. And <a href="https://github.com/matterport/Mask_RCNN/blob/v2.1/samples/balloon/balloon.py#L221">detect_and_color_splash()</a> handles the whole process from loading the image, running instance segmentation, and applying the color splash filter.</blockquote><h3>FAQ</h3><ul><li><strong>Q:</strong> I want to dive deeper and understand the details, what should I read?<br><strong>A:</strong> Read these papers in this order: <a href="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=AF8817DD0F70B32AA08B2ECBBA8099FA?doi=10.1.1.715.2453&amp;rep=rep1&amp;type=pdf">RCNN (pdf)</a>, <a href="https://arxiv.org/abs/1504.08083">Fast RCNN</a>, <a href="https://arxiv.org/abs/1506.01497">Faster RCNN</a>, <a href="https://arxiv.org/abs/1612.03144">FPN</a>, <a href="https://arxiv.org/abs/1703.06870">Mask RCNN</a>.</li><li><strong>Q:</strong> Where can I ask more questions?<br><strong>A:</strong> The <a href="https://github.com/matterport/Mask_RCNN/issues">Issues page on GitHub</a> is active, you can use it for questions, as well as to report issues. Remember to search closed issues as well in case your question has been answered already.</li><li><strong>Q:</strong> Can I contribute to this project?<br><strong>A:</strong> That would be great. Pull Requests are always welcome.</li><li><strong>Q:</strong> Can I join your team and work on fun projects like this one?<br><strong>A:</strong> Yes, we’re hiring for deep learning and computer vision. <a href="https://matterport.com/careers/">Apply here</a>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/937/1*w_ownWZZ38QhiVjVU757DA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7c761e238b46" width="1" height="1" alt=""><hr><p><a href="https://engineering.matterport.com/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46">Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow</a> was originally published in <a href="https://engineering.matterport.com">Matterport Engineering Techblog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Traffic Sign Recognition with TensorFlow]]></title>
            <link>https://medium.com/@waleedka/traffic-sign-recognition-with-tensorflow-629dffc391a6?source=rss-1a69ae209bc4------2</link>
            <guid isPermaLink="false">https://medium.com/p/629dffc391a6</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[self-driving-cars]]></category>
            <category><![CDATA[tensorflow]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[neural-networks]]></category>
            <dc:creator><![CDATA[Waleed Abdulla]]></dc:creator>
            <pubDate>Sat, 17 Dec 2016 03:29:27 GMT</pubDate>
            <atom:updated>2016-12-17T03:29:27.831Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>Yes officer, I saw the speed limit sign. I just didn’t see you.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*e0UlsRVfTM2xw_uVWTsPVg.png" /></figure><p>This is <strong>part 1</strong> of a series about building a deep learning model to recognize traffic signs. It’s intended to be a learning experience, for myself and for anyone else who likes to follow along. There are a lot of resources that cover the theory and math of neural networks, so I’ll focus on the practical aspects instead. I’ll describe my own experience building this model and share the <a href="https://github.com/waleedka/traffic-signs-tensorflow">source code</a> and relevant materials. This is suitable for those who know Python and the basics of machine learning already, but want hands on experience and to practice building a real application.</p><p>In this part, I’ll talk about image classification and I’ll keep the model as simple as possible. In later parts, I’ll cover convolutional networks, data augmentation, and object detection.</p><h3>Setup</h3><p>The source code is available in this <a href="https://github.com/waleedka/traffic-signs-tensorflow/blob/master/notebook1.ipynb">Jupyter notebook</a>. I’m using Python 3.5 and TensorFlow 0.12. If you prefer to run the code in Docker, you can use my <a href="https://hub.docker.com/r/waleedka/modern-deep-learning/">Docker image that contains many popular deep learning tools</a>. Run it with this command:</p><pre>docker run -it -p 8888:8888 -p 6006:6006 -v ~/traffic:/traffic waleedka/modern-deep-learning</pre><p>Note that my project directory is in <strong>~/traffic</strong> and I’m mapping it to the <strong>/traffic</strong> directory in the Docker container. Modify this if you’re using a different directory.</p><h3>Finding Training Data</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/423/1*HBllwUCObQcZ1OfuuHwVfA.jpeg" /></figure><p>My first challenge was finding a good training dataset. Traffic sign recognition is a well studied problem, so I figured I’ll find something online.</p><p>I started by googling “traffic sign dataset” and found several options. I picked the <a href="http://btsd.ethz.ch/shareddata/">Belgian Traffic Sign Dataset</a> because it was big enough to train on, and yet small enough to be easy to work with.</p><p>You can download the dataset from <a href="http://btsd.ethz.ch/shareddata/">http://btsd.ethz.ch/shareddata/</a>. There are a lot of datasets on that page, but you only need the two files listed under <strong>BelgiumTS for Classification (cropped images)</strong>:</p><ul><li>BelgiumTSC_Training (171.3MBytes)</li><li>BelgiumTSC_Testing (76.5MBytes)</li></ul><p>After expanding the files, this is my directory structure. Try to match it so you can run the code without having to change the paths:</p><pre>/traffic/datasets/BelgiumTS/Training/ /traffic/datasets/BelgiumTS/Testing/</pre><p>Each of the two directories contain 62 subdirectories, named sequentially from <strong>00000</strong> to <strong>00061</strong>. The directory names represent the labels, and the images inside each directory are samples of each label.</p><h3>Exploring the Dataset</h3><p>Or, if you prefer to sound more formal: do Exploratory Data Analysis. It’s tempting to skip this part, but I’ve found that the code I write to examine the data ends up being used a lot throughout the project. I usually do this in Jupyter notebooks and share them with the team. Knowing your data well from the start saves you a lot of time later.</p><p>The images in this dataset are in an old .ppm format. So old, in fact, that most tools don’t support it. Which meant that I couldn’t casually browse the folders to take a look at the images. Luckily, the <a href="http://scikit-image.org/">Scikit Image library</a> recognizes this format. This code will load the data and return two lists: images and labels.</p><pre>def load_data(data_dir):<br>    # Get all subdirectories of data_dir. Each represents a label.<br>    directories = [d for d in os.listdir(data_dir) <br>                   if os.path.isdir(os.path.join(data_dir, d))]</pre><pre>    # Loop through the label directories and collect the data in<br>    # two lists, labels and images.<br>    labels = []<br>    images = []<br>    for d in directories:<br>        label_dir = os.path.join(data_dir, d)<br>        file_names = [os.path.join(label_dir, f) <br>                      for f in os.listdir(label_dir) <br>                      if f.endswith(&quot;.ppm&quot;)]<br>        for f in file_names:<br>            images.append(skimage.data.imread(f))<br>            labels.append(int(d))<br>    return images, labels</pre><pre><br>images, labels = load_data(train_data_dir)</pre><p>This is a small dataset so I’m loading everything into RAM to keep it simple. For larger datasets, you’d want to load the data in batches.</p><p>After loading the images into Numpy arrays, I display a sample image of each label. <a href="https://github.com/waleedka/traffic-signs-tensorflow/blob/master/notebook1.ipynb">See code in the notebook</a>. This is our dataset:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/881/1*JqMjJ2u-9Blnzh0dEYbiRw.png" /><figcaption>The training set. consists of 62 classes. The numbers in parentheses are the count of images of each class.</figcaption></figure><p>Looks like a good training set. The image quality is great, and there are a variety of angles and lighting conditions. More importantly, the traffic signs occupy most of the area of each image, which allows me to focus on object classification and not have to worry about finding the location of the traffic sign in the image (object detection). I’ll get to object detection in a future post.</p><p>The first thing I noticed from the samples above is that images are square-ish, but have different aspect ratios. My neural network will take a fixed-size input, so I have some preprocessing to do. I’ll get to that soon, but first let’s pick one label and see more of its images. Here is an example of label 32:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/1*fEA7NL-2pumD3ygVeah-Vw.png" /><figcaption>Several sample images of label 32</figcaption></figure><p>It looks like the dataset considers all speed limit signs to be of the same class, regardless of the numbers on them. That’s fine, as long as we know about it beforehand and know what to expect. That’s why understanding your dataset is so important and can save you a lot of pain and confusion later.</p><p>I’ll leave exploring the other labels to you. Labels 26 and 27 are interesting to check. They also have numbers in red circles, so the model will have to get really good to differentiate between them.</p><h3><strong>Handling Images of Different Sizes</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/431/1*8H4avgWf_AGJ6yQyqfNnhQ.png" /><figcaption>Resizing images to a similar size and aspect ratio</figcaption></figure><p>Most image classification networks expect images of a fixed size, and our first model will do as well. So we need to resize all the images to the same size.</p><p>But since the images have different aspect ratios, then some of them will be stretched vertically or horizontally. Is that a problem? I think it’s not in this case, because the differences in aspect ratios are not that large. My own criteria is that if a person can recognize the images when they’re stretched then the model should be able to do so as well.</p><p>What are the sizes of the images anyway? Let’s print a few examples:</p><pre>for image in images[:5]:<br>    print(&quot;shape: {0}, min: {1}, max: {2}&quot;.format(<br>          image.shape, image.min(), image.max()))</pre><pre><strong>Output:</strong><br>shape: (141, 142, 3), min: 0, max: 255<br>shape: (120, 123, 3), min: 0, max: 255<br>shape: (105, 107, 3), min: 0, max: 255<br>shape: (94, 105, 3), min: 7, max: 255<br>shape: (128, 139, 3), min: 0, max: 255</pre><p>The sizes seem to hover around 128x128. I could use that size to preserve as much information as possible, but in early development I prefer to use a smaller size because it leads to faster training, which allows me to iterate faster. I experimented with 16x16 and 20x20, but they were too small. I ended up picking 32x32 which is easy to recognize (see below) and reduces the size of the model and training data by a factor of 16 compared to 128x128.</p><p>I’m also in the habit of printing the <em>min()</em> and <em>max()</em> values often. It’s a simple way to verify the range of the data and catch bugs early. This tells me that the image colors are the standard range of 0–255.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/870/1*kgqOQPfvQNn1ucKr9x4GIw.png" /><figcaption>Images resized to 32x32</figcaption></figure><h3>Minimum Viable Model</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/515/1*27--e0LEOkOgtLETOWcQDw.png" /></figure><p>We’re getting to the interesting part! Continuing the theme of keeping it simple, I started with the simplest possible model: A one layer network that consists of one neuron per label.</p><p>This network has 62 neurons and each neuron takes the RGB values of all pixels as input. Effectively, each neuron receives 32*32*3=3072 inputs. This is a <em>fully-connected layer</em> because every neuron connects to every input value. You’re probably familiar with its equation:</p><pre>y = xW + b</pre><p>I start with a simple model because it’s easy to explain, easy to debug, and fast to train. Once this works end to end, expanding on it is much easier than building something complex from the start.</p><h3>Building the TensorFlow Graph</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/367/1*4VT1QyeSy1rRkl-_1nBvNA.png" /><figcaption>Visualization of a part of a TensorFlow graph</figcaption></figure><p>TensorFlow encapsulates the architecture of a neural network in an execution graph. The graph consists of operations (Ops for short) such as Add, Multiply, Reshape, …etc. These ops perform actions on data in tensors (multidimensional arrays).</p><p>I’ll go through the code to build the graph step by step below, but here is the full code if you prefer to scan it first:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/bb6c383f00ae2bf2980507b69e6c0776/href">https://medium.com/media/bb6c383f00ae2bf2980507b69e6c0776/href</a></iframe><p>First, I create the Graph object. TensorFlow has a default global graph, but I don’t recommend using it. Global variables are bad in general because they make it too easy to introduce bugs. I prefer to create the graph explicitly.</p><pre>graph = tf.Graph()</pre><p>Then I define <strong>Placeholders</strong> for the images and labels. The placeholders are TensorFlow’s way of receiving input from the main program. Notice that I create the placeholders (and all other ops) inside the block of <strong>with graph.as_default()</strong>. This is so they become part of my graph object rather than the global graph.</p><pre><strong>with</strong> graph.as_default():<br>    images_ph = tf.placeholder(tf.float32, [<strong>None</strong>, 32, 32, 3])<br>    labels_ph = tf.placeholder(tf.int32, [<strong>None</strong>])</pre><p>The shape of the <strong>images_ph</strong> placeholder is <strong>[None, 32, 32, 3]</strong>. It stands for <strong>[batch size, height, width, channels]</strong> (often shortened as NHWC) <strong>.</strong> The <strong>None</strong> for batch size means that the batch size is flexible, which means that we can feed different batch sizes to the model without having to change the code. Pay attention to the order of your inputs because some models and frameworks might use a different arrangement, such as NCHW.</p><p>Next, I define the fully connected layer. Rather than implementing the raw equation, <strong>y = xW + b</strong>, I use a handy function that does that in one line and also applies the activation function. It expects input as a one-dimensional vector, though. So I flatten the images first.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/227/1*AeCSRciqQwMLZEBjP9az5w.png" /><figcaption>The ReLU function</figcaption></figure><p>I’m using the ReLU activation function here:</p><pre>f(x) = max(0, x)</pre><p>It simply converts all negative values to zeros. It’s been shown to work well in classification tasks and trains faster than sigmoid or tanh. For more background, check <a href="http://cs231n.github.io/neural-networks-1/">here</a> and <a href="https://www.quora.com/What-are-the-benefits-of-using-rectified-linear-units-vs-the-typical-sigmoid-activation-function">here</a>.</p><pre><em># Flatten input from: [None, height, width, channels]</em><br><em># To: [None, height * width * channels] == [None, 3072]</em><br>images_flat = tf.contrib.layers.flatten(images_ph)</pre><pre><em># Fully connected layer. </em><br><em># Generates logits of size [None, 62]</em><br>logits = tf.contrib.layers.fully_connected(images_flat, 62,<br>    tf.nn.relu)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/396/1*cH8w5ZLPKzn1XSJo3IWepw.png" /><figcaption>Bar chart visualization of a logits vector</figcaption></figure><p>The output of the fully connected layer is a <strong>logits</strong> vector of length 62 (technically, it’s <strong>[None, 62]</strong> because we’re dealing with a batch of logits vectors).</p><p>A row in the logits tensor might look like this: [0.3, 0, 0, 1.2, 2.1, .01, 0.4, ….., 0, 0]. The higher the value, the more likely that the image represents that label. Logits are not probabilities, though — They can have any value, and they don’t add up to 1. The actual absolute values of the logits are not important, just their values relative to each other. It’s easy to convert logits to probabilities using the <strong>softmax</strong> function if needed (it’s not needed here).</p><p>In this application, we just need the index of the largest value, which corresponds to the id of the label. The <a href="https://www.tensorflow.org/api_docs/python/math_ops/sequence_comparison_and_indexing#argmax"><strong>argmax</strong></a> op does that.</p><pre># Convert logits to label indexes.<br># Shape [None], which is a 1D vector of length == batch_size.<br>predicted_labels = tf.argmax(logits, 1)</pre><p>The <strong>argmax</strong> output will be integers in the range 0 to 61.</p><h3>Loss Function and Gradient Descent</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/450/1*BLrDfa8_UqunbUeeliAj3Q.png" /><figcaption>Credit: Wikipedia</figcaption></figure><p>Choosing the right loss function is an area of research in and of itself, which I won’t delve into it here other than to say that <strong>cross-entropy</strong> is the most common function for classification tasks. If you’re not familiar with it, there is a really good explanation <a href="https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/">here</a> and <a href="http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function">here</a>.</p><p>Cross-entropy is a measure of difference between two vectors of probabilities. So we need to convert labels and the logits to probability vectors. The function <strong>sparse_softmax_cross_entropy_with_logits() </strong>simplifies that. It takes the generated logits and the groundtruth labels and does three things: converts the label indexes of shape <strong>[None]</strong> to logits of shape <strong>[None, 62] </strong>(one-hot vectors), then it runs <strong>softmax </strong>to<strong> </strong>convert both prediction logits and label logits to probabilities, and finally calculates the cross-entropy between the two. This generates a loss vector of shape <strong>[None]</strong> (1D of length = batch size), which we pass through <strong>reduce_mean()</strong> to get one single number that represents the loss value.</p><pre>loss = tf.reduce_mean(<br>        tf.nn.sparse_softmax_cross_entropy_with_logits(<br>            logits, labels_ph))</pre><p>Choosing the optimization algorithm is another decision to make. I usually use the ADAM optimizer because it’s been shown to converge faster than simple gradient descent. This post does a great job <a href="http://sebastianruder.com/optimizing-gradient-descent/index.html">comparing different gradient descent optimizers</a>.</p><pre>train = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)</pre><p>The last node in the graph is the initialization op, which simply sets the values of all variables to zeros (or to random values or whatever the variables are set to initialize to).</p><pre>init = tf.initialize_all_variables()</pre><p>Notice that the code above doesn’t execute any of the ops yet. It’s just building the graph and describing its inputs. The variables we defined above, such as <strong>init</strong>, <strong>loss</strong>, <strong>predicted_labels</strong> don’t contain numerical values. They are references to ops that we’ll execute next.</p><h3>Training Loop</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/311/1*42vputdkk7qHc7-Oq_gGRw.png" /></figure><p>This is where we iteratively train the model to minimize the loss function. Before we start training, though, we need to create a <strong>Session</strong> object.</p><p>I mentioned the <strong>Graph</strong> object earlier and how it holds all the Ops of the model. The <strong>Session</strong>, on the other hand, holds the values of all the variables. If a graph holds the equation <strong>y=xW+b</strong> then the session holds the actual values of these variables.</p><pre>session = tf.Session(graph=graph)</pre><p>Usually the first thing to run after starting a session is the initialization op, <strong>init</strong>, to initialize the variables.</p><pre>session.run(init)</pre><p>Then we start the training loop and run the <strong>train </strong>op repeatedly. While not necessary, it’s useful to run the <strong>loss</strong> op as well to print its values and monitor the progress of the training.</p><pre>for i in range(201):<br>    _, loss_value = session.run(<br>        [train, loss], <br>        feed_dict={images_ph: images_a, labels_ph: labels_a})</pre><pre>    if i % 10 == 0:<br>        print(&quot;Loss: &quot;, loss_value)</pre><p>In case you’re wondering, I set the loop to 201 so that the <strong>i % 10</strong> condition is satisfied in the last round and prints the last loss value. The output should look something like this:</p><pre>Loss:  4.2588<br>Loss:  2.88972<br>Loss:  2.42234<br>Loss:  2.20074<br>Loss:  2.06985<br>Loss:  1.98126<br>Loss:  1.91674<br>Loss:  1.86652<br>Loss:  1.82595<br>...</pre><h3>Using the Model</h3><p>Now we have a trained model in memory in the <strong>Session</strong> object. To use it, we call <strong>session.run()</strong> just like in the training code. The <strong>predicted_labels</strong> op returns the output of the <strong>argmax()</strong> function, so that’s what we need to run. Here I classify 10 random images and print both, the predictions and the groundtruth labels for comparison.</p><pre># Pick 10 random images<br>sample_indexes = random.sample(range(len(images32)), 10)<br>sample_images = [images32[i] for i in sample_indexes]<br>sample_labels = [labels[i] for i in sample_indexes]</pre><pre># Run the &quot;predicted_labels&quot; op.<br>predicted = session.run(predicted_labels,<br>                        {images_ph: sample_images})<br>print(sample_labels)<br>print(predicted)</pre><pre><br><strong>Output:<br></strong>[15, 22, 61, 44, 32, 22, 57, 38, 56, 38]<br>[14  22  61  44  32  22  56  38  56  38]</pre><p>In the notebook, I include a function to visualize the results as well. It generates something like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/750/1*JAzIY34jJUrhbfagvr-1LA.png" /></figure><p>The visualization shows that the model is working , but doesn’t quantify how accurate it is. And you might’ve noticed that it’s classifying the training images, so we don’t know yet if the model generalizes to images that it hasn’t seen before. Next, we calculate a better evaluation metric.</p><h3>Evaluation</h3><p>To properly measure how the model generalizes to data it hasn’t seen, I do the evaluation on test data that I didn’t use in training. The BelgiumTS dataset makes this easy by providing two separate sets, one for training and one for testing.</p><p>In the notebook I load the test set, resize the images to 32x32, and then calculate the accuracy. This is the relevant part of the code that calculates the accuracy.</p><pre># Run predictions against the full test set.<br>predicted = session.run(predicted_labels, <br>                        feed_dict={images_ph: test_images32})<br># Calculate how many matches we got.<br>match_count = sum([int(y == y_) <br>                   for y, y_ in zip(test_labels, predicted)])<br>accuracy = match_count / len(test_labels)<br>print(&quot;Accuracy: {:.3f}&quot;.format(accuracy))</pre><p>The accuracy I get in each run ranges between <strong>0.40</strong> and <strong>0.70 </strong>depending on whether the model lands on a local minimum or a global minimum. This is expected when running a simple model like this one. In a future post I’ll talk about ways to improve the consistency of the results.</p><h3>Closing the Session</h3><p>Congratulations! We have a working simple neural network. Given how simple this neural network is, training takes just a minute on my laptop so I didn’t bother saving the trained model. In the next part, I’ll add code to save and load trained models and expand to use multiple layers, convolutional networks, and data augmentation. Stay tuned!</p><pre># Close the session. This will destroy the trained model.<br>session.close()</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=629dffc391a6" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>