<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Sebastien Tremblay on Medium]]></title>
        <description><![CDATA[Stories by Sebastien Tremblay on Medium]]></description>
        <link>https://medium.com/@underchemist?source=rss-5068b77c3dbd------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*a95RofR7Xm_BFIz17-r_wg.jpeg</url>
            <title>Stories by Sebastien Tremblay on Medium</title>
            <link>https://medium.com/@underchemist?source=rss-5068b77c3dbd------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Thu, 16 Apr 2026 16:03:05 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@underchemist/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Splitting GeoJSON files using geojsplit]]></title>
            <link>https://medium.com/@underchemist/splitting-geojson-files-using-geojsplit-8ff72ec68c67?source=rss-5068b77c3dbd------2</link>
            <guid isPermaLink="false">https://medium.com/p/8ff72ec68c67</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[gis]]></category>
            <category><![CDATA[nodejs]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Sebastien Tremblay]]></dc:creator>
            <pubDate>Tue, 10 Sep 2019 03:36:56 GMT</pubDate>
            <atom:updated>2019-09-10T03:38:05.112Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="Snapshot of buildings over Montreal, QC. Credit to https://blogs.bing.com/maps/2019-03/microsoft-releases-12-million-canadian" src="https://cdn-images-1.medium.com/max/1024/1*U87h9zUhaLBLZkwppwwSng.png" /></figure><p>I recently found Microsoft’s publicly available AI generated building footprint <a href="https://github.com/microsoft/USBuildingFootprints">dataset </a>for all 50 states in USA (and now ~12 million buildings in Canada as well🍁). This is a huge trove of data which got me really excited! Unfortunately processing very large JSON/GeoJSON documents tends to be very memory intensive as GDAL’s og2ogr will naively attempt to read the entire file into memory. In my case I needed to project the California building footprints to another spatial reference system (EPSG:4326 ➡️ EPSG:32610). Because of the memory constraints, my local workstation quickly became overwhelmed.</p><h3>Introducing geojsplit</h3><p>In my efforts to reduce my workstation workload, I found <a href="https://github.com/woodb/geojsplit">https://github.com/woodb/geojsplit</a>, a node package which splits GeoJSON files into smaller files similar to GNU split. Because of my work development environment it is more convenient to use linux or python command line tools. Seeing that a python implementation did not exist, I spent my weekend creating one!</p><p><a href="https://github.com/underchemist/geojsplit">Geojsplit </a>is inspired by the original node package, though it is not a one-to-one re-implementation. To get started,</p><pre>$ pip install geojsplit<br>$ geojsplit -h<br>usage: geojsplit [-h] [-l GEOMETRY_COUNT] [-a SUFFIX_LENGTH] [-o OUTPUT]<br>                 [-n LIMIT] [-v] [-d] [--version]<br>                 geojson<br><br>Split a geojson file into many geojson files.<br><br>positional arguments:<br>  geojson               filename of geojson file to split<br><br>optional arguments:<br>  -h, --help            show this help message and exit<br>  -l GEOMETRY_COUNT, --geometry-count GEOMETRY_COUNT<br>                        the number of features to be distributed to each file.<br>  -a SUFFIX_LENGTH, --suffix-length SUFFIX_LENGTH<br>                        number of characters in the suffix length for split<br>                        geojsons<br>  -o OUTPUT, --output OUTPUT<br>                        output directory to save split geojsons<br>  -n LIMIT, --limit LIMIT<br>                        limit number of split geojson file to at most LIMIT,<br>                        with GEOMETRY_COUNT number of features.<br>  -v, --verbose         increase output verbosity<br>  -d, --dry-run         see output without actually writing to file<br>  --version             show geojsplit version number</pre><p>To split a GeoJSON file,</p><pre>$ geojsplit ~/data/large.geojson</pre><p>will produce large_xaaaa.geojson, large_xaaab.geojson, ... with each file containing by default 100 features in a valid feature collection. To change the number of features per file,</p><pre>$ geojsplit --geometry-count 10000 ~/data/large.geojson</pre><p>will produce GeoJSON files with 10000 features per feature collection. If instead you want to process only the first n batches of features from a file,</p><pre>$ geojsplit -n 10 ~/data/large.geojson</pre><p>will output at <em>most </em>10 splitted files, fewer if there are less than 10 x 100 features in the GeoJSON. Finally if you have truly a monster of a GeoJSON file, where 26 x 4 = 456,976 characters is not enough to guarantee a unique filename,</p><pre>$ geojsplit --suffix-length 10 ~/data/large.geojson</pre><p>will produce filenames like large_xaaaaaaaaaa.geojson, large_xaaaaaaaaab.geojson,... .</p><h3>Use in code</h3><p>Geojsplit is not only a command line tool however, it’s (admittedly simple) backend is contained entirely in a GeoJSONBatchStreamer class which can be easily integrated into your code for more sophisticated manipulation. The heavy lifting is done by the excellent <a href="https://github.com/ICRAR/ijson">ijson </a>library which allows us to iterate over parsed JSON elements. Because we are dealing with fairly uniform JSON documents (Typically a feature collection with type , features , and properties attributes) we can iterate over just the features array without having to read the entire file into memory.</p><pre>from geojsplit import geojpslit</pre><pre>geojson = geojsplit.GeoJSONBatchStreamer(&quot;/home/user/large.geojson&quot;)<br>for features in geojson.stream():<br>    for feature in features:<br>        do_something(feature)<br>        ...</pre><p>Like the --geometry-count flag in the command tool, you can specify how many features should be included in a batch with the batch keyword.</p><p>That’s pretty much it for now, geojsplit is a solution to a problem I think many doing geospatial data analysis can run into. Please check it out if you think it might be useful and feel free to contribute, create issues, or ask questions!</p><ul><li>Github: <a href="https://github.com/underchemist/geojsplit">https://github.com/underchemist/geojsplit</a></li><li>Docs: <a href="https://geojsplit.readthedocs.io/en/latest">https://geojsplit.readthedocs.io/en/latest</a></li><li>Twitter: <a href="https://twitter.com/underchemist">https://twitter.com/underchemist</a></li></ul><p>👋</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8ff72ec68c67" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>