Splitting GeoJSON files using geojsplit

Sebastien Tremblay — Tue, 10 Sep 2019 03:36:56 GMT

I recently found Microsoft’s publicly available AI generated building footprint dataset for all 50 states in USA (and now ~12 million buildings in Canada as well🍁). This is a huge trove of data which got me really excited! Unfortunately processing very large JSON/GeoJSON documents tends to be very memory intensive as GDAL’s og2ogr will naively attempt to read the entire file into memory. In my case I needed to project the California building footprints to another spatial reference system (EPSG:4326 ➡️ EPSG:32610). Because of the memory constraints, my local workstation quickly became overwhelmed.

Introducing geojsplit

In my efforts to reduce my workstation workload, I found https://github.com/woodb/geojsplit, a node package which splits GeoJSON files into smaller files similar to GNU split. Because of my work development environment it is more convenient to use linux or python command line tools. Seeing that a python implementation did not exist, I spent my weekend creating one!

Geojsplit is inspired by the original node package, though it is not a one-to-one re-implementation. To get started,

$ pip install geojsplit
$ geojsplit -h
usage: geojsplit [-h] [-l GEOMETRY_COUNT] [-a SUFFIX_LENGTH] [-o OUTPUT]
                 [-n LIMIT] [-v] [-d] [--version]
                 geojson

Split a geojson file into many geojson files.

positional arguments:
  geojson               filename of geojson file to split

optional arguments:
  -h, --help            show this help message and exit
  -l GEOMETRY_COUNT, --geometry-count GEOMETRY_COUNT
                        the number of features to be distributed to each file.
  -a SUFFIX_LENGTH, --suffix-length SUFFIX_LENGTH
                        number of characters in the suffix length for split
                        geojsons
  -o OUTPUT, --output OUTPUT
                        output directory to save split geojsons
  -n LIMIT, --limit LIMIT
                        limit number of split geojson file to at most LIMIT,
                        with GEOMETRY_COUNT number of features.
  -v, --verbose         increase output verbosity
  -d, --dry-run         see output without actually writing to file
  --version             show geojsplit version number

To split a GeoJSON file,

$ geojsplit ~/data/large.geojson

will produce large_xaaaa.geojson, large_xaaab.geojson, ... with each file containing by default 100 features in a valid feature collection. To change the number of features per file,

$ geojsplit --geometry-count 10000 ~/data/large.geojson

will produce GeoJSON files with 10000 features per feature collection. If instead you want to process only the first n batches of features from a file,

$ geojsplit -n 10 ~/data/large.geojson

will output at most 10 splitted files, fewer if there are less than 10 x 100 features in the GeoJSON. Finally if you have truly a monster of a GeoJSON file, where 26 x 4 = 456,976 characters is not enough to guarantee a unique filename,

$ geojsplit --suffix-length 10 ~/data/large.geojson

will produce filenames like large_xaaaaaaaaaa.geojson, large_xaaaaaaaaab.geojson,... .

Use in code

Geojsplit is not only a command line tool however, it’s (admittedly simple) backend is contained entirely in a GeoJSONBatchStreamer class which can be easily integrated into your code for more sophisticated manipulation. The heavy lifting is done by the excellent ijson library which allows us to iterate over parsed JSON elements. Because we are dealing with fairly uniform JSON documents (Typically a feature collection with type , features , and properties attributes) we can iterate over just the features array without having to read the entire file into memory.

from geojsplit import geojpslit

geojson = geojsplit.GeoJSONBatchStreamer("/home/user/large.geojson")
for features in geojson.stream():
    for feature in features:
        do_something(feature)
        ...

Like the --geometry-count flag in the command tool, you can specify how many features should be included in a batch with the batch keyword.

That’s pretty much it for now, geojsplit is a solution to a problem I think many doing geospatial data analysis can run into. Please check it out if you think it might be useful and feel free to contribute, create issues, or ask questions!

Github: https://github.com/underchemist/geojsplit
Docs: https://geojsplit.readthedocs.io/en/latest
Twitter: https://twitter.com/underchemist

👋

Stories by Sebastien Tremblay on Medium

Splitting GeoJSON files using geojsplit

Introducing geojsplit

Use in code