9

I am looking to implement a streaming json parser for a very, very large JSON file (~ 1TB) that I'm unable to load into memory. One option is to use something like https://github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal.

Given a very large json object, how would I be able to parse it object-by-object, similar to this approach in xml: https://www.ibm.com/developerworks/library/x-hiperfparse/index.html.

For example, in pseudocode:

with open('file.json','r') as f:
    json_str = ''
    for line in f: # what if there are no newline in the json obj?
        json_str += line
        if is_valid(json_str):
            obj = json.loads(json_str)
            do_something()
            json_str = ''

Additionally, I did not find jq -c to be particularly fast (ignoring memory considerations). For example, doing json.loads was just as fast (and a bit faster) than using jq -c. I tried using ujson as well, but kept getting a corruption error which I believe was related to the file size.

# file size is 2.2GB
>>> import json,time
>>> t0=time.time();_=json.loads(open('20190201_itunes.txt').read());print (time.time()-t0)
65.6147990227

$ time cat 20190206_itunes.txt|jq -c '.[]' > new.json
real    1m35.538s
user    1m25.109s
sys 0m15.205s

Finally, here is an example 100KB json input which can be used for testing: https://hastebin.com/ecahufonet.json

7
  • 1
    As far as I'm aware, there is no good story in Python here. The ideal approach would be to change the process which generates 1TB json blob to use a more convenient format for streaming such as jsonlines Commented Feb 6, 2019 at 18:25
  • @wim these are user/client-generated files, so I have no control over them. Commented Feb 6, 2019 at 18:25
  • why cant you try pd.read_json() with chunksize option ? Commented Feb 6, 2019 at 18:28
  • Can you use something like jq and preprocess the document so it's valid JSONLines? Commented Feb 6, 2019 at 18:33
  • @Nusrath could you please clarify how you'd use the chunksize option approach? Is it able to work with the above input? Commented Feb 7, 2019 at 19:56

2 Answers 2

-1

Consider converting this json into filesystem tree (folders & files), So that every json object is converted to a folder, that contains files:

  • name.txt - contains name of the property in parent folder (json-object), value of the property is the current folder (json-object)
  • properties_000000001.txt
  • properties_000000002.txt

    ....

every properties_X.txt file contains at most N (limited number) lines property_name: property_value:

  • "number_property": 100
  • "boolean_property": true
  • "object_property": folder(folder_0000001)
  • "array_property": folder(folder_000002)

folder_0000001, folder_000002 - names of local folders

every array is converted to a folder with files:

  • name.txt
  • elements_0000000001.txt
  • elements_0000000002.txt

    ....

Sign up to request clarification or add additional context in comments.

2 Comments

just so I'm clear, you're suggesting to creating potentially billions and billions of files/folders just to store the json? That sounds a bit impractical...And might even take longer just to clean up the files once things are done. Or am I missing something?
Image
You said it's quite large json (~ 1TB). Also you mentioned converting the file into json-newline-delimited but can entire json become one line after such converting? I proposed to store it as folders&files as you can keep a small view of the entire json in your program while storing the rest on disk. It's not clear if the depth of json file is going to be limited (depth of nestings {{{}}}) as well as what you plan to do with parsed json.or during parsing/navigating through the json.
-2

If the file contains one large JSON object (either array or map), then per the JSON spec, you must read the entire object before you can access its components.

If for instance the file is an array with objects [ {...}, {...} ] then newline delimited JSON is far more efficient since you only have to keep one object in memory at a time and the parser only has to read one line before it can begin processing.

If you need to keep track of some of the objects for later use during parsing, I suggest creating a dict to hold those specific records of running values as your iterate the file.

Say you have JSON

{"timestamp": 1549480267882, "sensor_val": 1.6103881016325283}
{"timestamp": 1549480267883, "sensor_val": 9.281329310309406}
{"timestamp": 1549480267883, "sensor_val": 9.357327083443344}
{"timestamp": 1549480267883, "sensor_val": 6.297722749124474}
{"timestamp": 1549480267883, "sensor_val": 3.566667175421604}
{"timestamp": 1549480267883, "sensor_val": 3.4251473635178655}
{"timestamp": 1549480267884, "sensor_val": 7.487766674770563}
{"timestamp": 1549480267884, "sensor_val": 8.701853236245032}
{"timestamp": 1549480267884, "sensor_val": 1.4070662393018396}
{"timestamp": 1549480267884, "sensor_val": 3.6524325449499995}
{"timestamp": 1549480455646, "sensor_val": 6.244199614422415}
{"timestamp": 1549480455646, "sensor_val": 5.126780276231609}
{"timestamp": 1549480455646, "sensor_val": 9.413894020722314}
{"timestamp": 1549480455646, "sensor_val": 7.091154829208067}
{"timestamp": 1549480455647, "sensor_val": 8.806417239029447}
{"timestamp": 1549480455647, "sensor_val": 0.9789474417767674}
{"timestamp": 1549480455647, "sensor_val": 1.6466189633300243}

You can process this with

import json
from collections import deque

# RingBuffer from https://www.daniweb.com/programming/software-development/threads/42429/limit-size-of-a-list
class RingBuffer(deque):
    def __init__(self, size):
        deque.__init__(self)
        self.size = size

    def full_append(self, item):
        deque.append(self, item)
        # full, pop the oldest item, left most item
        self.popleft()

    def append(self, item):
        deque.append(self, item)
        # max size reached, append becomes full_append
        if len(self) == self.size:
            self.append = self.full_append

    def get(self):
        """returns a list of size items (newest items)"""
        return list(self)


def proc_data():
    # Declare some state management in memory to keep track of whatever you want
    # as you iterate through the objects
    metrics = {
        'latest_timestamp': 0,
        'last_3_samples': RingBuffer(3)
    }

    with open('test.json', 'r') as infile:        
        for line in infile:
            # Load each line
            line = json.loads(line)
            # Do stuff with your running metrics
            metrics['last_3_samples'].append(line['sensor_val'])
            if line['timestamp'] > metrics['latest_timestamp']:
                metrics['latest_timestamp'] = line['timestamp']

    return metrics

print proc_data()

11 Comments

@Elif -- could you show an example of using the last method, with some test json?
@Elif -- thanks for the update, but this is already given a json-newline item. Suppose it is a json object with zero newlines in it. That is the scenario I have. -- So this line would not work: for line in infile: line = json.loads(line)
You mentioned "One option is to use something like github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal." My proposed pattern using metrics presumes that you have already converted to newline delimited json using cat a.json | jq -c '.[]' . If the processing pattern works for your use case then you should use newline delimited JSON going forward. I apologize for the confusion.
Where does the JSON spec say that you must read the entire object before you can access its components? (@EilifMikkelsen, thanks for your answer, but I wonder whether that statement is correct. Is all JSON streaming processing not JSON spec conform?)
So if you are scanning for sibling objects of b, you have to parse the entire file, but if you are only interested in items inside b's array, its totally valid for a to be a incomplete object (and b a incomplete array, which you are streaming, that's the purpose ). With that in mind, streaming json is totally valid as long as you accept that objects may be incomplete.
|