Python pandas: fast way to flatten JSON into rows by a surrogate key

Question

My knowledge of packages such as pandas is fairly shallow, and I've been looking for a solution to flatten data into rows. With a dict like this, with a surrogate key called entry_id:

data = [
    {
        "id": 1,
        "entry_id": 123,
        "type": "ticker",
        "value": "IBM"
    },
    {
        "id": 2,
        "entry_id": 123,
        "type": "company_name",
        "value": "International Business Machines"
    },
    {
        "id": 3,
        "entry_id": 123,
        "type": "cusip",
        "value": "01234567"
    },
    {
        "id": 4,
        "entry_id": 321,
        "type": "ticker",
        "value": "AAPL"
    },
    {
        "id": 5,
        "entry_id": 321,
        "type": "permno",
        "value": "123456"
    },
    {
        "id": 6,
        "entry_id": 321,
        "type": "company_name",
        "value": "Apple, Inc."
    },
    {
        "id": 7,
        "entry_id": 321,
        "type": "formation_date",
        "value": "1976-04-01"
    }
]

I would like to flatten the data into rows grouped by the surrogate key entry_id to look like this (empty strings or None values, doesn't matter):

[
    {"entry_id": 123, "ticker": "IBM", "permno": "", "company_name": "International Business Machines", "cusip": "01234567", "formation_date": ""},
    {"entry_id": 321, "ticker": "AAPL", "permno": "123456", "company_name": "Apple, Inc", "cusip": "", "formation_date": "1976-04-01"}
]

I've tried using DataFrame's groupby and json_normalize, but haven't been able to get the right level of sorcery for the desired result. I could walk the data in pure Python, but I'm certain that would not be a fast solution. I'm not sure how to specify that type is the column, value is the value, and entry_id is the aggregation key. I'm open to packages other than pandas as well.

Shubham Sharma · Accepted Answer · 2021-05-27 13:57:53Z

11

We can create a dataframe from the given list of records, then pivot the dataframe to reshape, fill the NaN values with empty string, then convert the pivoted frame to dictionary

df = pd.DataFrame(data)
df.pivot('entry_id', 'type', 'value').fillna('').reset_index().to_dict('r')

[{'entry_id': 123,
  'company_name': 'International Business Machines',
  'cusip': '01234567',
  'formation_date': '',
  'permno': '',
  'ticker': 'IBM'},
 {'entry_id': 321,
  'company_name': 'Apple, Inc.',
  'cusip': '',
  'formation_date': '1976-04-01',
  'permno': '123456',
  'ticker': 'AAPL'}]

answered May 27, 2021 at 13:57

Shubham Sharma

71.8k6 gold badges27 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python pandas: fast way to flatten JSON into rows by a surrogate key

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related