32

Suppose I have a multi-level dictionary like this

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

I'd like to access it like this

test = get_entry(mydict, 'first.second.third.fourth')

What I have so far is

def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
       result = dict[key]

    return result

Are there more efficient ways to do it? According to %timeit the runtime of the function is 1.26us, while accessing the dictionary the standard way like this

foo = mydict['first']['second']['third']['fourth']

takes 541ns. I'm looking for ways to trim it to 800ns range if possible.

Thanks

5
  • 1
    Are all your intermediary dictionaries of length one? If they are, you can use a tuple key fairly efficiently. Commented Apr 20, 2018 at 16:29
  • 1
    this throws KeyError: 'second' for me Commented Apr 20, 2018 at 16:31
  • @theausome - that answer "...doesn't seem to work on nested dicts." Commented Apr 20, 2018 at 16:37
  • You have to make a few trade-offs if you want to boost performance. What is more likely to change more often - the dictionary you're traversing or the dot notation string you use to traverse? If both are frequently changing and of the same importance you're not going to get much faster than presented in @tdelaney solution. Commented Apr 25, 2018 at 15:46
  • Relevant: stackoverflow.com/questions/14692690/… Commented Apr 27, 2018 at 12:03

7 Answers 7

13

I got a 20% performance boost by tightening up the code a bit but a whopping 400% increase by using a cache for split strings. That only makes a difference if you use the same spec multiple times. Here are sample implementations and a profile script to test.

test.py

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

# original
def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
       result = result[key]

    return result

# tighten up code
def get_entry_2(mydict, keyspec):
    for key in keyspec.split('.'):
        mydict = mydict[key]
    return mydict

# use a cache
cache = {}
def get_entry_3(mydict, keyspec):
    global cache
    try:
        spec = cache[keyspec]
    except KeyError:
        spec = tuple(keyspec.split('.'))
        cache[keyspec] = spec

    for key in spec:
        mydict = mydict[key]
    return mydict

if __name__ == "__main__":
    test = get_entry(mydict, 'first.second.third.fourth')
    print(test)

profile.py

from timeit import timeit
print("original get_entry")
print(timeit("get_entry(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry, mydict"))

print("get_entry_2 with tighter code")
print(timeit("get_entry_2(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry_2, mydict"))

print("get_entry_3 with cache of split spec")
print(timeit("get_entry_3(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry_3, mydict"))

print("just splitting a spec")
print(timeit("x.split('.')", setup="x='first.second.third.fourth'"))

The timing on my machine is

original get_entry
4.148535753000033
get_entry_2 with tighter code
3.2986323120003362
get_entry_3 with cache of split spec
1.3073233439990872
just splitting a spec
1.0949148639992927

Notice that splitting the spec is a comparatively expensive operation for this function. That's why caching helps.

Sign up to request clarification or add additional context in comments.

10 Comments

Looks like you're the only one who paid attention to performance.
@kabanus I don't get what you mean. You can get nano-second level performance with my solution as long as you pre-process your data once. Whether that can be done or not is on OP, not me.
@COLDSPEED I think the choice between yours and mine is whether lots of queries are done on one dataset or a few queries are done on many datasets.
Yes, there are tradeoffs :)
@cᴏʟᴅsᴘᴇᴇᴅ yes :) I was biased against you because it seems like cheating, but looking back, I guess I was just jealous.
|
12

There's really only one solution. Rebuild your dictionary. But do it just once.

def recursive_flatten(mydict):
    d = {}
    for k, v in mydict.items():
        if isinstance(v, dict):
            for k2, v2 in recursive_flatten(v).items():
                d[k + '.' + k2] = v2 
        else:
            d[k] = v
    return d

In [786]: new_dict = recursive_flatten(mydict); new_dict
Out[786]: {'first.second.third.fourth': 'the end'}

(Some more tests)

In [788]: recursive_flatten({'x' : {'y' : 1, 'z' : 2}, 'y' : {'a' : 5}, 'z' : 2})
Out[788]: {'x.y': 1, 'x.z': 2, 'y.a': 5, 'z': 2}

In [789]: recursive_flatten({'x' : 1, 'y' : {'x' : 234}})
Out[789]: {'x': 1, 'y.x': 234}

Every access becomes constant time from here on.

Now, just access your value using new_dict['first.second.third.fourth']. Should work for any arbitrarily nested dictionary that does not contain a self-reference.

Note that every solution has its fair share of tradeoffs, this is no exception. Unless you're firing millions of queries at your data such that preprocessing is an acceptable overhead, then this is it. With the other solutions, you are only sidestepping the problem instead of addressing it - which is dealing with the dictionary's structure. OTOH, if you're going to do this once on many such similar data structures, it make no sense to preprocess just for a single query, in which case you may prefer one of the other solutions.

8 Comments

Just a note that this seems to only allow access to the final level of nesting, you wouldn't for example be able to access new_dict['first.second']
@chrisz If needed, that can be fixed by caching res = recursive_flatten(v), updating d with d.update(res), and then iterating over res in a similar manner.
Using a dict directly really is the only fast solution.
Though in terms of space, your (extended in comments) solution would not scale nicely (read linearly).
I believe this could be a good dupe target, but since you placed the bounty, I thought to ask? stackoverflow.com/questions/14692690/…
|
11

I updated the answer from How to use a dot "." to access members of dictionary? to use an initial conversion which will then work for nested dictionaries:

You can use the following class to allow dot-indexing of dictionaries:

class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

However, this only supports nesting if all nested dictionaries are also of type dotdict. That's where the following helper function comes in:

def dct_to_dotdct(d):
    if isinstance(d, dict):
        d = dotdict({k: dct_to_dotdct(v) for k, v in d.items()})
    return d

This function has to be run once on your nested dictionary, and the result can then be indexed using dot-indexing.

Here are some examples:

In [13]: mydict
Out[13]: {'first': {'second': {'third': {'fourth': 'the end'}}}}

In [14]: mydict = dct_to_dotdct(mydict)

In [15]: mydict.first.second
Out[15]: {'third': {'fourth': 'the end'}}

In [16]: mydict.first.second.third.fourth
Out[16]: 'the end'

A note about performance: this answer is slow compared to standard dictionary access, I just wanted to present an option that actually used "dot access" to a dictionary.

Comments

7

Here is a solution similar to chrisz's, but you do not have to anything to your dict a-prior. :

class dictDotter(dict):
    def __getattr__(self,key):
        val = self[key]
        return val if type(val) != dict else dictDotter(val)

and just x=dictDotter(originalDict) will let you have arbitrary dot getting (`x.first.second...). I'll note this is twice as slow as chrisz solution, and his is 9 times as slow as yours (on my machine, approximately).

So, if you insist on making this work @tdelaney seems to have provided the only real performance improvement.

Another option that does better than what you have (in terms of run time):

class dictObjecter:
    def __init__(self,adict):
        for k,v in adict.items():
            self.__dict__[k] = v
            if type(v) == dict: self.__dict__[k] = dictObjecter(v)

which will make an object out of your dict, so dot notation is usual. This will improve run time to 3 times what you have, so not bad, but at the cost of going over your dict, and replacing it with something else.

Here is the total testing code:

from timeit import timeit

class dictObjecter:
    def __init__(self,adict):
        for k,v in adict.items():
            self.__dict__[k] = v
            if type(v) == dict: self.__dict__[k] = dictObjecter(v)

class dictDotter(dict):
    def __getattr__(self,key):
        val = self[key]
        return val if type(val) != dict else dictDotter(val)

def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
        result = result[key]

    return result

class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

def dct_to_dotdct(d):
    if isinstance(d, dict):
        d = dotdict({k: dct_to_dotdct(v) for k, v in d.items()})
    return d

x = {'a':{'b':{'c':{'d':1}}}}
y = dictDotter(x)
z = dct_to_dotdct(x)
w = dictObjecter(x)
print('{:15} : {}'.format('dict dotter',timeit('y.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('dot dict',timeit('z.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('dict objecter',timeit('w.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('original',timeit("get_entry(x,'a.b.c.d')",globals=locals(),number=1000)))
print('{:15} : {:.20f}'.format('best ref',timeit("x['a']['b']['c']['d']",globals=locals(),number=1000)))

I provided the last regular lookup as a best reference.The results on a Windows Ubuntu subsystem:

dict dotter     : 0.0035500000003594323
dot dict        : 0.0017939999997906853
dict objecter   : 0.00021699999979318818
original        : 0.0006629999998040148
best ref        : 0.00007999999979801942

so the is objectified dict is 3 times as slow as a regular dictionary lookup - so if speed is important, why would you want this?

3 Comments

No answer here has actually paid attention to performance, including the answer you claimed had. None of these solutions are any good if there are to be millions of accesses - it all adds up.
@cᴏʟᴅsᴘᴇᴇᴅ Hey, at least give me the "nice effort" consideration. I was trying thing that actually need a .a.b.c.d to access deeper into the maze.
Okay, you get a "nice effort" consideration from me (+1). I do like your answer, it, like all the other answers, certainly has its merits over mine.
2
+50

I had the same need, so I created the Prodict.

For your case, you can do it in one line:

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}
dotdict = Prodict.from_dict(mydict)
print(dotdict.first.second.third.fourth) # "the end"

After that, use dotdict just like a dict, because it is a subclass of dict:

dotdict.first == dotdict['first'] # True

You can also add more keys dynamically with dot notation:

dotdict.new_key = 'hooray'
print(dotdict.new_key) # "hooray"

It works even if the new keys are nested dictionaries:

dotdict.it = {'just': 'works'}
print(dotdict.it.just)  # "works"

Lastly, if you define your keys beforehand, you get auto completion and auto type conversion:

class User(Prodict):
    user_id: int
    name: str

user = User(user_id="1", "name":"Ramazan")
type(user.user_id) # <class 'int'>
# IDE will be able to auto complete 'user_id' and 'name' properties

UPDATE:

This is the test result for the same code written by @kabanus:

x = {'a': {'b': {'c': {'d': 1}}}}
y = dictDotter(x)
z = dct_to_dotdct(x)
w = dictObjecter(x)
p = Prodict.from_dict(x)

print('{:15} : {}'.format('dict dotter', timeit('y.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('prodict', timeit('p.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('dot dict', timeit('z.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('dict objecter', timeit('w.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('original', timeit("get_entry(x,'a.b.c.d')", globals=locals(), number=10000)))
print('{:15} : {:.20f}'.format('prodict getitem', timeit("p['a']['b']['c']['d']", globals=locals(), number=10000)))
print('{:15} : {:.20f}'.format('best ref', timeit("x['a']['b']['c']['d']", globals=locals(), number=10000)))

And results:

dict dotter     : 0.04535976458466595
prodict         : 0.02860781018446784
dot dict        : 0.019078164088831673
dict objecter   : 0.0017378700050722368
original        : 0.006594238310349346
prodict getitem : 0.00510931794975705289
best ref        : 0.00121740293554022105

As you can see, its performance is between "dict dotter" and "dot dict". Any performance enhancement suggestion will be appreciated.

Comments

1

The code should be less iterative and more dynamic!!

data

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

Function

def get_entry(dict, keyspec):
    for keys in keyspec.split('.'):
        dict = dict[keys]
    return dict

call the function

res = get_entry(mydict, 'first.second.third.fourth')

this will take less time to execute even it's a dynamic code execution!!

5 Comments

I fail to see how this is remotely different from OP's solution which they didn't want.
As you see there's no use of extra variables to store values that leads it to save time to execute and the time difference is in micro seconds so this will be effective when this code will execute a million times by another code. Moreover you can use first, first.second , first.second.third as an arg without changing single line of code.
The extra variable makes near 0 difference whatsoever, I would certainly hope for larger performance gains than this on a million records.
@cᴏʟᴅsᴘᴇᴇᴅ Can you tell me how much time this code will takes if you measure it really!! Because I'm dmm sure that it's very big difference of time when this code will execute with extra variable and without extra variable.
Not nearly as much as the other answers, we'll go with that.
1

You can use reduce (functools.reduce in python3):

import operator
def get_entry(dct, keyspec):
    return reduce(operator.getitem, keyspec.split('.'), dct)

It is more nicely looking but with a little less perfomance.

Your version timeit:

>>> timeit("get_entry_original(mydict, 'first.second.third.fourth')",
           "from __main__ import get_entry_original, mydict", number=1000000)
0.5646841526031494

with reduce:

>>> timeit("get_entry(mydict, 'first.second.third.fourth')",
           "from __main__ import get_entry, mydict")
0.6140949726104736

As tdelaney notice - split consume almost as much cpu power as getting key in dict:

def split_keys(keyspec):
    keys = keyspec.split('.')

timeit("split_keys('first.second.third.fourth')",
       "from __main__ import split_keys")
0.28857898712158203

Just move string splitting away from get_entry function:

def get_entry(dct, keyspec_list):
    return reduce(operator.getitem, keyspec_list, dct)


timeit("get_entry(mydict, ['first', 'second', 'third', 'fourth'])",
       "from __main__ import get_entry, mydict")
0.37825703620910645

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.