7

I have a df with this structure:

id  a1_l1   a2_l1   a3_l1   a1_l2   a2_l2   a3_l2
1   1       5       3       1       2       3
2   1       5       3       1       2       3
3   2       5       3       5       5       3
4   5       5       3       5       5       3
5   5       5       2           
6   5       5       2           
7   5       5       2           
8   2       5       2           
9   3       5       1           
10  3       5       1   

I want to summarize in a table such that I get:

    l1  l2
a1  0.4 0.5
a2  1   0.5
a3  0   0

In which what I'm doing is counting how may times 5 was present divided by the number of valid responses, so that for example:

a1, l1 is equal to .4 as I have 4 values of 5 divided by 10. and a2, l1 equals .5 as I have 2 values of 5 divided by 4 valid responses per column.

Thanks!

4 Answers 4

4

You can reshape to have a dataframe with MultiIndex, then perform a simple division of the (sum of the truthy values equal to 5) by not na. Finally, unstack:

df2 = df.set_index('id')
df2.columns = df2.columns.str.split('_', expand = True)
df2 = (df2.eq(5).sum()/df2.notna().sum()).unstack()

output:

     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0
Sign up to request clarification or add additional context in comments.

1 Comment

made an edit ... some of the string functions offer the expand argument that turn to a MultiIndex if it is a pandas Index
1

Try with pd.wide_to_long

s = pd.wide_to_long(df,['a1','a2','a3'],i='id',j = 'level',sep='_',suffix='\\w+')
out = s.eq(5).groupby(level=1).sum()
out = out.T.div(s.groupby(level=1).size())
out
level   l1   l2
a1     0.4  0.2
a2     1.0  0.2
a3     0.0  0.0

1 Comment

Nice use of wide_to_long ;) +1
1

All the answers look equally complex, so I decided to run some benchmarks to help the OP make a decision.

I benchmarked each of the three solutions, each four times in succession, using IPython's %timeit magic. I used the OP's original 6x10 dataframe and a randomly-generated 100x10000 dataframe:

Solution Avg. time to run on OP's dataframe Avg. time to run on 100x10000 dataframe
mozway 2.1 ms 42.88 ms (0.048 seconds)
Asish M. 4.5 ms 2,200 ms (2.2 seconds)
BENY 13.3 ms 1,772 ms (1.77 seconds)

mozway's is by far the fastest.

Congrats, @mozway!


Benchmark details for OP's dataset

# mozway
2.15 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Asish M.
4.51 ms ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.49 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.51 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# BENY
13.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Benchmark details for 100x10000 dataset

# BENY
1.77 s ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.76 s ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.79 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Asish M.
2.21 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.2 s ± 9.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.19 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.2 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# mozway
42.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43.9 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
42.2 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Code used to generate 100x10000 dataset

import random
import numpy as np
dct = {}
sz = 100
for x in range(sz):
    for y in range(sz):
        k = f'a{x}_l{y}'
        dct[k] = [random.choice(nums) for _ in range(sz)]
        dct[k].sort(key=lambda x: np.isnan(x)) # put nans at the end
dct['id'] = list(range(sz))
df = pd.DataFrame(dct)

3 Comments

And how does it look when the data size is increased?
Check now @sammywemmy ;)
nice reminder that the less the number of rows to deal with the better. thanks for the benchmarks
1

You can drop the id column drop('id', axis=1) as well instead of set_index('id')

  1. calculate the percentage for each column using (x == 5).sum() / x.notna().sum()
  2. reset the index to get a column we can work with
  3. generate the 'a' and 'l' columns by splitting the index
  4. pivot
tmp = df.set_index('id').apply(lambda x: x.eq(5).sum()/x.notna().sum()).reset_index()
tmp[['a', 'l']] = tmp['index'].str.split('_', expand=True)


>>>
   index    0   a   l
0  a1_l1  0.4  a1  l1
1  a2_l1  1.0  a2  l1
2  a3_l1  0.0  a3  l1
3  a1_l2  0.5  a1  l2
4  a2_l2  0.5  a2  l2
5  a3_l2  0.0  a3  l2

tmp.drop('index', axis=1).pivot(index='a', columns='l').droplevel(0, axis=1).rename_axis(None).rename_axis(None, axis=1)

>>>
     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.