Combined aggregate based on valid values

Question

I have a df with this structure:

id  a1_l1   a2_l1   a3_l1   a1_l2   a2_l2   a3_l2
1   1       5       3       1       2       3
2   1       5       3       1       2       3
3   2       5       3       5       5       3
4   5       5       3       5       5       3
5   5       5       2           
6   5       5       2           
7   5       5       2           
8   2       5       2           
9   3       5       1           
10  3       5       1

I want to summarize in a table such that I get:

    l1  l2
a1  0.4 0.5
a2  1   0.5
a3  0   0

In which what I'm doing is counting how may times 5 was present divided by the number of valid responses, so that for example:

a1, l1 is equal to .4 as I have 4 values of 5 divided by 10. and a2, l1 equals .5 as I have 2 values of 5 divided by 4 valid responses per column.

Thanks!

Sunderam Dubey · Accepted Answer · 2022-05-30 15:31:00Z

4

You can reshape to have a dataframe with MultiIndex, then perform a simple division of the (sum of the truthy values equal to 5) by not na. Finally, unstack:

df2 = df.set_index('id')
df2.columns = df2.columns.str.split('_', expand = True)
df2 = (df2.eq(5).sum()/df2.notna().sum()).unstack()

output:

     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0

edited May 30, 2022 at 15:31

Sunderam Dubey

8,84812 gold badges26 silver badges43 bronze badges

answered Nov 7, 2021 at 16:28

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sammywemmy Over a year ago

made an edit ... some of the string functions offer the expand argument that turn to a MultiIndex if it is a pandas Index

BENY · Accepted Answer · 2021-11-07 16:26:44Z

1

Try with pd.wide_to_long

s = pd.wide_to_long(df,['a1','a2','a3'],i='id',j = 'level',sep='_',suffix='\\w+')
out = s.eq(5).groupby(level=1).sum()
out = out.T.div(s.groupby(level=1).size())
out
level   l1   l2
a1     0.4  0.2
a2     1.0  0.2
a3     0.0  0.0

answered Nov 7, 2021 at 16:26

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

mozway Over a year ago

Nice use of wide_to_long ;) +1

score 1 · Accepted Answer · 2021-11-07 21:20:30Z

All the answers look equally complex, so I decided to run some benchmarks to help the OP make a decision.

I benchmarked each of the three solutions, each four times in succession, using IPython's %timeit magic. I used the OP's original 6x10 dataframe and a randomly-generated 100x10000 dataframe:

Solution	Avg. time to run on OP's dataframe	Avg. time to run on 100x10000 dataframe
mozway	2.1 ms	42.88 ms (0.048 seconds)
Asish M.	4.5 ms	2,200 ms (2.2 seconds)
BENY	13.3 ms	1,772 ms (1.77 seconds)

mozway's is by far the fastest.

Congrats, @mozway!

Benchmark details for OP's dataset

# mozway
2.15 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Asish M.
4.51 ms ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.49 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.51 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# BENY
13.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Benchmark details for 100x10000 dataset

# BENY
1.77 s ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.76 s ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.79 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Asish M.
2.21 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.2 s ± 9.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.19 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.2 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# mozway
42.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43.9 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
42.2 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Code used to generate 100x10000 dataset

import random
import numpy as np
dct = {}
sz = 100
for x in range(sz):
    for y in range(sz):
        k = f'a{x}_l{y}'
        dct[k] = [random.choice(nums) for _ in range(sz)]
        dct[k].sort(key=lambda x: np.isnan(x)) # put nans at the end
dct['id'] = list(range(sz))
df = pd.DataFrame(dct)

nice reminder that the less the number of rows to deal with the better. thanks for the benchmarks

Sunderam Dubey · Accepted Answer · 2022-05-30 15:31:11Z

1

You can drop the id column drop('id', axis=1) as well instead of set_index('id')

calculate the percentage for each column using (x == 5).sum() / x.notna().sum()
reset the index to get a column we can work with
generate the 'a' and 'l' columns by splitting the index
pivot

tmp = df.set_index('id').apply(lambda x: x.eq(5).sum()/x.notna().sum()).reset_index()
tmp[['a', 'l']] = tmp['index'].str.split('_', expand=True)


>>>
   index    0   a   l
0  a1_l1  0.4  a1  l1
1  a2_l1  1.0  a2  l1
2  a3_l1  0.0  a3  l1
3  a1_l2  0.5  a1  l2
4  a2_l2  0.5  a2  l2
5  a3_l2  0.0  a3  l2

tmp.drop('index', axis=1).pivot(index='a', columns='l').droplevel(0, axis=1).rename_axis(None).rename_axis(None, axis=1)

>>>
     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0

edited May 30, 2022 at 15:31

Sunderam Dubey

8,84812 gold badges26 silver badges43 bronze badges

answered Nov 7, 2021 at 16:25

Asish M.

2,6571 gold badge19 silver badges34 bronze badges

Collectives™ on Stack Overflow

Combined aggregate based on valid values

4 Answers 4

1 Comment

1 Comment

Benchmark details for OP's dataset

Benchmark details for 100x10000 dataset

Code used to generate 100x10000 dataset

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Benchmark details for OP's dataset

Benchmark details for 100x10000 dataset

Code used to generate 100x10000 dataset

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related