Get min and max values of categorical variable in a dataframe

Question

I have a dataframe that looks like this:

D   X   Y   Z
A  22  16  23
A  21  16  22
A  20  17  21
B  33  50  11
B  34  53  12
B  34  55  13
C  44  34  11
C  45  33  11
C  45  33  10
D  55  35  60
D  57  34  61
E  66  36  13
E  67  38  14 
E  67  37  13

I want to get the minimum and maximum values of the categorical variable D across all the column values and so the output dataframe should look something like this:

D Xmin Xmax Ymin Ymax Zmin Zmax
A 20    22   16   17   21   23 
B 33    34   50   55   11   13    
C 44    45   33   34   10   11  
D 55    57   34   35   60   61
E 66    67   36   38   13   14

I have tried this, but no luck:

min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})

sammywemmy · Accepted Answer · 2021-11-13 01:11:21Z

13

df = df.groupby('D').agg(['min', 'max'])

Output:

>>> df
    X       Y       Z    
  min max min max min max
D                        
A  20  22  16  17  21  23
B  33  34  50  55  11  13
C  44  45  33  34  10  11
D  55  57  34  35  60  61
E  66  67  36  38  13  14

>>> df['X']['min']
D
A    20
B    33
C    44
D    55
E    66
Name: min, dtype: int64

You can flatten the columns as well:

df.columns = df.columns.map(''.join)
df.rename_axis(None)

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

edited Nov 13, 2021 at 1:11

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

answered Nov 13, 2021 at 0:31

user17242583

Sign up to request clarification or add additional context in comments.

Comments

Asish M. · Accepted Answer · 2021-11-13 00:30:43Z

10

from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

answered Nov 13, 2021 at 0:30

Asish M.

2,6571 gold badge19 silver badges34 bronze badges

3 Comments

user17242583 Over a year ago

This is the fastest, per HarryPlotter's benchmarks.

Peter Mortensen Over a year ago

@user17242583: Not per cycle. (Though it is not clear way the two kinds of number would not be consistent - due to some fixed overhead?)

user17242583 Over a year ago

Yeah actually, for an individual run it looks like mine (and Harry's) was faster.

Celius Stingher · Accepted Answer · 2021-11-14 15:07:56Z

9

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

edited Nov 14, 2021 at 15:07

user17242583

answered Nov 13, 2021 at 0:39

Celius Stingher

18.4k6 gold badges26 silver badges54 bronze badges

Comments

Rodalm · Accepted Answer · 2021-11-14 15:07:20Z

7

df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join)  # flatten multi-level columns

Output

>>> df

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D                                    
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Speed Test

I did some simple tests comparing Celius's, Asish's and user17242583's/my solutions.

Setup

import numpy as np 
import pandas as pd 
from itertools import product

n = 1_000_000
rng = np.random.default_rng() 

df = pd.DataFrame({
    'D': rng.choice(list("ABCDEFGH"), size=n),
    'X': rng.integers(100, size=n),
    'Y': rng.integers(100, size=n),
    'Z': rng.integers(100, size=n),
})

>>> df.shape
(1000000, 4)

Results

Here are the results.

# Ashish's 
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs) 

116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

# user17242583's / mine 
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join)  # flat multi-level columns

120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg is a little faster for large DataFrames.

edited Nov 14, 2021 at 15:07

user17242583

answered Nov 13, 2021 at 1:02

Rodalm

5,7589 silver badges22 bronze badges

7 Comments

Rodalm Over a year ago

@user17242583 Yes, but only for >~1M rows. For ~100k rows, the timings are practically identical ;) Lower than that, Celius's answer seems faster!

Celius Stingher Over a year ago

I would have expected in no scenario for my answer to be the fastest (a join operation with two gruopings inside it) however indeed for <100k rows time is usually deprioritized over readability and applicability for a newer user. As long as bad habits are not taught (like using iterrows when you can have a vectorized solution) I try to aim for a one-liner that does the trick.

Rodalm Over a year ago

@CeliusStingher yeah I understood, and I totally agree with you! I was just curious how big was the difference between using join and agg.

Celius Stingher Over a year ago

I think the answer is good enough for general knowledge. I am curious about something, while specifying all the details you mention might help create a more objective measurement, there is probably no way we can't match OP specs and most users have different specs too. Therefore all our measurements will be different from OPs but the most efficient solution vs the least efficient one might stand across most specs.

Rodalm Over a year ago

@CeliusStingher Thank you, I completely agree. During many years of reading SO posts, I never saw someone providing those details in a simple performance test. The average user isn’t interested in them. The setup is precisely given so everyone (especially OP) can easily compare the timings in their own machine if they want. It shouldn’t be seen as a rigorous benchmark, as many more tests would be needed. The point was just to show that for large enough data, using agg instead of join is faster due to their implementation, which is something we're all aware of.

|

Collectives™ on Stack Overflow

Get min and max values of categorical variable in a dataframe

4 Answers 4

Comments

3 Comments

Comments

Speed Test

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

Comments

Speed Test

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related