Using pandas nullable integer dtype in np.where condition

Question

I have a DataFrame below which has some missing values.

df = pd.DataFrame(data=[['A', 1, None], ['B', 2, 5]],
                  columns=['X', 'Y', 'Z'])

Since df['Z'] is supposed to be an integer column, I changed its data type to pandas new experimental type nullable integer as below.

ydf['Z'] = ydf['Z'].astype(pd.Int32Dtype())
ydf

    X   Y   Z
0   A   1   <NA>
1   B   2   5

Now I am trying to use a simple numpy where method to replace the non-null values in the column df['Z'] with a fixed integer value (say 1) using the code below.

np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'] > 0, 1, 0))

But I get the following error, and I am unable to understand why as I am already checking for the rows with null values in the first condition.

TypeError: boolean value of NA is ambiguous

I think np.where expected an array of booleans only, but ydf['Z'] > 0 returns nans like <NA> — user17242583
– user17242583, Commented Nov 24, 2021 at 19:29
Yeah, and df['Z'] > 0 (where df is the original df, before converting it to the new Int32 type) returns False for nan. — user17242583
– user17242583, Commented Nov 24, 2021 at 19:30

user17242583 · Accepted Answer · 2021-11-24 19:35:52Z

2

np.where expects an array of booleans. With the int64 dtype, using > on the Series returns False for nans. With the Int32 dtype (note the capital I), > doesn't coerce nans to False, thus the error.

One solution would be to use ydf['Z'].gt(0).fillna(False) instead of ydf['Z'] > 0. (They're the same, the second one just changes NA to False):

np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'].gt(0).fillna(False), 1, 0))

answered Nov 24, 2021 at 19:35

user17242583

Sign up to request clarification or add additional context in comments.

1 Comment

asanoop24 Over a year ago

Quick question. Shouldn't the second np.where condition consider only the rows that are rejected as False in the first condition where I am checking for NA. Otherwise what is the point of a nested condition.

Corralien · Accepted Answer · 2021-11-24 19:38:36Z

1

As suggested by @user17242583, np.where need an array of boolean values only but your comparison return a tri-state array: True, False and <NA>.

>>> df['Z'] > 0
0    <NA>
1    True
Name: Z, dtype: boolean

In this case, np.where can't decide if the returned value should be interpreted as True or False.

Just cast on the fly your column:

>>> np.where(pd.isna(df['Z']), pd.NA, np.where(df['Z'].astype(float) > 0, 1, 0))

array([<NA>, 1], dtype=object)

edited Nov 24, 2021 at 19:38

answered Nov 24, 2021 at 19:32

Corralien

121k8 gold badges44 silver badges69 bronze badges

Comments

sammywemmy · Accepted Answer · 2021-12-18 11:35:37Z

0

One option, that could be helpful here is the case_when function from pyjanitor, that could help with the nested expressions, and also works with Pandas extension array types:

# pip install pyjanitor
import pandas as pd
import janitor

df.case_when(
      df.Z.isna(), df.Z, # condition, result
      df.Z.gt(0), 1,
      0, # default value if False
      column_name='Z')

   X  Y     Z
0  A  1  <NA>
1  B  2     1

answered Dec 18, 2021 at 11:35

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Collectives™ on Stack Overflow

Using pandas nullable integer dtype in np.where condition

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related