4

I have a DataFrame below which has some missing values.

df = pd.DataFrame(data=[['A', 1, None], ['B', 2, 5]],
                  columns=['X', 'Y', 'Z'])

Since df['Z'] is supposed to be an integer column, I changed its data type to pandas new experimental type nullable integer as below.

ydf['Z'] = ydf['Z'].astype(pd.Int32Dtype())
ydf

    X   Y   Z
0   A   1   <NA>
1   B   2   5

Now I am trying to use a simple numpy where method to replace the non-null values in the column df['Z'] with a fixed integer value (say 1) using the code below.

np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'] > 0, 1, 0))

But I get the following error, and I am unable to understand why as I am already checking for the rows with null values in the first condition.

TypeError: boolean value of NA is ambiguous
5
  • np.where(ydf['Z'] > 0, 1, 0) is throwing the error. Commented Nov 24, 2021 at 19:27
  • Yes I know that but why? Commented Nov 24, 2021 at 19:28
  • I think np.where expected an array of booleans only, but ydf['Z'] > 0 returns nans like <NA> Commented Nov 24, 2021 at 19:29
  • Yeah, and df['Z'] > 0 (where df is the original df, before converting it to the new Int32 type) returns False for nan. Commented Nov 24, 2021 at 19:30
  • 1
    Understood. That makes sense. Thanks. Commented Nov 24, 2021 at 19:38

3 Answers 3

2

np.where expects an array of booleans. With the int64 dtype, using > on the Series returns False for nans. With the Int32 dtype (note the capital I), > doesn't coerce nans to False, thus the error.

One solution would be to use ydf['Z'].gt(0).fillna(False) instead of ydf['Z'] > 0. (They're the same, the second one just changes NA to False):

np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'].gt(0).fillna(False), 1, 0))
Sign up to request clarification or add additional context in comments.

1 Comment

Quick question. Shouldn't the second np.where condition consider only the rows that are rejected as False in the first condition where I am checking for NA. Otherwise what is the point of a nested condition.
1

As suggested by @user17242583, np.where need an array of boolean values only but your comparison return a tri-state array: True, False and <NA>.

>>> df['Z'] > 0
0    <NA>
1    True
Name: Z, dtype: boolean

In this case, np.where can't decide if the returned value should be interpreted as True or False.

Just cast on the fly your column:

>>> np.where(pd.isna(df['Z']), pd.NA, np.where(df['Z'].astype(float) > 0, 1, 0))

array([<NA>, 1], dtype=object)

Comments

0

One option, that could be helpful here is the case_when function from pyjanitor, that could help with the nested expressions, and also works with Pandas extension array types:

# pip install pyjanitor
import pandas as pd
import janitor

df.case_when(
      df.Z.isna(), df.Z, # condition, result
      df.Z.gt(0), 1,
      0, # default value if False
      column_name='Z')

   X  Y     Z
0  A  1  <NA>
1  B  2     1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.