How to select all non-NaN columns and non-NaN last column using pandas?

Question

Forgive me if the title a little bit confusing.

Assuming I have test.h5. Below is the result of reading this file using df.read_hdf('test.h5', 'testdata')

     0     1     2     3     4     5    6
0   123   444   111   321   NaN   NaN  NaN
1   12    234   113   67    21    32   900
3   212   112   543   321   45    NaN  NaN

I want to select the last non-Nan column. My expected result is like this

0   321
1   900
2   45

Also I want to select all column except the last non-NaN column. My expected result perhaps is like this. It might can be in numpy array but I have not foud any solution yet.

      0     1     2     3     4     5    6
0    123   444   111   
1    12    234   113   67    21    32  
3    212   112   543   321

I searched online and found df.iloc[:, :-1] for reading all column but the last one and df.iloc[:, -1] for reading the last column.

My current result using these 2 command is like this: 1. for reading all column except the last one

       0     1     2     3     4     5    
0     123   444   111   321   NaN   NaN  
1     12    234   113   67    21    32   
3     212   112   543   321   45    NaN

2.for reading the last column

0   NaN
1   900
2   Nan

My question is, is there any command or query used in pandas to address these condition?

Thank you for any help and suggestion.

Bharath M Shetty · Accepted Answer · 2017-10-02 05:00:27Z

6

You can use sorted to satisfy your condition i.e

ndf = df.apply(lambda x : sorted(x,key=pd.notnull),1)

This will give

     0      1      2      3      4      5      6
0   NaN    NaN    NaN  123.0  444.0  111.0  321.0
1  12.0  234.0  113.0   67.0   21.0   32.0  900.0
3   NaN    NaN  212.0  112.0  543.0  321.0   45.0

Now you can select the last column i.e

ndf.iloc[:,-1]

0    321.0
1    900.0
3     45.0
Name: 6, dtype: float64

ndf.iloc[:,:-1].apply(lambda x : sorted(x,key=pd.isnull),1)

      0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN

edited Oct 2, 2017 at 5:00

answered Oct 2, 2017 at 4:38

Bharath M Shetty

30.7k6 gold badges66 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Divakar · Accepted Answer · 2017-10-02 05:13:54Z

6

Part #2

Here's a vectorized way with some masking to do the second task of selecting all columns except the last non-NaN column -

idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)
df_out = df.mask(idx[:,None] <= np.arange(df.shape[1]))

Here's a sample run on a modified/generic version of the sample dataframe with two islands of NaNs in the third row and the second row having NaN islands at the start -

In [181]: df
Out[181]: 
     0      1      2    3     4     5      6
0  123  444.0  111.0  321   NaN   NaN    NaN
1   12    NaN    NaN   67  21.0  32.0  900.0
3  212    NaN    NaN  321  45.0   NaN    NaN

In [182]: idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)

In [183]: df.mask(idx[:,None] <= np.arange(df.shape[1]))
Out[183]: 
     0      1      2      3     4     5   6
0  123  444.0  111.0    NaN   NaN   NaN NaN
1   12    NaN    NaN   67.0  21.0  32.0 NaN
3  212    NaN    NaN  321.0   NaN   NaN NaN

Part #1

Back to solving the first case, simply use NumPy's advanced-indexing -

In [192]: df.values[np.arange(len(idx)), idx]
Out[192]: array([ 321.,  900.,   45.])

edited Oct 2, 2017 at 5:13

answered Oct 2, 2017 at 5:02

Divakar

222k19 gold badges274 silver badges374 bronze badges

1 Comment

Zero Over a year ago

You could use notnull inplace of isnull negation

piRSquared · Accepted Answer · 2017-10-02 06:05:25Z

5

Option 1

df.stack().groupby(level=0).last()

0    321.0
1    900.0
3     45.0
dtype: float64

Option 2
Using apply with pd.Series.last_valid_index

# Thanks to Bharath shetty for the suggestion
df.apply(lambda x : x[x.last_valid_index()], 1)
# Old Answer
# df.apply(pd.Series.last_valid_index, 1).pipe(lambda x: df.lookup(x.index, x))

array([ 321.,  900.,   45.])

Option 3
Getting creative with np.where and a dictionary comprehension

pd.Series({df.index[i]: df.iat[i, j] for i, j in zip(*np.where(df.notnull()))})

0    321.0
1    900.0
3     45.0
dtype: float64

Option 4
pd.DataFrame.ffill

df.ffill(1).iloc[:, -1]

0    321.0
1    900.0
3     45.0
Name: 6, dtype: float64

Solving last trick

df.stack().groupby(level=0, group_keys=False).apply(lambda x: x.head(-1)).unstack()

       0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN

edited Oct 2, 2017 at 6:05

answered Oct 2, 2017 at 5:44

piRSquared

296k68 gold badges512 silver badges657 bronze badges

5 Comments

Bharath M Shetty Over a year ago

For option 2 simply df.apply(lambda x : x[x.last_valid_index()],1)

piRSquared Over a year ago

Why did you remove it?

Bharath M Shetty Over a year ago

Because I thought it wouldn't answer second part.

piRSquared Over a year ago

True, but still interesting. Anyway, happy to delete it if you have a change of heart.

Bharath M Shetty Over a year ago

No sir I like to see the option list from you. Let it be there

jezrael · Accepted Answer · 2017-10-02 05:42:00Z

4

Use notnull + iloc + idxmax for columns names of last non NaNs values first and last lookup:

a = df.notnull().iloc[:,::-1].idxmax(1)
print (a)
0    3
1    6
3    4
dtype: object

print (pd.Series(df.lookup(df.index, a)))
0    321.0
1    900.0
2     45.0
dtype: float64

And then replace this values to NaNs:

arr = df.values
arr[np.arange(len(df.index)),a] = np.nan
print (pd.DataFrame(arr, index=df.index, columns=df.columns))
       0      1      2      3     4     5   6
0  123.0  444.0  111.0    NaN   NaN   NaN NaN
1   12.0  234.0  113.0   67.0  21.0  32.0 NaN
3  212.0  112.0  543.0  321.0   NaN   NaN NaN

edited Oct 2, 2017 at 5:42

answered Oct 2, 2017 at 5:03

jezrael

869k103 gold badges1.4k silver badges1.3k bronze badges

Comments

Fang · Accepted Answer · 2017-10-02 07:02:13Z

To those who are looking for answer for this specific issue, for me I ended up using the answer given by Bharath shetty. To make it more easy to access later, I modified the answer given and below are my code:

#assuming you have some csv file with different length of row/column
#and you want to create h5 file from those csv files
data_one = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_one/*.csv")]
data_two = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_two/*.csv")] 

df1 = pd.DataFrame(data_one)
df2 = pd.DataFrame(data_two)

combine = df1.append(df2, ignore_index=True)
combine_sort = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combine.to_hdf('test.h5', 'testdata')

For reading

dataframe = pd.read_hdf('test.h5', 'testdata')
dataset = dataframe.values

q1 = dataset[:, :-1] # return all column except the last column
q2 = dataset[:, -1] # return the last column

Collectives™ on Stack Overflow

How to select all non-NaN columns and non-NaN last column using pandas?

5 Answers 5

Comments

1 Comment

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest