7

Forgive me if the title a little bit confusing.

Assuming I have test.h5. Below is the result of reading this file using df.read_hdf('test.h5', 'testdata')

     0     1     2     3     4     5    6
0   123   444   111   321   NaN   NaN  NaN
1   12    234   113   67    21    32   900
3   212   112   543   321   45    NaN  NaN

I want to select the last non-Nan column. My expected result is like this

0   321
1   900
2   45

Also I want to select all column except the last non-NaN column. My expected result perhaps is like this. It might can be in numpy array but I have not foud any solution yet.

      0     1     2     3     4     5    6
0    123   444   111   
1    12    234   113   67    21    32  
3    212   112   543   321  

I searched online and found df.iloc[:, :-1] for reading all column but the last one and df.iloc[:, -1] for reading the last column.

My current result using these 2 command is like this: 1. for reading all column except the last one

       0     1     2     3     4     5    
0     123   444   111   321   NaN   NaN  
1     12    234   113   67    21    32   
3     212   112   543   321   45    NaN  

2.for reading the last column

0   NaN
1   900
2   Nan

My question is, is there any command or query used in pandas to address these condition?

Thank you for any help and suggestion.

5 Answers 5

6

You can use sorted to satisfy your condition i.e

ndf = df.apply(lambda x : sorted(x,key=pd.notnull),1)

This will give

     0      1      2      3      4      5      6
0   NaN    NaN    NaN  123.0  444.0  111.0  321.0
1  12.0  234.0  113.0   67.0   21.0   32.0  900.0
3   NaN    NaN  212.0  112.0  543.0  321.0   45.0

Now you can select the last column i.e

ndf.iloc[:,-1]
0    321.0
1    900.0
3     45.0
Name: 6, dtype: float64
ndf.iloc[:,:-1].apply(lambda x : sorted(x,key=pd.isnull),1)
      0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN
Sign up to request clarification or add additional context in comments.

Comments

6

Part #2

Here's a vectorized way with some masking to do the second task of selecting all columns except the last non-NaN column -

idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)
df_out = df.mask(idx[:,None] <= np.arange(df.shape[1]))

Here's a sample run on a modified/generic version of the sample dataframe with two islands of NaNs in the third row and the second row having NaN islands at the start -

In [181]: df
Out[181]: 
     0      1      2    3     4     5      6
0  123  444.0  111.0  321   NaN   NaN    NaN
1   12    NaN    NaN   67  21.0  32.0  900.0
3  212    NaN    NaN  321  45.0   NaN    NaN

In [182]: idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)

In [183]: df.mask(idx[:,None] <= np.arange(df.shape[1]))
Out[183]: 
     0      1      2      3     4     5   6
0  123  444.0  111.0    NaN   NaN   NaN NaN
1   12    NaN    NaN   67.0  21.0  32.0 NaN
3  212    NaN    NaN  321.0   NaN   NaN NaN

Part #1

Back to solving the first case, simply use NumPy's advanced-indexing -

In [192]: df.values[np.arange(len(idx)), idx]
Out[192]: array([ 321.,  900.,   45.])

1 Comment

You could use notnull inplace of isnull negation
5

Option 1

df.stack().groupby(level=0).last()

0    321.0
1    900.0
3     45.0
dtype: float64

Option 2
Using apply with pd.Series.last_valid_index

# Thanks to Bharath shetty for the suggestion
df.apply(lambda x : x[x.last_valid_index()], 1)
# Old Answer
# df.apply(pd.Series.last_valid_index, 1).pipe(lambda x: df.lookup(x.index, x))

array([ 321.,  900.,   45.])

Option 3
Getting creative with np.where and a dictionary comprehension

pd.Series({df.index[i]: df.iat[i, j] for i, j in zip(*np.where(df.notnull()))})

0    321.0
1    900.0
3     45.0
dtype: float64

Option 4
pd.DataFrame.ffill

df.ffill(1).iloc[:, -1]

0    321.0
1    900.0
3     45.0
Name: 6, dtype: float64

Solving last trick

df.stack().groupby(level=0, group_keys=False).apply(lambda x: x.head(-1)).unstack()

       0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN

5 Comments

For option 2 simply df.apply(lambda x : x[x.last_valid_index()],1)
Why did you remove it?
Because I thought it wouldn't answer second part.
True, but still interesting. Anyway, happy to delete it if you have a change of heart.
No sir I like to see the option list from you. Let it be there
4

Use notnull + iloc + idxmax for columns names of last non NaNs values first and last lookup:

a = df.notnull().iloc[:,::-1].idxmax(1)
print (a)
0    3
1    6
3    4
dtype: object

print (pd.Series(df.lookup(df.index, a)))
0    321.0
1    900.0
2     45.0
dtype: float64

And then replace this values to NaNs:

arr = df.values
arr[np.arange(len(df.index)),a] = np.nan
print (pd.DataFrame(arr, index=df.index, columns=df.columns))
       0      1      2      3     4     5   6
0  123.0  444.0  111.0    NaN   NaN   NaN NaN
1   12.0  234.0  113.0   67.0  21.0  32.0 NaN
3  212.0  112.0  543.0  321.0   NaN   NaN NaN

Comments

0

To those who are looking for answer for this specific issue, for me I ended up using the answer given by Bharath shetty. To make it more easy to access later, I modified the answer given and below are my code:

#assuming you have some csv file with different length of row/column
#and you want to create h5 file from those csv files
data_one = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_one/*.csv")]
data_two = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_two/*.csv")] 

df1 = pd.DataFrame(data_one)
df2 = pd.DataFrame(data_two)

combine = df1.append(df2, ignore_index=True)
combine_sort = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combine.to_hdf('test.h5', 'testdata')

For reading

dataframe = pd.read_hdf('test.h5', 'testdata')
dataset = dataframe.values

q1 = dataset[:, :-1] # return all column except the last column
q2 = dataset[:, -1] # return the last column

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.