Is there a elegant way to only keep top[2~3] value for each row in a matrix?

Question

A easier way has updated in the end of the question.

What I have

I have a user-user correlation matrix called matrixcorr_of_user like the one below:

userId       316       320       359       370       910
userId                                                  
316     1.000000  0.202133  0.208618  0.176050  0.174035
320     0.202133  1.000000  0.242837  0.019035  0.031737
359     0.208618  0.242837  1.000000  0.357620  0.175914
370     0.176050  0.019035  0.357620  1.000000  0.317371
910     0.174035  0.031737  0.175914  0.317371  1.000000

What I want

For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:

Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?

What I have tried

I first melt the matrix:

melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")

melted_corr.head()
Out[23]: 
   userId corr_user     value
0     316       316  1.000000
1     320       316  0.202133
2     359       316  0.208618
3     370       316  0.176050
4     910       316  0.174035

filter it row by row:

get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]

filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)

filted
Out[39]: 
corr_user  userId
316        359       0.208618
           320       0.202133
320        359       0.242837
           316       0.202133
359        370       0.357620
           320       0.242837
370        359       0.357620
           910       0.317371
910        370       0.317371
           359       0.175914

and finally reshape it:

filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

Updated:

I came up with a easier way to do this after saw the answer of @John Zwinck

Let's say there is a new matrix df with some dupicated value and NaN

userId  316       320       359       370       910
userId                                             
316     1.0  0.500000  0.500000  0.500000       NaN
320     0.5  1.000000  0.242837  0.019035  0.031737
359     0.5  0.242837  1.000000  0.357620  0.175914
370     0.5  0.019035  0.357620  1.000000  0.317371
910     NaN  0.031737  0.175914  0.317371  1.000000

At first I get the rank of each row.

rank = df.rank(1, ascending=False, method="first")

Then I use df.isin() to get the mask that I want.

mask = rank.isin(list(range(2,4)))

Finally

df.where(mask)

Then I get want I want.

userId  316  320       359       370  910
userId                                   
316     NaN  0.5  0.500000       NaN  NaN
320     0.5  NaN  0.242837       NaN  NaN
359     0.5  NaN       NaN  0.357620  NaN
370     0.5  NaN  0.357620       NaN  NaN
910     NaN  NaN  0.175914  0.317371  NaN

So for each row, you want only the second and third highest value? (What if one of those values appears more than once?) — timgeb
– timgeb, Commented Nov 22, 2017 at 12:41
@timgeb only get the value that appears first . Likerow.sort_values(ascending = False)[1:3] for row in df. — Dawei
– Dawei, Commented Nov 22, 2017 at 12:45
@timgeb I think this is highly unlikely (edge case of floatting point equality). — Ma0
– Ma0, Commented Nov 22, 2017 at 12:54
@Ev. Kounis I know, but I would not ship my code with the comment "highly unlikely it does not work as intended" :) — timgeb
– timgeb, Commented Nov 22, 2017 at 13:24
@timgeb Since OP does not specify a tie-break policy, whichever one is used by numpy should be acceptable. — Ma0
– Ma0, Commented Nov 22, 2017 at 13:27

John Zwinck · Accepted Answer · 2017-11-23 00:27:54Z

4

First, use np.argsort() to find which locations have the highest values:

sort = np.argsort(df)

This gives a DataFrame whose column names are meaningless, but the second and third columns from the right contain the desired indices within each row:

        316  320  359  370  910
userId                         
316       4    3    1    2    0
320       3    4    0    2    1
359       4    0    1    3    2
370       1    0    4    2    3
910       1    0    2    3    4

Next, construct a boolean mask, set to true in the above locations:

mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True

Now you have the mask you need:

array([[False,  True,  True, False, False],
       [ True, False,  True, False, False],
       [False,  True, False,  True, False],
       [False, False,  True, False,  True],
       [False, False,  True,  True, False]], dtype=bool)

Finally, df.where(mask):

             316       320       359       370       910
userId                                                  
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN

edited Nov 23, 2017 at 0:27

answered Nov 22, 2017 at 13:16

John Zwinck

253k44 gold badges349 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dawei Over a year ago

Sorry I jusr found this can fail when value appears many time, for example when first row' s value are all 1.

John Zwinck Over a year ago

@Dawei: What do you mean by "fail" here? It actually works fine in the case you describe, with no errors. Perhaps you mean the tie-breaking policy does not match the one you had in mind? If so, what did you expect it to do in the presence of duplicates? Your example input data had no duplicates.

Dawei Over a year ago

My fault. I forgot to add some duplicates and NaN to my input data. I just updated a easier way to achieve it in my question. Please have a look.

John Zwinck Over a year ago

@Dawei: The solution you added to your question is a good one. I'd replace list(range(2,4)) with np.arange(2, 4).

ags29 · Accepted Answer · 2017-11-22 12:53:46Z

1

This should work:

melted_corr['group_rank']=melted_corr.groupby('userId')['value']\
.rank(ascending=False)

then select the top-2 per user with:

melted_corr[melted_corr.group_rank<=2]

answered Nov 22, 2017 at 12:53

ags29

2,6911 gold badge11 silver badges15 bronze badges

Comments

Svetlin Mladenov · Accepted Answer · 2017-11-22 14:44:40Z

1

Here is my numpy-esque solution:

top_k = 3
top_corr = corr_of_user.copy()
top_ndarray = top_corr.values
np.fill_diagonal(top_ndarray, np.NaN)
rows = np.arange(top_corr.shape[0])[:, np.newaxis]
columns = top_ndarray.argsort()[:, :-top_k]
top_ndarray[rows, columns] = np.NaN
top_corr

And we get

userId       316       320       359       370       910
userId
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN

You can replace top_corr = corr_of_user.copy() with top_corr = corr_of_user if you don't want a copy but rather an in-place solution.

The ideas is pretty much the same as John Zwinck's idea - get an index of the necessary fields and use it to index into the array and clear the values we don't need. A slight advantage of my solution is that K (the number of top results that we want) is a parameter and not hardcoded. It also works when corr_of_user has all 1s.

edited Nov 22, 2017 at 14:44

answered Nov 22, 2017 at 14:11

Svetlin Mladenov

4,48729 silver badges33 bronze badges

3 Comments

Brad Solomon Over a year ago

Check out np.fill_diagonal also

John Zwinck Over a year ago

@BradSolomon: I removed the ravel from my answer, since no readers completed the exercise. :)

Dawei Over a year ago

@BradSolomon I add a easier way in my problem description. Please have a look.

Collectives™ on Stack Overflow

Is there a elegant way to only keep top[2~3] value for each row in a matrix?

What I have

What I want

What I have tried

Updated:

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

What I have

What I want

What I have tried

Updated:

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest