5

A easier way has updated in the end of the question.

What I have

I have a user-user correlation matrix called matrixcorr_of_user like the one below:

userId       316       320       359       370       910
userId                                                  
316     1.000000  0.202133  0.208618  0.176050  0.174035
320     0.202133  1.000000  0.242837  0.019035  0.031737
359     0.208618  0.242837  1.000000  0.357620  0.175914
370     0.176050  0.019035  0.357620  1.000000  0.317371
910     0.174035  0.031737  0.175914  0.317371  1.000000

What I want

For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:

Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?

What I have tried

I first melt the matrix:

melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")

melted_corr.head()
Out[23]: 
   userId corr_user     value
0     316       316  1.000000
1     320       316  0.202133
2     359       316  0.208618
3     370       316  0.176050
4     910       316  0.174035

filter it row by row:

get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]

filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)

filted
Out[39]: 
corr_user  userId
316        359       0.208618
           320       0.202133
320        359       0.242837
           316       0.202133
359        370       0.357620
           320       0.242837
370        359       0.357620
           910       0.317371
910        370       0.317371
           359       0.175914

and finally reshape it:

filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]: 
userId          316       320       359       370       910
corr_user                                                  
316             NaN  0.202133  0.208618       NaN       NaN
320        0.202133       NaN  0.242837       NaN       NaN
359             NaN  0.242837       NaN  0.357620       NaN
370             NaN       NaN  0.357620       NaN  0.317371
910             NaN       NaN  0.175914  0.317371       NaN

Updated:

I came up with a easier way to do this after saw the answer of @John Zwinck

Let's say there is a new matrix df with some dupicated value and NaN

userId  316       320       359       370       910
userId                                             
316     1.0  0.500000  0.500000  0.500000       NaN
320     0.5  1.000000  0.242837  0.019035  0.031737
359     0.5  0.242837  1.000000  0.357620  0.175914
370     0.5  0.019035  0.357620  1.000000  0.317371
910     NaN  0.031737  0.175914  0.317371  1.000000

At first I get the rank of each row.

rank = df.rank(1, ascending=False, method="first")

Then I use df.isin() to get the mask that I want.

mask = rank.isin(list(range(2,4)))

Finally

df.where(mask)

Then I get want I want.

userId  316  320       359       370  910
userId                                   
316     NaN  0.5  0.500000       NaN  NaN
320     0.5  NaN  0.242837       NaN  NaN
359     0.5  NaN       NaN  0.357620  NaN
370     0.5  NaN  0.357620       NaN  NaN
910     NaN  NaN  0.175914  0.317371  NaN
5
  • 1
    So for each row, you want only the second and third highest value? (What if one of those values appears more than once?) Commented Nov 22, 2017 at 12:41
  • @timgeb only get the value that appears first . Likerow.sort_values(ascending = False)[1:3] for row in df. Commented Nov 22, 2017 at 12:45
  • @timgeb I think this is highly unlikely (edge case of floatting point equality). Commented Nov 22, 2017 at 12:54
  • @Ev. Kounis I know, but I would not ship my code with the comment "highly unlikely it does not work as intended" :) Commented Nov 22, 2017 at 13:24
  • @timgeb Since OP does not specify a tie-break policy, whichever one is used by numpy should be acceptable. Commented Nov 22, 2017 at 13:27

3 Answers 3

4

First, use np.argsort() to find which locations have the highest values:

sort = np.argsort(df)

This gives a DataFrame whose column names are meaningless, but the second and third columns from the right contain the desired indices within each row:

        316  320  359  370  910
userId                         
316       4    3    1    2    0
320       3    4    0    2    1
359       4    0    1    3    2
370       1    0    4    2    3
910       1    0    2    3    4

Next, construct a boolean mask, set to true in the above locations:

mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True

Now you have the mask you need:

array([[False,  True,  True, False, False],
       [ True, False,  True, False, False],
       [False,  True, False,  True, False],
       [False, False,  True, False,  True],
       [False, False,  True,  True, False]], dtype=bool)

Finally, df.where(mask):

             316       320       359       370       910
userId                                                  
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN
Sign up to request clarification or add additional context in comments.

4 Comments

Sorry I jusr found this can fail when value appears many time, for example when first row' s value are all 1.
@Dawei: What do you mean by "fail" here? It actually works fine in the case you describe, with no errors. Perhaps you mean the tie-breaking policy does not match the one you had in mind? If so, what did you expect it to do in the presence of duplicates? Your example input data had no duplicates.
My fault. I forgot to add some duplicates and NaN to my input data. I just updated a easier way to achieve it in my question. Please have a look.
@Dawei: The solution you added to your question is a good one. I'd replace list(range(2,4)) with np.arange(2, 4).
1

This should work:

melted_corr['group_rank']=melted_corr.groupby('userId')['value']\
.rank(ascending=False)

then select the top-2 per user with:

melted_corr[melted_corr.group_rank<=2]

Comments

1

Here is my numpy-esque solution:

top_k = 3
top_corr = corr_of_user.copy()
top_ndarray = top_corr.values
np.fill_diagonal(top_ndarray, np.NaN)
rows = np.arange(top_corr.shape[0])[:, np.newaxis]
columns = top_ndarray.argsort()[:, :-top_k]
top_ndarray[rows, columns] = np.NaN
top_corr

And we get

userId       316       320       359       370       910
userId
316          NaN  0.202133  0.208618       NaN       NaN
320     0.202133       NaN  0.242837       NaN       NaN
359          NaN  0.242837       NaN  0.357620       NaN
370          NaN       NaN  0.357620       NaN  0.317371
910          NaN       NaN  0.175914  0.317371       NaN

You can replace top_corr = corr_of_user.copy() with top_corr = corr_of_user if you don't want a copy but rather an in-place solution.

The ideas is pretty much the same as John Zwinck's idea - get an index of the necessary fields and use it to index into the array and clear the values we don't need. A slight advantage of my solution is that K (the number of top results that we want) is a parameter and not hardcoded. It also works when corr_of_user has all 1s.

3 Comments

Check out np.fill_diagonal also
@BradSolomon: I removed the ravel from my answer, since no readers completed the exercise. :)
@BradSolomon I add a easier way in my problem description. Please have a look.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.