A easier way has updated in the end of the question.
What I have
I have a user-user correlation matrix called matrixcorr_of_user like the one below:
userId 316 320 359 370 910
userId
316 1.000000 0.202133 0.208618 0.176050 0.174035
320 0.202133 1.000000 0.242837 0.019035 0.031737
359 0.208618 0.242837 1.000000 0.357620 0.175914
370 0.176050 0.019035 0.357620 1.000000 0.317371
910 0.174035 0.031737 0.175914 0.317371 1.000000
What I want
For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?
What I have tried
I first melt the matrix:
melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")
melted_corr.head()
Out[23]:
userId corr_user value
0 316 316 1.000000
1 320 316 0.202133
2 359 316 0.208618
3 370 316 0.176050
4 910 316 0.174035
filter it row by row:
get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]
filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)
filted
Out[39]:
corr_user userId
316 359 0.208618
320 0.202133
320 359 0.242837
316 0.202133
359 370 0.357620
320 0.242837
370 359 0.357620
910 0.317371
910 370 0.317371
359 0.175914
and finally reshape it:
filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
Updated:
I came up with a easier way to do this after saw the answer of @John Zwinck
Let's say there is a new matrix df with some dupicated value and NaN
userId 316 320 359 370 910
userId
316 1.0 0.500000 0.500000 0.500000 NaN
320 0.5 1.000000 0.242837 0.019035 0.031737
359 0.5 0.242837 1.000000 0.357620 0.175914
370 0.5 0.019035 0.357620 1.000000 0.317371
910 NaN 0.031737 0.175914 0.317371 1.000000
At first I get the rank of each row.
rank = df.rank(1, ascending=False, method="first")
Then I use df.isin() to get the mask that I want.
mask = rank.isin(list(range(2,4)))
Finally
df.where(mask)
Then I get want I want.
userId 316 320 359 370 910
userId
316 NaN 0.5 0.500000 NaN NaN
320 0.5 NaN 0.242837 NaN NaN
359 0.5 NaN NaN 0.357620 NaN
370 0.5 NaN 0.357620 NaN NaN
910 NaN NaN 0.175914 0.317371 NaN
row.sort_values(ascending = False)[1:3] for row in df.numpyshould be acceptable.