Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

Question

This is an update to the structure of my DataFrame, I formulated the structure in haste, I was inspecting a single user and mocked up that structure. @liliscent's remark: "data accidentally satisfies this condition" is also true and value_counts and cum_sum() solves it. But then user_id's also change, and different user's can have the same meet_id if they have the same text.

Updated DataFrames structure:

   mytable = pd.DataFrame({'user_id': [ '3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3d',
                                 '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
              'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
        'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc'], 'label': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B',
    'C', 'C', 'A', 'G', 'H', 'H', 'H', 'A', 'A', 'B', 'E', 'G', 'B', 'B']})
   mytable =  mytable[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.

   user_id  meet_id  text label
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     B
   3c        1      abc     B
   3c        2      xyz     B
   3c        2      xyz     B
   3c        2      xyz     B
   3c        2      xyz     C
   3c        2      xyz     C
   3c        2      xyz     A
   3d        3      npq     G
   3d        3      npq     H
   3d        3      npq     H
   3d        3      npq     H
   3e        4      tt      A
   3e        5      op      A
   3r        6      li      B
   3w        1      abc     E
   3w        2      xyz     G 
   3w        1      abc     B
   3w        1      abc     B

I would like to groupby on [user_id & meet_id] column and concatenate the label column in such a way that the label with higher frequency for that group is left untouched, while the second most frequent label will have the first label concatenated, and the last label will have all labels concatenated.

updated DataFrame output is what I am looking for

    mytable_pro = pd.DataFrame({'user_id': ['3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c','3d',
                                 '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
              'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
        'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz','npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc' ], 'label': ['A', 'A', 'A', 'A', 'A', 'B,A', 'B,A', 'B', 'B', 'B',
    'B, C', 'B, C', 'A,B,C', 'H,G', 'H', 'H', 'H', 'A', 'A', 'B', 'E,B', 'G', 'B', 'B']})
    mytable_pro = mytable_pro[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.

This gives:

    user_id  meet_id text  label
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc     B,A
   3c        1       abc     B,A
   3c        2       xyz      B
   3c        2       xyz      B
   3c        2       xyz      B
   3c        2       xyz    B, C
   3c        2       xyz    B, C
   3c        2       xyz    A,B,C
   3d        3       npq     H,G
   3d        3       npq      H
   3d        3       npq      H
   3d        3       npq      H
   3e        4       tt       A
   3e        5       op       A
   3r        6       li       B
   3w        1       abc     E,B
   3w        2       xyz      G
   3w        1       abc      B
   3w        1       abc      B

The answer given by @piRSquared:

    mytable.groupby('meet_id').label.value_counts().groupby('meet_id').apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))

is the CORRECT ANSWER for the WRONG question I asked, thanks a ton and really sorry. It solves the ordering problem as mentioned previously but would not work if a different user has the same meet_id. Just to be exhaustive, if the label frequency turns out to be equal for a group, it does not matter which of the label gets the other concatenated.

It gives:

     user_id  meet_id  text       label
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc        A, B
   3c          1         abc        A, B
   3c          2         xyz           B
   3c          2         xyz           B
   3c          2         xyz           B
   3c          2         xyz        B, C
   3c          2         xyz        B, C
   3c          2         xyz     B, C, A
   3d          3         npq        H, G
   3d          3         npq           H
   3d          3         npq           H
   3d          3         npq           H
   3e          4          tt           A
   3e          5          op           A
   3r          6          li           B
   3w          1         abc     A, B, E
   3w          2         xyz    B, C, A, G
   3w          1         abc        A, B
   3w          1         abc        A, B

The labels for 3w are off since the labels for meet_id are picked up ignoring the difference is user_id. My bad!

Now, since user_id must also be considered, I tried the following:

    s = mytable.groupby(['user_id', 'meet_id']).label.value_counts().groupby(['user_id, 'meet_id']).apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))

This throws:

    AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Ah! Another small update, In reality I have words in my label column.

    dummy_boo = pd.DataFrame({'user_id': ['3g', '3g', '3g'], 'meet_id': [9,9,9], 'text': ['baby', 'baby', 'baby'], 'label':['hello', 'hello', 'why']}

Output:

    user_id  meet_id  text  label
      3g        9     baby  hello
      3g        9     baby  hello
      3g        9     baby   why

Applying the above code is resulting in each character being separated by a comma.

 user_id  meet_id  text   label
  3g        9      baby  h, e, l, l, o
  3g        9      baby  h, e, l, l, o
  3g        9      baby  h, e, l, l, o, w, h, y

Instead I need:

    user_id  meet_id  text   label
  3g        9      baby    hello
  3g        9      baby    hello
  3g        9      baby    hello, why

The dtype for label is object. Should we use astype instead. A Big thank you to everyone for helping me out.

piRSquared · Accepted Answer · 2018-05-10 01:54:36Z

`value_counts` and `cumsum`

value_counts sorts by descending count

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].cumsum().str.join(', ')
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  B, C, A
13      3d        3  npq     H, G
14      3d        3  npq        H
15      3d        3  npq        H
16      3d        3  npq        H
17      3e        4   tt        A
18      3e        5   op        A
19      3r        6   li        B
20      3w        1  abc     B, E
21      3w        2  xyz        G
22      3w        1  abc        B
23      3w        1  abc        B

Include `sorted` as well

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].cumsum().apply(sorted).str.join(', ')
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  A, B, C
13      3d        3  npq     G, H
14      3d        3  npq        H
15      3d        3  npq        H
16      3d        3  npq        H
17      3e        4   tt        A
18      3e        5   op        A
19      3r        6   li        B
20      3w        1  abc     B, E
21      3w        2  xyz        G
22      3w        1  abc        B
23      3w        1  abc        B

And to adjust for words rather than single characters

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].add('|').cumsum().apply(
        lambda e: ', '.join(sorted(e.strip('|').split('|')))
    )
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

Old Answer

With transform and a custom cumulative unique function

from collections import Counter

def cum_unique(x):
    return pd.Series(list(map(
        Counter, x
    ))).cumsum().str.join(', ')

mytable.assign(label=mytable.groupby('meet_id').label.transform(cum_unique))

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  B, C, A

Shortened version

mytable.assign(label=mytable.groupby('meet_id').label.transform(
    lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
))

Per comment

by liliscent

We can sort first by meet_id and group size

sizes = mytable.groupby(['meet_id', 'label']).label.transform('size')

m1 = mytable.assign(sizes=sizes).sort_values(
    ['meet_id', 'sizes'], ascending=[True, False]).drop('sizes', 1)
m1

m1.assign(label=m1.groupby('meet_id').label.transform(
    lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
)).reindex(mytable.index)

Hi Sir , Can you help me with this , I really cannot find the problem heremytable.groupby(['user_id', 'meet_id', 'text'])['label'].apply(lambda x : x.rolling(len(x),min_periods=1).apply(lambda x : ','.join(list(x))))
OP wants "higher frequency for that group is left untouched", not in the original order. OP's data accidentally satisfies this condition. Strangely, it seems all the answers here are flawed...

sacuL · Accepted Answer · 2018-05-09 18:41:08Z

6

You could try something like the following:

mytable['label'] = (mytable.groupby('meet_id')
                    .label.transform(lambda x: list(x.cumsum()))
                    .apply(set))

>>> mytable
   user_id  meet_id text      label
0       3c        1  abc        {A}
1       3c        1  abc        {A}
2       3c        1  abc        {A}
3       3c        1  abc        {A}
4       3c        1  abc        {A}
5       3c        1  abc     {A, B}
6       3c        1  abc     {A, B}
7       3c        2  xyz        {B}
8       3c        2  xyz        {B}
9       3c        2  xyz        {B}
10      3c        2  xyz     {C, B}
11      3c        2  xyz     {C, B}
12      3c        2  xyz  {C, B, A}

If you want to get rid of the set data type and just have it as a string (as in your desired output), you could apply ', '.join(sorted(set(x)))) instead of simply set (thanks @Wen and @ScottBoston):

mytable['label'] = (mytable.groupby('meet_id')
                    .label.transform(lambda x: list(x.cumsum()))
                    .apply(lambda x: ', '.join(sorted(set(x)))))
>>> mytable
   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  A, B, C

edited May 9, 2018 at 18:41

answered May 9, 2018 at 18:21

sacuL

51.7k9 gold badges89 silver badges116 bronze badges

4 Comments

BENY Over a year ago

add join maybe ?

sacuL Over a year ago

@Wen, yeah, you're right, that way it matches the expected output, though I would actually prefer to have it as a set I think... Thanks, added as an edit :)

Scott Boston Over a year ago

(mytable.groupby('meet_id')                     .label.transform(lambda x: list(x.cumsum()))                     .apply(lambda x: sorted(set(x))))

gets back OP's sorted list. I like this solution. +1

shantanuo Over a year ago

I got the same results without using list in lambda x:

Scott Boston · Accepted Answer · 2018-05-09 18:39:03Z

Edit: Okay much more simple solution:

mytable['label'] = mytable.groupby(['user_id','meet_id','text'])['label']\
       .apply(lambda x: x.cumsum()).apply(lambda x: sorted(set(x)))

My ugly attempt:

mytable['label'] = mytable.groupby(['user_id','meet_id','text'])['label']\
      .apply(lambda x: x.cumsum().str.extractall('(.)')\
                        .groupby(level=0)[0].apply(lambda x: sorted(set(x))))

Output:

   user_id  meet_id text      label
0       3c        1  abc        [A]
1       3c        1  abc        [A]
2       3c        1  abc        [A]
3       3c        1  abc        [A]
4       3c        1  abc        [A]
5       3c        1  abc     [A, B]
6       3c        1  abc     [A, B]
7       3c        2  xyz        [B]
8       3c        2  xyz        [B]
9       3c        2  xyz        [B]
10      3c        2  xyz     [B, C]
11      3c        2  xyz     [B, C]
12      3c        2  xyz  [A, B, C]

Collectives™ on Stack Overflow

Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

3 Answers 3

`value_counts` and `cumsum`

Include `sorted` as well

Old Answer

2 Comments

4 Comments

Edit: Okay much more simple solution:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

value_counts and cumsum

Include sorted as well

Old Answer

2 Comments

4 Comments

Edit: Okay much more simple solution:

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`value_counts` and `cumsum`

Include `sorted` as well