5

I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.

df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})

    A   B   C   D
0   A   B   A   A
1   B   C   B   C
2   A   D   D   C
3   B   A   D   B

I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:

for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)

   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?

2
  • the column names and the values are the same for a reason..? Commented Aug 12, 2020 at 13:59
  • No I just did it like this for simplicity Commented Aug 12, 2020 at 14:29

4 Answers 4

6

Use stack + str.get_dummies and then sum on level=0 and join it with df:

df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))

Result:

print(df1)
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1
Sign up to request clarification or add additional context in comments.

Comments

3

Rather than use apply to loop over each row, I looped over each column to compute the sum for each letter:

for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)

This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)

Some time comparisons:

%%timeit
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And for other answers here:

%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1 Comment

Thank you, this method is the fastest, also on the actual data. I have to do this for 250 string values (creating 250 new columns), checking strings in 10 columns. My way was 757 seconds, yours 5.9 seconds. Massive improvement!
3

Try get_dummies and sum with level , here we do not need stack :-)

df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]: 
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

Comments

1

you could do:

counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.