8

Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.

For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.

alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')

%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using a list comprehension is just as fast

%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For the object data type, we cannot use np.char. The list comprehension is 3x as fast.

%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.

import numpy as np
cimport numpy as np
from numpy cimport ndarray

def lower(ndarray[object] arr):
    cdef int i
    cdef int n = len(arr)
    cdef ndarray[object] result = np.empty(n, dtype='object')
    for i in range(n):
        result[i] = arr[i].lower()
    return result

This yields a modest improvement

%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I have tried accessing the memory directly with the data ndarray attribute like this:

def lower_fast(ndarray[object] arr):
    cdef int n = len(arr)
    cdef int i
    cdef char* data = arr.data
    cdef int itemsize = arr.itemsize
    for i in range(n):
        # no idea here

I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.

Update with fastest method (doesn't work for unicode)

Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.

cdef int f(char *a, int itemsize, int shape):
    cdef int i
    cdef int num
    cdef int loc
    for i in range(shape * itemsize):
        num = a[i]
        print(num)
        if 65 <= num <= 90:
            a[i] +=32

def lower_fast(ndarray arr):
    cdef char *inp
    inp = arr.data
    f(inp, arr.itemsize, arr.shape[0])
    return arr

This is 100x faster than the others and what I am looking for.

%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21
  • What's arr? You defined alist, arr_unicode, and arr_object, but then referred to some unknown arr. Commented Dec 27, 2017 at 21:14
  • 1
    Thanks @MaxU. That's a much better string for testing. Commented Dec 27, 2017 at 22:05
  • 1
    Hmm, I think I got the point. Thanks guys. Appreciate the feedback! Commented Dec 27, 2017 at 22:07
  • 1
    (Also, consider that Unicode case mappings aren't all 1 character to 1 character. For example, 'ß'.upper() == 'SS'. str.upper handles this fine, but np.char.upper doesn't realize the result needs more room.) Commented Dec 27, 2017 at 22:45
  • 1
    @usr2564301: I believe the capital Eszett was nonstandard in German orthography until very recently. Standard practice was to use SS. While Unicode has encoded the capital Eszett since 5.1, the Unicode standard still specifies SS as the capital of ß. See the Unicode FAQ (although that entry might currently be a bit out of date). Commented Dec 28, 2017 at 1:13

1 Answer 1

1

This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.

In some C file, say, _lower.c, have

#include <stdlib.h>
#include <string.h>   
#include <unistr.h>
#include <unicase.h>

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len, s_len;
    uint8_t *s_ptr = *s, *lowered;
    while(s_ptr - *s < total_len) {
        s_len = u8_strlen(s_ptr);
        if (s_len == 0) {
            s_ptr += 1;
            continue;
        }
        lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
        memcpy(s_ptr, lowered, lower_len);
        free(lowered);
        s_ptr += s_len;
    }
}

Then, in lower.pxd you do

cdef extern from "_lower.c":
    cdef void _c_tolower(unsigned char **s, unsigned int total_len)

And finally, in lower.pyx:

cpdef void lower(ndarray arr):
    cdef unsigned char * _arr
    _arr = <unsigned char *> arr.data
    _c_tolower(&_arr, arr.shape[0] * arr.itemsize)

On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.

There are a few points of improvement that I don't know will make much of a difference:

  • arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
  • I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
  • I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.

Not really an answer, but hope it helps your further investigations!

PS You'll notice that this does the lowering in-place, so the usage would be like this:

>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
...     print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094

EDIT

DUH, you modify the C function to look like this

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len;
    uint8_t *lowered;

    lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
    memcpy(*s, lowered, lower_len);
    free(lowered);
}

and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.

Anyway, ~40% faster this way:

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078 
Sign up to request clarification or add additional context in comments.

5 Comments

Why pass &arr and not just arr? You never assign *s =, so I see no reason to.
Also, this fails on non-contiguous arrays like arr_unicode[::2]
@Eric you're right -- it was from a first version of the function I tried before using memcpy. And, definitely. I mostly wrote it up because OP said he's not sure how to hook his C++ function into python, so it's mostly illustrative.
Thanks very much for putting this all together. I think a more straightforward approach would be to not do the operation in place and simply pass each individual string to to_lower in a for loop and simply return the new string. This would avoid any memory copying.
Yes, but then a) your new string will not be wrapped in a numpy.ndarray object, and b) if you are not careful you will have a potential memory leak. It would also be slower to do that one string at a time, because each string would be converted from a python unicode object to an unsigned char * and then back. When possible, send as little data as possible, as few times as possible, through the python-C barrier.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.