Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.
For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.
alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')
%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a list comprehension is just as fast
%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the object data type, we cannot use np.char. The list comprehension is 3x as fast.
%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.
import numpy as np
cimport numpy as np
from numpy cimport ndarray
def lower(ndarray[object] arr):
cdef int i
cdef int n = len(arr)
cdef ndarray[object] result = np.empty(n, dtype='object')
for i in range(n):
result[i] = arr[i].lower()
return result
This yields a modest improvement
%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have tried accessing the memory directly with the data ndarray attribute like this:
def lower_fast(ndarray[object] arr):
cdef int n = len(arr)
cdef int i
cdef char* data = arr.data
cdef int itemsize = arr.itemsize
for i in range(n):
# no idea here
I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.
Update with fastest method (doesn't work for unicode)
Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.
cdef int f(char *a, int itemsize, int shape):
cdef int i
cdef int num
cdef int loc
for i in range(shape * itemsize):
num = a[i]
print(num)
if 65 <= num <= 90:
a[i] +=32
def lower_fast(ndarray arr):
cdef char *inp
inp = arr.data
f(inp, arr.itemsize, arr.shape[0])
return arr
This is 100x faster than the others and what I am looking for.
%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
arr? You definedalist,arr_unicode, andarr_object, but then referred to some unknownarr.'ß'.upper() == 'SS'.str.upperhandles this fine, butnp.char.upperdoesn't realize the result needs more room.)