In [ ]:

from fastai.text.all import *

In [ ]:

chunked??

Let's look at how long it takes to tokenize a sample of 1000 IMDB review.

In [ ]:

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head(2)

Out[ ]:

	label	text	is_valid
0	negative	Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!	False
1	positive	This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...	False

In [ ]:

ss = L(list(df.text))
ss[0]

Out[ ]:

"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!"

We'll start with the simplest approach:

In [ ]:

def delim_tok(s, delim=' '): return L(s.split(delim))
s = ss[0]
delim_tok(s)

Out[ ]:

(#69) ['Un-bleeping-believable!','Meg','Ryan',"doesn't",'even','look','her','usual','pert','lovable'...]

...and a general way to tokenize a bunch of strings:

In [ ]:

def apply(func, items): return list(map(func, items))

Let's time it:

In [ ]:

%%timeit -n 2 -r 3
global t
t = apply(delim_tok, ss)

29.9 ms ± 722 µs per loop (mean ± std. dev. of 3 runs, 2 loops each)

...and the same thing with 2 workers:

In [ ]:

%%timeit -n 2 -r 3
parallel(delim_tok, ss, n_workers=2, progress=False)

325 ms ± 3.52 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

How about if we put half the work in each worker?

In [ ]:

batches32 = [L(list(o)).map(str) for o in np.array_split(ss, 32)]
batches8  = [L(list(o)).map(str) for o in np.array_split(ss, 8 )]
batches   = [L(list(o)).map(str) for o in np.array_split(ss, 2 )]

In [ ]:

%%timeit -n 2 -r 3
parallel(partial(apply, delim_tok), batches, progress=False, n_workers=2)

146 ms ± 7.26 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

So there's a lot of overhead in using parallel processing in Python. :(

Let's see why. What if we do nothing interesting in our function?

In [ ]:

%%timeit -n 2 -r 3
global t
t = parallel(noop, batches, progress=False, n_workers=2)

52 ms ± 5.06 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

That's quite fast! (Although still slower than single process.)

What if we don't return much data?

In [ ]:

def f(x): return 1

In [ ]:

%%timeit -n 2 -r 3
global t
t = parallel(f, batches, progress=False, n_workers=2)

44.7 ms ± 2.82 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

That's a bit faster still.

What if we don't actually return the lists of tokens, but create them still?

In [ ]:

def f(items):
    o = [s.split(' ') for s in items]
    return [s for s in items]

So creating the tokens, isn't taking the time, but returning them over the process boundary is.

In [ ]:

%%timeit -n 2 -r 3
global t
t = parallel(f, batches, progress=False, n_workers=2)

65.4 ms ± 445 µs per loop (mean ± std. dev. of 3 runs, 2 loops each)

Is numpy any faster?

In [ ]:

sarr = np.array(ss)

In [ ]:

%%timeit -n 2 -r 3
global t
t = np.char.split(sarr)

53.2 ms ± 55.4 µs per loop (mean ± std. dev. of 3 runs, 2 loops each)

Spacy¶

In [ ]:

from spacy.lang.en import English

def conv_sp(doc): return L(doc).map(str)

class SpTok:
    def __init__(self):
        nlp = English()
        self.tok = nlp.Defaults.create_tokenizer(nlp)
    
    def __call__(self, x): return L(self.tok(str(x))).map(conv_sp)

Let's see how long it takes to create a tokenizer in Spacy:

In [ ]:

%%timeit -n 2 -r 3
SpTok()

478 ms ± 21 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

In [ ]:

nlp = English()
sp_tokenizer = nlp.Defaults.create_tokenizer(nlp)
def spacy_tok(s): return L(sp_tokenizer(str(s))).map(str)

Time tokenize in Spacy using a loop:

In [ ]:

%%timeit -r 3
global t
t = apply(spacy_tok, ss)

2.63 s ± 37.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

...and the same thing in parallel:

In [ ]:

%%timeit -r 3
global t
t = parallel(partial(apply, spacy_tok), batches, progress=False, n_workers=2)

1.65 s ± 28.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

...and with more workers:

In [ ]:

%%timeit -r 3
global t
t = parallel(partial(apply, spacy_tok), batches8, progress=False, n_workers=8)

527 ms ± 5.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

...and with creating the tokenizer in the child process:

In [ ]:

def f(its):
    tok = SpTok()
    return [[str(o) for o in tok(p)] for p in its]

In [ ]:

%%timeit -r 3
global t
t = parallel(f, batches8, progress=False, n_workers=8)

2.08 s ± 10.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Let's try pipe

In [ ]:

%%timeit -r 3
global t
t = L(nlp.tokenizer.pipe(ss)).map(conv_sp)

2.51 s ± 41.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

In [ ]:

def f(its): return L(nlp.tokenizer.pipe(its)).map(conv_sp)

In [ ]:

%%timeit -r 3
global t
t = parallel(f, batches8, progress=False, n_workers=8)

539 ms ± 5.86 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

In [ ]:

test_eq(chunked(range(12),n_chunks=4), [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])
test_eq(chunked(range(11),n_chunks=4), [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]])
test_eq(chunked(range(10),n_chunks=4), [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]])
test_eq(chunked(range( 9),n_chunks=3), [[0, 1, 2], [3, 4, 5], [6, 7, 8]])

In [ ]:

%%timeit -r 3
global t
t = parallel_chunks(f, ss, n_workers=8, progress=False)

607 ms ± 7.87 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

In [ ]:

def array_split(arr, n): return chunked(arr, math.floor(len(arr)/n))