Here we review the optimizers used in machine learning.

Gradient Descent

In [1]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Data

Let's use a simple dataset of salaries from developers and machine learning engineers in five Chinese cities in 2019

In [2]:
# developer salary in Beijing, Shanghai, Hangzhou, Shenzhen and Guangzhou in 2019
x = [13854,12213,11009,10655,9503] 
x = np.reshape(x,newshape=(5,1)) / 10000.0

# Machine Learning Engineer in the five cities.
y =  [21332, 20162, 19138, 18621, 18016] 
y = np.reshape(y,newshape=(5,1)) / 10000.0

Functions

Objective Function: $$y=ax+b+ε$$ Cost Function: $$J(a,b)=\frac{1}{2n}\sum_{i=0}^{n}(y_i−\hat{y}_i )^2$$ Optimization Function or optimizer: $$\theta = \theta - \alpha \frac{\partial J}{\partial \theta}$$ Here in the univariate linear regression: $$a = a - \alpha \frac{\partial J}{\partial a}$$ $$b = b - \alpha \frac{\partial J}{\partial b}$$

Here $\frac{\partial J}{\partial a}$ and $\frac{\partial J}{\partial b}$ are:

$$ \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)$$$$ \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)$$
In [3]:
def model(a, b, x):
    return a*x + b

def cost_function(a, b, x, y):
    n = 5
    return 0.5/n * (np.square(y-a*x-b)).sum()

def sgd(a,b,x,y):
    n = 5
    alpha = 1e-1
    y_hat = model(a,b,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    a = a - alpha*da
    b = b - alpha*db
    return a, b
In [4]:
def iterate_sgd(a,b,x,y,times):
    for i in range(times):
        a,b = sgd(a,b,x,y)

    y_hat=model(a,b,x)
    cost = cost_function(a, b, x, y)
    print(a,b,cost)
    plt.scatter(x,y)
    plt.plot(x,y_hat)
    plt.show()
    return a,b, cost
In [5]:
a=0
b=0
_, _, sgd_cost = iterate_sgd(a,b,x,y,100)
0.950768563083351 0.8552812669346652 0.00035532090622957674
In [6]:
sgd_cost
Out[6]:
0.00035532090622957674

After 100 iterations, the regression is almost done. We record the cost such that in the following exploration of other optimizers, we will be able to compare iterations to reach the same cost.

In [7]:
def iterate(a, b, x, y, target_cost, func):
    i=0
    for i in range(1000):
        a,b = func(a,b,x,y)
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i

This is just a small test see if it works reversely.

In [8]:
iterate(a,b,x,y, sgd_cost, sgd)
Out[8]:
100

Momentum

Momentum was proposed by Boris Polyak in 1964. Compared to SGD, it adds another term to the parameter $\theta$. This term is momentum, denoted as $m$. It takes previous momentum, multiplied by $\beta$. When the derivate is in one direction, the momentum gets larger and larger and the optimum is reached sooner. While the derivative direction is changed, the momentum gets smaller, but it gets larger again when the direction is fixed.

$$ m = \beta m - \alpha \frac{\partial J}{\partial \theta}$$$$ \theta = \theta + m $$
In [9]:
def momentum(a, b, ma, mb, x, y):
    n = 5
    alpha = 1e-1
    beta = 0.9
    y_hat = model(a,b,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    ma = beta*ma - alpha*da
    mb = beta*mb - alpha*db
    a = a + ma
    b = b + mb
    return a, b, ma, mb
In [10]:
def iterate_momentum(x, y, target_cost):
    a=0
    b=0
    ma=0
    mb=0
    for i in range(1000):
        a, b, ma, mb = momentum(a,b, ma, mb, x,y)
        print(f"{ma}\t{mb}")
        cost = cost_function(a, b, x, y)
        print(f"{ma}\t{mb}\t{cost}")
        if cost<target_cost:
            break
    return i
In [11]:
iterate_momentum(x, y, target_cost=sgd_cost)
0.22441501579999998	0.19453800000000002
0.37422247273264564	0.32448006197140566
0.42205082627737606	0.3661434135397697
0.3669569357577927	0.31871477461500164
0.23200196838148313	0.2021526955988872
0.056494030025446756	0.050474753607803874
-0.11476727348134302	-0.09754962817390435
-0.24244627038078237	-0.20787942869918896
-0.3012619304723101	-0.25863596662407884
-0.28445487409046366	-0.24396840343676746
-0.20350545891969496	-0.17380977569712325
-0.08364332123714681	-0.0699911702901115
0.04338773008400103	0.040019175300583965
0.147354432709795	0.13006006211489604
0.20640546147925654	0.1812234868328625
0.21130873793163232	0.1855214000311385
0.16633382496141397	0.146649293292915
0.08690993480643269	0.0779595676235392
-0.00507401470797135	-0.0016055486586756679
-0.08699979992056481	-0.07247278613120942
-0.1408466468646064	-0.11904732814606164
-0.15691741066134915	-0.13293724917152472
-0.1352587212815897	-0.1141824310137057
-0.08467582859413841	-0.07040205626263701
-0.01981150872173977	-0.014266840612897652
0.0428388633316896	0.03994932131351764
0.0889440734138505	0.07984525590808778
0.10944946511171101	0.09758582125685551
0.10215246324062663	0.09126528657265356
0.07152894890411822	0.06475708853737841
0.02702573023397084	0.026236225728736323
-0.01963087668527036	-0.014149754660337372
-0.05738680603496786	-0.04683505435206044
-0.0783606725095876	-0.06499936572619444
-0.07935720748013644	-0.06587751992946814
-0.06214086021009116	-0.05099624589363139
-0.03252999577624324	-0.02539033468548647
0.0013599704310883268	0.003917662426907619
0.031231311028720562	0.029747420490065724
0.05054959967538386	0.04644447498706586
0.055885184899792356	0.051041064960017556
0.047399341455049274	0.04367682409456289
0.028447708284948785	0.02725561708852082
0.0044813976566155725	0.006494616802269353
-0.018428741728553195	-0.013352721762541375
-0.035064239440853656	-0.027770553086432857
-0.04218591581428813	-0.033955812608976274
Out[11]:
46

Voila, we achieve the same cost in just 46 iterations.

Nesterov Accelerated Gradient

One small variant to momentum optimization, proposed by Yurii Nesterov in 1983, is almost always faster than vanilla momentum optimization. The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov momentum optimization, measures the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at θ + βm .

Function: $$ m = \beta m - \alpha \frac{\partial J(\theta+\beta m)}{\partial \theta}$$

$$ \theta = \theta + m $$

Nesterov

For linear regression:

$$ \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)$$$$ \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)$$

Here: $$\hat{y}=(a+m_a)x+(b+m_b)$$

In [12]:
def nesterov(a, b, ma, mb, x, y):
    n = 5
    alpha = 1e-1
    beta = 0.9
    # to make it nesterov
    # only modify here
    y_hat = model(a+ma,b+mb,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    ma = beta*ma - alpha*da
    mb = beta*mb - alpha*db
    a = a + ma
    b = b + mb
    return a, b, ma, mb
In [13]:
def iterate_nesterov(x, y, target_cost):
    a=0
    b=0
    ma=0
    mb=0
    for i in range(1000):
        a, b, ma, mb = nesterov(a,b, ma, mb, x,y)
        print(ma, mb)
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i
In [14]:
iterate_nesterov(x, y, target_cost=sgd_cost)
0.22441501579999998 0.19453800000000002
0.3220564154452913 0.2793379239428112
0.3123373193267518 0.2712021012049902
0.23316116137430112 0.2029331301200368
0.12597323633919164 0.11039828486837797
0.02495663925447264 0.02318026316585743
-0.04849493913806344 -0.0402057492598321
-0.08649709799457425 -0.07293501030570788
-0.09201745463834532 -0.07757373717860767
-0.0745806082761438 -0.062362386777225415
-0.04587718632305795 -0.03741593563963194
-0.016328775542062378 -0.011751334705692
0.006910985026297077 0.008442254305498949
0.020532049949227454 0.020301604537172682
0.024559560846665853 0.02384948551560824
0.02123968741689477 0.021030536818391725
0.013779790759404111 0.014621078911301263
0.005283948295620864 0.0073079790564409405
-0.001933819282152938 0.0010943603685497844
-0.006624599047500719 -0.0029384067739360196
-0.008540833610722172 -0.004575531028581929
-0.008161152003470043 -0.004230677649705485
Out[14]:
21

Well, it took 21 iterations.

AdaGrad

Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley. It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum. The AdaGrad algorithm achieves this correction by scaling down the gradient vector along the steepest dimensions.

$$\epsilon=1e-10$$$$ s = s + \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta} $$$$ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon} $$

AdaGrad

In [59]:
def ada_grad(a,b,sa, sb, x,y):
    epsilon=1e-10
    n = 5
    alpha = 1e-1
    y_hat = model(a,b,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    sa=sa+da*da + epsilon
    sb=sb+db*db + epsilon
    a = a - alpha*da / np.sqrt(sa)
    b = b - alpha*db / np.sqrt(sb)
    return a, b, sa, sb
In [62]:
def iterate_ada_grad(x, y, target_cost):
    a=0
    b=0
    sa=0
    sb=0
    for i in range(1000):
        a, b, sa, sb = ada_grad(a,b, sa, sb, x,y)
        print(f"{sa}\t{sb}")
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i
In [63]:
iterate_ada_grad(x, y, target_cost=sgd_cost)
5.036209931751423 3.7845033445000005
9.022051275308753 6.780559696352509
12.37760305876143 9.303433839935428
15.277271622803307 11.48402050624463
17.82104257560732 13.397362865404352
20.075029302440374 15.093083234334934
22.086658442962953 16.606779524600782
23.891795301524446 17.965372780402355
25.518582625798217 19.189988154097705
26.989713284113275 20.297660542873828
28.323871617739783 21.30241632396213
29.536696895192907 22.215996430588216
30.641454625769647 23.048360177403552
31.64952032868011 23.808048334075067
32.570737998208436 24.502452158091806
33.41369201864302 25.138017471314626
34.1859175911238 25.72040258936349
34.89406641053884 26.254602665813465
35.5440390853296 26.745049076615274
36.14109238402617 27.195689911681644
36.68992711613361 27.610055932293786
37.19476089865728 27.991315184952946
37.65938897276373 28.342318646661948
38.08723546173832 28.665638696230218
38.48139690171807 28.96360178615045
38.84467946537304 29.238316380907396
39.179630992211976 29.491696997550104
39.4885687078732 29.725485010754486
39.773603338143715 29.94126675204925
40.036660187143156 30.140489330578298
40.27949764286802 30.324474523037136
40.50372348972214 30.494431017700293
40.71080934136093 30.651465247698887
40.90210345416155 30.79659100891801
41.078842138921175 30.930738025829122
41.24215995373791 31.054759602568705
41.393098832737515 31.169439475341107
41.53261628206476 31.275497964778044
41.66159275534983 31.373597512470813
41.78083830489881 31.464347673911906
41.89109859152528 31.548309630076606
41.99306032474861 31.626000271475725
42.087356195645654 31.6978959014265
42.17456935664612 31.76443559928711
42.25523749575812 31.82602427929392
42.329856546897744 31.883035476277904
42.39888407301105 31.935813885795167
42.46274235438478 31.984677682985065
42.52182121083374 32.029920641686175
42.57648058323726 32.071814072927175
42.6270528971003 32.110608599810575
42.67384522837379 32.14653578397523
42.71714128963229 32.17980961722002
42.75720325283092 32.210627890463314
42.79427342321258 32.239173450973354
42.82857577747798 32.26561535771065
42.86031737804119 32.29010994365495
42.889689674048164 32.31280179313012
42.91686969881755 32.3338246413763
42.94202117245539 32.35330220293709
42.96529551758434 32.37134893482122
42.986832795402066 32.38807073985339
43.00676256863235 32.40356561513984
43.025204697347036 32.417924250135606
43.04227007311023 32.43123057840443
43.058061296420725 32.4435622868061
43.07267330199957 32.45499128552349
43.086193936081116 32.46558414205043
43.09870448951406 32.475402481997115
43.110280190159735 32.48450335933047
43.12099065778465 32.49293959844891
43.130900324380626 32.50076011029321
43.14006882260533 32.50801018451441
43.14855134481713 32.51473175955591
43.15639897497798 32.52096367235614
43.163658995515654 32.52674188924168
43.17037517106928 32.53209971945499
43.17658801088964 32.53706801264641
43.182335011525204 32.54167534155483
43.18765088129709 32.54594817100533
43.19256774794783 32.54991101426352
43.19711535074128 32.55358657770543
43.20132121819138 32.556995894687155
43.205210832506516 32.560158449430105
43.20880778175224 32.563092291674785
43.21213390065799 32.56581414279806
43.2152094009225 32.56833949403565
43.218052991807106 32.57068269740245
43.220681991746005 32.57285704985823
43.22311243164701 32.574874871224374
43.22535915050512 32.576747576319114
43.22743588390416 32.578485741743236
43.22935534593807 32.58009916771559
43.2311293050434 32.58159693532748
43.23276865419746 32.58298745955748
43.2342834759022 32.584278538362064
43.23568310234268 32.58547739813427
43.23697617107932 32.58659073580031
43.238170676606536 32.58762475780386
43.23927401808534 32.58858521620936
43.24029304353432 32.5894774421379
43.24123409074243 32.59030637673369
43.24210302514704 32.59107659984421
43.242905274902746 32.59179235658335
43.243645863349556 32.59245758193453
43.24432943907341 32.593075923538784
43.244960303737784 32.5936507628022
43.24554243785174 32.5941852344471
43.24607952462745 32.59468224462204
43.24657497206891 32.59514448767715
43.24703193342299 32.595574461703606
43.24745332611418 32.59597448292836
43.24784184927554 32.59634669904891
43.248199999979725 32.596693101586226
43.2485300882666 32.59701553732847
Out[63]:
114

As we can see that AdaGrad is slower even than SGD.

RMSProp

As we’ve seen, AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum. The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

$$\epsilon=1e-10$$$$ s = \beta s + (1-\beta) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta} $$$$ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon} $$
In [70]:
def rmsprop(a,b,sa, sb, x,y):
    epsilon=1e-10
    beta = 0.9
    n = 5
    alpha = 1e-1
    y_hat = model(a,b,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    sa=beta*sa+(1-beta)*da*da + epsilon
    sb=beta*sb+(1-beta)*db*db + epsilon
    a = a - alpha*da / np.sqrt(sa)
    b = b - alpha*db / np.sqrt(sb)
    return a, b, sa, sb
In [71]:
def iterate_rmsprop(x, y, target_cost):
    a=0
    b=0
    sa=0
    sb=0
    for i in range(1000):
        a, b, sa, sb = rmsprop(a,b, sa, sb, x,y)
        print(sa, sb)
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i
In [72]:
iterate_rmsprop(x, y, target_cost=sgd_cost)
0.5036209932651422 0.37845033453999993
0.6666748340963962 0.5011779498389735
0.7035624038659631 0.5290949609920477
0.6846116014928423 0.5150052874480152
0.6413660413519212 0.48260564824209634
0.5892427995404017 0.44348725811529754
0.5358164308595106 0.40335218230125897
0.48463120243995755 0.3648742872141607
0.4371551733415317 0.32916597779828294
0.39382062269916923 0.2965594306913857
0.3545748318305831 0.26702044359109756
0.3191617858943141 0.24036036007818096
Out[72]:
11

Wa, that is only 11 times.

Adam

Adam, which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp.

$$ m = \beta_1 m - (1-\beta_1)\frac{\partial J}{\partial \theta}$$$$ s = \beta_2 s + (1-\beta_2) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta} $$$$ \hat{m} = \frac{m}{1-\beta_1^T} $$$$ \hat{s} = \frac{s}{1-\beta_2^T} $$$$ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon} $$

In [77]:
def adam(a, b, ma, mb, sa, sb, t, x, y):
    epsilon=1e-10
    beta1 = 0.9
    beta2 = 0.9
    n = 5
    alpha = 1e-1
    y_hat = model(a,b,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    ma = beta1 * ma - (1-beta1)*da
    mb = beta1 * mb - (1-beta1)*db
    sa = beta2 * sa + (1-beta2)*da*da
    sb = beta2 * sb + (1-beta2)*db*db
    ma_hat = ma/(1-beta1**t)
    mb_hat = mb/(1-beta1**t)
    sa_hat=sa/(1-beta2**t)
    sb_hat=sb/(1-beta2**t)
    
    a = a + alpha*ma_hat / np.sqrt(sa_hat)
    b = b + alpha*mb_hat / np.sqrt(sb_hat)
    
    return a, b, ma, mb, sa, sb
In [86]:
def iterate_adam(x, y, target_cost):
    a=0
    b=0
    ma=0
    mb=0
    sa=0
    sb=0
    for i in range(1000):
        a, b, ma, mb, sa, sb = adam(a,b, ma, mb, sa, sb, i+1, x, y)
        print(f"{ma}\t{mb}\t{sa}\t{sb}")
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i
In [87]:
iterate_adam(x, y, target_cost=sgd_cost)
0.22441501579999992	0.19453799999999996	0.5036209931651422	0.3784503344399999
0.4016192340199999	0.3481753999999999	0.8518430281932292	0.6402109361703999
0.5363759738971576	0.46503883320196493	1.07262411300758	0.8062610188683121
0.6330149656607605	0.5488792785222582	1.1911922341823753	0.895531357068881
0.6954922736152254	0.603123968069419	1.2302760874303023	0.925077503166402
0.7274510180595801	0.6309289839343443	1.2102871620886997	0.9102165370096031
0.7322854961686895	0.6352347958137915	1.1494443580778129	0.8646207447731215
0.7132116293046631	0.6188272743886445	1.0638272189106417	0.8003578054256075
0.6733462945505432	0.5844063952845583	0.9673391882821758	0.7278635560226704
0.6157957995957747	0.5346628581915572	0.8715625623331551	0.6558335963281313
0.5437474965633194	0.4723574225921566	0.7855022477276057	0.5910315423836943
0.4605465905355612	0.4003874237356603	0.7152614958663444	0.5380462226395261
0.369725397655793	0.3218121173778424	0.6637757717878439	0.49909226801269685
0.274947831152388	0.23980462538427427	0.6308124057706669	0.47400962325270635
0.17985640530446415	0.15751947629598656	0.6134242262809029	0.46060302566421035
0.08785962237036239	0.07790852397792443	0.6068582956517541	0.45532244789580134
0.001933592815800661	0.003549186015826572	0.6056783659441023	0.4541038358094726
-0.07550214246825582	-0.06346307358846585	0.6047743758547716	0.45312546332404074
-0.14262068152365687	-0.12154597442587403	0.6000511654663822	0.4493241456796001
-0.198158083452391	-0.16960472211934757	0.5887657091557674	0.4406482004363841
-0.2413503318769978	-0.2069774262424965	0.5695892904193119	0.4261043208993387
-0.2718627632509636	-0.23337399319470653	0.5424938152075657	0.40567262875423904
-0.28972757335619564	-0.2488220132766666	0.5085404375723384	0.38014845346026394
-0.29529721236352313	-0.25362640497715555	0.4696181652662369	0.3509465461704309
-0.2892167641183978	-0.2483454971123596	0.4281550327777841	0.3198846514087885
-0.27241533612699387	-0.2337835718350304	0.3868085337177365	0.2889514543955107
Out[87]:
25

25 times

NAdam

Adam, which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp.

$$ m = \beta_1 m - (1-\beta_1)\frac{\partial J(\theta+\beta_1 m)}{\partial \theta}$$$$ s = \beta_2 s + (1-\beta_2) \frac{\partial J(\theta+\beta_1 m)}{\partial \theta} \odot \frac{\partial J(\theta+\beta_1 m)}{\partial \theta} $$$$ \hat{m} = \frac{m}{1-\beta_1^T} $$$$ \hat{s} = \frac{s}{1-\beta_2^T} $$$$ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon} $$

In [15]:
def nadam(a, b, ma, mb, sa, sb, t, x, y):
    epsilon=1e-10
    beta1 = 0.9
    beta2 = 0.9
    n = 5
    alpha = 1e-1
    # to modify adam to nadam, 
    # we only modify here
    # with a = a + ma 
    # and b = b + mb
    y_hat = model(a+ma,b+mb,x)
    da = (1.0/n) * ((y_hat-y)*x).sum()
    db = (1.0/n) * ((y_hat-y).sum())
    ma = beta1 * ma - (1-beta1)*da
    mb = beta1 * mb - (1-beta1)*db
    sa = beta2 * sa + (1-beta2)*da*da
    sb = beta2 * sb + (1-beta2)*db*db
    ma_hat = ma/(1-beta1**t)
    mb_hat = mb/(1-beta1**t)
    sa_hat=sa/(1-beta2**t)
    sb_hat=sb/(1-beta2**t)
    
    a = a + alpha*ma_hat / np.sqrt(sa_hat)
    b = b + alpha*mb_hat / np.sqrt(sb_hat)
    
    return a, b, ma, mb, sa, sb
In [18]:
def iterate_nadam(x, y, target_cost):
    a=0
    b=0
    ma=0
    mb=0
    sa=0
    sb=0
    for i in range(1000):
        a, b, ma, mb, sa, sb = nadam(a,b, ma, mb, sa, sb, i+1, x, y)
        print(f"{ma}\t{mb}\t{sa}\t{sb}")
        cost = cost_function(a, b, x, y)
        if cost<target_cost:
            break
    return i
In [19]:
iterate_nadam(x, y, target_cost=sgd_cost)
0.22441501579999992	0.19453799999999996	0.5036209931651422	0.3784503344399999
0.34945317673264553	0.30303326197140557	0.6707614023970666	0.5043149255896258
0.4086630974407369	0.3545220463126516	0.6923373513038595	0.5207829264938997
0.42479852013231584	0.36869895522307994	0.655595591180131	0.4933351229538528
0.41338276013846714	0.3590014409177542	0.5996858101920048	0.45138499366847756
0.3850680115669883	0.33465143621864796	0.5414133518432255	0.40758055150153066
0.34719317806152206	0.30200342133598246	0.487276010490025	0.36682917334514187
0.3048126007276661	0.26543462733685447	0.43913535841653084	0.3305518278022634
0.2613794636593488	0.22793641452525237	0.3968993337519565	0.2986967105147332
0.21920663475537913	0.19151305757275972	0.35978057495684657	0.2706847309093604
0.17978502825765205	0.15745724253108329	0.32686534752782104	0.24583770178851216
0.1440100206019037	0.12654600408884117	0.29734596861667506	0.22355385981586914
0.11234680538118677	0.09918382962474853	0.27059121178700213	0.20336160118017668
0.08495332320331386	0.0755090552163363	0.24614315931263117	0.1849178241182976
0.061772280422955676	0.05547351561060023	0.22368554429852547	0.16798470258860718
Out[19]:
14

well, it takes only 14 times.