import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
quatity score: Takes into account different metrics, such as number of startups, number of coworking spaces, and number of accelerators, to establish the activity level of the startup ecosystem
quality score: Studies parameters that indicate qualitative results achieved by the ecosystem. These parameters include analyzing the traction of the ecosystem’s top startups, as well as reviewing “special entities” the ecosystem has produced: Unicorns, Exits, and Pantheons
business score: It is a mix of business and economic indicators at the national level, discounted for cities that haven't reached a critical mass either for Quantity or Quality.
total score: The total score of the rankings is a sum of the quantity, quality, and business environment
What are correlations between the metrics of the best startups?
cities = pd.read_csv('Best Cities for Startups.csv')
countries = pd.read_csv('Best Countries for Startups.csv')
print(cities.shape)
cities.head()
(1000, 9)
position | change in position from 2020 | city | country | total score | quatity score | quality score | business score | sign of change in position | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | San Francisco Bay | United States | 328.966 | 29.14 | 296.02 | 3.80 | NaN |
1 | 2 | 0 | New York | United States | 110.777 | 11.43 | 95.55 | 3.80 | NaN |
2 | 3 | 3 | Beijing | China | 66.049 | 5.01 | 58.61 | 2.43 | + |
3 | 4 | 1 | Los Angeles Area | United States | 58.441 | 11.23 | 43.41 | 3.80 | + |
4 | 5 | 2 | London | United Kingdom | 56.913 | 15.77 | 37.44 | 3.70 | - |
cities.tail()
position | change in position from 2020 | city | country | total score | quatity score | quality score | business score | sign of change in position | |
---|---|---|---|---|---|---|---|---|---|
995 | 996 | 26 | Ouagadougou | Burkina Faso | 0.060 | 0.02 | 0.02 | 0.02 | - |
996 | 997 | new | Baghdad | Iraq | 0.058 | 0.01 | 0.02 | 0.03 | NaN |
997 | 998 | 13 | Mbabane | Swaziland | 0.057 | 0.01 | 0.02 | 0.03 | - |
998 | 999 | new | Conakry | Guinea | 0.047 | 0.01 | 0.02 | 0.02 | NaN |
999 | 1000 | new | Sanaa | Yemen | 0.037 | 0.01 | 0.02 | 0.01 | NaN |
cities.corr()
position | total score | quatity score | quality score | business score | |
---|---|---|---|---|---|
position | 1.000000 | -0.279582 | -0.358950 | -0.186913 | -0.806506 |
total score | -0.279582 | 1.000000 | 0.921991 | 0.991778 | 0.358643 |
quatity score | -0.358950 | 0.921991 | 1.000000 | 0.881724 | 0.453313 |
quality score | -0.186913 | 0.991778 | 0.881724 | 1.000000 | 0.244499 |
business score | -0.806506 | 0.358643 | 0.453313 | 0.244499 | 1.000000 |
sns.pairplot(cities)
<seaborn.axisgrid.PairGrid at 0x2e646570ee0>
plt.figure(figsize=(8,5))
ax = sns.heatmap(cities.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
There is a strong correlation between quality and quatity scores. We observe that there is a weak correlation bewtween quality score and business score.But it looks that there is a negative strong correlation between business score and position. So, the city rises to the top if business score increases. Moreover, even tough the cities rankings be determined according to total score, we observe that there is a negative weak correlation between position and total score. Let's split top 100 cities to understand why be like that and examine correlation between them.
cities_top_100 = cities[:100]
cities_top_100.corr()
position | total score | quatity score | quality score | business score | |
---|---|---|---|---|---|
position | 1.000000 | -0.434891 | -0.578036 | -0.416559 | 0.110732 |
total score | -0.434891 | 1.000000 | 0.915673 | 0.998811 | 0.069569 |
quatity score | -0.578036 | 0.915673 | 1.000000 | 0.896382 | 0.058255 |
quality score | -0.416559 | 0.998811 | 0.896382 | 1.000000 | 0.052109 |
business score | 0.110732 | 0.069569 | 0.058255 | 0.052109 | 1.000000 |
sns.pairplot(cities_top_100)
<seaborn.axisgrid.PairGrid at 0x2e645472970>
plt.figure(figsize=(8,5))
ax = sns.heatmap(cities_top_100.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
When we examine top 100 cities in the dataset, correlation between quantity-quality scores and position gets strong. So, The best startups cities start arising. Let's check top 25 cities to verify that.
cities_top_25 = cities[:25]
cities_top_25.corr()
position | total score | quatity score | quality score | business score | |
---|---|---|---|---|---|
position | 1.000000 | -0.575365 | -0.659024 | -0.562125 | -0.053008 |
total score | -0.575365 | 1.000000 | 0.913301 | 0.999004 | 0.233821 |
quatity score | -0.659024 | 0.913301 | 1.000000 | 0.894715 | 0.288322 |
quality score | -0.562125 | 0.999004 | 0.894715 | 1.000000 | 0.215706 |
business score | -0.053008 | 0.233821 | 0.288322 | 0.215706 | 1.000000 |
sns.pairplot(cities_top_25)
<seaborn.axisgrid.PairGrid at 0x2e6567251f0>
plt.figure(figsize=(8,5))
ax = sns.heatmap(cities_top_25.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
When we check the dataset, it supports our idea above. Because business score decreased so quantity-quality scores become determiner. So let's examine those cities.
cities_top_25_copy = cities_top_25.copy()
cities_top_25_copy['best_ratio'] = cities_top_25_copy['quality score'] / cities_top_25_copy['quatity score']
cities_top_25_copy[['city','best_ratio','business score']]
city | best_ratio | business score | |
---|---|---|---|
0 | San Francisco Bay | 10.158545 | 3.80 |
1 | New York | 8.359580 | 3.80 |
2 | Beijing | 11.698603 | 2.43 |
3 | Los Angeles Area | 3.865539 | 3.80 |
4 | London | 2.374128 | 3.70 |
5 | Boston Area | 7.369091 | 3.80 |
6 | Shanghai | 10.131653 | 2.43 |
7 | Tel Aviv Area | 4.930693 | 3.13 |
8 | Moscow | 2.122117 | 2.39 |
9 | Bangalore | 3.561508 | 2.38 |
10 | Paris | 3.153700 | 3.41 |
11 | Seattle | 4.882521 | 3.80 |
12 | Berlin | 4.057072 | 3.49 |
13 | New Delhi | 2.363946 | 2.61 |
14 | Tokyo | 6.400000 | 3.30 |
15 | Mumbai | 4.140673 | 2.61 |
16 | Chicago | 2.505721 | 3.80 |
17 | Austin | 3.625000 | 3.80 |
18 | Washington DC Area | 3.109510 | 3.80 |
19 | Sao Paulo | 4.836502 | 2.29 |
20 | Shenzhen | 7.761364 | 1.99 |
21 | San Diego | 4.102273 | 3.80 |
22 | Seoul | 7.393750 | 3.24 |
23 | Stockholm | 5.535519 | 3.78 |
24 | Singapore City | 2.927215 | 3.30 |
plt.figure(figsize=(11,9))
plt.barh(cities_top_25_copy['city'],cities_top_25_copy['best_ratio'])
plt.xlabel('Startsup Ratio')
plt.ylabel('Cities')
plt.show()
When we compare quantity and quality as propotion, even tough number of coworking spaces and number of accelerators are less, the number of startups may be more. But it can't prove that. It can cause to ask us that question, is Beijing become the best startup city of the world if Beijing has the same quantity score with quantity score of San Fransisco?
cities.iloc[[0,2]].plot.bar()
<AxesSubplot:>
cities.iloc[[0,2]].plot.bar(x='city', y= 'business score')
<AxesSubplot:xlabel='city'>
Business Score of San Fransisco Bay is higher than Business Score of Beijing. So, It may mean that the Unicorn,Pantheorn startups will build in San Fransisco Bay because business score of Beijing is lower, if Beijing has the same quantity score with quantity score of San Fransisco Bay. Which it can mean that there will be the best startups in San Fransisco Bay.
countries.head(8)
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0 | United States | 124.420 | 19.45 | 101.17 | 3.80 | NaN |
1 | 2.0 | 0 | United Kingdom | 28.719 | 8.16 | 16.86 | 3.70 | NaN |
2 | 3.0 | 0 | Israel | 27.741 | 5.48 | 19.14 | 3.13 | NaN |
3 | 4.0 | 0 | Canada | 19.876 | 6.58 | 9.75 | 3.55 | NaN |
4 | 5.0 | 0 | Germany | 17.053 | 3.64 | 9.93 | 3.49 | NaN |
5 | 6.0 | 4 | Sweden | 15.423 | 2.40 | 9.24 | 3.78 | + |
6 | 7.0 | 7 | China | 15.128 | 1.33 | 11.46 | 2.34 | + |
7 | 8.0 | 0 | Switzerland | 14.943 | 3.82 | 7.58 | 3.54 | NaN |
countries.tail()
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
96 | 97.0 | 8 | Uganda | 0.180 | 0.07 | 0.04 | 0.07 | - |
97 | 98.0 | 2 | Nepal | 0.172 | 0.06 | 0.04 | 0.08 | + |
98 | 99.0 | new entry | Namibia | 0.165 | 0.04 | 0.05 | 0.07 | NaN |
99 | 100.0 | new entry | Ethiopia | 0.162 | 0.07 | 0.06 | 0.03 | NaN |
100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
countries.corr()
ranking | total score | quantity score | quality score | business score | |
---|---|---|---|---|---|
ranking | 1.000000 | -0.520022 | -0.613922 | -0.397828 | -0.952646 |
total score | -0.520022 | 1.000000 | 0.958664 | 0.988242 | 0.480830 |
quantity score | -0.613922 | 0.958664 | 1.000000 | 0.914458 | 0.576589 |
quality score | -0.397828 | 0.988242 | 0.914458 | 1.000000 | 0.350522 |
business score | -0.952646 | 0.480830 | 0.576589 | 0.350522 | 1.000000 |
sns.pairplot(countries)
<seaborn.axisgrid.PairGrid at 0x2e65786c8e0>
plt.figure(figsize=(8,5))
ax = sns.heatmap(countries.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
business_score_top = countries[countries['business score'] >= 3.0]
business_score_top
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0 | United States | 124.420 | 19.45 | 101.17 | 3.80 | NaN |
1 | 2.0 | 0 | United Kingdom | 28.719 | 8.16 | 16.86 | 3.70 | NaN |
2 | 3.0 | 0 | Israel | 27.741 | 5.48 | 19.14 | 3.13 | NaN |
3 | 4.0 | 0 | Canada | 19.876 | 6.58 | 9.75 | 3.55 | NaN |
4 | 5.0 | 0 | Germany | 17.053 | 3.64 | 9.93 | 3.49 | NaN |
5 | 6.0 | 4 | Sweden | 15.423 | 2.40 | 9.24 | 3.78 | + |
7 | 8.0 | 0 | Switzerland | 14.943 | 3.82 | 7.58 | 3.54 | NaN |
8 | 9.0 | 2 | Australia | 13.835 | 4.46 | 5.88 | 3.50 | - |
10 | 11.0 | 5 | The Netherlands | 13.700 | 3.44 | 6.96 | 3.30 | - |
11 | 12.0 | 0 | France | 13.286 | 3.03 | 6.85 | 3.41 | NaN |
12 | 13.0 | 2 | Estonia | 12.428 | 3.19 | 5.77 | 3.47 | - |
13 | 14.0 | 1 | Finland | 11.582 | 2.68 | 5.26 | 3.64 | - |
14 | 15.0 | 6 | Spain | 11.146 | 3.48 | 4.35 | 3.31 | - |
15 | 16.0 | 1 | Lithuania | 9.992 | 3.77 | 2.98 | 3.25 | - |
17 | 18.0 | 0 | Ireland | 9.633 | 2.51 | 3.68 | 3.44 | NaN |
18 | 19.0 | 0 | South Korea | 8.888 | 0.68 | 4.96 | 3.24 | NaN |
20 | 21.0 | 0 | Japan | 8.709 | 0.99 | 4.42 | 3.30 | NaN |
21 | 22.0 | 0 | Denmark | 8.368 | 2.04 | 2.68 | 3.65 | NaN |
22 | 23.0 | 1 | Belgium | 7.359 | 2.07 | 1.98 | 3.31 | + |
25 | 26.0 | 4 | Taiwan | 6.946 | 1.50 | 2.09 | 3.36 | + |
27 | 28.0 | 0 | Austria | 6.936 | 1.75 | 1.67 | 3.52 | NaN |
28 | 29.0 | 4 | Italy | 6.602 | 1.68 | 1.87 | 3.06 | - |
29 | 30.0 | 3 | Poland | 6.515 | 1.40 | 1.95 | 3.17 | - |
30 | 31.0 | 2 | Norway | 6.386 | 1.15 | 1.57 | 3.66 | + |
31 | 32.0 | 6 | Czechia | 6.226 | 1.24 | 1.72 | 3.26 | - |
32 | 33.0 | 14 | New Zealand | 5.865 | 1.05 | 1.12 | 3.69 | + |
business_score_top.plot.barh(x='country', y= 'business score',legend=False,figsize=(10,7),
title=' Countries of The Best Business Score')
<AxesSubplot:title={'center':' Countries of The Best Business Score'}, ylabel='country'>
countries_20 = countries[:20]
countries_20
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0 | United States | 124.420 | 19.45 | 101.17 | 3.80 | NaN |
1 | 2.0 | 0 | United Kingdom | 28.719 | 8.16 | 16.86 | 3.70 | NaN |
2 | 3.0 | 0 | Israel | 27.741 | 5.48 | 19.14 | 3.13 | NaN |
3 | 4.0 | 0 | Canada | 19.876 | 6.58 | 9.75 | 3.55 | NaN |
4 | 5.0 | 0 | Germany | 17.053 | 3.64 | 9.93 | 3.49 | NaN |
5 | 6.0 | 4 | Sweden | 15.423 | 2.40 | 9.24 | 3.78 | + |
6 | 7.0 | 7 | China | 15.128 | 1.33 | 11.46 | 2.34 | + |
7 | 8.0 | 0 | Switzerland | 14.943 | 3.82 | 7.58 | 3.54 | NaN |
8 | 9.0 | 2 | Australia | 13.835 | 4.46 | 5.88 | 3.50 | - |
9 | 10.0 | 6 | Singapore | 13.745 | 3.21 | 7.69 | 2.84 | + |
10 | 11.0 | 5 | The Netherlands | 13.700 | 3.44 | 6.96 | 3.30 | - |
11 | 12.0 | 0 | France | 13.286 | 3.03 | 6.85 | 3.41 | NaN |
12 | 13.0 | 2 | Estonia | 12.428 | 3.19 | 5.77 | 3.47 | - |
13 | 14.0 | 1 | Finland | 11.582 | 2.68 | 5.26 | 3.64 | - |
14 | 15.0 | 6 | Spain | 11.146 | 3.48 | 4.35 | 3.31 | - |
15 | 16.0 | 1 | Lithuania | 9.992 | 3.77 | 2.98 | 3.25 | - |
16 | 17.0 | 0 | Russia | 9.813 | 2.17 | 5.14 | 2.51 | NaN |
17 | 18.0 | 0 | Ireland | 9.633 | 2.51 | 3.68 | 3.44 | NaN |
18 | 19.0 | 0 | South Korea | 8.888 | 0.68 | 4.96 | 3.24 | NaN |
19 | 20.0 | 3 | India | 8.833 | 1.83 | 4.40 | 2.61 | + |
Let's split the dataset two pieces. The first dataset will be the top eight countries, the other dataset will be the rest countries.
countries_20.corr()
ranking | total score | quantity score | quality score | business score | |
---|---|---|---|---|---|
ranking | 1.000000 | -0.546039 | -0.616779 | -0.522993 | -0.398569 |
total score | -0.546039 | 1.000000 | 0.955399 | 0.997841 | 0.338033 |
quantity score | -0.616779 | 0.955399 | 1.000000 | 0.934549 | 0.453367 |
quality score | -0.522993 | 0.997841 | 0.934549 | 1.000000 | 0.295646 |
business score | -0.398569 | 0.338033 | 0.453367 | 0.295646 | 1.000000 |
sns.pairplot(countries_20)
<seaborn.axisgrid.PairGrid at 0x2e659c48640>
plt.figure(figsize=(8,5))
ax = sns.heatmap(countries_20.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
plt.figure(figsize=(11,9))
plt.barh(countries_20['country'],countries_20['total score'])
plt.xlabel('total score')
plt.ylabel('Best Countries')
plt.title('The Best Startups Countries')
plt.show()
countries_20_copy = countries_20.copy()
countries_20_copy['best_ratio'] = countries_20_copy['quality score'] / countries_20_copy['quantity score']
compare_c_20 = countries_20_copy[['country','best_ratio','business score']]
compare_c_20
country | best_ratio | business score | |
---|---|---|---|
0 | United States | 5.201542 | 3.80 |
1 | United Kingdom | 2.066176 | 3.70 |
2 | Israel | 3.492701 | 3.13 |
3 | Canada | 1.481763 | 3.55 |
4 | Germany | 2.728022 | 3.49 |
5 | Sweden | 3.850000 | 3.78 |
6 | China | 8.616541 | 2.34 |
7 | Switzerland | 1.984293 | 3.54 |
8 | Australia | 1.318386 | 3.50 |
9 | Singapore | 2.395639 | 2.84 |
10 | The Netherlands | 2.023256 | 3.30 |
11 | France | 2.260726 | 3.41 |
12 | Estonia | 1.808777 | 3.47 |
13 | Finland | 1.962687 | 3.64 |
14 | Spain | 1.250000 | 3.31 |
15 | Lithuania | 0.790451 | 3.25 |
16 | Russia | 2.368664 | 2.51 |
17 | Ireland | 1.466135 | 3.44 |
18 | South Korea | 7.294118 | 3.24 |
19 | India | 2.404372 | 2.61 |
compare_c_20.plot.bar(x='country',figsize=(12,7))
<AxesSubplot:xlabel='country'>
countries_20_top_8 = countries_20[:8]
countries_20_top_8
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0 | United States | 124.420 | 19.45 | 101.17 | 3.80 | NaN |
1 | 2.0 | 0 | United Kingdom | 28.719 | 8.16 | 16.86 | 3.70 | NaN |
2 | 3.0 | 0 | Israel | 27.741 | 5.48 | 19.14 | 3.13 | NaN |
3 | 4.0 | 0 | Canada | 19.876 | 6.58 | 9.75 | 3.55 | NaN |
4 | 5.0 | 0 | Germany | 17.053 | 3.64 | 9.93 | 3.49 | NaN |
5 | 6.0 | 4 | Sweden | 15.423 | 2.40 | 9.24 | 3.78 | + |
6 | 7.0 | 7 | China | 15.128 | 1.33 | 11.46 | 2.34 | + |
7 | 8.0 | 0 | Switzerland | 14.943 | 3.82 | 7.58 | 3.54 | NaN |
countries_20_top_8.corr()
ranking | total score | quantity score | quality score | business score | |
---|---|---|---|---|---|
ranking | 1.000000 | -0.681746 | -0.791867 | -0.652999 | -0.405576 |
total score | -0.681746 | 1.000000 | 0.959042 | 0.998174 | 0.339937 |
quantity score | -0.791867 | 0.959042 | 1.000000 | 0.940510 | 0.477379 |
quality score | -0.652999 | 0.998174 | 0.940510 | 1.000000 | 0.298494 |
business score | -0.405576 | 0.339937 | 0.477379 | 0.298494 | 1.000000 |
sns.pairplot(countries_20_top_8,hue = 'country', diag_kind="hist")
<seaborn.axisgrid.PairGrid at 0x2e65aab2d90>
plt.figure(figsize=(8,5))
ax = sns.heatmap(countries_20_top_8.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
countries_20_last_12 = countries_20[8:]
countries_20_last_12
ranking | change in position from 2020 | country | total score | quantity score | quality score | business score | change in position sign | |
---|---|---|---|---|---|---|---|---|
8 | 9.0 | 2 | Australia | 13.835 | 4.46 | 5.88 | 3.50 | - |
9 | 10.0 | 6 | Singapore | 13.745 | 3.21 | 7.69 | 2.84 | + |
10 | 11.0 | 5 | The Netherlands | 13.700 | 3.44 | 6.96 | 3.30 | - |
11 | 12.0 | 0 | France | 13.286 | 3.03 | 6.85 | 3.41 | NaN |
12 | 13.0 | 2 | Estonia | 12.428 | 3.19 | 5.77 | 3.47 | - |
13 | 14.0 | 1 | Finland | 11.582 | 2.68 | 5.26 | 3.64 | - |
14 | 15.0 | 6 | Spain | 11.146 | 3.48 | 4.35 | 3.31 | - |
15 | 16.0 | 1 | Lithuania | 9.992 | 3.77 | 2.98 | 3.25 | - |
16 | 17.0 | 0 | Russia | 9.813 | 2.17 | 5.14 | 2.51 | NaN |
17 | 18.0 | 0 | Ireland | 9.633 | 2.51 | 3.68 | 3.44 | NaN |
18 | 19.0 | 0 | South Korea | 8.888 | 0.68 | 4.96 | 3.24 | NaN |
19 | 20.0 | 3 | India | 8.833 | 1.83 | 4.40 | 2.61 | + |
countries_20_last_12.corr()
ranking | total score | quantity score | quality score | business score | |
---|---|---|---|---|---|
ranking | 1.000000 | -0.983246 | -0.762573 | -0.739960 | -0.373214 |
total score | -0.983246 | 1.000000 | 0.707677 | 0.803915 | 0.367816 |
quantity score | -0.762573 | 0.707677 | 1.000000 | 0.184262 | 0.376132 |
quality score | -0.739960 | 0.803915 | 0.184262 | 1.000000 | -0.012712 |
business score | -0.373214 | 0.367816 | 0.376132 | -0.012712 | 1.000000 |
sns.pairplot(countries_20_last_12,hue = 'country', diag_kind="hist")
<seaborn.axisgrid.PairGrid at 0x2e65c31ac70>
plt.figure(figsize=(8,5))
ax = sns.heatmap(countries_20_last_12.corr(), vmin=-1, vmax=1, cbar=False,
cmap='RdBu', annot=True)
Summarize;