figsize( 12.5 ,4)
import scipy.stats as stats
Most of you are likely familar with the git-repository website Github. An observed phenomenon on Github is the scale-ness of the popularity of repositories. Here, for lack of a better measure, we use the numbers of stars and forks to measure popularity. This is not a bad measure, but it can ignore page-views, downloads and tends to overemphasize older repositories. Since we will be studying all repositories and not a single one, the absense of these measures is not as relevant.
Contained in this folder is a Python script for scrapping data from Github on the popularity of repos. The script requires the Requests
and BeautifulSoup
libraries, but if you don't have that installed, provided in the ./data
folder is the same data from a previous date (Feburary 18, 2013 at last pull). The data is the fraction of repositories with stars equal to or greater than $2^k,\; k = 0,...,15$ and the fraction of repositories with forks equal to or than $2^k,\; k = 0,...,15$.
run github_datapull.py
Scrapping data from Github. Sorry Github... The data is contained in variables `foo_to_explore` and `repo_with_foo` stars first... number of repos with greater than or equal to 0 stars: 2738541 number of repos with greater than or equal to 1 stars: 1704779 number of repos with greater than or equal to 2 stars: 493529 number of repos with greater than or equal to 4 stars: 212099 number of repos with greater than or equal to 8 stars: 106973 number of repos with greater than or equal to 16 stars: 58101 number of repos with greater than or equal to 32 stars: 31877 number of repos with greater than or equal to 64 stars: 17370 number of repos with greater than or equal to 128 stars: 9239 number of repos with greater than or equal to 256 stars: 4578 number of repos with greater than or equal to 512 stars: 2150 number of repos with greater than or equal to 1024 stars: 872 number of repos with greater than or equal to 2048 stars: 286 number of repos with greater than or equal to 4096 stars: 84 number of repos with greater than or equal to 8192 stars: 22 number of repos with greater than or equal to 16384 stars: 5 number of repos with greater than or equal to 32768 stars: 1 forks second... number of repos with greater than or equal to 0 forks: 2738548 number of repos with greater than or equal to 1 forks: 334539 number of repos with greater than or equal to 2 forks: 159206 number of repos with greater than or equal to 4 forks: 74836 number of repos with greater than or equal to 8 forks: 36532 number of repos with greater than or equal to 16 forks: 17948 number of repos with greater than or equal to 32 forks: 8394 number of repos with greater than or equal to 64 forks: 3841 number of repos with greater than or equal to 128 forks: 1580 number of repos with greater than or equal to 256 forks: 605 number of repos with greater than or equal to 512 forks: 222 number of repos with greater than or equal to 1024 forks: 69 number of repos with greater than or equal to 2048 forks: 17 number of repos with greater than or equal to 4096 forks: 4 number of repos with greater than or equal to 8192 forks: 2 number of repos with greater than or equal to 16384 forks: 0 number of repos with greater than or equal to 32768 forks: 0
plt.plot( ( stars_to_explore), repo_with_stars, label = "stars" )
plt.plot( ( forks_to_explore), repo_with_forks, label = "forks" )
plt.legend(loc = "lower right")
plt.title("Popularity of Repos (as measured by stars and forks)" )
plt.xlabel("$K$")
plt.ylabel("number of repos with stars/forks $K$" )
plt.xlim( -200, 35000 )
(-200, 35000)
Clearly, we need to adjust the scale of this plot as most of the action is hidden. The number of repos falls very quickly. We will put it on a log-log plot.
plt.plot( log2( stars_to_explore+1), log2(repo_with_stars+1), 'o-', label = "stars" )
plt.plot( log2( forks_to_explore+1), log2(repo_with_forks+1), 'o-', label = "forks", )
plt.legend(loc = "upper right")
plt.title("Log-Log plot of Popularity of Repos (as measured by stars and forks)" )
plt.xlabel("$\log{K}$")
plt.ylabel("$\log$(number of repos with stars/forks < K )" )
<matplotlib.text.Text at 0x5d64590>
Both characteristics look like a straight line plotted on a log-log plot. What does this mean? Denote the fraction of repos with greater than or equal to $k$ stars (or forks) $P(k)$. So in the above plot, $\log{P(k)}$ on the y-axis and $\log{k}$ is on the x-axis. The above linear relationship can be written as:
$$ \log_2{P(k)} = \beta\log_2{k} + \alpha$$rearranging by taking both sides to the power of 2:
$$ P(k) = 2^\alpha k^{\beta} = C k^{\beta}, \;\; C = 2^{\alpha}$$This relationship is very interesting. It is called a power-law, and occurs very freqently in social datasets. Why does it occur so frequently in social datasets? It has much to do with a "winner-take-all" or "winner-take-most" effect. Winners in a power-law enviroment are components that seem take a disproportiante amount of the popularity, and keep winning. In term of popularity of repos, winning repos are repos that are very good quailty (intially are winners), and are shared/talked about often (keep winning).
The above plot is also telling us that the majority of repos have very few stars and forks, only a handful have hundreds, and an incredibly small number have thousands. This is not-so obvious after browsing Github's website, where you see some repos with 36000+ stars, but fail to see the millions that do not have any stars (as they are not popular, they won't be common on your tour of the site.)
Distributions like this are also said to have fat-tails, i.e. the probability does not drop quickly as we extend into the tail of the dataset, but most of the probability is still centered near zero.
The heaviness of the tail and strength of "winner-take-all" effect are both influenced by the $\beta$ parameter. The small the $\beta$, the more pronounced these effects. Below is a list of distributions that follow a power-law and an approximate $\beta$ exponent [1]. Recall though that we never observe these numbers, we must infer them from the data.
Phenomenon | Assumed Exponent | |
---|---|---|
Frequency of word use | -1.2 | |
Number of hits on website | -1.4 | |
US book sales | -1.5 | |
Intensity of wars | -0.8 | |
New worth of Americans | -1.1 | |
Github Stars | ?? |
It is very easy to overestimate the true paramter $\beta$. This is because the tail events (the events of 500+ stars) are very rare. For example, suppose in our Github dataset we only observe 100 samples. With very high probability (about 30%), all of these samples will have less than 31 stars. This is because approximately 99% ( Number of all repos - Number of repos with greater than 31 stars)/(Number of all repos) of all repos have less than 31 stars. Thus, we would have no samples in our dataset from the tail of the distribution. If I then told you that there existed a repo with 36000+ stars, you would call me crazy, as it would be about 1000 times larger than your observed most popular repo. You would assign a very large $\beta$ exponent to your dataset (recall large $\beta$ means thinner tails). Similarly, with the same 30% probability we would not see repos more popular than 64 stars if we had a sample of 1000. Taking this to its logical conclusion, how confident should we be that there might not exist a theoretical repo that can attain 72000+ stars, or 150000+ stars, one which would push an estimated $\beta$ down even more.
The
from scipy.special import beta
import pymc as mc
param = mc.Exponential("param", 1 )
@mc.stochastic( dtype = int, observed = True )
def yule_simon( value = repo_with_stars, rho = param ):
"""test"""
def logp( value, rho):
return np.log( rho ) + np.log( beta( value, rho+1) )
def random(rho):
W = stats.expon.rvs(scale=1./rho)
return stats.geom.rvs( np.exp( -W ) )
model = mc.Model( [param, yule_simon] )
mcmc = mc.MCMC( model )
mcmc.sample( 10000, 8000)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-83-fe84742cff9d> in <module>() 8 9 @mc.stochastic( dtype = int, observed = True ) ---> 10 def yule_simon( value = repo_with_stars, rho = param ): 11 """test""" 12 c:\Python27\lib\site-packages\pymc\InstantiationDecorators.pyc in instantiate_p(__func__) 147 def instantiate_p(__func__): 148 value, parents = _extract(__func__, kwds, keys, 'Stochastic') --> 149 return __class__(value=value, parents=parents, **kwds) 150 151 keys = ['logp','random','rseed'] c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients) 714 if check_logp: 715 # Check initial value --> 716 if not isinstance(self.logp, float): 717 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__()) 718 c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self) 833 logp = float(logp) 834 except: --> 835 raise TypeError(self.__name__ + ': computed log-probability ' + str(logp) + ' cannot be cast to float') 836 837 if logp != logp: TypeError: yule_simon: computed log-probability [-20.51503062 -19.93158602 -18.31405136 -17.21386783 -16.32349938 -15.53050299 -14.75010755 -13.96101721 -13.13877723 -12.2264853 -11.23694781 -10.06769225 -8.63087616 -7.0237458 -5.33941252 -3.44559118 -1.4738842 ] cannot be cast to float
def logp( value, rho):
return np.log( rho ) + np.log( beta( value, rho+1) )
beta( repo_with_stars, 1.3)
array([ 3.96781274e-09, 7.12048348e-09, 3.60230434e-08, 1.08508004e-07, 2.64859823e-07, 5.86390404e-07, 1.28195491e-06, 2.82711390e-06, 6.44529816e-06, 1.60819963e-05, 4.33570545e-05, 1.39960612e-04, 5.90762159e-04, 2.95765600e-03, 1.59980669e-02, 1.06728840e-01, 7.69230769e-01])
x = np.linspace( 1, 50, 200 )
plt.plot( x, exp( -(x-1)**2 ), label = "Normal distribution" )
plt.plot( x, x**(-2), label = r"Power law, $\beta = -2$" )
plt.plot( x, x**(-1), label = r"Power law, $\beta = -1$" )
plt.legend()
<matplotlib.legend.Legend at 0x1487d518>
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()
import pymc as mc
beta = mc.Uniform("beta", -100, 100)
@mc.observed
def survival( value = y_, beta = beta):
return np.sum( [ value[i-1]*np.log( (i+0.)**beta - (i+1.)**beta ) for i in range(1,99) ] )
model = mc.Model( [survival, beta] )
#map_ = mc.MAP( model )
#map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 50000, 40000 )
[****************100%******************] 50000 of 50000 complete
from pymc.Matplot import plot as mcplot
mcplot(mcmc)
Plotting beta
stars_to_explore[1:]
array([ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768])
a = stats.pareto.rvs( 2.5, size = (50000,1) )
hist(a, bins = 100)
print
y = [ (a >= i).sum() for i in range(1,100 ) ]
y_ = -np.diff(y)
print y_
print y
[41264 5572 1638 646 313 181 101 70 48 47 22 14 14 14 7 11 4 6 4 0 1 0 2 0 0 1 2 1 3 0 2 2 1 0 1 1 2 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [50000, 8736, 3164, 1526, 880, 567, 386, 285, 215, 167, 120, 98, 84, 70, 56, 49, 38, 34, 28, 24, 24, 23, 23, 21, 21, 21, 20, 18, 17, 14, 14, 12, 10, 9, 9, 8, 7, 5, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
b = -2.3
np.sum( [ y_[i-1]*np.log( (i+0.)**b - (i+1.)**b ) for i in range(1,7) ] ) + y[-1]*np.log(7)
-13526.483069774908
y_
array([48930, 940, 103, 19, 7, 1])
np.append( y_, y[-1] )
array([48930, 940, 103, 19, 7, 1, 0])
mc.Uninformative?