1 Introduction¶

This project is about a kind of lottery ticket called "array 3", which is a means of raising money by selling numbered tickets and giving prizes to the holders of the number drawn at random. And you have three ways to get the prize:

you choose a number from 000 to 999. If your choice matches the prize number，then you get 1040 RMB.
you choose one number --A from 0 to 9 twice and choose another different number -- B as the same time,such as'112'. If the prize number includes two 'A'and one 'B'(3 cases in total like 112,121,211)，then you get 346 RMB.
you choose three different numbers from 0 to 9, such as'123'.If the prize number includes '1', '2'and '3'(6 cases in total)，then you get 173 RMB.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import scipy.stats as st
%matplotlib inline

2 data & clean¶

The datas of the first csv come from the Official website of the lottery ticket. I copy the datas and save them as csv.

2-1 infomation of the data¶

In [2]:

url_all = "data-all.csv"
df_all = pd.read_csv(url_all)

In [3]:

df_allcopy=df_all.copy()
df_allcopy.dtypes

Out[3]:

time_serial     int64
password       object
way1            int64
way1_number     int64
way2            int64
way2_number     int64
way3           object
way3_number     int64
total_sales    object
time           object
dtype: object

The time_serial means a serial number to the time.

The password is the prize number

The way1, way2, way3 are the amounts of winning bets for the three ways, and those way_number mean the prizes to the ways.

In [4]:

df_allcopy.head(5)

Out[4]:

	time_serial	password	way1	way1_number	way2	way2_number	way3	way3_number	total_sales	time
0	20098	4 4 9	8148	1040	10839	346	0	173	25,888,230	2020/5/27
1	20097	9 3 5	9422	1040	0	346	31680	173	25,527,170	2020/5/26
2	20096	7 1 8	11124	1040	0	346	40734	173	26,909,786	2020/5/25
3	20095	4 2 5	12017	1040	0	346	34704	173	24,962,754	2020/5/24
4	20094	3 2 8	17743	1040	0	346	58498	173	26,033,312	2020/5/23

2-2 clean the data¶

In [5]:

df_allcopy['time'] = pd.to_datetime(df_allcopy['time'])

#split the 'password' into three columns
df_allcopy['password_1'] = df_allcopy['password'].str.split(' ').str[0].astype(int)
df_allcopy['password_2'] = df_allcopy['password'].str.split(' ').str[1].astype(int)
df_allcopy['password_3'] = df_allcopy['password'].str.split(' ').str[2].astype(int)

# chenge the '- -' in df_allcopy['way3'] with '0' and chenge the values into 'int'
df_allcopy.loc[df_allcopy['way3']=='- -', 'way3'] = 0
df_allcopy['way3'] = df_allcopy['way3'].astype(int)

# dope the '- -' in df_allcopy['total_sales'] and chenge the values into 'float'
df_allcopy.loc[df_allcopy['total_sales']=='- -', 'total_sales'] = np.nan
df_allcopy['total_sales'] = df_allcopy['total_sales'].str.replace(',', '').astype(float)


df_A = df_allcopy.set_index('time')
df_A.head(5)

Out[5]:

	time_serial	password	way1	way1_number	way2	way2_number	way3	way3_number	total_sales	password_1	password_2	password_3
time
2020-05-27	20098	4 4 9	8148	1040	10839	346	0	173	25888230.0	4	4	9
2020-05-26	20097	9 3 5	9422	1040	0	346	31680	173	25527170.0	9	3	5
2020-05-25	20096	7 1 8	11124	1040	0	346	40734	173	26909786.0	7	1	8
2020-05-24	20095	4 2 5	12017	1040	0	346	34704	173	24962754.0	4	2	5
2020-05-23	20094	3 2 8	17743	1040	0	346	58498	173	26033312.0	3	2	8

In [6]:

df_A.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5505 entries, 2020-05-27 to 2004-11-14
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   time_serial  5505 non-null   int64  
 1   password     5505 non-null   object 
 2   way1         5505 non-null   int64  
 3   way1_number  5505 non-null   int64  
 4   way2         5505 non-null   int64  
 5   way2_number  5505 non-null   int64  
 6   way3         5505 non-null   int32  
 7   way3_number  5505 non-null   int64  
 8   total_sales  1345 non-null   float64
 9   password_1   5505 non-null   int32  
 10  password_2   5505 non-null   int32  
 11  password_3   5505 non-null   int32  
dtypes: float64(1), int32(4), int64(6), object(1)
memory usage: 473.1+ KB

2-3 Supplementary datas¶

The column total_sales that means the total sale for one time has only 1345 values, which are not enough.

The following datas In the Official website of the lottery ticket, there are links to some new websites, such as this. And I use a reptile to get more datas and save as csv.

2-3-1 clean the data¶

I won't show the details of the primeval data by using a data-clean function.

In [7]:

#data-clean function
def csv_clean1(url):
    df_new = pd.read_csv(url)
    strftime = "%Y%m%d" 
    df_new.loc[:, 'date'] = pd.to_datetime(df_new['date'] , format = strftime)
    df_new.loc[:, 'amount'] = df_new['amount'].str.replace(',', '').astype(float)
    df_new.loc[:, 'result'] = df_new['result'].str.replace(' ', '').str.replace('、', '').astype(int)
    
    df1=df_new.set_index('date')
    return df1

In [8]:

url1= "data12897--13500-1.csv"
url2= "data13500--13796-1.csv"
#url3= "data13796--13806-1.csv"
df1=pd.concat([csv_clean1(url1), csv_clean1(url2)], axis=0)
df1.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 909 entries, 2010-04-28 to 2012-11-05
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   amount  909 non-null    float64
 1   result  909 non-null    int32  
dtypes: float64(1), int32(1)
memory usage: 17.8 KB

In [9]:

#data-clean function
def csv_clean2(url):
    df_new = pd.read_csv(url)
    
    #chenge the time datas in str into datatime
    strftime = "%Y年%m月%d日"  # the time datas of the csv contain Chinese
    df_new.loc[:, 'date'] = pd.to_datetime(df_new['date'].str.replace(' ', ''), format= strftime)
    
    #chenge tother datas in str into str
    df_new.loc[:, 'amount'] = df_new['amount'].str.replace(',', '').astype(int)
    df_new.loc[df_new['amount']==-1, 'amount'] = np.nan
    df_new.loc[:, 'result'] = df_new['result'].str.replace(' ', '').astype(int)
    
    df1=df_new.set_index('date')
    return df1

In [10]:

url4= "data13806--14445-1--na.csv"
df_10=csv_clean2(url4)
url5= "data14000--14233-1.csv"
url6= "data14233--14445-1--na.csv"
url7= "data14445--15058-1.csv"
df9=pd.concat([ csv_clean2(url5), csv_clean2(url6), csv_clean2(url7)], axis = 0)

some datas of df_10 are same as other datas in df9. So, let's merge them.

In [11]:

#rename df_10
names={
    'amount': 'amount2',
    'result': 'result2'
}
df_10.rename(columns=names, inplace=True)

#merge, the datas from different dataframes have different tags
df11 = df9.merge(df_10,left_index=True, right_index= True , how="outer")

#if the values in df9 is nan ,I replace them with the value in df_10
df11.loc[df11['amount'].isnull(), 'amount']=df11.loc[df11['amount'].isnull(), 'amount2']
df11.loc[df11['result'].isnull(), 'result']=df11.loc[df11['result'].isnull(), 'result2']

df_B = pd.concat([df1, df11.loc[:, ['amount', 'result']]], axis=0)
df_B.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2161 entries, 2010-04-28 to 2016-05-09
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   amount  2023 non-null   float64
 1   result  2161 non-null   float64
dtypes: float64(2)
memory usage: 50.6 KB

2-3-2 merge and complete the two dataframe¶

In [12]:

df_total = df_A.merge(df_B,left_index=True, right_index= True , how="left")
df_total.loc[df_total['total_sales'].isnull(),'total_sales' ] = df_total.loc[df_total['total_sales'].isnull(),'amount' ]

#get the ratio of prize and the total prizes
df_total['total_prize'] = df_total['way1']*df_total['way1_number']+df_total['way2']*df_total['way2_number']+df_total['way3']*df_total['way3_number']
df_total['ratio_of_prize'] = df_total['total_prize']/df_total['total_sales']
df_total.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5505 entries, 2020-05-27 to 2004-11-14
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   time_serial     5505 non-null   int64  
 1   password        5505 non-null   object 
 2   way1            5505 non-null   int64  
 3   way1_number     5505 non-null   int64  
 4   way2            5505 non-null   int64  
 5   way2_number     5505 non-null   int64  
 6   way3            5505 non-null   int32  
 7   way3_number     5505 non-null   int64  
 8   total_sales     3368 non-null   float64
 9   password_1      5505 non-null   int32  
 10  password_2      5505 non-null   int32  
 11  password_3      5505 non-null   int32  
 12  amount          2023 non-null   float64
 13  result          2161 non-null   float64
 14  total_prize     5505 non-null   int64  
 15  ratio_of_prize  3368 non-null   float64
dtypes: float64(4), int32(4), int64(7), object(1)
memory usage: 805.1+ KB

3 is the password stochastic?¶

if the password is stochastic, each number obeys the distribution (Recording as P) independently.

There are 3 numbers for each period. For convenience, I average them. The mean obeys another distribution(Recording as Q) independently.

In [13]:

#the mean and var of Q
var_theory_Q=np.arange(10).var()/3
mean_theory_Q=np.arange(10).mean()

print(mean_theory_Q, var_theory_Q)

4.5 2.75

when average the mean of the 3 numbers for each period -- Q, we get a distribution(Recording as NQ)

In [14]:

data_3_1_df = df_total.loc[:, ['password_1', 'password_2', 'password_3']]
#get Q
data_3_1_df['Q'] = data_3_1_df.mean(axis=1)

#get the mean and var of distributions--NQ
N=200
data_3_1_df['mean'] = data_3_1_df['Q'].rolling(N).mean()
#data_3_1_df['var'] = data_3_1_df['Q'].rolling(N).var()
#data_3_1_df['var_of_mean'] = data_3_1_df['var']/N

#the mean and var of NQ on theory
mean_theory_NQ= mean_theory_Q
var_theory_NQ= var_theory_Q/N

#get the independent distributions--NQ
data_3_1_plot = data_3_1_df.iloc[range(N-1, 5505, N ),4:]

#Standardize NQ
data_3_1_plot['e'] = (data_3_1_plot['mean']-mean_theory_NQ)/(var_theory_NQ)**0.5
data_3_1_plot['e_pdf']=st.norm.pdf(data_3_1_plot['e'])

data_3_1_plot.describe()

Out[14]:

	mean	e	e_pdf
count	27.000000	27.000000	27.000000
mean	4.564259	0.548005	0.262317
std	0.113061	0.964186	0.125638
min	4.363333	-1.165497	0.047707
25%	4.460000	-0.341121	0.156323
50%	4.570000	0.596962	0.303592
75%	4.660833	1.371591	0.373573
max	4.741667	2.060940	0.394933

N=200,NQ contains 200 independent distributions--Q, or 600 independent distributions--P.

According to the Central limit theorem, NQ obeys independent normal distributions--N(mean_theory_NQ,var_theory_NQ)approximately

In [15]:

fig, ax = plt.subplots(1)
data_3_1_plot.plot(
    ax=ax,
    kind='scatter',
    x='e',
    y='e_pdf',
    s=8,
    c='black'
);


a=np.linspace(-4,4,100)
plt.plot(a,st.norm.pdf(a),color='y')


# 5%
x=st.norm.isf(0.025)
y=st.norm.pdf(x)
plt.vlines(st.norm.isf(0.025),0,y,ls='--' ,color='b');
plt.vlines(-st.norm.isf(0.025),0,y,ls='--',color='b');

# 1%
y2=st.norm.pdf(st.norm.isf(0.005))
plt.vlines(st.norm.isf(0.005),0,y2,ls='--',color='r');
plt.vlines(-st.norm.isf(0.005),0,y2,ls='--',color='r');

We can see that of the 27 observations, two numbers are not within the 95% confidence interval. While, all observations are within the 99% confidence interval.

But there is still a problem. It seems that most of the observations are on the right side(>0).

adjust the N to 5500

In [16]:

data_3_1_df = df_total.loc[:, ['password_1', 'password_2', 'password_3']]
#get Q
data_3_1_df['Q'] = data_3_1_df.mean(axis=1)

#get the mean and var of distributions--NQ
N=5500
data_3_1_df['mean'] = data_3_1_df['Q'].rolling(N).mean()

#the mean and var of NQ on theory
mean_theory_NQ= mean_theory_Q
var_theory_NQ= var_theory_Q/N

#get the independent distributions--NQ
data_3_1_plot = data_3_1_df.iloc[range(N-1, 5505, N ),4:]

#Standardize NQ
data_3_1_plot['e'] = (data_3_1_plot['mean']-mean_theory_NQ)/(var_theory_NQ)**0.5
data_3_1_plot['e_pdf']=st.norm.pdf(data_3_1_plot['e'])

N=5500

According to the Central limit theorem, NQ obeys independent normal distributions--N(mean_theory_NQ,var_theory_NQ)practically.

In [17]:

fig, ax = plt.subplots(1)
data_3_1_plot.plot(
    ax=ax,
    kind='scatter',
    x='e',
    y='e_pdf',
    s=8,
    c='black'
);

a=np.linspace(-4,4,100)
plt.plot(a,st.norm.pdf(a),color='y')

# 5%
x=st.norm.isf(0.025)
y=st.norm.pdf(x)
plt.vlines(st.norm.isf(0.025),0,y,ls='--' ,color='b');
plt.vlines(-st.norm.isf(0.025),0,y,ls='--',color='b');

# 1%
y2=st.norm.pdf(st.norm.isf(0.005))
plt.vlines(st.norm.isf(0.005),0,y2,ls='--',color='r');
plt.vlines(-st.norm.isf(0.005),0,y2,ls='--',color='r');

We can see that the number is out of the 95% confidence interval, even on the edge the 99% confidence interval. As the same time,the observation is on the right side(>0).

So, the randomness of the numbers is not as perfect as we think.

4 the sale & ratio¶

4-1 In general¶

In [18]:

#get a dataframe with the complete time series
value = {'nan': np.nan}
time = pd.date_range(start='2004-11-14', end='2020-05-27')
df_time = pd.DataFrame(value, index=time)
a = [ 'way1', 'way2', 'way3', 'total_sales', 'total_prize']
data_4_1_df = df_time.merge(df_total.loc[:, a].dropna(),left_index=True, right_index= True , how="left").iloc[:, 1:]

4-1-1 the total sales¶

In [19]:

#get a dataframe with the mean of business quarter
data_4_1_Q=data_4_1_df.resample("BQ").mean()
data_4_1_Q.dropna(inplace=True)
data_4_1_Q['ratio_of_prize'] = data_4_1_Q.loc[:, 'total_prize']/data_4_1_Q.loc[:, 'total_sales']
data_4_1_Q['Y']=data_4_1_Q.index.year

data_4_1_Q.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41 entries, 2010-06-30 to 2020-06-30
Freq: BQ-DEC
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   way1            41 non-null     float64
 1   way2            41 non-null     float64
 2   way3            41 non-null     float64
 3   total_sales     41 non-null     float64
 4   total_prize     41 non-null     float64
 5   ratio_of_prize  41 non-null     float64
 6   Y               41 non-null     int64  
dtypes: float64(6), int64(1)
memory usage: 2.6 KB

In [20]:

#get a dataframe with the mean of year
data_4_1_Y=data_4_1_df.resample("Y").mean()
data_4_1_Y.dropna(inplace=True)
data_4_1_Y['ratio_of_prize'] = data_4_1_Y.loc[:, 'total_prize']/data_4_1_Y.loc[:, 'total_sales']
data_4_1_Y['Y']=data_4_1_Y.index.year
data_4_1_Y.head()

Out[20]:

	way1	way2	way3	total_sales	total_prize	ratio_of_prize	Y
2010-12-31	8169.415323	2431.403226	12527.794355	2.250945e+07	1.095191e+07	0.486547	2010
2011-12-31	8773.952514	1963.958101	11970.259777	2.317832e+07	1.131766e+07	0.488286	2011
2012-12-31	7589.482249	2025.369822	9459.044379	1.981185e+07	9.751048e+06	0.492183	2012
2013-12-31	6574.768025	1757.429467	7981.943574	1.720629e+07	8.414256e+06	0.489022	2013
2014-12-31	5801.340502	1585.340502	6981.899642	1.567297e+07	7.594769e+06	0.484578	2014

In [21]:

#merge them
data_4_1_QY=data_4_1_Q.reset_index().merge(data_4_1_Y.reset_index(), on='Y', how='left').set_index('index_x')
data_4_1_QY.head()
data_4_1_QY['Q/Y']= data_4_1_QY['total_sales_x']/data_4_1_QY['total_sales_y']
data_4_1_QY.head()

Out[21]:

	way1_x	way2_x	way3_x	total_sales_x	total_prize_x	ratio_of_prize_x	Y	index_y	way1_y	way2_y	way3_y	total_sales_y	total_prize_y	ratio_of_prize_y	Q/Y
index_x
2010-06-30	7657.468750	2062.234375	12303.546875	2.328062e+07	1.028595e+07	0.441825	2010	2010-12-31	8169.415323	2431.403226	12527.794355	2.250945e+07	1.095191e+07	0.486547	1.034260
2010-09-30	8477.021739	2239.652174	12485.782609	2.190322e+07	1.119144e+07	0.510949	2010	2010-12-31	8169.415323	2431.403226	12527.794355	2.250945e+07	1.095191e+07	0.486547	0.973068
2010-12-31	8217.945652	2879.967391	12725.804348	2.257920e+07	1.117566e+07	0.494954	2010	2010-12-31	8169.415323	2431.403226	12527.794355	2.250945e+07	1.095191e+07	0.486547	1.003099
2011-03-31	7997.963855	2315.421687	11951.722892	2.382693e+07	1.065117e+07	0.447022	2011	2011-12-31	8773.952514	1963.958101	11970.259777	2.317832e+07	1.131766e+07	0.488286	1.027983
2011-06-30	9213.142857	2253.626374	11893.230769	2.502334e+07	1.183722e+07	0.473047	2011	2011-12-31	8773.952514	1963.958101	11970.259777	2.317832e+07	1.131766e+07	0.488286	1.079601

In [22]:

fig, ax = plt.subplots(1)

data_4_1_Y['total_sales'].plot(
    ax=ax,
    color='r',
    ylim=[10**7,2.75*10**7]
);

data_4_1_Q['total_sales'].plot(
    ax=ax,
    color='y',
    ylim=[10**7,2.75*10**7]
);

from this,we can see that:

Sales declined significantly between 2011 and 2016.

The sales volume rose slowly from 2016 to 2019.(Sales are rising rapidly in 2020, possibly due to missing data.)

Sales tend to fluctuate between quarters.

In [23]:

fig, ax = plt.subplots(1)
data_4_1_QY.plot(
    ax=ax,
    y='Q/Y'
);
plt.hlines(1,'2010-06-30','2020-06-30',ls='--' ,color='b');
for i in range(2010,2021,1):
    a=f'{i}'+'-06-30'
    plt.vlines(a, 0.8,1.1,ls='--',color='r');

from this,we can see that:

Sales fluctuate clearly from quarter to quarter.Sales in the first quarter are often higher than other quarters.

4-1-2 the ratio of prize¶

Let's discuss the ratio of prize to sales.

From 2010 to 2014, this ratio was set at around 50%.

From August 2014, this ratio was set to 52%.

In [24]:

fig, ax = plt.subplots(1)

data_4_1_Q['ratio_of_prize'].plot(
    ax=ax,
    color='b'
);

data_4_1_Y['ratio_of_prize'].plot(
    ax=ax,
    color='r'
);

plt.hlines(0.5,'2010-06-30','2014-07-30',ls='--' ,color='b');
plt.hlines(0.52,'2014-07-30','2020-06-30',ls='--' ,color='b');
plt.hlines(0.49,'2010-06-30','2014-07-30',ls='--' ,color='r');
plt.hlines(0.515,'2014-07-30','2020-06-30',ls='--' ,color='r');

We can see that the actual ratio is often smaller than the set ratio.

Since 2014, the annual average ratio has exceeded the set ratio only once.

Before 2014, the actual ratio of annual averages was always around 0.45, lower than 0.5.

4-2 some details¶

In [25]:

value = {'nan': np.nan}
time = pd.date_range(start='2004-11-14', end='2020-05-27')
df_time = pd.DataFrame(value, index=time)
a = [ 'way1', 'way2', 'way3', 'total_sales', 'total_prize', 'ratio_of_prize']
data_4_2_df = df_time.merge(df_total.loc[:, a].dropna(),left_index=True, right_index= True , how="left").iloc[:, 1:]
#data_4_2_df['day']=data_4_2_df.index.year()

data_4_2_df['year']=data_4_2_df.index.year
data_4_2_df['weekofyear']=data_4_2_df.index.weekofyear
data_4_2_df['dayofweek']=data_4_2_df.index.dayofweek

data_4_2_df['day']=data_4_2_df.index.day
data_4_2_df['month']=data_4_2_df.index.month

data_4_2_df.head()

Out[25]:

	way1	way2	way3	total_sales	total_prize	ratio_of_prize	year	weekofyear	dayofweek	day	month
2004-11-14	NaN	NaN	NaN	NaN	NaN	NaN	2004	46	6	14	11
2004-11-15	NaN	NaN	NaN	NaN	NaN	NaN	2004	47	0	15	11
2004-11-16	NaN	NaN	NaN	NaN	NaN	NaN	2004	47	1	16	11
2004-11-17	NaN	NaN	NaN	NaN	NaN	NaN	2004	47	2	17	11
2004-11-18	NaN	NaN	NaN	NaN	NaN	NaN	2004	47	3	18	11

4-2-1 week¶

In [26]:

data_4_2_1_df=data_4_2_df.loc[:, ['year', 'weekofyear', 'dayofweek', 'total_sales']]
data_4_2_1_df.head()

Out[26]:

	year	weekofyear	dayofweek	total_sales
2004-11-14	2004	46	6	NaN
2004-11-15	2004	47	0	NaN
2004-11-16	2004	47	1	NaN
2004-11-17	2004	47	2	NaN
2004-11-18	2004	47	3	NaN

In [27]:

b=data_4_2_1_df.pivot_table( index=['year','weekofyear'], columns="dayofweek", values='total_sales').dropna()
b.idxmax(axis=1).value_counts().sort_index().plot.bar();

It can be clearly seen that the lottery with the results announced on Friday had the best sales in a week. While, the lottery with the results announced on Saturday had the worst sales.

So people are more inclined to buy lottery tickets with the results announced on Friday.

4-2-2 month¶

How about the month?

In [28]:

data_4_2_2_df=data_4_2_df.loc[:, ['year', 'month', 'day', 'total_sales']]
data_4_2_2_df.head()

Out[28]:

	year	month	day	total_sales
2004-11-14	2004	11	14	NaN
2004-11-15	2004	11	15	NaN
2004-11-16	2004	11	16	NaN
2004-11-17	2004	11	17	NaN
2004-11-18	2004	11	18	NaN

In [29]:

b=indexs_full=data_4_2_2_df.pivot_table(index=['year','month'], columns="day", values='total_sales')
indexs_full=b.isnull().sum(axis=1)<4

b.loc[indexs_full, :].idxmax(axis=1).value_counts().sort_index().plot.bar();

We can't find out much obvious from this picture.