# Initialize Otter
import otter
grader = otter.Notebook("ps4.ipynb")
Before getting started on the assignment, run the cell at the very top that imports otter
and the cell below which will import the packages we need.
Important: As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get <something> not defined
errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the otter
cell above.
import numpy as np
import pandas as pd
import statsmodels.api as sm
Does the stock market efficiently use information in valuing stocks? The Efficient Markets Hypothesis (“EMH”), developed by Nobel-prize winner Eugene Fama, maintains that current stock prices fully reflect all available information. An implication of this hypothesis is that returns in the current period should not be systematically related to information known in earlier periods. Otherwise, we could use this information to predict stock returns, thus violating EMH. As an analyst at an investment management company, you have been tasked with examining the validity of the EMH. You obtained a dataset of 142 randomly-selected firms listed on the New York Stock Exchange, consisting of the following four variables:
Variable | Description |
---|---|
return | Total return from holding a firm’s stock over a one-year period, from January 2014 to December 2014. Note that an annual return such has 31.4% is entered in the dataset as 31.4. |
dkr | A firm’s debt to capital ratio in 2013. |
lnetincome | Natural log of the net income for a firm in 2013. |
lsalary | Natural log of the total compensation for a firm’s CEO in 2013. |
Using these data, you estimated the following two regressions.
Regression 1
Regression 2
Question 1.a.
Based on the results for the two OLS regressions, what is the sign of the correlation between dkr
and lnetincome
? Alternatively, is there not enough information to determine the sign of the correlation?
Type your answer here, replacing this text.
Question 1.b.
Interpret the coefficient on lnetincome
in Regression 2.
Type your answer here, replacing this text.
Now suppose you added another variable to the regression, and obtained the following regression results.
Regression 3
Question 1.c. Suppose that you use Regression 3 to examine whether EMH holds. What are the null and alternative hypotheses?
Type your answer here, replacing this text.
Question 1.d. Carry out the test in part (c) at the 5% level. Do you reject or fail to reject the null hypothesis?
Type your answer here, replacing this text.
Question 1.e. Interpret the result you obtained in part (d), in light of your task of examining the validity of EMH.
Type your answer here, replacing this text.
Question 1.f. Provide (at least) two reasons why there might be imperfect multicollinearity present in Regression 3.
Type your answer here, replacing this text.
Question 1.g. Which of the following statements is true based on a comparison of Regression 2 and Regression 3?
dkr
and lnetincome
are highly-correlated.dkr
and lsalary
are highly-correlated.lnetincome
and lsalary
are highly-correlated.Type your answer here, replacing this text.
Question 1.h. The sample of 142 stocks only include companies that were traded on the NYSE as of the end of 2013. A company that went out of business, for instance, before the end of that year could not enter the sample. How would this sampling affect the estimated coefficient relative to the population regression?
Type your answer here, replacing this text.
Antitrust authorities have long been concerned that airline carriers may exercise their market power by charging higher fares. The greatest concern arises when one airline runs the vast majority of flights in and out of an airport. Usually this happens when an airline designates an airport as a national or regional “hub” of their operations. The dataset airfares.csv
consists of average fares and other characteristics of popular U.S. origin-destination pairs (e.g., Boston-Chicago) for the year 2000.
Variable | Description | Units |
---|---|---|
lfare | logarithm of the average fare on the route | log of fare in 2000 dollars |
dist | distance of the route | thousands of miles |
passen | average number of passengers per day | thousands of passengers |
concen | market share of biggest airline carrier on the route, measured in terms of passengers carried | fraction (e.g., 0.55 = 55% market share) |
origin | city of origin of flight | |
destin | city of destination of flight |
af = pd.read_csv("airfares.csv")
af.head()
Question 2.a.
Regress lfare
on dist
, passen
and concen
, with robust standard errors. Make sure the cell below (and all regression questions in this assignment) shows your regression results like you've done in previous assignments, otherwise we cannot give credit. This assignment will be a little less guided. Make sure do use different variable names for each separate coding part to avoid unexpected errors from reusing variables. Refer to previous assignments if you need a refresher on how we performed different regressions. Don't forget to add a constant to your regressions.
...
Question 2.b.
What is the interpretation of the coefficient on passen
?
Type your answer here, replacing this text.
Question 2.c. Based on your OLSEs, and assuming the OLS assumptions hold, what is the partial effect of the market share of the largest carrier on air fares? Is your answer consistent with the hypothesis that firms use their market power to charge higher prices?
Type your answer here, replacing this text.
Question 2.d. How would you test whether market power is used the same way on more popular and less popular routes? Write down the model and the hypothesis, carry out the estimation and the test.
This question is for your code, the next is for your explanation.
...
Question 2.e. Explain.
Type your answer here, replacing this text.
Question 2.f. We need to question whether the results of the regression in part (d) is revealing a causal relationship between concentration and airfares. In particular, we are concerned whether our estimation results on U.S. data are valid for other markets, such as Europe and Asia. Give one reason why the results would not be “externally valid” if applied to the airline industry in one of these other two regions.
Type your answer here, replacing this text.
Question 2.g. We are also aware of several potential threats to “internal validity” of the results. For each one of the five main internal validity threats, describe one possibility that could plausibly lead to that particular threat.
Type your answer here, replacing this text.
The World Health Organization (“WHO”) collects data which assesses the health care outcomes of the populations in 191 countries across the globe, as well as exploring potential explanations for those outcomes. These data are published in the annual “World Health Report.” The file who.csv
contains five years (1993-1997) of these data. The variables in the panel of countries include:
Variable | Description |
---|---|
comp | composite measure of health care attainment |
dale | disability-adjusted life expectancy |
year | 1993,1994,1995,1996,1997 |
hexp | per capita health expenditure |
hc3 | educational attainment (tertiary schooling) |
country | number assigned to country |
oecd | dummy indicator for an OECD member country |
gini | Gini coefficient for income inequality |
geff | World Bank measure of government effectiveness |
voice | World Bank measure of democratization of the political process |
tropics | dummy indicator of tropical location |
popden | population density (people per square mile) |
pubthe | proportion of health expenditure paid by public authorities |
gdpc | normalized per-capita GDP |
who = pd.read_csv("who.csv")
who.head()
Question 3.a.
Create a new variable for the dataset that is the square of educational attainment (hc3
). Then regress life expectancy (dale
) on health expenditures (hexp
), the educational attainment in the country (hc3
), and its square (the variable you created). For now, select rows from 1997 and use only these rows in the regression. Use robust standard errors and don't forget to add a constant term. Comment on whether you think the relationship between life expectancy and education is linear or quadratic and why you came to that conclusion.
This question is for your code, the next is for your explanation.
...
Question 3.b. Explain.
Type your answer here, replacing this text.
Question 3.c.
To the specification in part (a), add the additional control variables: gini
, tropics
, popden
, pubthe
, gdpc
, voice
, and geff
. Test whether these additional regressors are jointly significant (we do the F-test for you in this part, you just have to interpret it). What effect does inclusion of these additional controls have on the coefficients of the other included regressors?
This question is for your code, the next is for your explanation.
# This is the code for your regression.
# We give you starter code for this one so that we know what the variable name is
# for the regression results, which we use in the code cell below.
model_3b = ...
results_3b = ...
results_3b.summary()
# Please don't change this cell, just run it.
# This is how you do an F-test. Notice that we do .f_test on the results of the
# unrestricted model, and then we give the names of the variables we want to
# test inside quotation marks.
results_3b.f_test("gini, tropics, popden, pubthe, gdpc, voice, geff").summary()
Question 3.d. Explain.
Type your answer here, replacing this text.
Question 3.e.
Return to the simpler regression specification in part (a). We want see if the determinants of life expectancy are different for rich and poor countries. Use membership in the “Organization of Economic Cooperation & Development” (oecd
) as the indicator of a rich country. The OECD had 30 member countries during this time period. Perform a test of the hypothesis that all three of the coefficients in the population regression are equal for OECD and non-OECD countries.
Hint: You will need to create three new variables.
This question is for your code, the next is for your explanation.
...
# This extra code cell may be helpful
...
Question 3.f. Explain.
Type your answer here, replacing this text.
The table of results that will be shared on bCourses (under Problem Sets) is copied from a paper by Jaeger and Page (1996) entitled “Degrees Matter: New Evidence on Sheepskin Effects in the Returns to Education,” The Review of Economics and Statistics. The question is whether employers pay in relation to years of education or whether there is an additional premium for obtaining a degree. Such premium might be called the “sheepskin effect” (because diplomas at one time were printed on a sheet of sheepskin) or the “diploma effect.” The Jaeger and Page paper estimates the magnitude of this effect. Note: an empty cell in the table means that variable is not included in the regression.
Question 4.a. Why do you think Jaeger and Page estimate their model using only people of a single race and gender (in this particular case the sample consists of white males)?
Type your answer here, replacing this text.
Question 4.b. Look at column (3) of the table. In words, interpret the coefficient on the dummy variable “9”.
Hint: Note that “12” is the omitted category.
Type your answer here, replacing this text.
Question 4.c. Why do you think the effect of the 14th year of education is larger than that of the 15th?
Type your answer here, replacing this text.
Question 4.d. Now look at column (4). Think about a student who is currently a senior. What is the average difference in the student's wage now and the one that the student could get at the end of the year following graduation?
Type your answer here, replacing this text.
Question 4.e. Based on the results presented in this column, would you rather choose to complete a PhD or a professional degree? Explain.
Type your answer here, replacing this text.
Question 4.f. Using the results from columns (3) and (4), how would you test the presence of a “diploma effect”? Carry out the test at a 5% significance level.
Hint: You may find some of the information you need in the footnote of the table.
Type your answer here, replacing this text.
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!
# Save your notebook first, then run this cell to export your submission.
grader.to_pdf(pagebreaks=False, display_link=True)