{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(AER)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instrumental Variables\n",
"\n",
"by Jonas Peters, Niklas Pfister, 29.12.2017\n",
"\n",
"This notebook aims to give you a basic understanding of the instrumental variable approach and when it can be used to infer causal relations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this method is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea of the instrumental variable approach is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions.\n",
"\n",
"\n",
"First, assume the following SCM\n",
"\\begin{align}\n",
"I &:= N_I\\\\\n",
"H &:= N_H\\\\ \n",
"X &:= \\gamma I + \\delta_X H + N_X\\\\\n",
"Y &:= \\beta X + \\delta_Y H + N_Y.\\\\\n",
"\\end{align}\n",
"The corresponding DAG looks as follows.\n",
"\\begin{align}\n",
" &\\phantom{0}\\\\\n",
" &\\begin{array}{ccc}\n",
" & & &H & \\\\\n",
" & &\\phantom{abcdefgh}\\overset{\\delta_X}{\\swarrow} & & \\overset{\\delta_Y}{\\searrow}\\phantom{abcdefgh}\\\\\n",
" & & & & \\\\\n",
" I &\\overset{\\gamma}{\\longrightarrow} &X & \\overset{\\beta}{\\longrightarrow} & Y\\\\\n",
" \\end{array}\\\\\n",
" &\\phantom{0}\n",
"\\end{align}\n",
"Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ effects $Y$ only via $X$ (and not directly).\n",
"\n",
"\n",
"\n",
"Second, it is possible to define instrumental variables without SCMs, too. Let us therefore write\n",
"\\begin{equation}\n",
"Y = \\beta X + \\epsilon_Y\n",
"\\end{equation}\n",
"(this can always be done). Here, $\\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is likely to be the case). We then call a variable $I$ an instrumental variable if it satisfies the following two conditions:\n",
"\n",
"1. $\\operatorname{cov}(X,I)\\neq 0$ (relevance)\n",
"2. $\\operatorname{cov}(\\epsilon_Y,I)=0$ (exogenity).\n",
"\n",
"Informally speaking, these conditions mean that $I$ affects $Y$ only through its effect on $X$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Estimation\n",
"\n",
"We now want to illustrate how the instrumental variable $I$ can be used to estimate the causal effect $\\beta$ in the model above. To this end we use the CollegeDistance data set from [1] available in the R package AER."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# load CollegeDistance data set\n",
"data(\"CollegeDistance\")\n",
"# read out relevant variables\n",
"Y <- CollegeDistance$score\n",
"X <- CollegeDistance$education\n",
"I <- CollegeDistance$distance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:\n",
"* $Y$ - base year composite test score. These are achievement tests given to high school seniors in the sample.\n",
"* $X$ - number of years of education.\n",
"* $I$ - distance from closest 4-year college (units are in 10 miles).\n",
"\n",
"We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$ we can then use a so-called 2-stage least squares (2SLS) procedure, which goes as follows:\n",
"* Step 1: Regress $X$ on $I$ and compute the corresponding predicted values $\\hat{X}$ of $X$.\n",
"* Step 2: Regress $Y$ on $\\hat{X}$, then the resulting regression coefficient is asymptotically equivalent to the causal effect of $X$ on $Y$.\n",
"\n",
"The following four exercises go over some of the details of the 2SLS and apply it to the above data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1\n",
"Assume the following two structural assignments\n",
"\\begin{align*}\n",
"Y &:= \\beta X + \\epsilon_Y \\\\\n",
"X &:= \\gamma I + \\epsilon_X,\n",
"\\end{align*}\n",
"where $\\epsilon_X$ and $\\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1 and 2 above. Prove that the 2-step least square method does indeed give a consistent estimator of causal effect in this case. Hint: For simplicity you may also assume that $\\operatorname{cov}(I, \\epsilon_X)=0$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End of Solution 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"\n",
"Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Why might it not be a valid instrument? Hint: You can perform a regression in order to test if there is significant correlation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End of Solution 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End of Solution 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A slightly different approach to 2SLS is to use the formula\n",
"\n",
"\\begin{equation} \\tag{1}\n",
"\\beta=\\frac{\\operatorname{cov}(Y,I)}{\\operatorname{cov}(X,I)}.\n",
"\\end{equation}\n",
"\n",
"This formula can be shown quite easily using the same setting as in Exercise 1 (try proving it). Replacing the population covariance by the corresponding empirical estimates again results in a consistent estimator."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 4\n",
"Apply the above estimator (1) to CollegeDistance data and compare your result with the one from Exercise 3."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### End of Solution 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"[1] Kleiber, C., A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.3.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}