Linear Model and Extensions

Peng Ding

To students and readers
who are interested in linear models

Acronyms

I try hard to avoid using acronyms to reduce the unnecessary burden for reading. The following are standard and will be used repeatedly.

ANOVA	(Fisher’s) analysis of variance
CLT	central limit theorem
CV	cross-validation
EHW	Eicker–Huber–White (robust covariance matrix or standard error)
FWL	Frisch–Waugh–Lovell (theorem)
GEE	generalized estimating equation
GLM	generalized linear model
HC	heteroskedasticity-consistent (covariance matrix or standard error)
IID	independent and identically distributed
LAD	least absolute deviations
lasso	least absolute shrinkage and selection operator
MLE	maximum likelihood estimate
OLS	ordinary least squares
RSS	residual sum of squares
WLS	weighted least squares

Symbols

All vectors are column vectors as in R unless stated otherwise. Let the superscript “ ${}^{\tiny\textsc{t}}$ ” denote the transpose of a vector or matrix.

$\stackrel{{\scriptstyle\textup{a}}}{{\sim}}$	approximation in distribution
$\mathbb{R}$	the set of all real numbers
$\beta$	regression coefficient
$\varepsilon$	error term
$H$	hat matrix $H=X(X^{\tiny\textsc{t}}X)^{-1}X^{\tiny\textsc{t}}$
$h_{ii}$	leverage score: the $(i,i)$ the element of the hat matrix $H$
$I_{n}$	identity matrix of dimension $n\times n$
$x_{i}$	covariate vector for unit $i$
$X$	covariate matrix
$Y$	outcome vector
$y_{i}$	outcome for unit $i$
	independence and conditional independence

Useful R packages

This book uses the following R packages and functions.

package	function or data	use
car	hccm	Eicker–Huber–White robust standard error
	linearHypothesis	testing linear hypotheses in linear models
foreign	read.dta	read stata data
gee	gee	Generalized estimating equation
HistData	GaltonFamilies	Galton’s data on parents’ and children’s heights
MASS	lm.ridge	ridge regression
	glm.nb	Negative-Binomial regression
glmnet	cv.glmnet	Lasso with cross-validation
mlbench	BostonHousing	Boston housing data
	polr	proportional odds logistic regression
Matching	lalonde	LaLonde data
nnet	multinom	Multinomial logistic regression
quantreg	rq	quantile regression
survival	coxph	Cox proportional hazards regression
	survdiff	log rank test
	survfit	Kaplan–Meier curve

Preface

The importance of studying the linear model

A central task in statistics is to use data to build models to make inferences about the underlying data-generating processes or make predictions of future observations. Although real problems are very complex, the linear model can often serve as a good approximation to the true data-generating process. Sometimes, although the true data-generating process is nonlinear, the linear model can be a useful approximation if we properly transform the data based on prior knowledge. Even in highly nonlinear problems, the linear model can still be a useful first attempt in the data analysis process.

Moreover, the linear model has many elegant algebraic and geometric properties. Under the linear model, we can derive many explicit formulas to gain insights about various aspects of statistical modeling. In more complicated models, deriving explicit formulas may be impossible. Nevertheless, we can use the linear model to build intuition and make conjectures about more complicated models.

Pedagogically, the linear model serves as a building block in the whole statistical training. This book builds on my lecture notes for a master’s level “Linear Model” course at UC Berkeley, taught over the past eight years. Most students are master’s students in statistics. Some are undergraduate students with strong technical preparations. Some are Ph.D. students in statistics. Some are master’s or Ph.D. students in other departments. This book requires the readers to have basic training in linear algebra, probability theory, and statistical inference.

Recommendations for instructors

This book has twenty-seven chapters in the main text and four chapters as the appendices. As I mentioned before, this book grows out of my teaching of “Linear Model” at UC Berkeley. In different years, I taught the course in different ways, and this book is a union of my lecture notes over the past eight years. Below I make some recommendations for instructors based on my own teaching experience. Since UC Berkeley is on the semester system, instructors on the quarter system should make some adjustments to my recommendations below.

Version 1: a basic linear model course assuming minimal technical preparations

If you want to teach a basic linear model course without assuming strong technical preparations from the students, you can start with the appendices by reviewing basic linear algebra, probability theory, and statistical inference. Then you can cover Chapters LABEL:chapter::ols-1d–LABEL:chapter::interaction. If time permits, you can consider covering Chapter LABEL:chapter::binary-logit due to the importance of the logistic model for binary data.

Version 2: an advanced linear model course assuming strong technical preparations

If you want to teach an advanced linear model course assuming strong technical preparations from the students, you can start with the main text directly. When I did this, I asked my teaching assistants to review the appendices in the first two lab sessions and assigned homework problems from the appendices to remind the students to review the background materials. Then you can cover Chapters LABEL:chapter::ols-1d–LABEL:chapter::sandwich. You can omit Chapter LABEL:chapter::rols and some sections in other chapters due to their technical complications. If time permits, you can consider covering Chapter LABEL:chapter::gee due to the importance of the generalized estimating equation as well as its byproduct called the “cluster-robust standard error”, which is important for many social science applications. Furthermore, you can consider covering Chapter LABEL:chapter::survival-analysis due to the importance of the Cox proportional hazards model.

Version 3: an advanced generalized linear models course

If you want to teach a course on generalized linear models, you can use Chapters LABEL:chapter::binary-logit–LABEL:chapter::survival-analysis.

Additional recommendations for readers and students

Readers and students can first read my recommendations for instructors above. In addition, I have three other recommendations.

More simulation studies

This book contains some basic simulation studies. I encourage the readers to conduct more intensive simulation studies to deepen their understanding of the theory and methods.

Practical data analysis

Box wrote wisely that “all models are wrong but some are useful.” The usefulness of models strongly depends on the applications. When teaching “Linear Model”, I sometimes replaced the final exam with the final project to encourage students to practice data analysis and make connections between the theory and applications.

Homework problems

This book contains many homework problems. It is important to try some homework problems. Moreover, some homework problems contain useful theoretical results. Even if you do not have time to figure out the details for those problems, it is helpful to at least read the statements of the problems.

Omitted topics

Although “Linear Model” is a standard course offered by most statistics departments, it is not entirely clear what we should teach as the field of statistics is evolving. Although I made some suggestions to the instructors above, you may still feel that this book has omitted some important topics related to the linear model.

Advanced econometric models

After the linear model, many econometric textbooks cover the instrumental variable models and panel data models. For these more specialized topics, wooldridge2010econometric is a canonical textbook.

Advanced biostatistics models

This book covers the generalized estimating equation in Chapter LABEL:chapter::gee. For analyzing longitudinal data, linear and generalized linear mixed effects models are powerful tools. fitzmaurice2012applied is a canonical textbook on applied longitudinal data analysis. This book also covers the Cox proportional hazards model in Chapter LABEL:chapter::survival-analysis. For more advanced methods for survival analysis, kalbfleisch2011statistical is a canonical textbook.

Causal inference

I do not cover causal inference in this book intentionally. To minimize the overlap of the materials, I wrote another textbook on causal inference (ding2023first). However, I did teach a version of “Linear Model” with a causal inference unit after introducing the basics of linear model and logistic model. Students seemed to like it because of the connections between statistical models and causal inference.

Features of the book

The linear model is an old topic in statistics. There are already many excellent textbooks on the linear model. This book has the following features.

•

This book provides an intermediate-level introduction to the linear model. It balances rigorous proofs and heuristic arguments.
•

This book provides not only theory but also simulation studies and case studies.
•

This book provides the R code to replicate all simulation studies and case studies.
•

This book covers the theory of the linear model related to not only social sciences but also biomedical studies.
•

This book provides homework problems with different technical difficulties. The solutions to the problems are available to instructors upon request.

Other textbooks may also have one or two of the above features. This book has the above features simultaneously. I hope that instructors and readers find these features attractive.

Acknowledgments

Many students at UC Berkeley made critical and constructive comments on early versions of my lecture notes. As teaching assistants for my “Linear Model” course, Sizhu Lu, Chaoran Yu, and Jason Wu read early versions of my book carefully and helped me to improve the book a lot.

Professors Hongyuan Cao and Zhichao Jiang taught related courses based on an early version of the book. They made very valuable suggestions.

I am also very grateful for the suggestions from Nianqiao Ju.

When I was a student, I took a linear model course based on weisberg2005applied. In my early years of teaching, I used christensen2002plane and agresti2015foundations as reference books. I also sat in Professor Jim Powell’s econometrics courses and got access to his wonderful lecture notes. They all heavily impacted my understanding and formulation of the linear model.

If you identify any errors, please feel free to email me.