Google Summer of Code 2015 is coming to an end. During this summer, I have learned too many things to list here about statistical modeling, Ruby and software development in general, and I had a lot of fun in the process!
Linear mixed models
My GSoC project is the Ruby gem mixed_models. Mixed models are statistical models which predict the value of a response variable as a result of fixed and random effects. The gem in its current version can be used to fit statistical linear mixed models and perform statistical inference on the model parameters as well as to predict future observations. A number of tutorials/examples in IRuby notebook format are accessible from the
mixed_models github repository.
Linear mixed models are implemented in the class
LMM. The constructor method
LMM#initialize provides a flexible model specification interface, where an arbitrary covariance structure of the random effects terms can be passed as a
Proc or a block.
A convenient user-friendly interface to the basic model fitting algorithm is
LMM#from_formula, which uses the formula language of the R mixed models package
lme4 for model specification. With the
#from_formula method, the user can conveniently fit models with categorical predictor variables, interaction fixed or random effects, as well as multiple crossed or nested random effects, all with just one line of code.
Examples are given in the sections below.
The parameter estimation in
LMM#initialize is largely based on the approach developed by the authors of the R mixed models package
lme4, which is delineated in the
lme4 vignette. I have tried to make the code of the model fitting algorithm in
LMM#initialize easy to read, especially compared to the corresponding implementation in
lme4 code is largely written in C++, which is integrated in R via the packages
RcppEigen. It uses CHOLMOD code for various sparse matrix tricks, and it involves passing pointers to C++ object to R (and vice versa) many times, and passing different R environments from function to function. All this makes the
lme4 code rather hard to read. Even Douglas Bates, the main developer of
lme4, admits that “The end result is confusing (my fault entirely) and fragile”, because of all the utilized performance improvements. I have analyzed the
lme4 code in three blog posts (part 1, part 2 and part 3) before starting to work on my gem
LMM#initialize is written in a more functional style, which makes the code shorter and (I find) easier to follow. All matrix calculations are performed using the gem
nmatrix, which has a quite intuitive syntax and contributes to the overall code readability as well.
The Ruby gem loses with respect to memory consumption and speed in comparison to
lme4, because it is written in pure Ruby and does not utilize any sparse matrix tricks. However, for the same reasons the
mixed_models code is much shorter and easier to read than
lme4. Moreover, the linear mixed model formulation in
mixed_models is a little bit more general, because it does not assume that the random effects covariance matrix is sparse. More about the implementation of
LMM#initialize can be found in this blog post.
Other existing tools
Popular existing software packages for mixed models include the R package
lme4 (which is arguably the standard software for linear mixed models), the R package
nlme (an older package developed by the same author as
lme4, still widely used), Python’s
statmodels, and the Julia package
Below, I give a couple of examples illustrating some of the capabilities of
mixed_models and explore how it compares to the alternatives.
A usage example and discussion
As an example, we use data from the UCI machine learning repository, which originate from blog posts from various sources in 2010-2012, in order to model (the logarithm of) the number of comments that a blog post receives. The linear predictors are the text length, the log-transform of the average number of comments at the hosting website, the average number of trackbacks at the hosting website, and the parent blog posts. We assume a random effect on the number of comments due to the day of the week on which the blog post was published. In
mixed_models this model can be fit with
and we can display some information about the estimated fixed effects with
which produces the following output:
1 2 3 4 5 6
We can also display the estimated random effects coefficients and the random effects standard deviation,
1 2 3 4
1 2 3 4 5 6 7
Interestingly, the estimates of the random effects coefficients and standard deviation are all zero! That is, we have a singular fit. Thus, our results imply that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive.
It is worth pointing out that such a model fit with a singular covariance matrix is problematic with the current version of Python’s
statmodels (described as “numerically challenging” in the documentation) and the R package
nlme (“Singular covariance matrices correspond to infinite parameter values”, a mailing list reply by Douglas Bates, the author of
MixedModels.jl can handle singular fits without problems.
In fact, like
lme4 estimates the random effects coefficients and standard deviation to be zero, as we can see from the following R output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
mixed_models is rather slow when applied to such a large data set (
blog_data is a data frame of size 22435×8), especially when compared to
lme4 which uses many sparse matrix tricks and is mostly written in C++ (integrated in R via
Rcpp) to speed up computation. The difference in performance between
lme4 is on the order of hours for large data, and Julia’s
MixedModels.jl promises to be even faster than
lme4. However, there is no noticeable difference in performance speed for smaller data sets.
A second example and statistical inference on the parameter estimates
Often, the experimental design or the data suggests a linear mixed model whose random effects are associated with multiple grouping factors. A specification of multiple random effects terms which correspond to multiple grouping factors is often referred to as crossed random effect, or nested random effects if the corresponding grouping factors are nested in each other.
A good reference on such models is Chapter 2 of Douglas Bates’
mixed_models is particularly well suited for models with crossed or nested random effects. The current release of
statmodels, however, does not support crossed or nested random effects (according to the documentation).
As an example we fit a linear mixed model with nested random effects to a data frame with 100 rows, of the form:
1 2 3 4 5 6 7
We consider the following model:
- We take
yto be the response and
- We consider the factor
bto be nested within the factor
- We assume that the intercept varies due to variable
a; that is, a different (random) intercept term for each level of
- Moreover, we assume that the intercept varies due to the factor
bwhich is nested in
a; that is, different (random) intercept for each combination of levels of
That is, mathematically the model can be expressed as
gamma(a) ~ N(0, phi**2) and
delta(a,b) ~ N(0, psi**2) are normally distributed random variables which assume different realizations for different values of
b, and where
epsilon is a random Gaussian noise term with variance
sigma**2. The goal is to estimate the parameters
We fit this model in
mixed_models, and display the estimated random effects correlation structure with
1 2 3
which produces the output
1 2 3
The correlation between the factor
a and the nested random effect
a_and_b is denoted as
nil, because the random effects in the model at hand are assumed to be independent.
An advantage of
mixed_models over some other tools is the simplicity with which p-values and confidence intervals for the parameter estimates can be calculated using a multitude of available methods. Such methods include a likelihood ratio test implementation, multiple bootstrap based methods (which run in parallel by default), and methods based on the Wald Z statistic.
We can compute five types of 95% confidence intervals for the fixed effects coefficients with the following line of code:
which yields the result
1 2 3 4 5 6
For example, we see here that the intercept term is likely not significantly different from zero. We could proceed now by performing hypotheses tests using
#likelihood_ratio_test, or by refitting a model without an intercept using
We can also test the nested random effect for significance, in order to decide whether we should drop that term from the model to reduce model complexity. We can use a bootstrap based version of likelihood ratio test as follows.
We get a p-value of 9.99e-4, suggesting that we probably should keep the term
(1|a:b) in the model formula.
A third example — a less conventional model fit
Another advantage of
mixed_models against comparable tools is the ease of fitting models with arbitrary covariance structures of the random effects, which are not covered by the formula interface of
lme4. This can be done in a user-friendly manner by providing a block or a
Proc to the
LMM constructor. This unique feature of the Ruby language makes the implementation and usage of the method incredibly convenient. A danger of allowing for arbitrary covariance structures is, of course, that such a flexibility gives the user the freedom to specify degenerate and computationally unstable models.
As an example we look at an application to genetics, namely to SNP data (single-nucleotide polymorphism) with known pedigree structures (family relationships of the subjects). The family information is prior knowledge that we can model in the random effects of a linear mixed effects model.
We model the quantitative trait
y (a vector of length 1200) as
X is a
1200 x 130 matrix containing the genotypes (i.e. 130 SNPs for each of the 1200 subjects);
epsilon is a vector of independent random noise terms with variances equal to
beta is a vector of unknown fixed effects coefficients measuring the contribution of each SNP to the quantitative trait
b is a vector of random effects.
If we denote the kinship matrix by
K, then we can express the probability distribution of
b ~ N(0, delta**2 * 2 * K), where we multiply
2 because the diagonal of
K is constant
0.5, and where
delta**2 is a unknown scaling factor.
The goal is to estimate the unknown parameters
delta, and to determine which of the fixed effects coefficients are significantly different from 0 (i.e. which SNPs are possibly causing the variability in the trait
In order to specify the covariance structure of the random effects, we need to pass a block or
Proc that produces the upper triangular Cholesky factor of the covariance matrix of the random effects from an input Array. In this example, that would be the multiplication of the prior known Cholesky factor of the kinship matrix with a scaling factor.
Having all the model matrices and vectors, we compute the Cholesky factor of the kinship matrix and fit the model with
1 2 3 4 5 6 7 8
Then we can use the available hypotheses test and confidence interval methods to determine which SNPs are significant predictors of the quantitative trait. Out of the 130 SNPs in the model, we find 24 to be significant as linear predictors.
See this blog post for a full analysis of this data with
Room for improvement and future work
Writing the formula language interpretation code used by
LMM#from_formulafrom scratch was not easy. Much of the code can be reorganized to be easier to read and to use in other projects. Possibly, the formula interface should be separated out, similar to how it is done with the Python package patsy. Also, some shortcut symbols (namely
||) in the model specification formula language are currently not implemented.
I plan to add linear mixed models for high-dimensional data (i.e. more predictors than observations) to
mixed_models, because that work would be in line with my current PhD research.
I plan to add generalized linear mixed models capabilities to
mixed_models, which can be used to fit mixed models to discrete data (such as binary or count data).
I want to thank Google and the Ruby Science Foundation for giving me this excellent opportunity! I especially want to thank Pjotr Prins who was my mentor for the project for much helpful advice and suggestions as well as his prompt responses to any of my concerns. I also want to thank my fellow GSoC participants Will, Ivan, and Sameer for their help with certain aspects of my project.