Google Summer of Code 2015 is coming to an end. During this summer, I have learned too many things to list here about statistical modeling, Ruby and software development in general, and I had a lot of fun in the process!
Linear mixed models
My GSoC project is the Ruby gem mixed_models. Mixed models are statistical models which predict the value of a response variable as a result of fixed and random effects. The gem in its current version can be used to fit statistical linear mixed models and perform statistical inference on the model parameters as well as to predict future observations. A number of tutorials/examples in IRuby notebook format are accessible from the mixed_models
github repository.
Linear mixed models are implemented in the class LMM
. The constructor method LMM#initialize
provides a flexible model specification interface, where an arbitrary covariance structure of the random effects terms can be passed as a Proc
or a block.
A convenient userfriendly interface to the basic model fitting algorithm is LMM#from_formula
, which uses the formula language of the R mixed models package lme4
for model specification. With the #from_formula
method, the user can conveniently fit models with categorical predictor variables, interaction fixed or random effects, as well as multiple crossed or nested random effects, all with just one line of code.
Examples are given in the sections below.
Implementation
The parameter estimation in LMM#initialize
is largely based on the approach developed by the authors of the R mixed models package lme4
, which is delineated in the lme4
vignette. I have tried to make the code of the model fitting algorithm in LMM#initialize
easy to read, especially compared to the corresponding implementation in lme4
.
The lme4
code is largely written in C++, which is integrated in R via the packages Rcpp
and RcppEigen
. It uses CHOLMOD code for various sparse matrix tricks, and it involves passing pointers to C++ object to R (and vice versa) many times, and passing different R environments from function to function. All this makes the lme4
code rather hard to read. Even Douglas Bates, the main developer of lme4
, admits that “The end result is confusing (my fault entirely) and fragile”, because of all the utilized performance improvements. I have analyzed the lme4
code in three blog posts (part 1, part 2 and part 3) before starting to work on my gem mixed_models
.
The method LMM#initialize
is written in a more functional style, which makes the code shorter and (I find) easier to follow. All matrix calculations are performed using the gem nmatrix
, which has a quite intuitive syntax and contributes to the overall code readability as well.
The Ruby gem loses with respect to memory consumption and speed in comparison to lme4
, because it is written in pure Ruby and does not utilize any sparse matrix tricks. However, for the same reasons the mixed_models
code is much shorter and easier to read than lme4
. Moreover, the linear mixed model formulation in mixed_models
is a little bit more general, because it does not assume that the random effects covariance matrix is sparse. More about the implementation of LMM#initialize
can be found in this blog post.
Other existing tools
Popular existing software packages for mixed models include the R package lme4
(which is arguably the standard software for linear mixed models), the R package nlme
(an older package developed by the same author as lme4
, still widely used), Python’s statmodels
, and the Julia package MixedModels.jl
.
Below, I give a couple of examples illustrating some of the capabilities of mixed_models
and explore how it compares to the alternatives.
A usage example and discussion
As an example, we use data from the UCI machine learning repository, which originate from blog posts from various sources in 20102012, in order to model (the logarithm of) the number of comments that a blog post receives. The linear predictors are the text length, the logtransform of the average number of comments at the hosting website, the average number of trackbacks at the hosting website, and the parent blog posts. We assume a random effect on the number of comments due to the day of the week on which the blog post was published. In mixed_models
this model can be fit with
1 2 

and we can display some information about the estimated fixed effects with
1


which produces the following output:
1 2 3 4 5 6 

We can also display the estimated random effects coefficients and the random effects standard deviation,
1 2 3 4 

which produces
1 2 3 4 5 6 7 

Interestingly, the estimates of the random effects coefficients and standard deviation are all zero! That is, we have a singular fit. Thus, our results imply that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive.
It is worth pointing out that such a model fit with a singular covariance matrix is problematic with the current version of Python’s statmodels
(described as “numerically challenging” in the documentation) and the R package nlme
(“Singular covariance matrices correspond to infinite parameter values”, a mailing list reply by Douglas Bates, the author of nlme
). However, mixed_models
, lme4
and MixedModels.jl
can handle singular fits without problems.
In fact, like mixed_models
above, lme4
estimates the random effects coefficients and standard deviation to be zero, as we can see from the following R output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Unfortunately, mixed_models
is rather slow when applied to such a large data set (blog_data
is a data frame of size 22435×8), especially when compared to lme4
which uses many sparse matrix tricks and is mostly written in C++ (integrated in R via Rcpp
) to speed up computation. The difference in performance between mixed_models
and lme4
is on the order of hours for large data, and Julia’s MixedModels.jl
promises to be even faster than lme4
. However, there is no noticeable difference in performance speed for smaller data sets.
The full data analysis of the blog post data can be found in this IRuby notebook.
A second example and statistical inference on the parameter estimates
Often, the experimental design or the data suggests a linear mixed model whose random effects are associated with multiple grouping factors. A specification of multiple random effects terms which correspond to multiple grouping factors is often referred to as crossed random effect, or nested random effects if the corresponding grouping factors are nested in each other.
A good reference on such models is Chapter 2 of Douglas Bates’ lme4
book.
Like lme4
, mixed_models
is particularly well suited for models with crossed or nested random effects. The current release of statmodels
, however, does not support crossed or nested random effects (according to the documentation).
As an example we fit a linear mixed model with nested random effects to a data frame with 100 rows, of the form:
1 2 3 4 5 6 7 

We consider the following model:
 We take
y
to be the response andx
its predictor.  We consider the factor
b
to be nested within the factora
.  We assume that the intercept varies due to variable
a
; that is, a different (random) intercept term for each level ofa
.  Moreover, we assume that the intercept varies due to the factor
b
which is nested ina
; that is, different (random) intercept for each combination of levels ofa
andb
.
That is, mathematically the model can be expressed as
1


where gamma(a) ~ N(0, phi**2)
and delta(a,b) ~ N(0, psi**2)
are normally distributed random variables which assume different realizations for different values of a
and b
, and where epsilon
is a random Gaussian noise term with variance sigma**2
. The goal is to estimate the parameters beta_0
, beta_1
, phi
, psi
and sigma
.
We fit this model in mixed_models
, and display the estimated random effects correlation structure with
1 2 3 

which produces the output
1 2 3 

The correlation between the factor a
and the nested random effect a_and_b
is denoted as nil
, because the random effects in the model at hand are assumed to be independent.
An advantage of mixed_models
over some other tools is the simplicity with which pvalues and confidence intervals for the parameter estimates can be calculated using a multitude of available methods. Such methods include a likelihood ratio test implementation, multiple bootstrap based methods (which run in parallel by default), and methods based on the Wald Z statistic.
We can compute five types of 95% confidence intervals for the fixed effects coefficients with the following line of code:
1


which yields the result
1 2 3 4 5 6 

For example, we see here that the intercept term is likely not significantly different from zero. We could proceed now by performing hypotheses tests using #fix_ef_p
or #likelihood_ratio_test
, or by refitting a model without an intercept using #drop_fix_ef
.
We can also test the nested random effect for significance, in order to decide whether we should drop that term from the model to reduce model complexity. We can use a bootstrap based version of likelihood ratio test as follows.
1 2 

We get a pvalue of 9.99e4, suggesting that we probably should keep the term (1a:b)
in the model formula.
A third example — a less conventional model fit
Another advantage of mixed_models
against comparable tools is the ease of fitting models with arbitrary covariance structures of the random effects, which are not covered by the formula interface of lme4
. This can be done in a userfriendly manner by providing a block or a Proc
to the LMM
constructor. This unique feature of the Ruby language makes the implementation and usage of the method incredibly convenient. A danger of allowing for arbitrary covariance structures is, of course, that such a flexibility gives the user the freedom to specify degenerate and computationally unstable models.
As an example we look at an application to genetics, namely to SNP data (singlenucleotide polymorphism) with known pedigree structures (family relationships of the subjects). The family information is prior knowledge that we can model in the random effects of a linear mixed effects model.
We model the quantitative trait y
(a vector of length 1200) as
1


where X
is a 1200 x 130
matrix containing the genotypes (i.e. 130 SNPs for each of the 1200 subjects); epsilon
is a vector of independent random noise terms with variances equal to sigma**2
; beta
is a vector of unknown fixed effects coefficients measuring the contribution of each SNP to the quantitative trait y
; and b
is a vector of random effects.
If we denote the kinship matrix by K
, then we can express the probability distribution of b
as b ~ N(0, delta**2 * 2 * K)
, where we multiply K
by 2
because the diagonal of K
is constant 0.5
, and where delta**2
is a unknown scaling factor.
The goal is to estimate the unknown parameters beta
, sigma
, and delta
, and to determine which of the fixed effects coefficients are significantly different from 0 (i.e. which SNPs are possibly causing the variability in the trait y
).
In order to specify the covariance structure of the random effects, we need to pass a block or Proc
that produces the upper triangular Cholesky factor of the covariance matrix of the random effects from an input Array. In this example, that would be the multiplication of the prior known Cholesky factor of the kinship matrix with a scaling factor.
Having all the model matrices and vectors, we compute the Cholesky factor of the kinship matrix and fit the model with
1 2 3 4 5 6 7 8 

Then we can use the available hypotheses test and confidence interval methods to determine which SNPs are significant predictors of the quantitative trait. Out of the 130 SNPs in the model, we find 24 to be significant as linear predictors.
See this blog post for a full analysis of this data with mixed_models
.
Room for improvement and future work
Writing the formula language interpretation code used by
LMM#from_formula
from scratch was not easy. Much of the code can be reorganized to be easier to read and to use in other projects. Possibly, the formula interface should be separated out, similar to how it is done with the Python package patsy. Also, some shortcut symbols (namely*
,/
, and
) in the model specification formula language are currently not implemented.I plan to add linear mixed models for highdimensional data (i.e. more predictors than observations) to
mixed_models
, because that work would be in line with my current PhD research.I plan to add generalized linear mixed models capabilities to
mixed_models
, which can be used to fit mixed models to discrete data (such as binary or count data).
Acknowledgement
I want to thank Google and the Ruby Science Foundation for giving me this excellent opportunity! I especially want to thank Pjotr Prins who was my mentor for the project for much helpful advice and suggestions as well as his prompt responses to any of my concerns. I also want to thank my fellow GSoC participants Will, Ivan, and Sameer for their help with certain aspects of my project.