Asreml R Tutorial
ASreml is used to fit linear mixed models to quite large data sets with complex variance models. It extends the range of variance models available for the analysis of experimental data. ASreml has application in many areas, such as the analysis of: - balanced and unbalanced longitudinal data - repeated measures data (multivariate analysis of variance and spline type models), - balanced and unbalanced designed experiments - multi-environment trials and meta analysis - univariate and multivariate animal breeding and genetics data (involving a relationship matrix for correlated effects) - regular or irregular spatial data ASreml has an extensive command language which is designed for use in batch mode. A Windows interface is provided to help with the preparation of input files and viewing of output, both text and graphics. Visit for more information.
ASReml fits the general linear mixed model and estimates variance components by residual maximum likelihood (REML). Linear mixed effects models provide a flexible tool for the analysis of many data sets commonly arising in the agricultural, biological, medical and environmental sciences. Typical applications of ASReml include: ◊ balanced and unbalanced longitudinal data; ◊ repeated measures analysis; ◊ balanced and unbalanced designed experiments; ◊ multi-environment experiments; ◊ univariate and multivariate animal breeding and genetics data; ◊ regular or irregular spatial data. ASReml is available for Windows, Macintosh and Linux systems as an R package or standalone program from.
Login Username Password Remember Me • Search for: Search ASreml News.
A Practical Guide to Mixed Models in R Preface I created this guide so that students can learn about important statistical concepts while remaining firmly grounded in the programming required to use statistical tests on real data. I want this to be a guide students can keep open in one window while running R in another window, because it is directly relevant to their work. In that spirit of openness and relevance, note that I created this guide in R v 3.1.0 and used the following packages: • car v 2.0 • MASS v 7.3 • lme4 v 1.1 • mlmRev v 1.0 • agridat v 1.8 • MCMCglmm v 2.19 • ggplot2 v 0.9.3.1 • scapeMCMC v 1.1 1. Is a mixed model right for your needs? A mixed model is similar in many ways to a linear model. It estimates the effects of one or more explanatory variables on a response variable. The output of a mixed model will give you a list of explanatory values, estimates and confidence intervals of their effect sizes, p-values for each effect, and at least one measure of how well the model fits.
You should use a mixed model instead of a simple linear model when you have a variable that describes your data sample as a subset of the data you could have collected. What do I mean by that? Let's take a look at some preliminary data from a paper wasp kin recognition project I'm working on. ## Test.ID Observer Relation Aggression Tolerance Season ## 1 1 Charles Same 4 4 Early ## 2 2 Tyler Same 1 34 Early ## 3 3 Michelle Same 15 14 Early ## 4 4 Tyler Same 2 31 Early ## 5 5 Charles Same 1 4 Early ## 6 6 Rhyan Same 0 13 Early My response variables of interest are Aggression and Tolerance. Aggression is the number of aggressive behaviors in a sixty minute period. Tolerance is the number of tolerant behaviors in a sixty minute period. I am interested in the effects of relation (whether the wasps came from the same or different colonies) and season (early or late in the colony cycle) on these response variables.
These effects are 'fixed' because no matter where, how, or how many wasps I had sampled, I would have still have had the same levels in the same variables: same colony vs. Different colony, and early season vs.
There are two other variables, though, that would not remain fixed between samples. Observer represents the person who scored the interaction and Test.ID represents the group of three wasps observed for sixty minutes. The levels of Observer would be different if I had sampled in a different year, because different undergraduate volunteers would be available to observe behavior. The levels of Test ID would also vary between samples, because I could always rearrange which wasps participate in each experimental trial. Each trial is a unique sub-sample of the wasps I collected at that time.
If I had been able to test the wasps individually, and if all observers had scored all interactions, I wouldn't have any random effects. But instead, my data are inherently 'lumpy,' and the random effects describe that lumpiness.
Mixed models allow us to account for the lumpiness of data. Before you proceed, you will also want to think about the structure of your random effects. Are your random effects nested or crossed? In the case of my study, the random effects are nested, because each observer recorded a certain number of trials, and no two observers recorded the same trial, so here Test.ID is nested within Observer. But say I had collected wasps that clustered into five different genetic lineages. The 'genetics' random effect would have nothing to do with observer or arena; it would be orthogonal to these other two random effects.
Therefore this random effect would be crossed to the others. What probability distribution best fits your data? Say you've decided you want to run a mixed model. The next thing you have to do is find a probability distribution that best fits your data. There are many ways to test this, but I'll show you one. This method requires the packages 'car' and 'MASS'. Note that the negative binomial and gamma distributions can only handle positive numbers, and the Poisson distribution can only handle positive whole numbers.
The binomial and Poisson distributions are different from the others because they are discrete rather than continuous, which means they quantify distinct, countable events or the probability of these events. Now let's find a fitting distribution for my Aggression variable. Check out the plots I've generated using qqp.
These tutorials are aimed at researchers interested in multivariate methods for modelling among-individual variation in labile traits. ASReml-R: PDF. MCMCglmm: PDF. Avoiding the misuse of BLUP in behavioral ecology: I. Multivariate modelling for individual variation (ASReml-R tutorial) T.M. Houslay & A.J. Wilson, Behavioral Ecology.
The y axis represents the observations and the x axis represents the quantiles modeled by the distribution. The solid red line represents a perfect distribution fit and the dashed red lines are the confidence intervals of the perfect distribution fit. You want to pick the distribution for which the largest number of observations falls between the dashed lines.
In this case, that's the lognormal distribution, in which only one observation falls outside the dashed lines. Now, armed with the knowledge of which probability distribution fits best, I can try fitting a model.
How to fit a mixed model to your data 3a. If your data are normally distributed First, a note: if your data best fit the lognormal distribution, do not transform them. This is true for any type of transformation you might apply to your data to make them normal.
If you can transform your data to normality, common wisdom says you should use the transformed data. More recent statistics literature has entirely changed stance on this matter, however, because transformation makes interpretation of model results more difficult, and it makes mischief with the variance of the transformed variable.
Even if your data are transformable to normality, they are still not normal, and you should move on to the next section. If your data are normally distributed, your life will be a little easier, because you can use a linear mixed model (LMM).
You will want to load the lme4 package and make a call to the function lmer. The first argument to the function is a formula that takes the form y ~ x1 + x2. Etc., where y is the response variable and x1, x2, etc. Are explanatory variables. Random effects are added in with the explanatory variables.
Crossed random effects take the form (1 r1) + (1 r2). While nested random effects take the form (1 r1 / r2). The next argument is where you designate the data frame your variables come from. The argument after that is an important one. This is where you can designate whether the mixed model will estimate the parameters using maximum likelihood or restricted maximum likelihood. If your random effects are nested, or you have only one random effect, and if your data are balanced (i.e., similar sample sizes in each factor group) set REML to FALSE, because you can use maximum likelihood. If your random effects are crossed, don't set the REML argument because it defaults to TRUE anyway.
Lest this all seem too abstract, let's try this with some data. We will use some data from my very first published study about song in superb starlings.
In this study, we were interested in sexual dimorphism, the differences between male and female song, and whether birds of different social ranks, helper and breeder, sang differently. Our random effect was social group. Mean pitch of song fits a normal probability distribution. ## Linear mixed model fit by maximum likelihood ['lmerMod'] ## Formula: Mean.Pitch ~ Sex + Social.Rank + (1 Group) ## Data: starlings ## ## AIC BIC logLik deviance df.resid ## 389.3 396.0 -189.7 379.3 23 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.0559 -0.6272 0.0402 0.5801 2.0110 ## ## Random effects: ## Groups Name Variance Std.Dev. ## Group (Intercept) 0 0 ## Residual 44804 212 ## Number of obs: 28, groups: Group, 5 ## ## Fixed effects: ## Estimate Std.
Error t value ## (Intercept) 3099.0 82.2 37.7 ## SexM 51.7 81.3 0.6 ## Social.RankHelper -45.0 82.4 -0.5 ## ## Correlation of Fixed Effects: ## (Intr) SexM ## SexM -0.630 ## Scl.RnkHlpr -0.668 0.106 Let's look at these results. First we get some measures of model fit, including AIC, BIC, log likelihood, and deviance. Then we get an estimate of the variance explained by the random effect.
This number is important, because if it's indistinguishable from zero, then your random effect probably doesn't matter and you can go ahead and do a regular linear model instead. Next we have estimates of the fixed effects, with standard errors.
This information may be enough for you. Some journals like you to report the results of these models as effect sizes with confidence intervals. Certainly when I look at the fixed effects estimate, I can already tell that mean pitch does not differ between sexes and social ranks.
But some journals want you to report p-values. The creators of the lme4 package are philosophically opposed to p-values, for reasons I'll not go into here, so if you want some p-values you'll have to turn to the Anova function in the car package. ## Analysis of Deviance Table (Type II Wald chisquare tests) ## ## Response: Mean.Pitch ## Chisq Df Pr(>Chisq) ## Sex 0.4 1 0.52 ## Social.Rank 0.3 1 0.58 The Anova function does a Wald test, which tells us how confident we are of our estimate of the effect of sex and social rank on pitch, and the p-value tells me that I should not be confident at all.
There is one complication you might face when fitting a linear mixed model. R may throw you a 'failure to converge' error, which usually is phrased 'iteration limit reached without convergence.' That means your model has too many factors and not a big enough sample size, and cannot be fit. Unfortunately, I don't have any data that actually fail to converge on a model that I can show you, but let's pretend that last model didn't converge. What you should then do is drop fixed effects and random effects from the model and compare to see which fits the best. Drop fixed effects and random effects one at a time. Hold the fixed effects constant and drop random effects one at a time and find what works best.
Then hold random effects constant and drop fixed effects one at a time. Here I have only one random effect, but I'll show you by example with fixed effects. ## Data: starlings ## Models: ## nofixedlmm: Mean.Pitch ~ 1 + (1 Group) ## noranklmm: Mean.Pitch ~ Sex + (1 Group) ## nosexlmm: Mean.Pitch ~ Social.Rank + (1 Group) ## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) ## nofixedlmm 3 386 390 -190 380 ## noranklmm 4 388 393 -190 380 0.48 1 0.49 ## nosexlmm 4 388 393 -190 380 0.00 0 1.00 Note that this anova function is not the same as the Anova function we used to evaluate the significance of fixed effects in the model. This anova function with a lowercase 'a' is for comparing models.
The p values indicate that there are no groundshakingly important differences between the models. We can also compare the AIC values and note that the model with the lowest AIC value is the one with no fixed effects at all, which fits with our understanding that sex and social rank have no effect on song pitch. Whatever you approach you take, be sure to report in your manuscript the p values or AIC values you used to pick the best model. If your data are not normally distributed This is where life gets interesting. You see, the REML and maximum likelihood methods for estimating the effect sizes in the model make assumptions of normality that don't apply to your data, so you have to use a different method for parameter estimation. The problem is that there are many alternative estimation methods, each run from a different R package, and it can be hard to decide which one suits.
I will guide you through this decision process with examples. First, we need to test whether we can use penalized quasilikelihood (PQL) or not. PQL is a flexible technique that can deal with non-normal data, unbalanced design, and crossed random effects. However, it produces biased estimates if your response variable fits a discrete count distribution, like Poisson or binomial, and the mean is less than 5 - or if your response variable is binary. The Aggression variable fits the log-normal distribution, which is not a discretized distribution.
That means we can proceed with the PQL method. But before we proceed, let's return to the matter of transformation to normality. Edirol Virtual Sound Canvas Vst 4 on this page.
The reason we want to use a GLMM for this is that if we imagine a stastical method as E(x), E(ln(x)) is not the same as ln(E(x)). The former is performing a LMM on a transformed variable, while the latter is performing a GLMM on an untransformed variable. The latter is better because it better captures the variance of x.
Make sure you have the MASS package loaded. Note that instead of taking all the fixed and random effects as one formula, the random effects get their own argument in the glmmPQL function. To set the distribution to log-normal, we set the family to gaussian (another word for normal) and the link to log. The link can be anything, though if you want to use something besides log or inverse then you'll have to research how to customize the link function yourself. ## 'data.frame':2287 obs. Of 4 variables: ## $ schoolNR: Factor w/ 131 levels '1','2','10','12'.: 1 1 1 1 1 1 1 1 1 1. ## $ Minority: Factor w/ 2 levels 'N','Y': 1 2 1 1 1 2 2 1 2 2.
## $ ses: num 23 10 15 23 10 10 23 10 13 15. ## $ repeatgr: Factor w/ 3 levels '0','1','2': 1 1 1 1 1 1 1 1 2 1. Let's say we want to find out whether membership in a racial minority and socioeconomic status affect students' likelihood to repeat a grade. Our response variable is 'repeatgr', a binary response indicating whether the student repeated a grade or not. Racial minority status is a binary Y/N category and socioeconomic status is represented by 'ses', a numeric scale that ranges from 10 to 50, where 50 is the richest. Our random factor is 'schoolNR', which represents the schools from which the students were sampled. Because the response variable is binary, we will need a generalized linear mixed model with a binomial distribution, and because we have fewer than five random effects, we can use the Laplace approximation.
Strictly speaking, the Laplace approximation is a special case of a parameter estimation method called Gauss-Hermite quadrature (GHQ), with one iteration. GHQ is more accurate than Laplace due to repeated iterations, but becomes less flexible after the first iteration, so you can only use it for one random effect. We can use it in this example because our only random effect is 'schoolNR.' To go ahead with this method, we use the lme4 package again. ## Generalized linear mixed model fit by maximum likelihood (Adaptive ## Gauss-Hermite Quadrature, nAGQ = 25) [glmerMod] ## Family: binomial ( logit ) ## Formula: repeatgr ~ Minority + ses + ses * Minority + (1 schoolNR) ## Data: bdf ## ## AIC BIC logLik deviance df.resid ## 1672.8 1701.4 -831.4 1662.8 2282 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -0.964 -0.407 -0.314 -0.221 5.962 ## ## Random effects: ## Groups Name Variance Std.Dev.
## schoolNR (Intercept) 0.264 0.514 ## Number of obs: 2287, groups: schoolNR, 131 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>z ) ## (Intercept) -0.45194 0.20763 -2.18 0.03 * ## MinorityY 0.47957 0.47345 1.01 0.31 ## ses -0.06205 0.00798 -7.78 7.5e-15 *** ## MinorityY:ses 0.01196 0.02289 0.52 0.60 ## --- ## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) MnrtyY ses ## MinorityY -0.400 ## ses -0.905 0.369 ## MinortyY:ss 0.299 -0.866 -0.321. ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: repeatgr ~ Minority + ses + ses * Minority + (1 schoolNR) ## Data: bdf ## ## AIC BIC logLik deviance df.resid ## 1672.8 1701.5 -831.4 1662.8 2282 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -0.960 -0.407 -0.316 -0.222 5.950 ## ## Random effects: ## Groups Name Variance Std.Dev. ## schoolNR (Intercept) 0.258 0.508 ## Number of obs: 2287, groups: schoolNR, 131 ## ## Fixed effects: ## Estimate Std.
Error z value Pr(>z ) ## (Intercept) -0.45456 0.20611 -2.21 0.027 * ## MinorityY 0.48005 0.47121 1.02 0.308 ## ses -0.06191 0.00791 -7.83 4.9e-15 *** ## MinorityY:ses 0.01194 0.02274 0.53 0.600 ## --- ## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) MnrtyY ses ## MinorityY -0.400 ## ses -0.906 0.369 ## MinortyY:ss 0.299 -0.866 -0.321 You can see there's no important difference between Laplace and GHQ in this case.
Both show that socioeconomic class has a highly significant effect on students' likelihood to repeat a grade, though even with the logit transformation we can see that the effect size is small. There is one more consideration, though, when using this method. It becomes inaccurate when used on overdispersed data - that is, when the combined residuals are much larger than the residual degrees of freedom. When you use this method, you should check the model to make sure the data are not overdispersed. Here is some code that will help you with that.
## 'data.frame':102 obs. Of 6 variables: ## $ year: int 1901 1901 1901 1901 1902 1902 1902 1902 1902 1902. ## $ farmer: Factor w/ 18 levels 'Allardyce','Dooley'.: 10 5 3 18 10 5 18 17 4 12. ## $ place: Factor w/ 16 levels 'Arnestown','Bagnalstown'.: 3 16 14 11 3 16 11 4 8 6. ## $ district: Factor w/ 4 levels 'CentralPlain'.: 2 2 1 1 2 2 1 1 4 4. ## $ gen: Factor w/ 2 levels 'Archer','Goldthorpe': 1 1 1 1 1 1 1 1 1 1.
## $ profit: num 1 1 1 1 1 1 1 1 1 1. In this case, let's say we have no explanatory variables at all. We don't have any ideas about what fixed effects might influence whether a barley harvest turns a profit or not. We're interested in which random effects contribute to the variability of profit.
After all, random effects are factors that change the variance of a response variable; sometimes we're trying to account for that variance to make the fixed effects clearer, but sometimes we're interested in the variances of fixed effects for their own sake. To do this, we will use MCMCglmm, which can not only handle many random effects, but provides confidence intervals for the random effects, which none of the other packages we've used here provide in their summary (though in lme4 you can use confint() on a fitted model to achieve the same end.) The confusing part about MCMCglmm is that it is a Bayesian statistical method. All models make assumptions about the distribution of the variance in your data, but in a Bayesian method these assumptions are explicit, and we need to specify these assumed distributions. In Bayesian statistics, we call these priors. The priors I've used here are bog-standard and will work for most data.
Feel free to use them yourself. Just keep in mind that one R structure needs to be specified for each fixed effect and one G structure needs to be specified for each random effect.
Also remember my caution about the lognormal distribution: these priors may not play nicely with data modeled with a log link, so do some research on what priors to use for data on a log scale. Here I split up my data by season and relation, my two fixed effects. We can see right away that the dataset contains an extreme positive outlier; by far most of the observations fall between 0 and 20 and there's an outlier or two throwing it off. This is good to know.
We can also see that a high proportion of the late season observations are equal to zero. Plotting is also important for assessing model fit. You can tell which model you fit does the best job describing the data by plotting the fitted values in various ways. One easy application is graphing the residuals of a model. If you imagine a model as a best-fit line going through the scatterplot of your data, the residuals are the distances of of the points in the scatterplot from the best-fit line.
If the model fits, then if you plot residuals against the fitted values, you should see random scatter. If the scatter is not random that means there's some other random or fixed effect that explains variation in your data that you're missing. Let's try plotting the residuals of the mixed model I fit for song pitch in superb starlings.
What this plot does is create a dashed horizontal line representing zero: an average of zero deviation from the best-fit line. It also creates a solid line that represents the residual deviation from the best-fit line. I want the solid line to overlay the dashed line. This is a reassuring plot because the estimates are very similar between the two models (though the estimate for year is a little lower in the second) but the confidence interval for year is markedly smaller in the second model, which means we can be more confident about this estimate.
We can feel pretty good about our inferences on the second model, i.e., that genotype and year are the main contributors to variability. Our brains are best at detecting patterns when they are presented visually, so plot your data and your models whenever you can.
Learn to use the base, lattice, or ggplot2 package and it will serve you well for years to come. Resources • Bates, D. Lme4: Mixed-effects modeling with R. URL • The FAQ, by Ben Bolker (Prof. Of Botany and Zoology, University of Florida.) • by Ben Bolker • Bolker, B. M., Brooks, M. E., Clark, C.
J., Geange, S. W., Poulsen, J. R., Stevens, M. H., & White, J.-S. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in ecology & evolution, 24(3), 127–135.
• Hilborn, R. The ecological detective: confronting models with data (Vol. Princeton University Press.
• Lindsey, J. K., & Jones, B. Choosing among generalized linear models applied to medical data. Statistics in medicine, 17(1), 59–68.