RATTACA

RATTACA (RAT Trait Ascertainment using Common Alleles) is a new research service provided by the NIDA Center for GWAS in Outbred Rats. RATTACA uses genetic prediction to identify Heterogeneous Stock (HS) rats expected to exhibit extreme addiction-related traits, which we provide to researchers for behavioral study. New HS rats are available for prediction with each generation produced by the HS West colony at UC San Diego, four times per year.

Are you interested in predicting behavioral phenotypes (and more) in HS rats? Browse our trait list and contact Dr. Palmer for more info!

For technical details, check this bioRxiv preprint. Want a digestible summary? Check out this pretty poster:

What is RATTACA for?

Genetic prediction enables us to estimate expected phenotypes in HS rats without the need for experimental measurement. These predictions inform the sampling of phenotypically divergent samples, enabling unique study designs to identify causal genetic mechanisms shared between traits.


Take for example a study interested in the correlation between cocaine addiction and gene expression. One traditionally common study design involves measuring one trait (e.g., cocaine self-administration), identifying divergent individuals (e.g., high and low cocaine addiction), and then measuring a second trait (gene expression) that is hypothesized to be genetically correlated with the first. Unfortunately, observed correlations in selected or inbred lines often exist simply due to population structure and genetic drift, not necessarily due to a shared genetic basis for the two traits. RATTACA avoids these confounds by identifying correlations between measured and predicted phenotypes drawn from a single outbred population, effectively selecting phenotypically distinct samples without population effects.


Critically, RATTACA predictions can be made in behaviorally naïve rats. A fundamental problem in studying addiction behavior is the confounding effects introduced by experimental exposure to drugs. For example, observed expression differences between high and low cocaine addiction may reflect either differences in genetic risk for developing addiction-like traits, or differences in drug exposure between the two groups. Did the pre-existing genetic differences that causedrats to vary in drug usage also causeobserved differences in gene expression, or did exposure to different amounts of cocaine cause the observed differences? Disentangling these alternatives is extremely difficult using traditional experimental approaches. RATTACA avoids this fundamental problem. We can use RATTACA to identify naïve HS rats that are genetically predisposed to be divergent for drug self-administration (and many other traits outside of this example). Subsequent observed differences in gene expression between high- and low-predicted rats that have never been exposed to cocaine would necessarily be due to genetic predisposition.


Samples predicted via RATTACA to have either high or low trait values are defined and differentiated by their genotypes, meaning individuals from one sample must be more closely related to each other than they are to the alternative sample (on average). Because RATTACA samples are drawn from a single population, this differentiation cannot be attributed to confounding demographic factors that arise between separately bred laboratory strains. Therefore, if a pair of RATTACA samples is observed to differ for a second trait (not the trait used for prediction), this observed correlation is likely to have arisen due to a causative genetic architecture shared with the predicted trait. All trait correlations observed between RATTACA samples are thus evidence for genetic correlation.

What phenotypes can RATTACA predict?

We can predict any trait for which we have GWAS results from HS rats, which we use to train our prediction models. Prediction performance (and thus, the quality of selected samples) depends heavily on (1) the number of rats in the training sample and (2) the heritability of the trait to predict. As a general rule, we expect RATTACA predictions to reliably succeed for higher-heritability traits (h2 > 0.2) with training sample sizes greater than 300 individuals, and for lower-heritability traits (h2 = 0.1 – 0.2) with samples greater than 600. Browse the table for a list of traits available for prediction. Note that for many of these traits studies are ongoing, so their sample sizes are still growing.


RATTACA can also use eQTL data to predict gene expression in HS rats. We have data for 4,086 genes for which the heritability of gene expression is >0.1 and sample size is 330. 1,309 of these genes have heritability >0.3. Unlike behavior, where we use the whole genome for prediction, prediction of gene expression is based on just one or a few SNPs near the gene, which are detected as cis-eQTL SNPs. Comparisons of groups with predicted high or low expression of a gene would be analogous to comparing knock-down and over-expressing mutant lines, but without the labor, cost, and time needed to make such mutant lines. Another possible application is to predict the expression of many genes in order to estimate general gene expression profiles. This approach could be used for genetic correlation as outlined above or to recreate an observed gene expression pattern, such as following treatment with a pharmacological agent. Genes available for prediction can be found at Rat TWAS Hub.


Finally, we have identified 403 genes in the HS rat population that are annotated as having “loss of function” mutations, and 808 genes with structural variants that either delete or duplicate the entire gene. These mutations can be predicted with almost 100% accuracy. RATTACA can thus identify groups of rats that are equivalent to “knock outs” or “wild types” for these genes, or in the case of duplications, equivalent to “over expressers” and “wild types.” We will publicize lists of these genes and expect that some of them will be of interest to the scientific community. As we finalize these results you may contact Dr. Palmer for information.

How does RATTACA operate?

Researchers may request a desired trait and sample design for prediction at any time (e.g., 10 high and 10 low locomotor total distance). Every three months, we produce a new generation of 400-500 HS rats in the HS West colony at UC San Diego. When these pups are weaned, we sample ear punch tissues for rapid DNA extraction, low-coverage whole-genome sequencing, and genome-wide SNP genotyping. We then use SNP genotypes to predict trait values for requested phenotypes using G-BLUP (details below). We train prediction models using all phenotypic and genotypic data available for a given trait in our database, then use model predictions to identify current-generation rat pups with predicted extreme trait values. These pups are selected and shipped to researchers for downstream experimentation. When processing multiple requests (predictions for multiple traits and/or multiple projects), we maximize the difference between selected samples across all projects to ensure sufficient phenotypic differentiation for all traits.

What is the statistical basis for RATTACA?

RATTACA relies on phenotype and genotype data that we have already collected in HS rats. Without the data that have been accumulated over the prior 10 years, RATTACA would not be possible. While RATTACA is highly innovative for rodent genetics and for many of our intended applications, the statistical basis for RATTACA builds on fundamental techniques (G-BLUP) that have been used for decades in agricultural and human genetics. Therefore, the statistical basis for this approach is very well established. To make predictions, we use the R package rrBLUP v4.6.2. This method predicts complex traits using marker data as a random effect in a linear mixed model. This marker-based method calculates best linear unbiased predictors (BLUP) solutions for individual SNP effects on a given sampled phenotype. This approach contrasts with genome-wide association studies, in which the goal is often to identify a small set of significant genomic loci. Rather, G-BLUP uses a representative sample of genome-wide SNPs, which are presumed to be a mixture of causative and non-causative loci. This method is robust to the inclusion of both true and false positive predictors, and performs best when sufficient SNPs of small effect size are used for prediction. These BLUP predictions sum to produce genomic-estimated breeding values (GEBVs, also known as polygenic scores) for each individual, and these scores identify HS rats with high or low predicted phenotypes. In our approach, we model predictions on phenotype data processed to remove significant variance introduced by experimental covariates (e.g., sex, cohort, cage, age, etc.) This means that predicted phenotypes are returned not in the original (and more intuitive) unit of measurement, but ensures that predictions identify individuals with the greatest genetic contribution to the phenotype, which is the key variable of interest when evaluating genetic correlations between traits.