Quick Start
This tutorial will walk through the process of constructing a simple PRS with the make_prs()
function. It will introduce arguments which are used throughout the package and is aimed at useRs who are at least comfortable loading and doing simple data manipulations in R
. More complex analyses, as well as a more in depth discussion of the options available for this function may be found in later sections.
Before getting started, it's important to understand the basics behind a polygenic risk score (PRS).
A PRS is a weighted sum of genetic load across loci. For a vector of estimated effects and a genetic matrix with rows and columns for individuals and bi-allelic genetic variants coded according to the number of non-reference alleles present, the estimated polygenic risk score for individual , , may be defined by For more details on the above method, including description of assumptions and properties, please see Properties of PRS.
To construct a PRS we need to provide, at minimum:
- Genetic Data
- Estimated Effects
- A value threshold
Genetic Data
In practice, one of the most common formats for genetic data storage is the plink
binary format.
plink
style binary files come packaged in three separate files:
plink.fam
plink.bim
plink.bed
plink.fam
houses the information about sample names and phenotypes, while plink.bim
stores the names, base pair coordinates, and allelic information for the genetic loci. plink.bed
is the workhorse of the three, and is a binary file containing the genotypes of each individual in plink.fam
for each loci in plink.bim
.
For more information about the format, please see the plink1.9
website.
For reasons of speed and simplicity, we require that all genetic files used in this package are of the plink
binary format. Fortunately, it is very simply to convert from most formats to plink
binary. For more information and conversion instructions from common formats, please see the Input Format section.
Estimated Effects
The estimated effect sizes, whatever they may be, should be provided in some analogue of a plink.assoc
or plink.qassoc
file. The important columns that must be present are:
SNP
orRSID
: The rs ID of the SNP under questionBETA
orOR
: The estimated effects for the SNPP
: P value for the estimated coefficient against the null of no effect.
The program attempts to guess non-standard column names, but you can't go wrong with the above.
value Threshold
To make our score, we need one more input.
Typically, researchers may wish to include SNPs in the score with a certain level of evidence behind them. For this, we may specify a value which is required for inclusion in the score. Variants with a value less than this threshold will be included in the score . This option defaults to 0.05 if not specified.
Now that we have all the pieces in place, we can install our package and get started.
Installing the package
pRs
is currently under development and as such is not yet released on CRAN or any other service. As such, it must be installed from Github. The devtools
package provides a simple to use interface for dealing with packages from Github.
First we install the devtools
package.
We first install the package from CRAN:
install.packages("devtools")
Then we add it to our local workspace:
library(devtools)
Now that we have the devtools
package installed, we can go ahead and install pRs
.
We specify the owner of the repository (Chris1221
) and the name of the repository (pRs
):
devtools::install_github("Chris1221/pRs")
Now that the package has been installed, we can go ahead and create a PRS.
The syntax for the make_prs()
function is as follows:
make_prs(
bfile = bfile,
assoc = assoc,
p = p,
pheno = pheno
)
Arguments
bfile
: The path to the root of your binary file. For example, if your trio of binaryplink
files was calledplink.bim plink.bed plink.fam
and they live in your~/Documents
folder, you would specify~/Documents/plink
without any file extension. You must ensure that all three files are present.- Must be a
character
(i.e.is.character(bfile)
returnsT
). - Files must exist to pass initial tests.
- Must be a
assoc
: The file path to your association file with the estimated effects used to construct the score. If you do not include a file extension it will be guessed, but it is best to input the direct path, for example~/Docuemnts/plink.assoc
.- Must be a
string
(i.e.is.character(assoc)
returnsT
). - Files must exist to pass initial tests.
- Must be a
p
: P value threshold at which to include variants in your PRS.- Must be a
numeric
(i.e.is.numeric(p)
returnsT
).
- Must be a
- (optional)
pheno
: A seperate phenotype aside from the one included in the sixth column of theplink.fam
file for use in analysis.- Must be a
string
(i.e.is.character(pheno)
returnsT
) pointing to an existing file OR adata.frame
which has the following columns:FID
,IID
,PHENO
.
- Must be a
Continuing from the example given above, we input the required arguments in order to construct our score, which will be returned as a simple data.frame
.
s_i <- make_prs(
bfile = "./plink",
assoc = "./plink.qassoc",
p = 0.05
)
We can take a look at the first six rows of the output.
The combination of FID
and IID
uniquely identifies each individual and assigns them to a score in the SCORE
column.
head(s_i)
FID IID SCORE
CEU id1 12.25
CEU id2 -5.22
CEU id3 2.87
CEU id4 8.27
CEU id5 1.13
CEU id6 -8.39
Using the Score
If you did not input a phenotype, then you may do your own analysis with the resulting vector of scores.
Given a typical plink
style phenotype file plink.phen
.
FID IID PHEN
CEU id1 2.5
CEU id2 1.2
CEU id3 5.6
CEU id4 2.7
CEU id5 3.9
CEU id6 8.39
You must first read the plink.phen
file into your R
workspace with read.table
or fread
or similar.
Using base functions:
pheno <- read.table(
"plink.phen", # The file name
header = TRUE # We want column headers
)
Using data.table
:
pheno <- as.data.frame(
data.table::fread(
"plink.phen",
header = TRUE
)
Note we transform the data.table
to a data.frame
for consistancy with the rest of this tutorial.
You may then merge together the phenotype information with the score information.
df <- merge(
s_i, # Score information
pheno, # Phenotype information
by = c("FID", "IID"), # Merge by the two ID columns
all.x = TRUE, # Don't throw away non-matching rows in s_i
all.y = FALSE # Throw away non-matching rows in pheno
)
And analyze the resulting relationships with linear regression or a whole host of other options. The results of these analyses are beyond the scope of this introductory tutorial.
lm(
SCORE ~ PHENO,
data = df
)
Summary
In this brief introduction to the package, we have constructed a simple PRS from toy data. Basic arguements were introduced and explained.
Going forward into constructing more complex scores, these basic arguements remain the same.
Once you have mastered the material in this quick start guide, you may wish to consult other guides in the Getting Started section for more details on data format and computational considerations. We also present specific explanation for each of the more complex score types in their respective sections.