Quick Start

This tutorial will walk through the process of constructing a simple PRS with the make_prs() function. It will introduce arguments which are used throughout the package and is aimed at useRs who are at least comfortable loading and doing simple data manipulations in R. More complex analyses, as well as a more in depth discussion of the options available for this function may be found in later sections.


Before getting started, it's important to understand the basics behind a polygenic risk score (PRS).

A PRS is a weighted sum of genetic load across loci. For a vector of estimated effects β^\underline{\hat{\beta}} and a genetic matrix G\mathbf{G} with rows i=1...ni = 1 ... n and columns j=1...mj = 1 ... m for nn individuals and mm bi-allelic genetic variants coded 0,1,20,1,2 according to the number of non-reference alleles present, the estimated polygenic risk score for individual ii, S^i \hat{S}_i , may be defined by S^i=i=1mβ^iGi,j \hat{S}_i = \sum^m_{i = 1} \hat{\beta}_i \mathbf{G}_{i,j} For more details on the above method, including description of assumptions and properties, please see Properties of PRS.

To construct a PRS we need to provide, at minimum:

  • Genetic Data
  • Estimated Effects
  • A PP value threshold

Genetic Data

In practice, one of the most common formats for genetic data storage is the plink binary format.

plink style binary files come packaged in three separate files:

plink.fam
plink.bim
plink.bed

plink.fam houses the information about sample names and phenotypes, while plink.bim stores the names, base pair coordinates, and allelic information for the genetic loci. plink.bed is the workhorse of the three, and is a binary file containing the genotypes of each individual in plink.fam for each loci in plink.bim.

For more information about the format, please see the plink1.9 website.

For reasons of speed and simplicity, we require that all genetic files used in this package are of the plink binary format. Fortunately, it is very simply to convert from most formats to plink binary. For more information and conversion instructions from common formats, please see the Input Format section.

Estimated Effects

The estimated effect sizes, whatever they may be, should be provided in some analogue of a plink.assoc or plink.qassoc file. The important columns that must be present are:

  • SNP or RSID: The rs ID of the SNP under question
  • BETA or OR: The estimated effects for the SNP
  • P: P value for the estimated coefficient against the null of no effect.

The program attempts to guess non-standard column names, but you can't go wrong with the above.

PP value Threshold

To make our score, we need one more input.

Typically, researchers may wish to include SNPs in the score with a certain level of evidence behind them. For this, we may specify a PP value which is required for inclusion in the score. Variants with a PP value less than this threshold will be included in the score S^i\hat{S}_i. This option defaults to 0.05 if not specified.

Now that we have all the pieces in place, we can install our package and get started.


Installing the package

pRs is currently under development and as such is not yet released on CRAN or any other service. As such, it must be installed from Github. The devtools package provides a simple to use interface for dealing with packages from Github.

First we install the devtools package.

We first install the package from CRAN:

install.packages("devtools")

Then we add it to our local workspace:

library(devtools)

Now that we have the devtools package installed, we can go ahead and install pRs.

We specify the owner of the repository (Chris1221) and the name of the repository (pRs):

devtools::install_github("Chris1221/pRs")

Now that the package has been installed, we can go ahead and create a PRS.

The syntax for the make_prs() function is as follows:

make_prs(
    bfile = bfile,
    assoc = assoc,
    p = p,
    pheno = pheno
   )

Arguments

  • bfile: The path to the root of your binary file. For example, if your trio of binary plink files was called plink.bim plink.bed plink.fam and they live in your ~/Documents folder, you would specify ~/Documents/plink without any file extension. You must ensure that all three files are present.
    • Must be a character (i.e. is.character(bfile) returns T).
    • Files must exist to pass initial tests.
  • assoc: The file path to your association file with the estimated effects used to construct the score. If you do not include a file extension it will be guessed, but it is best to input the direct path, for example ~/Docuemnts/plink.assoc.
    • Must be a string (i.e. is.character(assoc) returns T).
    • Files must exist to pass initial tests.
  • p: P value threshold at which to include variants in your PRS.
    • Must be a numeric (i.e. is.numeric(p) returns T).
  • (optional) pheno: A seperate phenotype aside from the one included in the sixth column of the plink.fam file for use in analysis.
    • Must be a string (i.e. is.character(pheno) returns T) pointing to an existing file OR a data.frame which has the following columns: FID, IID, PHENO.

Continuing from the example given above, we input the required arguments in order to construct our score, which will be returned as a simple data.frame.

s_i <- make_prs(
          bfile = "./plink",
          assoc = "./plink.qassoc",
          p = 0.05
       )

We can take a look at the first six rows of the output.

The combination of FID and IID uniquely identifies each individual and assigns them to a score in the SCORE column.

head(s_i)

  FID    IID    SCORE
  CEU    id1    12.25
  CEU    id2    -5.22
  CEU    id3     2.87
  CEU    id4     8.27
  CEU    id5     1.13
  CEU    id6    -8.39

Using the Score

If you did not input a phenotype, then you may do your own analysis with the resulting vector of scores.

Given a typical plink style phenotype file plink.phen.

  FID    IID    PHEN
  CEU    id1     2.5
  CEU    id2     1.2
  CEU    id3     5.6
  CEU    id4     2.7
  CEU    id5     3.9
  CEU    id6     8.39

You must first read the plink.phen file into your R workspace with read.table or fread or similar.

Using base functions:

pheno <- read.table(
            "plink.phen", # The file name
            header = TRUE # We want column headers
         )

Using data.table:

pheno <- as.data.frame(
            data.table::fread(
                "plink.phen",
                header = TRUE
              )

Note we transform the data.table to a data.frame for consistancy with the rest of this tutorial.

You may then merge together the phenotype information with the score information.

df <- merge(
        s_i, # Score information
        pheno, # Phenotype information
        by = c("FID", "IID"), # Merge by the two ID columns
        all.x = TRUE, # Don't throw away non-matching rows in s_i
        all.y = FALSE # Throw away non-matching rows in pheno
      )

And analyze the resulting relationships with linear regression or a whole host of other options. The results of these analyses are beyond the scope of this introductory tutorial.

lm( 
    SCORE ~ PHENO,
    data = df
  )

Summary

In this brief introduction to the package, we have constructed a simple PRS from toy data. Basic arguements were introduced and explained.

Going forward into constructing more complex scores, these basic arguements remain the same.

Once you have mastered the material in this quick start guide, you may wish to consult other guides in the Getting Started section for more details on data format and computational considerations. We also present specific explanation for each of the more complex score types in their respective sections.

results matching ""

    No results matching ""