Background R Primer For Bioinformatics

RDataScience

References: R & Bioconductor Manual By Thomas Girke @ UC Riverside.  

This is an excellent resource, and the document below summarizes and annotates building from their resource.

Resources

Introduction

RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It’s a good idea to upgrade regularly so you can take advantage of the latest and greatest features. For this book, make sure you have RStudio 1.0.0.

When you start RStudio, you’ll see two key regions in the interface:

 

For now, all you need to know is that you type R code in the console pane, and press enter to run it. You’ll learn more as we go along!

The tidyverse

You’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.

You can install the complete tidyverse with a single line of code:

install.packages("tidyverse")

On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer.

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.

Running R code

The previous section showed you a couple of examples of running R code. Code in the book looks like this:

1 + 2 
#> [1] 3 

Install Package

First time, you need to install the package

install.packages("ggplot2")

From then on you just load the library

library(ggplot2) # Shows all libraries available on a system.

library(my_package) # Loads a particular package.

library(help=mypackage) # Lists all functions/objects of a package.

Reading and Writing From Files

Reads into a data frame

my_frame <- read.table(file="my_table") 
my_frame <- read.table(file="my_table", header=TRUE, sep="\t")
df <- read.csv('cohort.tpms.csv.gz',header = TRUE, sep = ",", 
 quote = "\"", dec = ".", fill = TRUE, row.names = 1)
write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F)

Commands to save R object to an external file and to read it in again from this file.

save(df, file="my_file.txt"); load(file="file.txt") #

Navigating

dir() # Reads content of current working directory. 
getwd() # Changes current working directory to the specified directory.
ls() or objects() # Lists R objects 
rm(my_object1, my_object2, ...) # Removes objects. 
rm(list = ls()) # Removes all objects without warning! 
str(object) # Displays object types and structure of an R object. 
ls.str(pattern="") # Lists object type info on all objects in a session. 
lsf.str(pattern="") # Lists object type info on all functions in a session. 
class(object) # Prints the object type. 
mode(object) # Prints the storage mode of an object. 
summary(object) # Generic summary info for all kinds of objects. 
attributes(object) # Returns an object's attribute list. 
gc() # Causes the garbage collection to take place. length(object) # Provides length of object.

Datatypes

vectors:

ordered collection of numeric, character, complex and logical values.

x <- c("1", "2", "3");

lists

general form of vectors with different types of elements

mylist <- c("hi", 1, "1") # This just seems wrong

 

data frames (objects)

Two dimensional structures with different data types

 

Creating Dataframes

## Sample Set: the following transforms the iris data set into a ggplot2-friendly format
# Calculates the mean values for the aggregates given by the Species column in the iris data set.
 iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean) 
# Calculates the standard deviations for the aggregates given by the Species column in the iris data set.
 iris_sd <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd) 
 convertDF <- function(df=df, mycolnames=c("Species", "Values", "Samples")) { myfactor <- rep(colnames(df)[-1], each=length(df[,1]));
# Defines function to convert data frames into ggplot2-friendly format.
mydata <- as.vector(as.matrix(df[,-1])); df <- data.frame(df[,1], mydata, myfactor); colnames(df) <- mycolnames; return(df) } 
# Converts iris_mean.
 df_mean <- convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) 
# Converts iris_sd.
 df_sd <- convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) 
# Define standard deviation limits.
 limits <- aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2] - df_sd[,2])
genes <- c('APOE','PTEN','MTFMT')
disease <- c("AD", "Autism", "Mitochondrial")
discovered <- as.Date(c('1995-11-1','2004-3-25','2001-3-14'))
gene_disease <- data.frame(genes,disease,discovered)

Subsets, Slicing, Accessing dataframes

factors: special type vectors with grouping information of its components
matrices: two dimensional structures with data of same type
arrays: multidimensional arrays of vectors
functions: piece of code

# Subsetting of one dimensional objects, like vectors and factors.
my_object[row] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col] 

my_logical <- my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.
# Generic syntax to access columns and rows in data frames.
my_frame[rows, columns] 
# Provides number of columns or rows of data frame, respectively.
length(my_frame); length(my_frame$y1)
# Gives column and row names of data frame.
colnames(my_frame); rownames(my_frame) 
# Prints row names or indexing column of data frame.
row.names(my_frame) 
# Sorts the rows of a data frame by the specified columns, here 'y2'; 
my_frame[order(my_frame$y2, decreasing=TRUE), ] 
 # for increasing order use 'decreasing=FALSE'.
# Subsequent sub-sorts can be performed by specifying additional columns. By adding a "-" sign one can reverse the sort order. 
my_frame[order(my_frame[,4], -my_frame[,3]),]
# Notation to print entire column of a data frame as vector or factor.
my_frame$y1
# Notation to access column element(s) of a data frame.
my_frame$y1[2:4]
# Notation for returning the value of an individual cell. In this example the corresponding 
v <-my_frame[,4]; v[3] 
#View only the first five rows of all columns.
my_frame[1:5,] 
# View all rows of the first two columns.
my_frame[, 1:2] 
# View all rows of the specified columns.
my_frame[, c(1,3)]
# View only the first five rows of the columns 1-2. 
my_frame[1:5,1:2] 
# Retrieve row values by their index name (here "August").
my_frame["August",]

Basic Operators and Calculations

Comparing two things, and returning a boolean

equal: ==
not equal: !=
greater/less than: > <
greater/less than or equal: >= <=

AND: &

x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.

OR: |

x == y | x != y # Returns TRUE where at least one comparison is TRUE.

Regex

?regexp # Opens general help page for regular expression support in R. 
month.name[grep("A", month.name)] # The grep function can be used for finding patterns in strings, here letter 
# 'A' in vector 'month.name'. 
gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE) # Example for using regular expressions to substitute a pattern by another one using a back reference. 
# Remember: single escapes '\' need to be double escaped '\\' in R

Functions

name <- function(arg_1, arg_2, ...) expression # Basic syntax for functions.

Let’s use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

The mpg data frame

You can test your answer with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpgcontains observations collected by the US Environment Protection Agency on 38 models of car.

mpg
#> # A tibble: 234 × 11
#>   manufacturer model displ  year   cyl      trans   drv   cty   hwy    fl
#>          <chr> <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>
#> 1         audi    a4   1.8  1999     4   auto(l5)     f    18    29     p
#> 2         audi    a4   1.8  1999     4 manual(m5)     f    21    29     p
#> 3         audi    a4   2.0  2008     4 manual(m6)     f    20    31     p
#> 4         audi    a4   2.0  2008     4   auto(av)     f    21    30     p
#> 5         audi    a4   2.8  1999     6   auto(l5)     f    16    26     p
#> 6         audi    a4   2.8  1999     6 manual(m5)     f    18    26     p
#> # ... with 228 more rows, and 1 more variables: class <chr>
mpg
#> # A tibble: 234 × 11
#>   manufacturer model displ  year   cyl      trans   drv   cty   hwy    fl
#>          <chr> <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>
#> 1         audi    a4   1.8  1999     4   auto(l5)     f    18    29     p
#> 2         audi    a4   1.8  1999     4 manual(m5)     f    21    29     p
#> 3         audi    a4   2.0  2008     4 manual(m6)     f    20    31     p
#> 4         audi    a4   2.0  2008     4   auto(av)     f    21    30     p
#> 5         audi    a4   2.8  1999     6   auto(l5)     f    16    26     p
#> 6         audi    a4   2.8  1999     6 manual(m5)     f    18    26     p
#> # ... with 228 more rows, and 1 more variables: class <chr>

Among the variables in mpg are:

  1. displ, a car’s engine size, in litres.
  2. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

ggplot

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size?

With ggplot2, you begin a plot with the function ggplot()ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very interesting so I’m not going to show it here.

You complete your graph by adding one or more layers to ggplot(). The function geom_point()adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You’ll learn a whole bunch of them throughout this chapter.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

The rest of this chapter will show you how to complete and extend this template to make different types of graphs. We will begin with the <MAPPINGS> component.

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class)) #> Warning: Using size for a discrete variable is not advised.

Or we could have mapped class to the alpha aesthetic, which controls the transparency of the points, or the shape of the points.

# Left 
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) 
# Right 
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))

What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

For each aesthetic, you use aes() to associate the name of the aesthetic with a variable to display. The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. The syntax highlights a useful insight about x and y: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.

Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Graphing with ggplot2

library(ggplot2) # Plots two vectors (columns) in form of a scatter plot against each other. 
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point()

Plots larger dots and colors them with default color scheme.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4)

Colors dots with custom scheme.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) + ylim(2,4) + xlim(4,8) + scale_color_manual(values=rainbow(10))

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + stat_smooth(method="lm", se=FALSE)

# Plots three line plots, one for each sample in Species column.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) + facet_wrap(~Species, ncol=1)

 

WORKFLOW: BASICS

You now have some experience running R code. I didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it’s both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.

Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

Coding basics

Let’s review some basics we’ve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:

1 / 200 * 30 
#> [1] 0.15 (59 + 73 + 2) / 3 
#> [1] 44.7 sin(pi / 2) #> [1] 1

You can create new objects with <-:

x <- 3 * 4

All R statements where you create objects, assignment statements, have the same form:

When reading that code say “object name gets value” in your head.

You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt + – (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.

Functions

R has a large collection of built-in functions that are called like this:

function_name(arg1 = val1, arg2 = val2, ...)

Let’s try using seq() which makes regular sequences of numbers and, while we’re at it, learn more helpful features of RStudio. Type se and hit TAB. A popup shows you possible completions. Specify seq() by typing more (a “q”) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.

Press TAB once more when you’ve selected the function you want. RStudio will add matching opening (() and closing ()) parentheses for you. Type the arguments 1, 10 and hit return.

seq(1, 10) #> [1] 1 2 3 4 5 6 7 8 9 10

Type this code and notice you get similar assistance with the paired quotation marks:

x <- "hello world"

Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:

> x <- "hello +

The + tells you that R is waiting for more input; it doesn’t think you’re done yet. Usually that means you’ve forgotten either a " or a ). Either add the missing pair, or press ESCAPE to abort the expression and try again.

If you make an assignment, you don’t get to see the value. You’re then tempted to immediately double-check the result:

y <- seq(1, 10, length.out = 5) 
y 
#> [1] 1.00 3.25 5.50 7.75 10.00

This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and “print to screen” to happen.

y <- seq(1, 10, length.out = 5)
#> [1] 1.00 3.25 5.50 7.75 10.00

Now look at your environment in the upper right pane:

Here you can see all of the objects that you’ve created.

In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

library(nycflights13)
library(tidyverse)

Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

Logical operators

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”.

Figure 5.1: Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

The following code finds all flights that departed in November or December:

filter(flights, month == 11 | month == 12)

The order of operations doesn’t work like English. You can’t write filter(flights, month == 11 | 12), which you might literally translate into “finds all flights that departed in November or December”. Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing!

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:

nov_dec <- filter(flights, month %in% c(11, 12))

Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120)

As well as & and |, R also has && and ||. Don’t use them here!

Whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. You’ll learn how to create new variables shortly

Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA
NA / 2
#> [1] NA

The most confusing result is this one:

NA == NA
#> [1] NA

It’s easiest to understand why this is true with a bit more context:

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
#> [1] NA
# We don't know!

If you want to determine if a value is missing, use is.na():

is.na(x)
#> [1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1     3
filter(df, is.na(x) | x > 1)
#> # A tibble: 2 × 1
#>       x
#>   <dbl>
#> 1    NA
#> 2     3
Grouped summaries with summarise()

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   delay
#>   <dbl>
#> 1  12.6

(We’ll come back to what that na.rm = TRUE means very shortly.)

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#> 
#>    year month   day delay
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.55
#> 2  2013     1     2 13.86
#> 3  2013     1     3 10.99
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # ... with 359 more rows

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

 

Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#> 
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1    NA
#> 2  2013     1     2    NA
#> 3  2013     1     3    NA
#> 4  2013     1     4    NA
#> 5  2013     1     5    NA
#> 6  2013     1     6    NA
#> # ... with 359 more rows

We get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#> 
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.55
#> 2  2013     1     2 13.86
#> 3  2013     1     3 10.99
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # ... with 359 more rows

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#> 
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.44
#> 2  2013     1     2 13.68
#> 3  2013     1     3 10.91
#> 4  2013     1     4  8.97
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # ... with 359 more rows

<

DATA IMPORT

Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you’ll learn how to read plain-text rectangular files into R. Here, we’ll only scratch the surface of data import, but many of the principles will translate to other forms of data. We’ll finish with a few pointers to packages that are useful for other types of data.

In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.

Most of readr’s functions are concerned with turning flat files into data frames:

  • read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.
  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions()read_table() reads a common variation of fixed width files where columns are separated by white space.
  • read_log() reads Apache style log files.

These functions all have similar syntax: once you’ve mastered one, you can use the others with ease. For the rest of this chapter we’ll focus on read_csv(). Not only are csv files one of the most common forms of data storage, but once you understand read_csv(), you can easily apply your knowledge to all the other functions in readr.

The first argument to read_csv() is the most important: it’s the path to the file to read.

heights <- read_csv("data/heights.csv")
#> Parsed with column specification:
#> cols(
#>   earn = col_double(),
#>   height = col_double(),
#>   sex = col_character(),
#>   ed = col_integer(),
#>   age = col_integer(),
#>   race = col_character()
#> )

When you run read_csv() it prints out a column specification that gives the name and type of each column.

You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:

read_csv("a,b,c
1,2,3
4,5,6")
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     2     3
#> 2     4     5     6

In both cases read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:

  1. Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.
    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)
    #> # A tibble: 1 × 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
    
    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
    #> # A tibble: 1 × 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
  2. The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:
    read_csv("1,2,3\n4,5,6", col_names = FALSE)
    #> # A tibble: 2 × 3
    #>      X1    X2    X3
    #>   <int> <int> <int>
    #> 1     1     2     3
    #> 2     4     5     6

    ("\n" is a convenient shortcut for adding a new line.)

    Alternatively you can pass col_names a character vector which will be used as the column names:

    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
    #> # A tibble: 2 × 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
    #> 2     4     5     6

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 × 3
#>       a     b     c
#>   <int> <int> <chr>
#> 1     1     2  <NA>

This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with read_tsv() and fixed width files with read_fwf(). To read in more challenging files, you’ll need to learn more about how readr parses each column, turning them into R vectors.

CREATING MARKDOWNS

R Markdown is a format for writing reproducible, dynamic reports with R. Use it to embed R code and results into slideshows, pdfs, html documents, Word files and more.

Markdown Cheatsheet

Markdown Example:

Open File Start by saving a text file with the extension .Rmd, or open an RStudio Rmd template.  In the menu bar, click File ▶ New File ▶ R Markdown… A window will open. Select the class of output you would like to make with your .Rmd file. Select the specific type of output to make with the radio buttons (you can change this later). Click OK

Plain text
End a line with two spaces to start a new paragraph.
*italics* and _italics_
**bold** and __bold__
superscript^2^
~~strikethrough~~
[link](www.rstudio.com)
# Header 1
## Header 2
### Header 3
#### Header 4
##### Header 5
###### Header 6
endash: --
emdash: ---
ellipsis: ...
inline equation: $A = \pi*r^{2}$
image: ![](path/to/smallorb.png)
horizontal rule (or slide break):
***
> block quote
* unordered list
* item 2
    + sub-item 1
    + sub-item 2
1. ordered list
2. item 2
    + sub-item 1
    + sub-item 2
Table Header  | Second Header
------------- | -------------
Table Cell    | Cell 2
Cell 3        | Cell 4

Data Wrangling in R

We have our two datasets: expression_results.csv.gz and sample_info.csv.gz, which likely we have decompressed by unzipping to expression_results.csv and sample_info.csv within the directory RNASeqExample.

Lets open up R Studio and lets create a new markdown called “RNASeq_Wrangle.Rmd”.  Lets make sure we don’t have the demo code in there and elete everything from “R Markdown” down online 10, such that it reads:

---
title: "RNASeq_Wrangle"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

Now lets start by loading in our libraries (if dplyr’s package isn’t installed you should type install.package(“dplyr”) in the command line window.  Lets first create an “R” codeblock by clicking “Insert” and then selecting “R” from the pull down.  This is directly above next to the green “+c”.   Lets load in the files into two dataframes: samples & genes.  The following should be in your code block.

The following may have an issue for you where you need to add a backslash before the quote.

 library(dplyr)
 setwd("~/RNASeqExample/")
 samples <- read.csv('sample_info.csv',header = TRUE, sep = ",", quote = "\\"", dec = ".", fill = TRUE, row.names = 1)
 genes <- read.csv('expression_results.csv',header = TRUE, sep = ",", quote = "\\"", dec = ".", fill = TRUE, row.names = 1)

Press the green arrow in the upper right side to run the code block.  There may be a few notes involving dplyr.  You should see in the upper right there are two dataframes for samples and genes.  We can click on each to get a preview.  For examples, the samples dataframe is shown below.

samples

genes

One thing we notice is that the UID, or unique ID is a column name for genes, and within the first rows of samples.  UID’s are important concept and they are guaranteed unique row within a table, study, or in fact globally… or universally (UUID).

Lots understand the data we are looking at a little better, and graph genes for two different samples by putting into the console

plot(genes$KITA_01,genes$KITA_02)

While we graphed 63,677 different points for these two genes, you wouldn’t know it as most of these have really very low values.  Moreover, the data look correlated driven by large 4 points.  This is an instance of what is likely non-parametric data (or data that doesn’t fit a normal distribution.  We can make a distribution of where the data are:

d <- density(genes$KITA_01) # returns the density data plot(d) # plots the results

Compare that to something more likely to be random normally distributed such as RIN for our samples:

d <- density(samples$RIN) # returns the density data > plot(d) # plots the results

 

Clearly we aren’t normally distributed.  Generally its wise to follow central-limit theorem when looking at data – or presume it to be non-parametric (Spearman), instead of parametric (Pearson).  What we are doing is same as graphing on a log plot –

plot(density(log2(genes$KITA_01[(genes$KITA_01>0)])))

Aren’t log’s fairly magical at making things normal?  Note that we are not showing the zeroes by the filtering done in line (genes$KITA_01>0).  Now lets actually plot the log2 values.

plot(log2(genes$KITA_01[(genes$KITA_01>0 |genes$KITA_02>0 )]),log2(genes$KITA_02[(genes$KITA_01>0 |genes$KITA_02>0 )]))

Now we see a plot showing us how two random different genes correlate.  If we wanted, we could calculate a Spearman correlation or a Pearson correlation.  A Pearson correlates the values, and the Spearman correlates the rank.  Generally, Spearman is robust to the values and does not suffer from “tale wagging the dog”.  Calculating the Pearson correlation coeffecient (r2) tell us about the percentage of variance given the other.

Ok – so in this case say that we wanted to create a dynamic range – how many orders of magnitude between the greatest and smallest value for each sample and then stores those in a new dataframe.

Lets use dplyr to make a new dataframe with some summary information.

samples$uid=rownames(samples)
genes_summary<-data.frame( UID=rownames(samples), min=minBySample <- sapply(genes, function(x) min(x[x > 0])), max=maxBySample <- sapply(genes, function(x) max(x)) )

We introduce some major statistical ideas and present them with the knowledge that the best statistical tool should always be based on working with experts in biostatistics or following examples laid out in analogous high impact articles.  In applied bioinformatics we are often not inventing a new analysis but rather following a pre-defined analysis approach.  Still, its worthwhile to review some of the major tools that come up time and again.

Correlation.  As we introduced above, correlation is generally a way to measure the strength of dependence of two variables on one another.  By knowing on variable, how much do we know about another?  One simple way to think about it as by percentage that one variable describes another variable.  This is the correlation constant This is another way of thinking about r^2.  R^2 is essentially the correlation coefficient.  Sometimes, we just refer to r, without squaring, and thus this can go from -1 to 1 where -1 would be perfect inversely correlated.  That is to say as one value goes up, another goes down.  We are going to discuss this a few different ways, so do your best not to get hung up in the intricacies as that is not our goal in this course.

Also, as above there are many types of correlation coefficients that refer to correlating two vectors; or two samples.  One type would be correlating the numbers as above, and this would be called a Pearson Correlation Constant.  Another is to correlate their ranks, which provides a way to conduct correlation where the relative ranks of each value are correlated.  This is called a Spearman Correlation Constant.  Spearman’s are appropriate whenever there is concern that the data is not “normally distributed”.  At a high level, if one would make a histogram of the data of the data it can provide us insight.

In the dataset we are using to learn there are 93 samples.  What if we took and measured the correlation constant between all samples?  This would create a 93 by 93 matrix of correlation values.  A function to help us with this would be “cor” in R, and we can apply it to genes.

corr<-cor(genes)

We can type `corr` to display it within the console, and we see something like this (shown for only 6 samples):

 KITA_01 KITA_02 KITA_03 KITA_04 KITA_05 KITA_06
KITA_01 1.0000000 0.9172325 0.7059474 0.7890462 0.8727615 0.7956106
KITA_02 0.9172325 1.0000000 0.6796419 0.8972362 0.9245204 0.9117325
KITA_03 0.7059474 0.6796419 1.0000000 0.7943030 0.7975012 0.7706625
KITA_04 0.7890462 0.8972362 0.7943030 1.0000000 0.9774844 0.9919876
KITA_05 0.8727615 0.9245204 0.7975012 0.9774844 1.0000000 0.9828198
KITA_06 0.7956106 0.9117325 0.7706625 0.9919876 0.9828198 1.0000000

We note that the obviously when one variable is perfectly correlated to itself, just as we would expect.

Now we can’t see the matrix easily, and honestly they are difficult to compute.  We could reshape this as a list of pairs and that would be easier.  This is core to the concept of “melt”, which turns a square matrix, into minimal (in this case pairwise) compoents.

melted_corr <- melt(corr)
 head(melted_corr)

The second command provides us the first few lines to get the idea of what is happening.

X1 X2 value
1 KITA_01 KITA_01 1.0000000
2 KITA_02 KITA_01 0.9172325
3 KITA_03 KITA_01 0.7059474
4 KITA_04 KITA_01 0.7890462
5 KITA_05 KITA_01 0.8727615
6 KITA_06 KITA_01 0.7956106

We can actually visualize this as a heatmap to better view:

ggplot(data = melted_corr, aes(x=X1, y=X2, fill=value)) + geom_tile()

However, we can do quite a bit and its best to find examples on the web.  One example I found was:

ggplot(melted_corr , aes(x = Var1, y = Var2)) + geom_raster(aes(fill = value)) + scale_fill_gradient2(low="navy", mid="white", high="red", midpoint=0.5) + theme_classic()

Now one thing we notice is that there are some samples that are like other samples.  In fact we see the default organization of the samples shows some all like each other near top, and some that are different.

We are going to need to install a few libraries. See below:

install.packages('devtools')  
devtools::install_github('hadley/ggplot2') 
install.packages('reshape2')

Lets use a different library, “plotly” which adds additional functionality to ggplot.   You may need to install the plotly packages via install.packages(‘plotly’)  and some individuals may be requested to install development version of ggplot, you can do this via: install_github(“ggplot2”, “hadley”, “develop”).  Here we remove some labels and make it work with hovering.  The entire block is:

library(ggplot2)
library(reshape2) 
corr<-cor(genes) 
melted_corr <- melt(corr) 
p<-ggplot(melted_corr , aes(x = Var1, y = Var2)) + geom_raster(aes(fill = value)) + scale_fill_gradient2(low="navy", mid="white", high="red", midpoint=0.5) + theme( plot.title = element_blank(),axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.y = element_blank(), axis.title.x = element_blank()) 
ggplotly(p)

 

One concept is supervised vs unsupervised analysis methods.  Supervised methods are when you identifying classes using algorithms and unsupervised is when we are identifying classes that are assigned.  In the case of our data, the supervised classes are “Kit”.  Unsupervised groups will be identified through algorithms.  One such algorithm is Hierarchal clustering.

 

Joins and other functions

We refer to our DPLYR spreadsheet, but we note that these are data base functions common in tables.  We want to join in our PCA Analysis and maybe color by something related to the sample.  Here is some info on joining from the cheatsheet.

Now we are just introducing joins, and they are important with databases.  There are three concepts to think about when merging tables.  Sometimes they have the same rows and sometimes they don’t.  In this case, its easy because they do.  Still lets do an inner join.

We know the first thing we need to do is to have a common column to join on.  When we look at the samples dataframe, you’ll notice there is no column, but that the rownames are what we want to join on.  So lets make it easy and create a new column with these rownames

 

Running a PCA

```{r}
pcadf<-prcomp(genes, center = TRUE,scale. = TRUE)
pcadf<-data.frame(pcad$rotation)
```

```{r}
library(dplyr)
samples$uid<-rownames(samples)
pcadf$uid<-rownames(pcadf)
samples<-inner_join(samples,pcadf,by="uid")
```
```{r}
plot_ly(samples, x = ~PC1, y = ~PC2, z = ~PC3, size=~reads,marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 25), color = ~Kit) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
  yaxis = list(title = 'PC2'),
  zaxis = list(title = 'PC3')))
```

samples$uid<-rownames(samples) 
pcadf$uid<-rownames(pcadf) 
samples<-inner_join(samples,pcadf,by="uid")

Now lets make our graph a little more informative.  That’s five dimensions for those counting

plot_ly(samples, x = ~PC1, y = ~PC2, z = ~PC3, size=~reads,marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(5, 25), color = ~RIN, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
  yaxis = list(title = 'PC2'),
  zaxis = list(title = 'PC3')))