Spoon-Fed R

Matt Shirley
October 24 2013

Overview

  1. interacting with R
  2. using R as a calculator
  3. variables
  4. data structures
  5. summarizing data
  6. loops, flow-control
  7. apply
  8. basic stats in R
  9. reading and writing delimited data
  10. plotting with base R graphics
  11. loading and installing packages
  12. plotting with ggplot2

interacting with R

  • command-line interpreter
  • GUI interpreter: RStudio

command-line interpreter

  • everyone has one
  • just type R at your command-line shell:
R version 3.0.2 -- "Frisbee Sailing"
Platform: x86_64-apple-darwin13.0.0 (64-bit)
...

Type 'q()' to quit R.

> 
  • The carat (>) is your prompt for entering commands
  • I will omit the carat for the rest of the presentation

GUI interpreter: RStudio

GUI interpreter: RStudio

RStudio is an integrated development environment including:

  • interpreter with code completion
  • text editor with syntax highlighting and completion
  • file browser
  • version control manager
  • visual object workspace
  • command history

the R interpreter

# This is a comment, which is ignored
# functions are applied with ()
print("hello") 
[1] "hello"
  • anything in quotes is a “string”
  • anything else is either a number or:
    • function
    • class
    • operator (+-/?%&=<>|!^*)

using R as a calculator

Addition

2 + 2
[1] 4

Subtraction

5 - 2
[1] 3

using R as a calculator

Division

2 * 2
[1] 4

Multiplication

5 / 2
[1] 2.5

using R as a calculator

Exponents

2^4
[1] 16

Logorithms

log10(100)
[1] 2
log2(4)
[1] 2

using R as a calculator

Order of operations

10 / 2 - 1
[1] 4
10 - 5 / 5
[1] 9
(10 - 5) / 5
[1] 1

Be careful. Evaluation of operators occurs left to right.

variables

x <- 1
x
[1] 1

Variables can be assigned (<-) a value

variables

x <- 1
y <- 2
x <- y
x
[1] 2
y
[1] 2

But be careful because they can be re-assigned

data structures: comparisons

x <- 0
x > 1 ## x is greater than 1
[1] FALSE
x < 1 ## x is greater than 1
[1] TRUE

data structures: comparisons

x == 1
[1] FALSE
x == 0
[1] TRUE
x != 0
[1] FALSE

Comparisons result in boolean values

data structures: vectors

x <- 3
y <- c(1,2,x)
y
[1] 1 2 3

Vectors can hold elements of the same type.

data structures: vectors

names(y) <- c("one", "two", "three")
y
  one   two three 
    1     2     3 

Vectors can also have names for each element.

data structures: vectors

z <- y * 3
z
  one   two three 
    3     6     9 
sum(z)
[1] 18

Arithmetic can be performed on a vector, which applies that operation to every element and returns a new vector.

data structures: vector indexing

  one   two three 
    3     6     9 
z[1]
one 
  3 
z["one"]
one 
  3 

Vectors can be indexed using a 1-based position, as well as name.

data structures: vector slicing

z
  one   two three 
    3     6     9 
z[2:3]
  two three 
    6     9 

Slicing a vector is as easy as specifying start:end.

data structures: vector slicing

z[-1]
  two three 
    6     9 
z[-2:-3]
one 
  3 

Remove elements from a vector using negative indices.

data structures: lists

q <- list(y, z)
q
[[1]]
  one   two three 
    1     2     3 

[[2]]
  one   two three 
    3     6     9 

Lists can contain vectors.

data structures: list indexing

q[[1]]
  one   two three 
    1     2     3 
q[[1]][1]
one 
  1 

You can index a list in the same way as a vector.

data structures: sequences

v <- seq(1,9) ## or 1:9
v
[1] 1 2 3 4 5 6 7 8 9

Let's construct a sequence of 9 numbers.

data structures: sequences

c(v,v)
 [1] 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rep(v, times=3)
 [1] 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

We can concatonate or repeat a vector as well.

data structures: matrices

mt <- matrix(v, nrow=3)
mt
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix(v, nrow=3, byrow=T)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Matrices, created from vectors, are row or column oriented.

data structures: matrix indexing

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
mt[1,1]
[1] 1
mt[3,3]
[1] 9

Matrices are indexed as [row,col]

data structures: dimension

dim(mt)
[1] 3 3
nrow(mt)
[1] 3
ncol(mt)
[1] 3

Dimensionality, number of rows and columns can computed using these functions.

data structures: dataframes

df <- data.frame(y, z)
colnames(df) <- c("first","second")
df
      first second
one       1      3
two       2      6
three     3      9

Dataframes are like matrices, but contain more structure.

data structures: dataframe indexing

      first second
one       1      3
two       2      6
three     3      9
df$first
[1] 1 2 3

Dataframes can be indexed by name to return a vector.

data structures: dataframe indexing

      first second
one       1      3
two       2      6
three     3      9
df["first"]
      first
one       1
two       2
three     3

Dataframes can be indexed by name to return another dataframe

data structures: dataframe indexing

      first second
one       1      3
two       2      6
three     3      9
df$first[1]
[1] 1

Dataframes can be further indexed to return individual elements

data structures: logical indexing

      first second
one       1      3
two       2      6
three     3      9
df > 3
      first second
one   FALSE  FALSE
two   FALSE   TRUE
three FALSE   TRUE

Dataframes, just like other structures, can be compared, resulting a boolean values.

data structures: logical indexing

      first second
one   FALSE  FALSE
two   FALSE   TRUE
three FALSE   TRUE
df[df > 3]
[1] 6 9

Passing the boolean result of comparison as an index returns only elements where the comparison was TRUE.

data structures: logical indexing

      first second
one   FALSE  FALSE
two   FALSE   TRUE
three FALSE   TRUE
which(df > 3)
[1] 5 6

The which function converts a boolean index to a numeric index.

data structures: dataframe binding

      first second
one       1      3
two       2      6
three     3      9
cbind(df, data.frame("third"=c(9,18,27)))
      first second third
one       1      3     9
two       2      6    18
three     3      9    27

Dataframe columns can be bound to form a new dataframe.

data structures: dataframe binding

      first second
one       1      3
two       2      6
three     3      9
rbind(df, data.frame("first"=4, "second"=12, row.names="four"))
      first second
one       1      3
two       2      6
three     3      9
four      4     12

Dataframe rows can be bound to form a new dataframe.

summarizing data

library(datasets)
dim(cars)
[1] 50  2
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

summarizing data

mean(cars$speed)
[1] 15.4
median(cars$speed)
[1] 15
sd(cars$speed)
[1] 5.288

Mean, median and standard deviation.

summarizing data

summary(cars)
     speed           dist    
 Min.   : 4.0   Min.   :  2  
 1st Qu.:12.0   1st Qu.: 26  
 Median :15.0   Median : 36  
 Mean   :15.4   Mean   : 43  
 3rd Qu.:19.0   3rd Qu.: 56  
 Max.   :25.0   Max.   :120  

Summarizing a dataframe returns percentiles and mean.

loops, flow-control: for loops

for (x in 1:10){ 
  print(x) 
  }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Use for loops to repeat a task a certain number of times.

loops, flow-control: if/else

x <- 0
if (x == 0) { print("yes") }
[1] "yes"
if (x > 1) { print("yes") } else { print("no") }
[1] "no"
  • If statements only execute code if the condition evaluates to TRUE.
  • Else statements execute when the condition is not satisfied.

loops, flow-control: while loops

x <- 0
while (x < 5){ 
  print(x) 
  x <- x + 1
  }
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4

Use while loops to repeat a task while a condition (x<5) is true.

apply: functional application

      first second
one       1      3
two       2      6
three     3      9
apply(df, 1, sum)
  one   two three 
    4     8    12 
apply(df, 2, sum)
 first second 
     6     18 

Apply a function over array columns (1) or rows (2).

sapply: simpler apply

      first second
one       1      3
two       2      6
three     3      9
sapply(df, sqrt)
     first second
[1,] 1.000  1.732
[2,] 1.414  2.449
[3,] 1.732  3.000

Simple apply a function to every element, returning the same type of data structure.

reading and writing delimited data

write.table(df, file = "example.txt")
write.table(df, file = "example.tsv", sep = "\t")
write.csv(df, file = "example.csv")

Write 1) space-delimited, 2) tab-delimited, 3) comma-delimited files containing dataframe df.

reading and writing delimited data

df1 = read.table("example.txt", header=T)
df2 = read.delim("example.tsv", sep = "\t")
df3 = read.csv("example.csv", row.names = 1)
identical(df1,df2)
[1] TRUE
identical(df2,df3)
[1] TRUE

All three files result in equivalent dataframes.

reading and writing delimited data

Issues to consider when reading and writing delimited files:

  1. Do I want/have column names (header)?
  2. Do I want/have row names?
  3. What is my delimiter?
  4. Do I want/have quotes surrounding each value?

Check the default behavior of the reading/writing function first.

plotting with base R graphics

head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

plotting with base R graphics: scatterplot

plot(cars)

plot of chunk unnamed-chunk-61

  • plot accepts a dataframe with two columns
  • column 1 = x axis
  • column 2 = y axis

plotting with base R graphics: line plot

plot(cars, type="l")

plot of chunk unnamed-chunk-62

  • valid plot types:
    • “p” for points
    • “l” for lines
    • “b” for both (“o” for overplotted)
    • “h” for ‘histogram’-like lines
    • “s” for stair steps (“S” for other)
    • “n” for no plotting.

plotting with base R graphics: linear regression

lmcars <- lm(dist ~ speed, cars)
lmcars

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
     -17.58         3.93  
  • lm fits a linear model: response ~ terms
  • in this case the response is distance traveled at speed

plotting with base R graphics: linear regression

plot(cars)
abline(lmcars)

plot of chunk unnamed-chunk-64

  • abline draws a line from slope and intercept

plotting with base R graphics: graphics parameters

plot(cars, title="Speed vs. Distance", xlab="Speed", ylab="Distance", ylim=c(0,100))
abline(lmcars)

plot of chunk unnamed-chunk-65

plotting with base R graphics: graphics parameters

plot(cars, col="red", pch=16, cex=2)
abline(lmcars, col="blue")

plot of chunk unnamed-chunk-66

plotting with base R graphics: histograms

hist(cars$speed)