R Programming

R Overview and History

What is R?

Is a dialect of S

What is S?

  • Developed by John Chambers et al. at Bell Labs

  • Iniated in 1976 as an internal statistical analysis environment / Early versions did not contain functions for statistical modeling

  • In 1988 it was rewritten in C (Version 3) -> The White Book contains all functions added

  • Version 4 was released in 1998 -> Green Book (Not many changes after that)

S Philosophy by John Chambers:

“Create an interactive environment where they did not consciously think of themselves as programming but as the needs evolved they should be able to slide in to programming when the language and system aspects would become more important”

  • R was create in 1991 in New Zealand by Ross Ihaka and Robert Gentleman (Their experience developing R is documented in a 1996 JCGS paper)

  • 1993 - First announcement of R to the public

  • 1995 - R goes to the GNU General Public License (R becomes free)

  • 1996 - R-help and R-devel

  • 1997 - R Core Group is formed (only they control all source code of R).

  • 2000 - R version 1.0.0 is released

Features of R

  • Quite lean -> functionality is divided into modular packages

  • Graphics capabilities very sophisticated and better than most stats packages

  • Useful for interactive work but contains a powerful programming language for developing new tools

  • Very active and vibrant user community

  • It’s free!

Free Software

  1. Freedom to run the program, for any purpose

  2. Freedom to study how the program works and adapt it to your needs

  3. Freedom to redistribute copies

  4. Freedom to improve the program and release those improvements

Drawbacks of R

  • Based on 40 year old technology

  • Little support for dynamic of 3-D graphics

  • Functionality is based on consumer demand - If you want something, do it yourself!

  • Objects must be stored in physical memory

Design of the R System

The R system is divided into 2 conceptual parts:

  1. The “base” R system that you download from CRAN

  2. Everything else

The R functionality is divided into a number of packages:

  1. The “base”R contains the base package required to run R

  2. Other packages are inside the “base” utils, stats, datasets, graphics...

  3. Recommended packages: boot, class, cluster, KernSmooth, lattice...

  4. About 4000 packages on CRAN

  5. And more at Bioconductor Project

Using R

R Console Input and Evaluation

What we type at the R prompt are called expression:

<- is the assignment operator

In the evaluation (after you press Enter) the number in the brackets indicates the position of that object in the vector

: - operator that creates a sequence of integers

R Data Types: Objects and Attibutes

R has five basic “atomic” classes of objects:

  1. character

  2. numeric (real numbers)

  3. integer

  4. complex

  5. logical (True/False)

  • A vector can only contain objects of the same class

  • A list is represented as a vector but can contain objects of different

  • Empty vectors are created with the vector () function classes

Numbers

  • Numbers are generally treated as numeric objects

  • If you explicitly wants an integer, you need to use the suffix L (Ex: Entering 1 gives you a numeric object; entering 1L gives you an integer)

  • Special number Inf - represents infinity (e.g. 1/0)

  • NaN - “Not a number” or missing value

Attributes

R objects can have attributes:

  • names, dimnames

  • dimensions (e.g. matrices, arrays)

  • class

  • length

  • other used defined attribute/metadata

  • Attributes of an object can be accessed using attibutes ()

R Data Types: Vectors and Lists

Vectors

c() -> concatanates

When mixing different classes R will go for the least demonimantor:

  • Numeric and Char -> Char

  • Logical and Numeric -> Numeric (F = 0 / T = 1)

  • Char and Logical -> Char (“True”)

Explicit coercion -> objects can be explicitly coerced from one class to the other:

  • as.numeric()

  • as.logical()

  • as.character()

Lists

Elements are indexed by double brackets

R Data Types: Matricies

Matrices are vectors with dimension attributes (length = 2 [nrow, ncol]).

matrix(nrow = 2, ncol = 3)

Matrices are constructed collum-wise, so:

matrix(1:6, nrow=2, ncol=3)

1 3 5

2 4 6

Create a matrix from a vector just adding the dimension attributes:

m <- 1:10

dim(m) <- c(2,5)

Create a matrix from binding columns and rows: x <- 1:3

y <- 10:12

cbind(x,y)

1 10

2 11

3 12

rbind(x,y)

1 2 3

10 11 12

R Data Types: Factors

Factors are used to represent categorical data (ordered or unordered)

factor(c(characters))

The order of the factors can be organized by using the levels argument (in modelling the first level is used as the baseline level):

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

R Data Types: Missing values

NA or NaN

is.na -> tests objects if they are NA is.nan -> test for NaN

NA values have a class also, so there are integer NA, character NA, etc.

R Data Types: Data Frames

  • Represented as a special type of list where every element of the list has to have the same length

  • Data frames can store different classes of objects in each column

  • Special attributes as row.names

  • Create when importing data using read.table() or read.csv

  • Can be converted to a matrix using data.matrix()

data.frame(foo = 1:4, bar = c(T, T, F, F))

foo bar

1 TRUE

2 TRUE

3 FALSE

4 FALSE

R Data Types:The names attribute

x <- 1:3 names(x) <- c("a", "b", "c")

a b c

1 2 3

Reading tabular data

Principal functions for reading data into R:

  • read.table and read.csv for reading tabular data

  • readLines for reading lines of a text file

  • source for reading in R code files (inverse of dump)

  • dget for reading in R code files (inverse of dput)

  • load for reading in saved workspace

  • unserialize for reading single R objects in binary form

Principal functions for writing data into R:

  • write.table

  • write.Lines

  • dump

  • dput

  • save

  • serializa

Small to moderate datasets

Read.table or .csv arguments:

  • file -> name of a file or a connection

  • header -> logical indicating if the lines has a header line

  • sep -> string indicating how the columns are separated

  • colClasses -> a character vector indicating the class of each column in the dataset

  • nrows -> the number of rows in the dataset

  • comment.char -> a character string indicating the comment

  • skip -> the number of lines to skip from the beginning

  • stringAsFactors -> should character variables be coded as factors?

Large datasets

Things that helps:

  • Specify the colClasses argument:

    initial<-read.table("data.txt",nrows=100)

    classes<-sapply(initial,class)

    tabAll<-read.table("data.txt",colClasses=classes

Memory calculation:

Ex) Data frame with 1,500,000 rows and 120 columns (all numeric):

1,500,000 * 120 * 8 bytes/numeric

= 1440000000 = 1,373,29MB = 1,34GB

Textual Data Formats

dump and dput are useful because the resulting textual format is edit-able, and in the case of corruption, recoverable

Textual formats:

  • can work much better with version control programs

  • can be longer-lived

  • adhere to the “Unix philosophy”

dput and dget objects:

y<-data.frame(a=1,b="a")

dput(y)

dput(y,file="y.R") -> puts dput(y) in a file

new.y<-dget("y>R")

Interfaces to the outside world

R connects to:

  • file

  • gzfile

  • bzfile

  • url

Subsetting R Objects

Basics

Operators for subsetting:

  • [ -> returns an object of the same class as the original (can extract multiple elements)

  • [[ -> extract elements of a list or a data frame but it may be of a different format (extract only one element)

  • $ -> extract elements of a list or data frame by name

Lists

Subsetting nested elements of a list:

x <-list(a = list(10,12,14), b=c(3.14, 2.81))

x[[c(1,3)]] -> third element of the first element of x

14

Matricies

x[row, column]

Partial matching

x <- list(aadvark = 1:5)

x$a -> gets aadvark

x[["a", exact = False]]

Removing NA’s

x <- c(1,2,NA,4,NA,5)

bad <- is.na(x)

x[!bad]

complete.cases() -> creates a logical vector

R Programming

Control structures

Allow you to control the flow of execution of the program

  • if, else -> testing a condition

  • for -> execute a loop a fixed number of times

  • while -> execute a function while a condition is true

  • repeat -> execute an infinite lo