R Programming

R Overview and History

What is R?

Is a dialect of S

What is S?

  • Developed by John Chambers et al. at Bell Labs

  • Iniated in 1976 as an internal statistical analysis environment / Early versions did not contain functions for statistical modeling

  • In 1988 it was rewritten in C (Version 3) -> The White Book contains all functions added

  • Version 4 was released in 1998 -> Green Book (Not many changes after that)

S Philosophy by John Chambers:

“Create an interactive environment where they did not consciously think of themselves as programming but as the needs evolved they should be able to slide in to programming when the language and system aspects would become more important”

  • R was create in 1991 in New Zealand by Ross Ihaka and Robert Gentleman (Their experience developing R is documented in a 1996 JCGS paper)

  • 1993 - First announcement of R to the public

  • 1995 - R goes to the GNU General Public License (R becomes free)

  • 1996 - R-help and R-devel

  • 1997 - R Core Group is formed (only they control all source code of R).

  • 2000 - R version 1.0.0 is released

Features of R

  • Quite lean -> functionality is divided into modular packages

  • Graphics capabilities very sophisticated and better than most stats packages

  • Useful for interactive work but contains a powerful programming language for developing new tools

  • Very active and vibrant user community

  • It’s free!

Free Software

  1. Freedom to run the program, for any purpose

  2. Freedom to study how the program works and adapt it to your needs

  3. Freedom to redistribute copies

  4. Freedom to improve the program and release those improvements

Drawbacks of R

  • Based on 40 year old technology

  • Little support for dynamic of 3-D graphics

  • Functionality is based on consumer demand - If you want something, do it yourself!

  • Objects must be stored in physical memory

Design of the R System

The R system is divided into 2 conceptual parts:

  1. The “base” R system that you download from CRAN

  2. Everything else

The R functionality is divided into a number of packages:

  1. The “base”R contains the base package required to run R

  2. Other packages are inside the “base” utils, stats, datasets, graphics...

  3. Recommended packages: boot, class, cluster, KernSmooth, lattice...

  4. About 4000 packages on CRAN

  5. And more at Bioconductor Project

Using R

R Console Input and Evaluation

What we type at the R prompt are called expression:

<- is the assignment operator

In the evaluation (after you press Enter) the number in the brackets indicates the position of that object in the vector

: - operator that creates a sequence of integers

R Data Types: Objects and Attibutes

R has five basic “atomic” classes of objects:

  1. character

  2. numeric (real numbers)

  3. integer

  4. complex

  5. logical (True/False)

  • A vector can only contain objects of the same class

  • A list is represented as a vector but can contain objects of different

  • Empty vectors are created with the vector () function classes


  • Numbers are generally treated as numeric objects

  • If you explicitly wants an integer, you need to use the suffix L (Ex: Entering 1 gives you a numeric object; entering 1L gives you an integer)

  • Special number Inf - represents infinity (e.g. 1/0)

  • NaN - “Not a number” or missing value


R objects can have attributes:

  • names, dimnames

  • dimensions (e.g. matrices, arrays)

  • class

  • length

  • other used defined attribute/metadata

  • Attributes of an object can be accessed using attibutes ()

R Data Types: Vectors and Lists


c() -> concatanates

When mixing different classes R will go for the least demonimantor:

  • Numeric and Char -> Char

  • Logical and Numeric -> Numeric (F = 0 / T = 1)

  • Char and Logical -> Char (“True”)

Explicit coercion -> objects can be explicitly coerced from one class to the other:

  • as.numeric()

  • as.logical()

  • as.character()


Elements are indexed by double brackets

R Data Types: Matricies

Matrices are vectors with dimension attributes (length = 2 [nrow, ncol]).

matrix(nrow = 2, ncol = 3)

Matrices are constructed collum-wise, so:

matrix(1:6, nrow=2, ncol=3)

1 3 5

2 4 6

Create a matrix from a vector just adding the dimension attributes:

m <- 1:10

dim(m) <- c(2,5)

Create a matrix from binding columns and rows: x <- 1:3

y <- 10:12


1 10

2 11

3 12


1 2 3

10 11 12

R Data Types: Factors

Factors are used to represent categorical data (ordered or unordered)


The order of the factors can be organized by using the levels argument (in modelling the first level is used as the baseline level):

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

R Data Types: Missing values

NA or NaN -> tests objects if they are NA is.nan -> test for NaN

NA values have a class also, so there are integer NA, character NA, etc.

R Data Types: Data Frames

  • Represented as a special type of list where every element of the list has to have the same length

  • Data frames can store different classes of objects in each column

  • Special attributes as row.names

  • Create when importing data using read.table() or read.csv

  • Can be converted to a matrix using data.matrix()

data.frame(foo = 1:4, bar = c(T, T, F, F))

foo bar





R Data Types:The names attribute

x <- 1:3 names(x) <- c("a", "b", "c")

a b c

1 2 3

Reading tabular data

Principal functions for reading data into R:

  • read.table and read.csv for reading tabular data

  • readLines for reading lines of a text file

  • source for reading in R code files (inverse of dump)

  • dget for reading in R code files (inverse of dput)

  • load for reading in saved workspace

  • unserialize for reading single R objects in binary form

Principal functions for writing data into R:

  • write.table

  • write.Lines

  • dump

  • dput

  • save

  • serializa

Small to moderate datasets

Read.table or .csv arguments:

  • file -> name of a file or a connection

  • header -> logical indicating if the lines has a header line

  • sep -> string indicating how the columns are separated

  • colClasses -> a character vector indicating the class of each column in the dataset

  • nrows -> the number of rows in the dataset

  • comment.char -> a character string indicating the comment

  • skip -> the number of lines to skip from the beginning

  • stringAsFactors -> should character variables be coded as factors?

Large datasets

Things that helps:

  • Specify the colClasses argument:




Memory calculation:

Ex) Data frame with 1,500,000 rows and 120 columns (all numeric):

1,500,000 * 120 * 8 bytes/numeric

= 1440000000 = 1,373,29MB = 1,34GB

Textual Data Formats

dump and dput are useful because the resulting textual format is edit-able, and in the case of corruption, recoverable

Textual formats:

  • can work much better with version control programs

  • can be longer-lived

  • adhere to the “Unix philosophy”

dput and dget objects:



dput(y,file="y.R") -> puts dput(y) in a file


Interfaces to the outside world

R connects to:

  • file

  • gzfile

  • bzfile

  • url

Subsetting R Objects


Operators for subsetting:

  • [ -> returns an object of the same class as the original (can extract multiple elements)

  • [[ -> extract elements of a list or a data frame but it may be of a different format (extract only one element)

  • $ -> extract elements of a list or data frame by name


Subsetting nested elements of a list:

x <-list(a = list(10,12,14), b=c(3.14, 2.81))

x[[c(1,3)]] -> third element of the first element of x



x[row, column]

Partial matching

x <- list(aadvark = 1:5)

x$a -> gets aadvark

x[["a", exact = False]]

Removing NA’s

x <- c(1,2,NA,4,NA,5)

bad <-


complete.cases() -> creates a logical vector

R Programming

Control structures

Allow you to control the flow of execution of the program

  • if, else -> testing a condition

  • for -> execute a loop a fixed number of times

  • while -> execute a function while a condition is true

  • repeat -> execute an infinite loop

  • break -> break the execution of a loop

  • next -> skip an iteration of a loop

  • return -> exit a function

If, else

In R you can assign the result of a if,else structure to a variable

y <- if(x>3) {10} else {0}

for loop

x <- c(1,2,3,4,5)

for(i in 1:5) OR for(i in seq_along(x)

seq_along() takes a vector and creates an integer sequence equal to the length of the vector

while loop

Initiate a variable count before the loop and increment it at every iteration

Writting Functions

  • R always returns the result of the last expression of the function

  • Functions are created by the function() and stored as R objects

  • Functions have named arguments that may have default values

  • The formal arguments are those included in the function()

  • Formals() takes a function as an input and returns a list of its arguments

  • Arguments can be matched by name, position or partially

  • Lazy evaluation -> the arguments are only evaluated if they are needed

  • The “...” argument represents a variable number of arguments that are passed by a different function

  • The “...” argument is also necessary when you don’t know in advance the number of arguments to be passed to the function

  • Arguments that come after the “...” must be explicitly matched

Scoping Rules

How does R recognizes and gives values to different symbols, even though a previous version of this symbol has been used before?

When R tries to bind a value to a symbol, it searches through a series of environments to find the appropriate value. When you are working on the command line and need to retrieve the value of an R object, the order is roughly this:

  1. Search the global environment for a symbol name matching the one requested

  2. Search the namespaces of each packages currently loaded on the search list (this list can be found using search()

  • Every time you load a new package it is allocated in the second position

Scoping rules: (determines how a value is associated with a free variable in a function

  • R uses lexical scoping or static scoping. An alternative is dynamic scoping

  • Related to the scoping rules is how R uses the search list to bind a value to a symbol

  • Lexical scoping in R means that the value of free variables are searched for in the environment in which the function is defined

What is an environment?

  • A collection of (symbol, value) pairs, i.e. x is a symbol and 3.14 may be its value.

  • Every environment has a parent environment; it is possible for a single environment to have multiple “children”

  • The only environment without a parent is the empty environment

  • A function + an environment = a closure or function closure

What happens when you are ina function and finds a free variable?

  1. If the value is not found in the environment in which the function was defined, then the search continues in the parent environment

  2. The search continues down the sequence of parent environments until it reaches the top-level environment (usually the global environment or the namespace of a package

  3. After the top-level environment, the search continues down the search list until it hits the empty environment.

  4. If after all that the value is not found an error is thrown