ROUGH DRAFT authorea.com/51901
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • R Programming

    R Overview and History

    What is R?

    Is a dialect of S

    What is S?

    • Developed by John Chambers et al. at Bell Labs

    • Iniated in 1976 as an internal statistical analysis environment / Early versions did not contain functions for statistical modeling

    • In 1988 it was rewritten in C (Version 3) -> The White Book contains all functions added

    • Version 4 was released in 1998 -> Green Book (Not many changes after that)

    S Philosophy by John Chambers:

    “Create an interactive environment where they did not consciously think of themselves as programming but as the needs evolved they should be able to slide in to programming when the language and system aspects would become more important”

    • R was create in 1991 in New Zealand by Ross Ihaka and Robert Gentleman (Their experience developing R is documented in a 1996 JCGS paper)

    • 1993 - First announcement of R to the public

    • 1995 - R goes to the GNU General Public License (R becomes free)

    • 1996 - R-help and R-devel

    • 1997 - R Core Group is formed (only they control all source code of R).

    • 2000 - R version 1.0.0 is released

    Features of R

    • Quite lean -> functionality is divided into modular packages

    • Graphics capabilities very sophisticated and better than most stats packages

    • Useful for interactive work but contains a powerful programming language for developing new tools

    • Very active and vibrant user community

    • It’s free!

    Free Software

    1. Freedom to run the program, for any purpose

    2. Freedom to study how the program works and adapt it to your needs

    3. Freedom to redistribute copies

    4. Freedom to improve the program and release those improvements

    Drawbacks of R

    • Based on 40 year old technology

    • Little support for dynamic of 3-D graphics

    • Functionality is based on consumer demand - If you want something, do it yourself!

    • Objects must be stored in physical memory

    Design of the R System

    The R system is divided into 2 conceptual parts:

    1. The “base” R system that you download from CRAN

    2. Everything else

    The R functionality is divided into a number of packages:

    1. The “base”R contains the base package required to run R

    2. Other packages are inside the “base” utils, stats, datasets, graphics...

    3. Recommended packages: boot, class, cluster, KernSmooth, lattice...

    4. About 4000 packages on CRAN

    5. And more at Bioconductor Project

    Using R

    R Console Input and Evaluation

    What we type at the R prompt are called expression:

    <- is the assignment operator

    In the evaluation (after you press Enter) the number in the brackets indicates the position of that object in the vector

    : - operator that creates a sequence of integers

    R Data Types: Objects and Attibutes

    R has five basic “atomic” classes of objects:

    1. character

    2. numeric (real numbers)

    3. integer

    4. complex

    5. logical (True/False)

    • A vector can only contain objects of the same class

    • A list is represented as a vector but can contain objects of different

    • Empty vectors are created with the vector () function classes

    Numbers

    • Numbers are generally treated as numeric objects

    • If you explicitly wants an integer, you need to use the suffix L (Ex: Entering 1 gives you a numeric object; entering 1L gives you an integer)

    • Special number Inf - represents infinity (e.g. 1/0)

    • NaN - “Not a number” or missing value

    Attributes

    R objects can have attributes:

    • names, dimnames

    • dimensions (e.g. matrices, arrays)

    • class

    • length

    • other used defined attribute/metadata

    • Attributes of an object can be accessed using attibutes ()

    R Data Types: Vectors and Lists

    Vectors

    c() -> concatanates

    When mixing different classes R will go for the least demonimantor:

    • Numeric and Char -> Char

    • Logical and Numeric -> Numeric (F = 0 / T = 1)

    • Char and Logical -> Char (“True”)

    Explicit coercion -> objects can be explicitly coerced from one class to the other:

    • as.numeric()

    • as.logical()

    • as.character()

    Lists

    Elements are indexed by double brackets

    R Data Types: Matricies

    Matrices are vectors with dimension attributes (length = 2 [nrow, ncol]).

    matrix(nrow = 2, ncol = 3)

    Matrices are constructed collum-wise, so:

    matrix(1:6, nrow=2, ncol=3)

    1 3 5

    2 4 6

    Create a matrix from a vector just adding the dimension attributes:

    m <- 1:10

    dim(m) <- c(2,5)

    Create a matrix from binding columns and rows: x <- 1:3

    y <- 10:12

    cbind(x,y)

    1 10

    2 11

    3 12

    rbind(x,y)

    1 2 3

    10 11 12

    R Data Types: Factors

    Factors are used to represent categorical data (ordered or unordered)

    factor(c(characters))

    The order of the factors can be organized by using the levels argument (in modelling the first level is used as the baseline level):

    x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

    R Data Types: Missing values

    NA or NaN

    is.na -> tests objects if they are NA is.nan -> test for NaN

    NA values have a class also, so there are integer NA, character NA, etc.

    R Data Types: Data Frames

    • Represented as a special type of list where every element of the list has to have the same length

    • Data frames can store different classes of objects in each column

    • Special attributes as row.names

    • Create when importing data using read.table() or read.csv

    • Can be converted to a matrix using data.matrix()

    data.frame(foo = 1:4, bar = c(T, T, F, F))

    foo bar

    1 TRUE

    2 TRUE

    3 FALSE

    4 FALSE

    R Data Types:The names attribute

    x <- 1:3 names(x) <- c("a", "b", "c")

    a b c

    1 2 3

    Reading tabular data

    Principal functions for reading data into R:

    • read.table and read.csv for reading tabular data

    • readLines for reading lines of a text file

    • source for reading in R code files (inverse of dump)

    • dget for reading in R code files (inverse of dput)

    • load for reading in saved workspace

    • unserialize for reading single R objects in binary form

    Principal functions for writing data into R:

    • write.table

    • write.Lines

    • dump

    • dput

    • save

    • serializa

    Small to moderate datasets

    Read.table or .csv arguments:

    • file -> name of a file or a connection

    • header -> logical indicating if the lines has a header line

    • sep -> string indicating how the columns are separated

    • colClasses -> a character vector indicating the class of each column in the dataset

    • nrows -> the number of rows in the dataset

    • comment.char -> a character string indicating the comment

    • skip -> the number of lines to skip from the beginning

    • stringAsFactors -> should character variables be coded as factors?

    Large datasets

    Things that helps:

    • Specify the colClasses argument:

      initial<-read.table("data.txt",nrows=100)

      classes<-sapply(initial,class)

      tabAll<-read.table("data.txt",colClasses=classes

    Memory calculation:

    Ex) Data frame with 1,500,000 rows and 120 columns (all numeric):

    1,500,000 * 120 * 8 bytes/numeric

    = 1440000000 = 1,373,29MB = 1,34GB

    Textual Data Formats

    dump and dput are useful because the resulting textual format is edit-able, and in the case of corruption, recoverable

    Textual formats:

    • can work much better with version control programs

    • can be longer-lived

    • adhere to the “Unix philosophy”

    dput and dget objects:

    y<-data.frame(a=1,b="a")

    dput(y)

    dput(y,file="y.R") -> puts dput(y) in a file

    new.y<-dget("y>R")

    Interfaces to the outside world

    R connects to:

    • file

    • gzfile

    • bzfile

    • url

    Subsetting R Objects

    Basics

    Operators for subsetting:

    • [ -> returns an object of the same class as the original (can extract multiple elements)

    • [[ -> extract elements of a list or a data frame but it may be of a different format (extract only one element)

    • $ -> extract elements of a list or data frame by name

    Lists

    Subsetting nested elements of a list:

    x <-list(a = list(10,12,14), b=c(3.14, 2.81))

    x[[c(1,3)]] -> third element of the first element of x

    14

    Matricies

    x[row, column]

    Partial matching

    x <- list(aadvark = 1:5)

    x$a -> gets aadvark

    x[["a", exact = False]]

    Removing NA’s

    x <- c(1,2,NA,4,NA,5)

    bad <- is.na(x)

    x[!bad]

    complete.cases() -> creates a logical vector