How to use R We are going to learn how you can use R for statistical computing in this paper. You will need an instance of Rstudio to work with the modules. Rstudio is a free and open source software that uses R as its back end. In order to work with these series of examples, just load or fire up Rstudio and copy and paste these codes from this page to the script window. Set up - Create three directories: data, code, and documents - Use script (this) and console - Keep all data in the data directory - Keep all codes and scripts in the code directory or folder - You can write a document using the script window - In the script window you type the contents of your document - You use using markdown syntax - In the following table the markdown syntax is explained {markdown syntax} | Markdown syntax | Meaning | |-----------------|-----------------------------| | Headers | Use a number of hash marks | | Table | This is an example of table | | Figures | ![name](filename.jpg) | Type math in console: {r math_console} (3 + 5) # type these in the console, not here ## Assign values to objects {r assign_values} wt_kg <- 55 # will not print anything in the console Note the following with object names - You can give an object any name you want - Do not start with a number - Object names are case sensitive - Do not use reserved names - Use nouns for variable names - Use verbs for function names - Avoid dots in object names - When you create objects, R will not print anything in the console - If yuou want to print, use parentheses () What can you do with variable names? - Do arithmetic with it - Change the variable's value by assigning new value to it - If you use other variables with this variable, then: - Changing the variable value does not change this other variable R code: variables {r variable_stuff} wt_kg <- 100 wt_lb <- wt_kg * 2.0 wt_kg <- 120 (wt_lb) # what do you think the wt_lb will print? 200 or 240? Why? Functions and arguments Functions are - Automate complex and repeated sets of commands - Canned scripts - Can be predefined such as mean() - You can access them by loading packages - Each function has inputs called arguments - Functions return a value - The values functions return can be numeric or non-numeric - When you run a function, you will have to first call it R code: example of a function {r example_function} a <- 9 # assign 9 to variable a b <- sqrt(a) # b calls function sqrt and gives argument a which is 9 to it ## Vectors and data types - Vectors are the most common and basic data type in R - Single value or series of values - Either number or characters - Assign using c() function - For character vector essential to have quote marks otherwise R thinks these are objects and throws error messages - length() tells you how many elements are present in a vector - class() tells you what type of element is this object - str() tells you what is the structure of the object ## Some examples to run {r exmples_vector} wt_g <- c(50, 60, 70, 80) animals <- c("mouse", "rat", "cat", "dog") (length(wt_g)) # return 3 (length(animals)) # return 4 (class(wt_g)) # returns numeric as everything is number (class(animals)) # returns character as it is a character vector (str(wt_g)) # gives you more information about this vector that is it is number wt_g <- c(wt_g, 90) # we can add more elements this way to the end wt_g <- c(30, wt_g) # add an element to its front # other types of vectors are logical (true/false), # integers == whole numbers or integer numbers # complex = complex numbers # raw = raw data Data structures - Vectors are the ones that contain similar or identical types of elements - Columns of data sets are vectors - lists can contain mix of element types (rows of data sets) - matrices contain matrix (similar elements) - data frames (rectangular data ) - factors (categorical variables with levels) - arrays (strings of data) What happens when we mix elements? {r example_mixing} num_char <- c(1,2,3, "a") (class(num_char)) num_log <- c(1,2,3, TRUE) (class(num_log)) char_log <- c("a", "b", "c", TRUE) (class(char_log)) mix_mix <- c(1, 2, 3, "4") (class(mix_mix)) Subsetting vectors - Enclose everything within square brackets - Use c() to string together different elements - The first position is 1, so the indexing starts at 1 R code example of subsetting vectors {r subsetting_vectors} ans <- c("mice", "rats", "dogs", "cats") (ans[2]) # will return "rats" (ans[c(3,2)]) # will return dogs rats Conditional subsetting - Subset from a vector by defining different conditions R code example of conditional subsetting {r conditional_subsetting} weight_g <- c(21, 34, 39, 54, 55) (weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]) # we only want 1st, 3rd and 4th element (weight_g > 50) # if you want weight > 50 (weight_g[weight_g > 50]) # subset (weight_g[weight_g > 50 | weight_g < 30]) # use pipe (weight_g[weight_g > 50 & weight_g < 30] ) # use and boolean How to search for strings in a vector {r search_strings} animals <- c("cat", "rat") # define what you want to search statement <- c("a", "cat", "sat", "on a", "mat", "to catch a", "rat" ) # specify the search string (animals %in% statement) # are animals in statement? ( animals[animals %in% statement]) # which animals? How to analyse real world data sets and missing data? - Missing data in R are presented as NA - If you operate on a vector which has NA, the operation will result NA - You have to remove NA in these cases - For those operations, set na.rm = TRUE R code example of missing data {r missing_data} height <- c(2,4,4,NA, 6) ( mean(height)) # will return NA ( mean(height, na.rm = T)) #T is short hand for TRUE What will you do to remove missing values from data sets? {r missing_data_set} ( height[!is.na(height)] ) # will return 4 values ( na.omit(height)) # remove missing data ( height[complete.cases(height)]) # similar to !is.na() lengths <- c(10, 24, NA, 18, NA, 20) # vector lengths_without_NA <- lengths[!is.na(lengths)] ( median(lengths_without_NA)) # can you think of one other way of doing this? Working with data sets We will analyse a data set that has the following variables Column Description ----------------- ---------------------------- record_id Unique ID month month of observation day day of observation year year of observation plot_id ID of particular plot species_id ID of a particular species sex sex male or female hindfoot_length length of the hindfoot weight weight in grams genus genus of the animal species species of the animal taxa the taxonomy plot_type type of plot - Use download.file() function to download the file - Store it in the data folder - Read the data into R using read.csv() function - This will save the data as a data.frame object How to read data in R {r data_reading} download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data_joined.csv") #download data surveys <- read.csv('data/portal_data_joined.csv') What do we do with the data set? {r what_data} ( head(surveys)) # first six rows ( tail(surveys)) # last six rows ( str(surveys)) # get the data structure (nrow(surveys)) # number of rows (ncol(surveys)) # number of columns ( names(surveys)) # lists variables (colnames(surveys)) # lists variables another style ( summary(surveys)) # get a summary of the data set Indexing and subsetting data sets {r subsetting_df} (surveys[1,1]) # first row first column ( surveys[1,6]) # element in row 1 and column 6 ( surveys[, 1]) # contents of the first column ( surveys[c(1:3), 7]) # first three rows, column 7 ( surveys[, -1]) # data set minus the first column ( surveys[c(1:6), ]) # keep only the first 6 rows ( surveys["species_id"]) # return a column by name ( surveys[, "species_id"]) # returns a vector values How to deal with factors - Used to represent categorical data - Can be ordered or unordered - Factors are stored as integers - These integers have labels associated with them - Even though they behave like characters, they are integers - R sorts factors in alphabetical order R code samples to deal with factors {r factors_sorted} sex <- factor(c("male", "female", "female", "male")) ( levels(sex)) # R assigns 1 to female and 2 to male ( nlevels(sex)) # returns number of levels plot(surveys$sex) # plot the number of observations ( levels(surveys$sex)) # returns "", "F", "M" levels(surveys$sex)[1] <- "not known" # change "" to "not known" How to format dates - Convert date and time to appropriate and usable - Use the lubridate package and ymd() function How to format dates with R {r date_function} library(lubridate) # load the lubridate package surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-")) # ymd converts dates ( str(surveys$date)) # returns the structure of date object