R

Introduction and use

When to use

Statistics is the emphasis, not matrix manipulation. It is an expression language.

When not to use

Text processing and general purpose programming are painful.

Writing, Building and executing code

Script execution

source("/path/file.R").

Reloading changed code

library(“R.utils”);

sourceDirectory(“work”, modifiedOnly=TRUE, \pattern="[A-Z]*[.]R$", recursive=FALSE);

Important environment variables

\verb R_LIBS : the place where R libraries are installed.

Working environment

getwd(), setwd(dir)

options(): Example options(digits=3)

Command history

history(), savehistory, loadhistory.

Can use Ctrl+R as in BASH.

Good IDE’s

Native GUI

Graphical data entry: data.entry(x), edit(x).

Listing and memory

objects() lists objects.

rm(objA, objB) removes certain objects from memory. rm(list=ls()) removes all objects.

exists() checks for the existance of an object.

Development environment

RStudio is good. Interactivity with RapidMiner is useful.

Debugging

The following can be inserted in the midst of code:

browser(): breaks execution and allows debugging with arbitrary code: like keyboard in matlab. Use cont to continue the program. Q halts execution.

debug(): marks function for debugging.

trace() function modifies a function to allow debug code to be temporarily inserted.

setBreakpoint(“fun.R#20”)

traceback() is useful when error is encountered.

Help

help(command), ?command, \ apropos(“log”), help.search(“log”), example(commandName).

libraries

Seeking

help.search(fnName);

Using

Installing

Many packages are listed in cran internet repository.

chooseCRANmirror()

install.packages(“igraph”). For usage, see the packages subsection.

Data: types, values, variables

Names and namespace

Valid names

a.b and a_b are valid name; but by doing so, you are not creating a structure named a. A name can start with ‘.’, but if so it cannot be followed by a digit.

Namespace

Uses: avoid name conflicts, structure code well.

To call a function b in a namespace a, use: a::b(). If b is hidden, then use a:::b().

Accessing list members directly

Use attach(Lst). detach() reverses this.

Object

Everything is actually an object with attributes.

Accessing attributes

attrName(objName)

Setting attributes

attr(object, attribute) is used to get or set an attribute.

Or one can use structure(object, attr=value).

Important attributes

mode: details about the type of data contained.

length.

Modes and type

Mode: details about the type of data contained. This is distinct from the type of the object itself (data.frame, or vector ..). So, type is akin to a generic/ template/ meta class in Java made concrete by specification of mode arguments corresponding to contents.

Basic modes

Boolean. Numeric. Integer.

Scalar values and operations

Special values

TRUE, FALSE. Inf. NA: ’not available’ or missing values. NaN is also a special case of NA.

These can be checked using is.na() and isNaN functions.

Data conversion

as.array(), as.data.frame etc..

Use methods(as) for a list of such methods.

Vectors

The basic math object

A scalar is actually a vector with one element in R. An array is also internally a vector. A string is a character array.

Homegeneity

Vectors are homogenous: their ‘mode’ attribute is character, numeric, logical.

Named entries

names(x) = stringList is a way of giving names to indices.

Indexing

A vector can be indexed with a vector of a] logical elements, b] positive integers, c] negative integers.

Eg: x[1:10] picks 1st 10 elements. x[-2]: all but x[2].

x[‘abc’] is valid if entries are named.

tail(x, k) picks the k last elements.

To append to a vector: append(v, val)

Creating vectors

Using the concatenation function - c(1, 23, 24) creates a vector of the given elements.

seq(); This can be abbreviated using the : operator.

rep(vector, timesToRepeat) is the replicator function.

vector() creates an empty vector.

Vector statistics functions

length, max, min, sum, mean, median, cummax, cummin, cumsum, cumprod, range, prod. cor: correlation.

which: gather ‘TRUE’ values from boolean vector.

Apply scalar functions

lapply returns a list. sapply, a simpler wrapper around lapply, returns a vector by default. vapply is an

apply returns an array or a vector but acts on arrays.

Factor

Factors are ways of storing a label-vector. They can be created using factor(labelColumn).

An important attribute is levels, which contains the set of labels used.

Strings

These are character vectors. Enclosed in "" for brevity.

String manipulation

concatenation: paste(vector). substr gets a substring.

strsplit(v, sep) returns a list, having split each element in v.

Dates

String connection

d<-as.Date("1995/12/01",format="%Y/%m/%d") converts string to date.

format(dt) converts date to string.

Arithmetic operations

Then, one can add days with d+20.

Arrays/ matrices

As special vectors

Arrays are just vectors which support multiple subscript indexing. So all operations and restrictions that apply to vectors apply to arrays.

Arrays are stored column by column.

The dimensionality is stored as a vector in the dim attribute.

Indexing

A[a, …]. Any subscript can be replaced with the sort of vector used to index vectors. If a subscript is omitted all values in that dimension are chosen.

Eg: A[,c(3:5)]: picks 3 cols.

Data creation

array(vector, dimensions). Or just set the dim attribute after creating a vector.

Or concatenate various vectors or arrays: c(v1, v2)

Matrices

x = matrix(vector, nrow = 3). matrix(0,nrow=n,ncol=n)

Row or column binding functions: rbind(vec1, vec2); cbind.

Dimension-wise oeprations

apply applies functions to margins of an array; apply(x,1,max): gives row max.

sweep(m, 2, colSums(m), FUN="/")

Matrix functions

t(A): transpose.\ %*%, %^%: matrix mult and exp.

diag(A) extracts diagonal, creates a diagonal matrix depending on the argument.

sum, rowSums, colSums, rowMeans, colMeans. cor: correlation.

Linear algebra

eigen, svd, qr. solve(A, b). lsfit(A, b).

Lists

Lists are heterogenous. They are a combination of a Hashmap and a list. They are very convenient to use as structures.

List creation

$L <-list(1, a, b)$ or $L <-list(a1= 1, a2 = a, xyz = b)$ for named lists.

Accessing list items

L[[4]], L[[“fieldName”]], L$fieldName. While indexing, list names can be abbreviated: Eg: ‘cov’ instead of ‘covar’, as long as the interpreter is still able to uniquely identify the intended member.

TO see if a member exists, use: fieldName %in% colnames(Lst).

Concatenation

c(L1, L2) returns a single list with members from both.

Data frame

Data frames are the R concept for data tables or matrices which can consist of columns of mixed types which can also have a name.

Creation

They are often read in from files - using read.csv for example.

By concatenating vectors: df = data.frame(n, s, b).

Conversion

Or from a matrix: data.frame(A). It can be reconverted to a matrix using data.matrix.

Indexing

Indexing is done as in the case of two dimensional arrays. If column headers exist, they can be used for indexing.

drops <- c("x","z")
DF[,!(names(DF) %in% drops)]
DF$colName

Searching

\verb which(sbux.df$Date == “3/1/1994”)

Concatenation

As in the case of matrices

Model Formulae

~ operator is used to separate left and right sides of formulae.

Syntax: response ~ predictor variables (separated by +).

Functions

Functions are actually objects.

Definition

my.mean <- function(x1 = defaultValue, y1) codeBlock

The value returned is the value evaluated by the last expression in the function definition. Multiple return values are usually handled using lists.

Variables used

Usually variables have a local scope: they cannot be accessed outside a function.

They can use variables from the parent scope, say g. The value of g is bound from the parent scope. If a normal assignment is made to g: $g <- 0$, g is then taken to refer a local variable. If g is not bound to a value, but is required in the function definition, there is an error; as it is not bound either in the definition or by an argument. However, if $g «- 0$ is used, then the parent-scope variable is updated, and g acts as the ‘state’ of the function.

Operator defintions

The function name can be replaced by %*% for example.

Invocation

Invocation can be done as in C, using a sequence of values. Together with positionally specified values, one can pass named arguments in any order. Eg: f(3, a=1, b = 2) or f(a=1).

Operator

Assignment

$<-, ->$, assign(). $«-$ is used to make global assignments: assignment to a variable outside the local scope.

Scalar operators

Arithmetic operators

As in C.

Operators on booleans

|, &, ||, &&, where the latter result in ‘short-circuit’ evaluation, where the second argument is evaluated only if necessary.

Operators which produce booleans

>, <, ==.

General Vector operators

Mapping over elements

lapply can be used to apply scalar functions to vector elements, while apply() can be used to apply vector functions on array rows.

Arithmetic and boolean ops

All scalar arithmetic operators are extended to be meaningful when provided vector arguments: even when they are not of the same size.

Lengthening of arguments

All shorter arguments are extended by repetition to have the size of the longest vector: Thus 1+c(2 3) is define.

Examples

c(1,2,3,4)/c(4,3,2,1). c(1,2,3,4) + c(4,3) yields 5 5 5 8.

ifelse op

ifelse operation uses a logical vector as a condition.

Set membership

drops <- c("x","z")
DF[,!(names(DF) %in% drops)]

Missing value identification

is.na(x) returns true for both NA and NAN values. Note that this is different from the syntactically undecidable expression x == NA.

Code structure

Every line of code is an expression or a sentence.

Code blocks

{} encloses code block.

Sentence syntax

Sentences end with newline or ;.

# comments.

Decision structures

if(..) codeBlock; optionally followed by else codeBlock. for(var in vector) codeBlock. while(cond) codeBlock.

Packages

To be able to use a package, one says: library(packageName). Standard packages are automatically available. Note that this is distinct from the idea of a namespace.

Organization with lists

Can group functions in lists which are declared in separate files. These files are then (re)loaded using commands like source or sourceDirectory.

IO

File I/O

To load data from a table, use read.table(), read.csv (fileName, header=FALSE, stringsAsFactors = FALSE), write.csv(x, file).

Write in matlab format: library(R.matlab), writeMat(filename, A=mat).

User interaction I/O

output

a+b prints a value, which is then lost.

print. printf is available in the base package.

sink(‘fileName’) diverts output to a file. sink() restores it to STDIO.

Input

x= scan(): keyboard input, no commas.

Plotting

plot(y ~ x) produces a scatter plot.

1
2
3
4
plot(x, y, main="title", sub="subtitle",
  xlab="X-axis label", ylab="y-axix label",
  xlim=c(xmin, xmax), ylim=c(ymin, ymax))
plot(NULL, xlim=c(0,1000), ylim=c(0,1000), xlab="Generations", ylab = "N_t") # Empty plot

Attributes

1
2
3
legend( x="topright",
        legend=c("k=2k","k=1.5k", "k=1k"),
        col=c("red","blue", "green"))

Add to same plot

1
2
3
plot(x,y1,type="l",col="red")
lines(x,y2,col="green")
points(x,y2,col="blue")

Combine plots

1
2
3
4
5
par(mfrow = c(1,2)) # Make 1*2 grid
par(mfg=c(1,1)) # Set the first plot
plot(NULL, xlim=c(0,1000), ylim=c(0,1000), xlab="Generations", ylab = "N_t")
par(mfg=c(1,2)) # Set the second plot
plot(NULL, xlim=c(0,50), ylim=c(0,1), xlab="Generations", ylab = "R_t")

Tables

Tables can be produced with xtabs(y ~ x).

Data preparation and exploration

scale(x, center = TRUE, scale = TRUE) normalizes columns using the mean and standard deviation.

Getting covariance matrix: cor(x, y = NULL, use = “pairwise.complete.obs”)

Modeling

Several useful functions are provided to evaluate fitted models in the package stats, which is loaded by default.

Classification

Decision tree learning: rpart

Logistic regression

glm(model, data=tblName, family = binomial())
ret <- glm.fit(x=X, y=z, family=binomial())

Options for family include: binomial(link=logit)

The return value is a list which includes coefficients and fitted.values.

Logistic regression with l1/l2 regularization

Use the glmnet package (requires gfortran).

Example:

returnList <- cv.glmnet(X, y, family = "binomial");

returnList contains the following vectors: lambda (corresponding to the lagrangian multiplier for the l1-norm), cvm - the corresponding mean cross-validated errors, glmnet.fit: the fit weights.

Decision tree: rpart and tree

ret <- rpart(model, tblRet)
print(ret)

Other library functions

Random sampling

Sampling from distributions: runif. sample(x, size, [replace=TRUE])

Write C extensions

Write C code

Useful libraries: #include <R.h> #include <Rmath.h>

Signature: void getSamples(int *input, int *output)

Calling C code

Compile C code: R CMD SHLIB foo.c

Loading: dyn.load(“foo.so”)

Calling: .C(“foo”, n=as.integer(5), x=as.double(rnorm(5)))

This returns a list of return.

Ref