Dealing with Rows and Columns

September 8, 2008

So, a problem that has been bugging me for a while is trying to get row and column sums, means, etc. etc. For the longest time I was assigning a column of identifiers and then using by(data, identifier, sum), or tapply (data, identifier, sum), but it typically makes your life a whole lot simpler if you don’t have to add in a column of identifiers, and if you don’t have to worry about what they’re doing to your data.

So here’s the most useful command I’ve discovered so far for this:

apply(x, margin, function)

where x is your data, margin indicates whether you want the operation performed over rows, margin = 1; columns, margin = 2; or both, margin = c(1,2); and of course function indicates the function you want performed (e.g. mean).

Splitting data

August 9, 2008

One of my new favourite commands to split my data by a given factor is split(). It takes a data set and breaks it into a group of dataframes or vectors based on the values of the factor you choose. There is a trick though, to use the new dataframes you have to attach the object which you used with your split command, and then you can attach the separate data frames by the name of the levels of the factor used to split them. Furthermore, thus far I haven’t been able to get it to work with numerical factors, just character factors.

#for example
y<-c(0,1,2,0,1,2)
x <- c(“Cat”,”Cat”,”Cat”,”Dog”,”Dog”,”Dog”)
z<-c(4,5,6,4,5,6)
data<-as.data.frame(cbind(x,y,z))
attach(data)
byspecies<-split(data,x)
attach(byspecies)
detach(data)
attach(Cat)
plot(y,z)

Every time you start R a set of packages is loaded that is stipulated in one of the background files that makes R tick. But what if you regularly use functions from a package that is not in the default group? Do we really have to go into the package manager (Packages & Data -> Package Manager) and check the little box to load the package every time we use these functions?
Of course not, all you need is the library command. For example, were we to need someone’s favourite package, vegan, all we would need to do is add:

library(vegan)

#to the beginning of our file. Note, it is also possible to change the default settings for the packages to be loaded in the background file. I’ve seen the code, and I was too afraid to touch it.

So a lab mate asked today about how to get x-axis labels to sit at a 45º angle. A little discussion of figure axes seems in order.

For starters, if you’re just looking for a basic figure you don’t really have to define much. R will make it’s best guess at what your axes should look like, and barring any complications (for example lots of different categories in a boxplot, or useless names of factor levels or variables) these figures are more or less serviceable. And, of course, most graphing functions offer easy ways to specify axis labels that differ from the default.

An example may be in order:

#First let’s start by making up some data to graph. The runif command generates random data with a uniform distribution. The first argument to this command (and the only one that I specify) indicates how many numbers you want to generate. Other (unused) arguments allow you to specify the range of generated numbers.

x <- runif(10)
y <- runif(10)

#So we’ve defined two vectors of random numbers, x and y (note that the arrow defines something, it’s kind of like an equals sign). Seeing as we’re interested in axis labels we should probably make a boxplot, because they’re a little more involved that plain x-y plots.

boxplot(x,y)

blogpost1

#So, here we have a plot with no general label for either axis, not particularly useful labels for our two categories, and y-axis values that hurt to read. The first step in making this figure nicer is turning those axis labels so they are perpendicular to the y-axis. Axis values are rotated by 90º using the las command. Where las = 1, the y-axis values are rotated to be perpendicular to the axis, when las = 2 the x-axis values are rotated to be parallel to the x-axis, when las = 3 their both parallel to the y-axis. Let’s make our y-axis labels perpendicular to the y-axis, eh?

boxplot(x,y,las=1)

blogpost2

#Now our y-axis labels look nice, but we’ve lost our x-axis labels (I swear that’s never happened to me before). Maybe we should make some new ones that are a little more descriptive?

boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2”))

blogpost3

#So we have our individual categories named. Note how we defined a vector of lake names. In R you define vectors by c(), with the elements of the vector separated by commas inside the brackets. But we have no idea what the y-axis represents. To add this in we’ll use the ylab argument, and we might as well specify our x-axis at the same time using the xlab argument.

boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2″),ylab=”Total Phosphorus”,xlab=”Lake”)

blogpost4

#Already, our graph looks pretty nice. If, however, we want to be petulant and redefine some element of our graph completely R is the program for us. We can do almost anything to our figure. To abolish the axis we have we put into the xaxt argument “n” to signify we don’t want an x-axis.

boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2″),ylab=”Total Phosphorus”,xlab=”Lake”,xaxt=”n”)

#Now we’re free to define our own axis using the axis command. N.B. that the axis command a command unto itself and not an argument of boxplot. It has to be run after boxplot, and boxplot has to be run every time you want to change the axis (otherwise it will just keep throwing the results of axis ontop of each other).

boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2″),ylab=”Total Phosphorus”,xlab=”Lake”,xaxt=”n”)
axis(1,at=c(1:2),labels=c(“Lake 1″,”Lake 2”),tick=TRUE)

#Here our figure looks the same, but only because we wanted it to. We could have done it with our tick marks wherever:

boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2″),ylab=”Total Phosphorus”,xlab=”Lake”,xaxt=”n”)
axis(1,at=c(0.7,2.2),labels=c(“Lake 1″,”Lake 2”),tick=TRUE)

blogpost5

#or with internal ticks:
boxplot(x,y,las=1,names=c(“Lake 1″,”Lake 2″),ylab=”Total Phosphorus”,xlab=”Lake”,xaxt=”n”)
axis(1,at=c(0.7,2.2),labels=c(“Lake 1″,”Lake 2”),tick=TRUE,tck=0.03,mpg=c(1,-1.5,0))

blogpost6

#Note here that only the x-axis ticks become internal because we are explicitly working with the x-axis (the first argument of axis asks which axis we are working with and 1 = the x-axis). We could do internal ticks with the y-axis if we ran the axis command for axis 2.

#That seems like enough for now. I realise I never said how to rotate your axes by 45º, which is more tricky than rotating them by 90º. It involves the srt command, which defines the amount of rotation, but the kicker is that srt only works on a text object. For those who are interested, the process is summarized in the R FAQ at: http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-create-rotated-axis-labels_003f

Vegan

June 12, 2008

I will not lie, this post is meant as a soppy, indecent, maudlin love-letter to the R package Vegan. The package is meant for community ecologists, and comes with a variety of functions that I love.

The package is mostly intended for community ecologists, especially those interested in ordination.

It provides an option for NMDS that improves on the functions provided in the MASS and Stats packages by including multiple random starts (which is important because

NMDS can be easily trapped in local stress minima).

There are attractive (i.e. flexible) commands for plotting your ordinations as biplots in two or three dimensions (ordiplot, ordiplot3d).

And there are other helpful functions, such as commonly used ecological data transformations (decostand)

and the calculation of often used community descriptors (e.g. diversity, with your choice of Shannon or Simpson (diversity))

The package seems to have been authored mostly by Jari Oksanen at the University of Oulu. He seems really nice and mans the help forum at:

http://r-forge.r-project.org/projects/vegan/

The vegan package, like sun in a blue sky, fills me with warmth


As I comment in a previous post, before you start working you need to tell R where to find the data you are interested in. You can do this using Misc -> Change Working Directory (shortcut ⌘D on a mac), but this can get tedious if you have to keep changing working directories. A better way of doing this is to change your working directory manually once, then find out what the directory is, and add a command to change to this working directory at the beginning of your document (or, more specifically, before you need the data file from that folder).

#Change your directory manually the first time
getwd()
setwd(“your directory path gleaned from the R console here – keep the quotes”)
data <- read.table(“youregoodtogo.txt”,header=TRUE)
attach(data)

Exporting dataframes

June 12, 2008

Once your data are in R, you may do crazy things to them. You may transform them, you may manipulate them, etc. etc. Sometimes I find it much easier to transform my data in R (with a simple line of code) than it would be to create whole spreadsheets and drag formulae everywhere as one would have to do in Excel.

The question then becomes, how do you get this lovely transformed data out in some format that you can use in Excel?

Let’s say you have a community matrix of horse abundances, horsies. You wish to apply a Hellinger transformation to this matrix to make it [more] appropriate for a PCA, which you can do using the function decostand() provided in the community ecology package vegan. But you don’t want to do your PCA in R! You want to do it in CANOCO! How do we get our data out to Excel so we can import it into Canoco?

#Transform your data
horsies.hel<-decostand(horsies,”hellinger”)
#Write your transformed data to a file that Excel can interpret
write.table(horsies.hel,file=”Hellinger Transformed Horse Data.csv”,sep=”,”,col.names=TRUE,row.names=FALSE)

This should give you a comma separated data file which can then be saved into another format by Excel (e.g. and Excel file, or a tab-delimited text file). I’m still trying to get a tab-delimited text file out of write.table(), but I have yet to succeed.

Getting Started

June 12, 2008

The first thing most people will probably want to do in R is get some data to work with. While you can input your data directly into R, the program does not provide an ideal environment for this. Really, you are much better off setting up your data in an Excel worksheet, and then importing the file into R.

Some important points:

1) Save your file as a tab delimited text file, this is the kind of file R likes to play with.

2) R does not like spaces.
This means every column header, character level, etc. etc. must be space free. Think of labels that don’t involve spaces, or use periods or underscores instead.

As with every procedure in R there are multiple ways to get your data in the system, here is what I usually do:

1) Change your working directory: R needs to know where your file is found. Misc -> Change Working Directory, pick the folder that includes your data file.

2) Get your data in the system: To get your data file into R use the following command:

read.table(“Your Data File”,header=T).

The header argument is used to indicate whether your first row is column names or data. TRUE implies column names, FALSE implies data. I always make the dataframe into an object, e.g.

horsies <-read.table(“FakeData.txt”,header=TRUE)

Now FakeData.txt is represented by my object horsies.

3) Attach your data. The data are in the system, but R still doesn’t know we want to play with them. To make your data front and centre to R use the following command:

attach(horsies).

You’re ready to start playing with stuff in R. To be sure that my data is the right data I usually pull out some uninteresting command just to make sure things are still going according to plan. For example:

names(horsies)

will return all the column names for my object horsies. That way I usually know if I’m working with the right dataframe.

PCA

June 12, 2008

I know of at least two ways to perform PCAs in R, both are very straightforward, and provide the exact same results, though they provide different default outputs. I provide my best understanding of PCA issues, but welcome additional tips.

The first [set] is the prcomp() command from the Stats package (Q-mode) and princomp() (R mode), the second is the PCA function written by Pierre Legendre, and available from his website (http://www.bio.umontreal.ca/legendre/indexEn.html#RFunctions). Legendre’s file includes a function to create a biplot of the PCA results (and can provide either Q or R mode PCA depending on the scaling option you choose).

Both offer easy options with respect to centering (which I believe to be a strictly aesthetic thing), standardization and scaling. Legendre’s function provides the eigenvalues for the components front and centre, whereas with prcomp() you have to have to ask for them using the summary() command. I personally prefer prcomp. You don’t have to define the function before each use, as you do with Legendre’s code,  prcomp gives you the eigenvector loadings front and centre, and you can get the principle components scores for each site using the scores() command, and then feed them into other functions that accept scores objects (lots of functions in vegan, etc. etc.).

#PCA using the prcomp() function
results <- prcomp(communitymatrix,center=TRUE,scale=TRUE)
summary(results)
scores(results)

Whether or not to standardize is the issue that I have the most trouble grappling with in PCAs. If your data has different units the problem is easy enough – you must use the correlations matrix (i.e. scale=TRUE). This matrix is based on variables standardized to a mean of 0 and unit variance (in this case the species vectors on a biplot indicate how well your ordination diagram approximates the species values (Leps ans Smilaurer (2003)) . If your data is in identical units, and the differences in the variance between variables is important, the covariance matrix may be appropriate (i.e. scale=FALSE),(here the species vectors on a biplot shows the variability in that species across the ordination space (Leps and Smilauer (2003)).

Note that this decision is important, and affects the outcome of your PCA. Quinn and Keough (2002) suggest that the choice on standardization depends on whether you wish for the variances in the species to become an important part of your analysis. If so, you should analyze the covariance matrix (that is the matrix derived by NOT standardizing), but if not, you should analyze the correlation matrix (that is the matrix derived by standardizing).

Another choice you will probably have to make is regarding the scaling of your ordination plot. Unlike the choice regarding standardization, this does not affect the output of your PCA, just the layout of your biplot. Essentially you are offered scaling options which preserve the distances between samples, or the relationships between species.

The R environment

June 12, 2008

So, let’s begin by touring the R interface. For me there are 3 useful parts of the R interface:

1) The R console: this is where R executes all of your commands, and this is where the output that R generates will be displayed.

2) The Quartz device window: this is where figures you generate will be shown. Only one window can be active at a time, which generally means that when you generate a new figure you lose the one you had there before. You can have more than one quartz window open, however, allowing you to have figures side by side (to open a second quartz window go to Window and select New Quartz Device Window).

3)The document window: this is where all of your coding action should happen. R doesn’t open this automatically, you have to do it yourself by going to File and picking New document. In this window you can write out all of the codes that you want to execute, and then execute them by pressing ⌘ Enter (on a mac).

Using this window allows you to fine tune your commands, if they don’t work, or if you want to spruce them up. And, most importantly, you can save your commands for use in the future, being able to recreate your analysis with a simple click of ⌘ Enter.