Before we start making graphs, we’ll learn a little bit about how data is loaded and manipulated in R. This is barely going to scratch the surface of R data management but it will be fine for our purposes.
First, download some sample datasets I put together at http://www.austinclemens.com/blog/data/sample_data.zip.
Now open R and load in the percomparison.csv data using the read_csv command. Type the following and then choose the percomparison.csv file from the resulting pop up:
data <- read.csv(file.choose())
Breaking this down, this is telling R that we want to create a new object named data. The arrow looking thing tells R that we want to store something in data. Read.csv is a function to translate comma separated values files into R dataframes. file.choose lets us find the file graphically. The data object is what R calls a dataframe. There are ways to load data from Excel spreadsheets and other sources but I recommend that you just save your data as a CSV in Excel and then load it into R this way.
We now have an R object called data with PER by year for four NBA players. You can view it by simply typing data.
Notice that the data object contains three columns: Player, Year, and PER. The numbers along the left side are not an actual column in the data, just a reference so you can quickly see what row something is in. We can retrieve columns from the data in two ways. You could include a number in brackets to tell R which column of data you want to see:
You can also use the column’s name by including the dollar sign operator. This command, for example, tells R to reach into the data dataframe and return only the Year column:
You might notice that these are returned a bit differently. Using brackets returns a column, while using the dollar sign returns a list. We are pretty much always going to want to use the dollar sign.
You can view a summary of your data:
Not very useful in this case since this is already summarized data, but you can see how many years of data are available for each player, for example. In excel you’ve probably used two columns from a dataset to create a new column by performing some mathematical operation. We can do that in R pretty easily:
Again, this is telling R that you want to create a new object, that you want that object to be inside the data dataframe, and that you want that object to be the result of adding the Year column to the PER column. You can see the result above.
One final thing you might want to do with a dataframe is attach it. Just issue the command attach(data). Attaching a data frame places all the objects in the dataframe into memory so you can reference them without the dollar sign, like this:
So that’s working with dataframes. Let's do a really simple linear regression. The command for linear regression in R is lm. This will actually create an object with a bunch of useful properties, so we don't just want to run lm, we want to run it and store it in a variable:
Breaking this down, I ran a linear regression and stored the resulting object in results. That code in the parenthesis is the formula for my linear regression. The ~ symbol means that PER is a function of year. In other words, PER is the dependent variable, and year is the only independent variable. I then used the summary function to display the model's results. The results tell me that the year variable has a slightly negative relationship to PER, but that the variable is not significant (p=0.332).
That's to be expected though. Since this is a linear regression, it can't capture the possibility that PER initially rises with year but then falls. UNLESS of course we were to transform year. Let's do that by creating a new variable that is the square of year:
And now let's run the regression again, with two independent variables:
Now both independent variables are significant (the stars to the right of each variable's p-value tell you if the variable is significant, and at what level). The positive coefficient on year shows that PER increases as year increases but eventually, as year_sq gets large, PER will begin to decrease with year. So that's a simple linear regression.
One other simple thing you can do with R is you can use it as a calculator: