Line Graphs and Scatter Plots

Graph 1: Simple Line

Time to make a graph. First we’re going to replicate this one:

First, we have to load up the four R libraries we will need. You need to do this every time you open R! We also have to load our fonts up. I think technically you can skip loadfonts but it was behaving erratically for me so might as well do it.

library(ggplot2)
library(grid)
library(nyloncalc)
library(extrafont)
loadfonts()

The data is in threes_by_year.csv, so first load it up and take a look at how it’s formatted:

So that I don’t have to use the dollar sign constantly, I’m going to attach data. Then I’m going to set the size of my plotting region. For the moment, all graphs need to be 8x6, which we set using dev.new(). This tells R to open a new plotting region of a specific size (I am working on the whole 8x6 thing but at the moment, if you deviate from that size it will cause problems for the Nylon Calculus badge).

attach(data)
dev.new(width=8,height=6)

A blank window will open. Just ignore that for now. We're going to 'build up' our plot. By that I just mean that we are going to create a plot object, and then we're going to make gradual modifications to it until we have it the way we want. The first step is to tell R what your Y and X axes are going to be.

plot <- ggplot(data,aes(x=year,y=threes))

This creates the plot object, tells ggplot that we are going to be using columns in the data dataframe, and then tells ggplot what those columns are. aes stands for aesthetic and this has never been the most intuitive thing about ggplot for me but our interaction with it is going to be pretty limited. Here, I tell ggplot that part of the graphic aesthetic is that the x-axis is year and the y-axis is threes (R cares about capitalizations by the way, notice that in our first tutorial dataframe year was capitalized, and now it's not). Now we have a graph but there's basically nothing in it. Before we can see anything, we have to tell ggplot what kind of plot this is. In ggplot this is done using geoms. In this case, we want geom_line:

plot <- plot + geom_line(size=1.2,alpha=.8,colour="cyan")

So if you look at that command, what we're doing is overwriting the plot object with the plot object + this new thing. The new thing is a cyan colored line (yes you have to spell it colour) plot with a line width of 1.2 and a transparency of 0.8. You can experiment for yourself with putting nothing in the parenthesis, you'll just get a thinner black line. Ok, let's plot what we've got:

plot

Pretty simple! Your dev window should now have something in it: Ew. We need to keep building this plot object to make it look nicer. Let's apply the Nylon Calculus theme to it, give it a header, and label the axes.

plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("Increasing Reliance on the 3-pointer")
plot <- plot + ylab("% of Shots That are 3s")
plot <- plot + xlab("Year")

That first command is just something you're going to want to do for every graph you make. The next three label our axes. Hopefully they are pretty self explanatory. Type 'plot' again and check out how far our graph has come: Much better! ggplot chose really weird axis breaks though. Since the x-axis is year, we'd really like there to be an axis break every year or maybe every other year (every year gives us a really crowded axis). We can specify our grid lines like so:

plot <- plot + scale_x_continuous(breaks=seq(1996,2013,2))
plot <- plot + scale_y_continuous(breaks=seq(16,30,2))

The scale commands can do a bunch of stuff, but for now all we're gonna do is change the breaks. We do that by setting breaks equal to a sequence that starts at 1996, ends at 2013, and proceeds in steps of 2. I also changed the y axis to give it a little more granularity. Now the chart looks like this: All we're missing is the Nylon Calculus badge. This takes some doing but you don't need to worry about the code too much, it will look pretty much exactly the same for every graph. Just copy and paste this stuff in:

x_min = ggplot_build(plot)$panel$ranges[[1]]$x.range[1]
x_max = ggplot_build(plot)$panel$ranges[[1]]$x.range[2]
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

And then to actually add the badge to our plot object:

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

The last thing to do is to save this image. Here I save it to my desktop - you'll want to change the filename path to whatever makes sense for you.

ggsave(filename="/Users/austinc/Desktop/nyloncalc_line.png")

Here it is all together, in window:

Graph 1: Grouped Lines

Let's replicate this graph:

First, load up the data in the percomparison.csv file:

This data illustrates something very important about how your data should be formatted: every row should be a single x,y coordinate. If you were just putting this data in a spreadsheet, you might have a row for Tim Duncan, and then columns for his first year, second year, third year, and so on. But for our purposes, it's best to have each row representing just a single data point on the graph. Notice that I have a 'grouping' variable here, to tell ggplot that we want a line for each player. ggplot will also automatically create a key for me based on this column (the player column), as we'll see in a second.

As before, I'm going to attach the dataset and set up the plotting window.

attach(data)
dev.new(width=8,height=6)

If your session has been open for a while and you've reused the data object, you may start getting these warning messages from attach() telling you that you are masking other objects. You can ignore these. I start building my graph the same way with one critical difference. I'm going to add a colour parameter to my graphing aesthetic and set that colour parameter equal to player. That's because I want each player to have a different colored line. I don't have to change my geom_line at all - it already knows I want grouped lines thanks to the colour aesthetic. And of course I won't set a colour for the line this time, since that's already set in the aesthetic.

plot <- ggplot(data,aes(x=Year,y=PER,colour=Player))
plot <- plot + geom_line(size=1.2,alpha=.8)

Now to set the theme and label my axes:

plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("Player PER Over Career")
plot <- plot + ylab("Player Efficiency Rating")
plot <- plot + xlab("Years Since Draft")
plot <- plot + scale_colour_brewer(palette="Dark2")

And now the exact same code adds the Nylon Calculus badge:

x_min = ggplot_build(plot)$panel$ranges[[1]]$x.range[1]
x_max = ggplot_build(plot)$panel$ranges[[1]]$x.range[2]
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

At this point you can type plot and hit enter to see the graph, or you can just save it directly using ggsave:

ggsave(filename="/Users/austinc/Desktop/groupedlines.png")

Graph 3: Scatter Plot with Fitted Line

Next up, this graph I used in an early NC article about forcing midrange shots:

The data for this one is midshots.csv. Load the data up:

Attach the data and open a new plotting window:

attach(data)
dev.new(width=8,height=6)

And now we can start the plot. For this one, we're going to be using a different type of geom. It's geom_line:

plot<-ggplot(data,aes(x=mid*100,y=drtg,label=Over))
plot<-plot+geom_point(shape=1,alpha=.8)

I'm going to change the breaks a bit (remember you can type plot and hit enter at any time to see what the graph you've built to this point looks like), add labels, and apply the NC theme.

plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("Teams That Force Midrange Shots Excel Defensively")
plot <- plot + ylab("Defensive Rating")
plot <- plot + xlab("% of Opponent Shots Taken from Midrange")
plot <- plot + scale_x_continuous(breaks=seq(27,38,2))
plot <- plot + scale_y_continuous(breaks=seq(94,110,2))

And throw the badge on there:

x_min = ggplot_build(plot)$panel$ranges[[1]]$x.range[1]
x_max = ggplot_build(plot)$panel$ranges[[1]]$x.range[2]
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

Now let's pause and see what we have. When I type plot and hit enter this is what I see:

Not exactly what I promised. We're missing two things: a fit line and team labels. Let's deal with the fit line first because that's really easy. Lines of best fit are just a geom type you add to the graph. In this case I'm going to use geom_smooth, but try replacing it with geom_line and seeing what happens, and give geom_smooth(method="lm") a whirl too.

plot <- plot + geom_smooth()

If you plot that, you can see we're pretty close to a finished product. The last thing is to add team labels. This can be a little tricky. The way to do it is to use geom_text, like this:

plot <- plot + geom_text(size=2.5,family="Gulim")

If you plot that it's pretty messy. Team names are sitting right on top of the points for the team. So let's add a little space:

plot <- plot + geom_text(size=2.5,family="Gulim",vjust=1.7,hjust=.3)

Notice the vjust and hjust commands. These are vertical justification and horizontal justification. Increasing the vertical justification moved the labels down, while increasing the horizontal justification moved them to the left a bit. If you plot it now you should see what I have below.

It's much better, and maybe you could publish this, but the perfectionist in me wants to clean it up further. Unfortunately there's no simple way to do it. What I did was I created two columns in my data, vjust and hjust, and I fiddled with the values for each team, looking at the graph and choosing a vjust and hjust value for that specific team. Then I set vjust and hjust equal to those columns such that each team received its own location coordinates. Here's the final code:

plot <- plot + geom_text(size=2.5,family="Gulim",vjust=data$vjust+.3,hjust=data$hjust+.5)

And now we can save it:

ggsave(filename="/Users/austinc/Desktop/midscatter.png")

Graph 4: Grouped Scatter Plot

Last one for this category. This is a replication of a chart Ian used.

I'll let you inspect the data - it's in fryemagic.csv. You've seen these lines many times now...

attach(data)
dev.new(width=8,height=6)

Just like with our grouped lines example above, we have a grouping variable, player. The x-axis is height of player, and the y-axis is percent of shots that are 3s. Here's the aesthetic and geom:

plot <- ggplot(data,aes(x=Height,y=X3PTA.FGA,colour=group))
plot <- plot + geom_point(shape=1,alpha=.8,size=3)

Labeling my axes and applying the theme...

plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("Visualizing the Rise of the Stretch 4")
plot <- plot + ylab("3PTA/FGA")
plot <- plot + xlab("Height in Inches")

Throw the badge on:

x_min = ggplot_build(plot)$panel$ranges[[1]]$x.range[1]
x_max = ggplot_build(plot)$panel$ranges[[1]]$x.range[2]
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

And we're done! If you plot that you should get the finished product above.

ggsave(filename="/Users/austinc/Desktop/fryespacing.png")

Next: Histograms