Histograms

Graph 1: Simple Histogram

Not gonna look at a bunch of complexity in this one, just making a simple histogram. Histograms show the distribution of a data set. The graph below, for example, shows where shots in the NBA come from.

Load up our libraries...

library(ggplot2)
library(grid)
library(nyloncalc)
library(extrafont)

The data is in shot_distance.csv, so first load it up and take a look at how it’s formatted:

You'll notice there's a lot of data! This is every shot taken in 2013-2014. I'm breaking my cardinal rule on this one - instead of having every row represent one plotted data point, this is the rare case where we're just using the raw data and allowing ggplots to summarize it for us. Now I'll attach and create the plotting window.

attach(data)
dev.new(width=8,height=6)

For the actual plot, the geom we want is geom_density. The y-axis in this graph is not in our data, it's derived from the data, so we are only going to supply an x-variable.

plot <- ggplot(data,aes(x=shot_distance))
plot <- plot + geom_density(alpha=.4,fill="#1b9e77")

I chose a color for the histogram arbitrarily. #1B9E77 is a color hex code. If you google 'color hex codes' you'll find guides to them. You can also just specify the name of a color. In this case we specify a fill rather than a colour. Don't worry too much about that detail, ggplot2 uses different terms for different sorts of color changes: lines, fills, groups, etc. I'm going to throw some labels on things and apply the theme.

plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("The Distribution of NBA Shots")
plot <- plot + ylab("Density")
plot <- plot + xlab("Distance From Basket (feet)")

And putting the badge on it:

x_min = ggplot_build(plot)$panel$ranges[[1]]$x.range[1]
x_max = ggplot_build(plot)$panel$ranges[[1]]$x.range[2]
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

The plot should look something like this:

The x-axis is using all values, and that's kinda bad. Before we get to that though, a word about the y-axis. Smoothed histograms like this one measure density on the y-axis. Density is basically uninterpretable. A lot of people look at histograms and want to say something like: "10% of shots come from within 2 feet from the basket." If the histogram wasn't smoothed, we could interpret it like a bar chart. In fact you might want to try this; replace geom_density with geom_histogram and see what happens. But you can't interpret density in any particularly useful way, because it's created to make the area under the chart sum to 1, and is therefore dependent on the x-axis. This drives some people insane.

Now to make this look a bit better, let's clean up the x-axis. There's no reason to show that long right tail. Remember from earlier that we can control axes using the scale_x_continuous command (there's also a scale_x_discrete command, if one throws errors, try the other). Starting from the top, I paste in all this graph code:

plot <- ggplot(data,aes(x=shot_distance))
plot <- plot + geom_density(alpha=.4,fill="#1b9e77")
plot <- plot + theme_nyloncalc()
plot <- plot + ggtitle("The Distribution of NBA Shots")
plot <- plot + ylab("Density")
plot <- plot + xlab("Distance From Basket (feet)")
plot <- plot + scale_x_continuous(breaks=seq(0,32,4),limits=c(0,32))

In the last command, I use both the breaks command, to control where tick marks appear, and the limits command, to truncate the graph at x=32. The c() command that you see after limits is a common R command that means concatenate. This is used in R for a ton of different things so it's a good command to know. In this case it's just telling ggplot that the limits for the x-axis are going to be 0 and 32.

You could plot this now and it looks pretty good, but we need to badge it, and unfortunately this is going to require a small change to our badge code. If you look at a previous block of badge code you can see that we are creating these variables called x_min, x_max, y_min, and y_max. Since we are limiting x in the graph, we also need to change these to reflect that:

x_min = 0
x_max = 32
y_min = ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max = ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain = x_max - x_min
range = y_max - y_min

plot <- plot + annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot <- plot + annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

If you plot that you'll have something very close to my model graph above. I had to change the x_min parameter to -1 to get a slightly better looking badge. Last step is to save it.

ggsave(filename="/Users/austinc/Desktop/shotdistance.png")

Here's all the code together:

dev.new(width=8,height=6)

plot<-ggplot(data,aes(x=shot_distance))
plot<-plot+geom_density(alpha=.4,fill="#1b9e77")
plot<-plot+theme_nyloncalc()
plot<-plot+ggtitle("The Distribution of NBA Shots")
plot<-plot+ylab("Density")
plot<-plot+xlab("Distance From Basket (feet)")
plot<-plot+scale_x_continuous(breaks=seq(0,32,4),limits=c(0,32))

x_min=-1
x_max=32
y_min=ggplot_build(plot)$panel$ranges[[1]]$y.range[1]
y_max=ggplot_build(plot)$panel$ranges[[1]]$y.range[2]

domain=x_max-x_min
range=y_max-y_min

plot<-plot+annotate("rect",xmin=x_max-.32*(domain),xmax=x_max,ymin=y_min,ymax=y_min+(range*.07),alpha=.8)

plot<-plot+annotate("text",x=((x_max-.32*(domain))+x_max)/2,y=((y_min+(range*.08))+y_min)/2,label="Nylon Calculus",colour="#ffffff",family="Chalk Line Outline",size=4.3,vjust=.5,hjust=.5)

ggsave(filename="/Users/austinc/Desktop/shotdistance.png")

Next: Bar Graphs