A simple look at linear regression with a sample dataset that’s not really Psychology-related, but allows us to make some of the important points.

Salaries dataset

2008-2009 salaries (for 9 months, as is common there) at a US college. There are n=397 observations of m=6 variables. The data comes along with a book called Companion to applied linear regression (CAR).

First let’s have a look at the table of data, to get an idea. Use the function head to display the first few lines of the data table.

head(Salaries)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Next, let’s visualise some aspects of the data. In psychology speak, we are interested in explaining the dependent variable, which in this case is salary (although we could make an argument that `rank - whether you are an Assisstant or Full Prof - is also a kind of dependent variable).

Simple histogram plot

To see what the distribution of salaries is, regardless of the various ways in which we could split up the data:

ggplot(Salaries, aes(x = salary)) + geom_histogram(binwidth = 5000) +
  labs(title="Academic Salaries (US), histogram")

… and if we split by rank?

ggplot(Salaries, aes(x = salary, fill=rank)) + geom_histogram(binwidth = 5000) +
  facet_wrap(~rank,ncol=1) +
  labs(title="Academic Salaries (US), faceted by rank")

… and by gender?

ggplot(Salaries, aes(x = salary)) + 
  geom_histogram(binwidth = 5000) +
  facet_wrap(~sex,ncol=1) +
  labs(title="Is there are gender gap in these data")

ggplot(Salaries, aes(x = yrs.service, y = salary)) + 
  geom_point() +
  labs(title="Scatter plot of salary (as a function of yrs of service)")