A simple look at linear regression with a sample dataset that’s not really Psychology-related, but allows us to make some of the important points.
2008-2009 salaries (for 9 months, as is common there) at a US college. There are n=397 observations of m=6 variables. The data comes along with a book called Companion to applied linear regression (CAR).
First let’s have a look at the table of data, to get an idea. Use the function head
to display the first few lines of the data table.
head(Salaries)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Next, let’s visualise some aspects of the data. In psychology speak, we are interested in explaining the dependent variable, which in this case is salary
(although we could make an argument that `rank - whether you are an Assisstant or Full Prof - is also a kind of dependent variable).
To see what the distribution of salaries is, regardless of the various ways in which we could split up the data:
ggplot(Salaries, aes(x = salary)) + geom_histogram(binwidth = 5000) +
labs(title="Academic Salaries (US), histogram")
… and if we split by rank?
ggplot(Salaries, aes(x = salary, fill=rank)) + geom_histogram(binwidth = 5000) +
facet_wrap(~rank,ncol=1) +
labs(title="Academic Salaries (US), faceted by rank")
… and by gender?
ggplot(Salaries, aes(x = salary)) +
geom_histogram(binwidth = 5000) +
facet_wrap(~sex,ncol=1) +
labs(title="Is there are gender gap in these data")
discipline
in which the Profs work matter?ggplot(Salaries, aes(x = yrs.service, y = salary)) +
geom_point() +
labs(title="Scatter plot of salary (as a function of yrs of service)")