In an attempt to find something useful to plug into my new htmlwidget sunburstR
(see post), I rediscovered this insightful article by Peter Norvig.
English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU
The content is great, but even better, he has published the ngram data in Google Fusion Tables. As a simple example, let's look at 2 letter ngrams for the start of a word with sunburstR
.
# devtools::install_github("timelyportfolio/sunburstR")
# use sunburst to analyze ngram data from Peter Norvig
# http://norvig.com/mayzner.html
library(sunburstR)
library(pipeR)
# read the csv data downloaded from the Google Fusion Table linked in the article
ngrams2 <- read.csv("./inst/examples/ngrams2.csv", stringsAsFactors = FALSE)
ngrams2 %>>%
# let's look at ngrams at the start of a word, so columns 1 and 3
(.[,c(1,3)]) %>>%
# split the ngrams into a sequence by splitting each letter and adding -
(
data.frame(
sequence = strsplit(.[,1],"") %>>%
lapply( function(ng){ paste0(ng,collapse = "-") } ) %>>%
unlist
,freq = .[,2]
,stringsAsFactors = FALSE
)
) %>>%
sunburst %>>%
htmlwidgets::saveWidget("example_ngrams.html")
Thanks
library(htmltools)
ngrams2 %>>%
(
lapply(
seq.int(3,ncol(.))
,function(letpos){
(.[,c(1,letpos)]) %>>%
# split the ngrams into a sequence by splitting each letter and adding -
(
data.frame(
sequence = strsplit(.[,1],"") %>>%
lapply( function(ng){ paste0(ng,collapse = "-") } ) %>>%
unlist
,freq = .[,2]
,stringsAsFactors = FALSE
)
) %>>%
sunburst
}
)
) %>>%
tagList %>>%
browsable