Playing with implementing a Markov chain to generate text similar to a given input. Running demo
xxxxxxxxxx
<style type="text/css">
textarea, p, button { font: small Georgia, serif }
textarea, p { line-height: 1.5; width: 40em }
</style>
<textarea id="input">
Formally, a Markov chain is a random process with the Markov property. Often, the term "Markov chain" is used to mean a Markov process which has a discrete (finite or countable) state-space. Usually a Markov chain is defined for a discrete set of times (i.e., a discrete-time Markov chain) although some authors use the same terminology where "time" can take continuous values. The use of the term in Markov chain Monte Carlo methodology covers cases where the process is in discrete time (discrete algorithm steps) with a continuous state space. The following concentrates on the discrete-time discrete-state-space case.
A discrete-time random process involves a system which is in a certain state at each step, with the state changing randomly between steps. The steps are often thought of as moments in time, but they can equally well refer to physical distance or any other discrete measurement; formally, the steps are the integers or natural numbers, and the random process is a mapping of these to states. The Markov property states that the conditional probability distribution for the system at the next step (and in fact at all future steps) depends only on the current state of the system, and not additionally on the state of the system at previous steps.
Since the system changes randomly, it is generally impossible to predict with certainty the state of a Markov chain at a given point in the future. However, the statistical properties of the system's future can be predicted. In many applications, it is these statistical properties that are important.
The changes of state of the system are called transitions, and the probabilities associated with various state-changes are called transition probabilities. The set of all states and transition probabilities completely characterizes a Markov chain. By convention, we assume all possible states and transitions have been included in the definition of the processes, so there is always a next state and the process goes on forever.
A famous Markov chain is the so-called "drunkard's walk", a random walk on the number line where, at each step, the position may change by +1 or −1 with equal probability. From any position there are two possible transitions, to the next or previous integer. The transition probabilities depend only on the current position, not on the manner in which the position was reached. For example, the transition probabilities from 5 to 4 and 5 to 6 are both 0.5, and all other transition probabilities from 5 are 0. These probabilities are independent of whether the system was previously in 4 or 6.
Another example is the dietary habits of a creature who eats only grapes, cheese, or lettuce, and whose dietary habits conform to the following rules:
It eats exactly once a day.
If it ate cheese today, tomorrow it will eat lettuce or grapes with equal probability.
If it ate grapes today, tomorrow it will eat grapes with probability 1/10, cheese with probability 4/10 and lettuce with probability 5/10.
If it ate lettuce today, it will not eat lettuce again tomorrow but will eat grapes with probability 4/10 or cheese with probability 6/10.
This creature's eating habits can be modeled with a Markov chain since its choice tomorrow depends solely on what it ate today, not what it ate yesterday or even farther in the past. One statistical property that could be calculated is the expected percentage, over a long period, of the days on which the creature will eat grapes.
A series of independent events (for example, a series of coin flips) satisfies the formal definition of a Markov chain. However, the theory is usually applied only when the probability distribution of the next step depends non-trivially on the current state.
Many other examples of Markov chains exist.
</textarea><br/>
<button id="generate">Generate Again</button>
<p id="output"></p>
<script>
;(function() {
function State(text) {
this.text = text
this.nextStates = {}
}
State.prototype.addNextState = function(text) {
this.nextStates[text] = this.nextStates[text] || 0
this.nextStates[text]++
}
State.prototype.nextState = function() {
var total = 0, word
for (word in this.nextStates) total += this.nextStates[word]
var r = Math.random() * total, n = 0
for (word in this.nextStates) {
n += this.nextStates[word]
if (r < n) break
}
return word
}
function Chain(input) {
this.input = input
this.states = {'': new State('')}
this.parse(this.input)
}
Chain.prototype.beginState = function() {
this.currentState = this.states['']
}
Chain.prototype.addState = function(text) {
this.currentState.addNextState(text)
this.currentState = this.states[text] = this.states[text] || new State(text)
}
Chain.prototype.parse = function(input) {
input = input.replace(/[“”",()*]/g, '')
input = input.replace(/\b([mM][rs])\./g, '$1')
input.split(/[;.?](\s+|$)/).forEach(function(sentence) {
sentence = sentence.trim()
if (sentence) {
this.beginState()
sentence.split(/\s+/).forEach(function(word) {
if (word) this.addState(word)
}, this)
this.addState('.')
}
}, this)
}
Chain.prototype.randomSentences = function(n) {
var output = []
for (var i = 0; i < n; i++) {
var sentence = []
this.beginState()
while (this.currentState.text !== '.' && sentence.length < 50) {
this.currentState = this.states[this.currentState.nextState()]
sentence.push(this.currentState.text)
}
sentence.pop()
output.push(sentence.join(' ') + '.')
}
return output.join(' ')
}
function train() {
window.chain = new Chain(document.getElementById('input').value)
}
function generate() {
document.getElementById('output').innerText = chain.randomSentences(5)
}
document.getElementById('input').onchange = train
document.getElementById('generate').onclick = generate
train()
generate()
})()
</script>