> At first, mrkwords.sh will pick a random line from the model and pick the first word of the pair as the first word of our output message.
In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.
The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.
It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.
(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).
The $ sign would only get in the way if used as a standalone word, but I agree. If the sentinel is ever encountered in actual content, it should be escaped somehow.
The first random state should be picked with start probability, which are equal to the frequency of a word beginning a sentence, which is exactly equal to picking the first word of a random sentence
In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.
The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.
It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.
(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).