Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> At first, mrkwords.sh will pick a random line from the model and pick the first word of the pair as the first word of our output message.

In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.

The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.

It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.

(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).



'^' and '$' are already the regex convention for start and end of string. With a corpus already broken into sentences, this would be readily applied.


> The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte.

The $ sign would get in the way if people actually use it in conersation, and maybe other things don't like the NUL byte.

Maybe '\x1F' (ASCII Unit Separator) or '\x1E' (ASCII Record Separator) are better candidates?


The $ sign would only get in the way if used as a standalone word, but I agree. If the sentinel is ever encountered in actual content, it should be escaped somehow.


Or just upper case first letter and full stop?


Because, as Mr. Obvious notes, capitalisation and stops never appear elsewhere in sentences, etc., etc.


In his dataset, all sentences start at the beginning of the line and end at it's end.

This is strictly a toy, no one serious would use this for anything anyways.


> In his dataset, all sentences start at the beginning of the line and end at it's end.

Right, but the sentence start is taken from the first word of a random state transition pair, not the first word of a random sentence.


The first random state should be picked with start probability, which are equal to the frequency of a word beginning a sentence, which is exactly equal to picking the first word of a random sentence




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: