> At first, mrkwords.sh will pick a random line from the model and pick the firs...

dredmorbius · on Nov 9, 2019

'^' and '$' are already the regex convention for start and end of string. With a corpus already broken into sentences, this would be readily applied.

beefhash · on Nov 9, 2019

> The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte.

The $ sign would get in the way if people actually use it in conersation, and maybe other things don't like the NUL byte.

Maybe '\x1F' (ASCII Unit Separator) or '\x1E' (ASCII Record Separator) are better candidates?

jstanley · on Nov 9, 2019

The $ sign would only get in the way if used as a standalone word, but I agree. If the sentinel is ever encountered in actual content, it should be escaped somehow.

korpiq · on Nov 9, 2019

Or just upper case first letter and full stop?

dredmorbius · on Nov 9, 2019

Because, as Mr. Obvious notes, capitalisation and stops never appear elsewhere in sentences, etc., etc.

make3 · on Nov 9, 2019

In his dataset, all sentences start at the beginning of the line and end at it's end.

This is strictly a toy, no one serious would use this for anything anyways.

jstanley · on Nov 9, 2019

> In his dataset, all sentences start at the beginning of the line and end at it's end.

Right, but the sentence start is taken from the first word of a random state transition pair, not the first word of a random sentence.

make3 · on Nov 9, 2019

The first random state should be picked with start probability, which are equal to the frequency of a word beginning a sentence, which is exactly equal to picking the first word of a random sentence