Raspel
Automated auto-correct of random sequences of characters. / Automatisierte Autokorrektur von zufälligen Buchstabenreihen.
Overview Haspeln (Deutsch) • Rasp (English) • The process • The code
Haspeln (Deutsch)
― Haspeln, 313 Wörter ↷ HTML
Ein Gedicht, gemacht aus einem limitierten Buchstabenreservoir: " grund"
, " hitze"
und " kraut"
. Manuell leicht bearbeitet, inklusive Einfügen von Satzzeichen.
Rasp (English)
― Rasp. A novel, 50.612 words ↷ HTML
Rasp is my contribution for Nano-NaNoGenMo 2022 in English. NaNoGenMo (short for National Novel Generation Month) is an annual online event where people gather to write code that generates a novel of 50.000+ words. Nick Montfort introduced Nano-NaNoGenMo, a subgenre for novel-generating computer programs that are at most 256 characters.
Every paragraph in Rasp is made of three sentences. Each sentence is derived from a random sequence of max. 7 characters including space. The available characters are defined in three sets: " string"
, " frozen"
and " wreath"
.
The process
Program
The program is a Bash/Shell skript that can be executed via the command line in Linux. It makes use of some UNIX command line utilities and Hunspell, a common spell checker you could find in LibreOffice for example.
First, it creates an infinite stream of random characters by calling /dev/urandom
. “The random number generator gathers environmental noise from device drivers and other sources into an entropy pool.” (Linux Programmer’s Manual) Second, this stream is reduced to a set of characters with the tr
command.
cat /dev/urandom | tr -cd " string"
– Delete all characters except s, t, r, i, n and g and a whitespace from the stream.
If you execute this line, characters will be printed indefinitely. Therefore, some more operations to shape the output are added.
fold -w 7
– Produce a fold / line wrap on the terminal after 7 characters.
head -c 110
– Print the first 110 characters of the given input.
tr "\n" " "
– Substitute all line breaks with single whitespace.
The result can now be piped into Hunspell. You need to specify a dictionary like -d en_US
for US English. With -U
all suggestions for incorrect words will be accepted, auto-corrected. This option is undocumented, I learned about it on a Q&A platform.
hunspell -U -d en_US
– Auto-correct misspelled words by using en_US dictionary.
Characters
The crucial part was to determine which characters should be part of a set and how to limit and fold the sequence. Some combinations of characters give plenty of candidates for correction, some almost none. I wanted to get a specific ‘tone’ by choosing the characters and not randomness present itself.
The Code
Skript for Nano-NaNoGenMo. It uses a modified Hunspell dictionary file in which all words containing uppercase letters are deleted.
#!/bin/bash for c in {0..750} do a=(" string" " frozen" " wreath") for i in {0..2} do cat /dev/urandom | tr -dc "${a[$i]}" | fold -w 7 | head -c 110 | tr "\n" " " | tr -s " " | hunspell -U -d en_USO >> r.txt done done sed -i "s/ $//;s/^ //;s/$/.&/" r.txt
The version for Haspeln is slightly different from the code above. It stores the generated stream of characters and the auto-corrected words in two separate files and merges them in the end. It has three different sets of characters for the random generator, each is used for making a paragraph with ~650 characters. Single word characters are deleted from the random sequence output, but sometimes brought back through the auto-correction.
#!/bin/bash a=(" grund" " hitze" " kraut" ) for i in {0..2} do for e in $(cat /dev/urandom | tr -dc "${a[$i]}" | fold -w 7 | head -c 650 | sed "s/\b\w\b//" | tr ' ' '\n') do echo "$e" >> ta.txt echo "$e" | hunspell -d de_DE | grep "&" | awk -F "0:" '{print $2}' | xargs | cut -d, -f 1 | sed "" >> tb.txt done done awk 'FNR == NR {S[FNR] = $0; next}; /^$/{$0 = S[FNR]} 1' ta.txt tb.txt | tr '\n' ' ' > tx.txt
Thanks to Moritz v. D.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.