nervousdata

Raspel

Automated auto-correct of random sequences of characters. / Automatisierte Autokorrektur von zufälligen Buchstabenreihen.

Overview Haspeln (Deutsch)Rasp (English)The processThe code

Haspeln (Deutsch)

― Haspeln, 313 Wörter  ↷ HTML

Ein Gedicht, gemacht aus einem limitierten Buchstabenreservoir: " grund", " hitze" und " kraut". Manuell leicht bearbeitet, inklusive Einfügen von Satzzeichen.

Rasp (English)

― Rasp. A novel, 50.612 words  ↷ HTML

Rasp is my contribution for Nano-NaNoGenMo 2022 in English. NaNoGenMo (short for National Novel Generation Month) is an annual online event where people gather to write code that generates a novel of 50.000+ words. Nick Montfort introduced Nano-NaNoGenMo, a subgenre for novel-generating computer programs that are at most 256 characters.

Every paragraph in Rasp is made of three sentences. Each sentence is derived from a random sequence of max. 7 characters including space. The available characters are defined in three sets: " string", " frozen" and " wreath".

The process

Program

The program is a Bash/Shell skript that can be executed via the command line in Linux. It makes use of some UNIX command line utilities and Hunspell, a common spell checker you could find in LibreOffice for example.

First, it creates an infinite stream of random characters by calling /dev/urandom. “The random number generator gathers environmental noise from device drivers and other sources into an entropy pool.” (Linux Programmer’s Manual) Second, this stream is reduced to a set of characters with the tr command.

cat /dev/urandom | tr -cd " string" – Delete all characters except s, t, r, i, n and g and a whitespace from the stream.

If you execute this line, characters will be printed indefinitely. Therefore, some more operations to shape the output are added.

fold -w 7 – Produce a fold / line wrap on the terminal after 7 characters.

head -c 110 – Print the first 110 characters of the given input.

tr "\n" " " – Substitute all line breaks with single whitespace.

The result can now be piped into Hunspell. You need to specify a dictionary like -d en_US for US English. With -U all suggestions for incorrect words will be accepted, auto-corrected. This option is undocumented, I learned about it on a Q&A platform.

hunspell -U -d en_US – Auto-correct misspelled words by using en_US dictionary.

Characters

The crucial part was to determine which characters should be part of a set and how to limit and fold the sequence. Some combinations of characters give plenty of candidates for correction, some almost none. I wanted to get a specific ‘tone’ by choosing the characters and not randomness present itself.

The Code

Skript for Nano-NaNoGenMo. It uses a modified Hunspell dictionary file in which all words containing uppercase letters are deleted.

#!/bin/bash
for c in {0..750}
do
a=(" string" " frozen" " wreath")
for i in {0..2}
  do
    cat /dev/urandom | tr -dc "${a[$i]}" | fold -w 7 | head -c 110
    | tr "\n" " " | tr -s " " | hunspell -U -d en_USO >> r.txt
  done
done
sed -i "s/ $//;s/^ //;s/$/.&/" r.txt

The version for Haspeln is slightly different from the code above. It stores the generated stream of characters and the auto-corrected words in two separate files and merges them in the end. It has three different sets of characters for the random generator, each is used for making a paragraph with ~650 characters. Single word characters are deleted from the random sequence output, but sometimes brought back through the auto-correction.

#!/bin/bash
  a=(" grund" " hitze" " kraut" )
  for i in {0..2}
  do
    for e in $(cat /dev/urandom | tr -dc "${a[$i]}" | fold -w 7 | head -c 650
    | sed "s/\b\w\b//" | tr ' ' '\n')
    do
      echo "$e" >> ta.txt
      echo "$e" | hunspell -d de_DE | grep "&" | awk -F "0:" '{print $2}'
      | xargs | cut -d, -f 1 | sed "" >> tb.txt
    done
  done
  awk 'FNR == NR {S[FNR] = $0; next}; /^$/{$0 = S[FNR]} 1' ta.txt tb.txt
  | tr '\n' ' ' > tx.txt

Thanks to Moritz v. D.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


11/2022