Random lipsums and Faker

This morning, Gabe posted a link to Faker, a Python module for generating random sentences, paragraphs, dates, names, addresses, phone numbers (or any other kind of number, for that matter), and other sorts of data that can be helpful when you’re testing code or design layouts. I went to its GitHub repository to look at the source code and came away both impressed and a little disappointed.

First, the impressive part. The names, addresses, and numbers are everything I’d hoped for. The names, for example, are generated from a list of over 3000 first names and nearly 500 last names. It’s a snap to create lists of names like

Frances Wunsch
Grace Crona
Ibrahim Pagac
Nicholas Marks
Justyn Heaney
Susanna Kohler
Rose Jenkins DVM
Nico Harris
Miss Mervin Collier DVM
Heaven Nolan MD

which should remind you of the people who send you spam. Their email addresses could be

leannon.elias@hotmail.com
tschowalter@stehr.com
jovan40@yahoo.com
pacocha.ned@schaden.com
joelle.wolff@schulist.net
whoppe@lebsack.biz
hbahringer@fay.com
patience49@hotmail.com
ismael.ortiz@yahoo.com
trisha.conroy@yahoo.com

If you don’t like some of the more exotic names, you could go into the source code and delete them from the lists.

Faker also makes easy to create lists of numbers in any format:

 $671.10
 $606.11
$2530.05
$2603.94
$2164.67
$2777.75
$2597.97
$2983.63
$4935.32
$9221.72

What I didn’t like about Faker was the random text it generates. Here’s a paragraph:

Dolores maxime consequatur consequatur quaerat rerum. Quidem
omnis sit doloribus quas. Voluptate molestiae dolor minus
cum id ducimus. Temporibus laboriosam eos quod.

As you can see, Faker uses the same sort of cod-Latin as the traditional Lorem ipsum text. The random text is generated from a list of only about 250 words, which leads to a lot of repetition. More important to me, though, is that the generated text looks nothing like English—a departure from the more realistic names, addresses, and other data that Faker makes.

What I’d like is something closer to my random Darwin generator, which uses bigrams from Origin of Species to create chunks of text that look like real English but read like a Victorian acid trip:

Of species; with fully developed structures. I will give
another instance of this, even in the middle; the masons
always piling up the cut-away cement, and adding up all that
are constructed for one purpose, might be specified. In the
skeletons of the bones of the skull are homologous--and
which shall die--which variety whether a small isolated
areas have invariably been the case, of some preceding forms.

The downside of my Darwin script is that it’s a quick hack—albeit a quick hack that I’ve been using unchanged for three years—that often puts syntactically incorrect artifacts like unmatched quotation marks and mid-sentence capitalized words in the text. Faker, to its credit, never does that.

For now, I intend to continue to use my ;darwin TextExpander snippet and its physics-oriented cousin, ;mech, for general placeholder text and to add snippets that use Faker for more specific placeholders. What I’d really like, though, is an expanded version Faker that generates chunks of well-formatted but random text from a corpus, possibly using the Natural Language Toolkit’s ngram functions.