Define the conditional right / left distribution of a character as the distribution of characters immediately following / preceding it. These distributions carry important information about a language and can, for example, be used to separate vowels from consonants. However, when represented simply as frequency tables they aren't very illustrative, so I decided to make a visualization.

A single conditional distribution p is visualized in the following way. First, calculate Shannon's diversity index d1(p) = Exp(-∑pi Ln pi), which measures how many different values a distribution can effectively take. As you may recognize, it is an exponentiated entropy. Diversity index of a uniform distribution over n values is n.

Then draw a bar with unit height and length = d1(p), subdivide it into sections with lengths proportional to pi and sort the sections by length. The length of each section can be interpreted as the ratio of the corresponding conditional character probability to the weighted geometric mean of pi with weights pi. The complete visualization is a stack of left pL and right pR conditional distribution bars sorted by total diversity index d1(pL) × d1(pR).

Now let's take a look at the visualizations of English (statistics collected from Moby-Dick) and Voynich manuscript. '_' denotes spaces, '/' new lines, and '*' marks the unconditional distributions.

Conditional character distributions in English language Conditional character distributions in Voynich manuscript

Some interesting, even if well-known features of Voynichese are apparent:


Follow me on Twitter to receive notifications about new blog posts. This post was made possible thanks to the patron support. If you appreciate the effort put into creating this page and would like to see more posts like this, you can support me on Patreon. Top supporters this month: