It has been claimed that the Voynich manuscript is a encrypted text, and encrypted (natural?) human language. Other take it for a fraud, an educated hoax.
Recently I have been wondering about the statistical features of the Voynich manuscript, and about statistical features of human language texts that distinguish them from gibberish. In the following I will look at the word distribution and the entropy of the Voynich manuscript compared to Latin, modern German, and English.
Today it is well known that the occurrence of words in a natural language text (including constructed languages) is not distributed equally or random. Since Zipfs studies (e.g., Zipf, G. 1932. Selective Studies and the Principles of Relative Frequency in Language. Harvard: Harvard University Press. ) we know that if the words of a text are ranked after their frequency, the probability of the occurrence of a word is inversely proportional to its rank. This Zipfian distribution follows a power-law and can easily be seen by plotting the ranked data on a log-log scale. The distribution should follow, more or less, a straight line. Normally this is shown using large corpora of texts. Since we one have one text, the Voynich manuscript, we will compare this texts to three other texts in Latin (Philip Melancthon: Augsburg Confession), German (Franz Kafka: Die Verwandlung), and English (E. A. Poe: The Fall of the House of Usher). The Latin text has been chosen since it was written around the same time as it is assumed the Voynich manuscript was written. The modern German and English texts have been chosen without any further considerations. Any text could have been used here. The statistical features are all more or less the same.
On first sight one can see that the rank-frequncy plot of the Voynich manuscript looks (more or less) identical to the plots of the texts of Melanchton, Kafka, and Poe. The only difference: these three texts are considerably shorter, hence the maximum frequency of the words is smaller. None the less, even the number of words used in the texts is not considerably different. The last plot shows a random text and it does not follow a power-law and looks considerably different from the four other texts.
In German and English, the most common words are the articles. Latin does not have any articles.
The table above shows the five most frequent words in the four texts. As we can see (in the German and English texts), articles are among the top words. For both English and German two articles are among the highest rank five words. In English, German, and Latin we find the word „and“ and its equivalents, and pronouns („er“, „I“, „quod“) and possible prepositions („zu“, „of“, „ut“).
The entropy, or information content, gives an estimation of the complexity of a message. Here the complexity is meant as the amount of information needed to, for example, compress and restore a message with regards to the number of different characters and their distribution.
For example: Given a binary system with two (uh!) states (0,1), and a message A= (1010101010) and a message B=(1000111010), the first is less complex. There is a simple rule that states: every 1 is followed by a 1. This is the information contained in message A. In message B, the amount of information, or complexity, is larger. The entropy is given in bits.
To calculate the entropy of the texts used here (Voynich, Melanchton, Kafka, Poe),the characters are counted and the probability of their distribution is calculated. (The so called conditional entropy was not used: For example, in German the character „c“ is much more likely to be followed by an „h“ or „k“ then by any other consonant. In English the vowel „o“ is more likely to be followed by „u“ then by „a“. I rather used the simple entropy, only looking at the fruency of the letters).
To make the texts more comparable, special characters (., and so on) have been removed. Also the texts were transferred to lower case characters (especially the Latin text only uses upper case letters for names and the beginning of paragraphs) Both the Voynich manuscript and the Latin text do not use special characters. In the Voynich transcript special characters are used to describe special chars. This has be ignored, which will lead to a certain degree of error in the analysis to Voynich, since some characters are not recognized as such.
|Text||No. of unique chars||Entropy observed||Entropy expected|
The Latin and the English texts both use the Latin alphabet with 28 letters, including the space (of course English has the „th“, which should properly be treated as it’s own letter; this was ignored here). German has some further characters, in total 32 here. The Voynich transcript exposes 47. This is considerably higher. It can be blamed on the transcript. Sometimes groups of characters and special characters are used to describe what is in fact one character in the manuscript. This leads to a higher numbers of letters, but not to more complexity or entropy. On the contrary: the Voynich manuscript seems much more structured than the other texts, the lower entropy shows less complexity (and no randomness).
All four texts expose an entropy of around 4bits, which is was is to be expected of a human language text.
Now, the statistics found above are to be reviewed and the four texts are to be compared, starting with the word frequency. As stated above, all four frequency/rank plot are similar. all four exhibit non random, typical for human language power-law like distributions.
Looking at Voynich, we have three possibilities: It is a hoax, not a real languages. It is an encrypted Language. It is a constructed Language.
First, we will consider the last to possibilities before coming back to the first possibilities at the end.
Looking at the five most frequent words, we can observe, first of all, a difference between the known languages, Latin, German and English. Latin does not have articles (at least the written Latin does not show articles, the spoken Vulgar Latin showed first signs of articles two thousand years ago already). Articles are a quite young development in Indo-European languages. As mentioned, Vulgar Latin showed signed of emerging articles, written Latin does not. Of the Indo-European languages spoken in Europe today, the romance languages (descendent of Latin) and the Germanic languages (e.g., German, English) have articles. Besides the north Germanic languages, articles are usually words in pre position of a noun (North-Germanic Languages append determined articles to the end of a noun).
If Voynich is an encrypted language from the 15th century, the question is: what language could that be? The spoken languages of that time exposed articles. In this case it would be likely to find articles among the top five words. Of course it could always be a language of yet another language family and not be an Indo-European language at all. In that case it would more likely not contain articles.
If it is a constructed language, we can not know if the language contained articles or not.
In the list above we found pronouns and prepositions among the most frequent words. We should expect to find these in the unknown language as well. An in fact, if we look at the most frequent words in the Voynich manuscript, we find „daiin“ and „aiin“, and „chedy“ and „sheddy“, words that look and probably sound alike and could be derivations of a common form (an article? and conjunction? a preposition or pronouns?).
So far we can see that the distribution of frequency and rank of the words in the Voynich manuscript is comparable to other texts. Furthermore, looking at the most frequent words we find patterns, words that seem to be similar or connected, maybe conjugations or declinations of one form, as we find, e.g., in the German text with the articles „die“ and „der“.
Now we will shortly look at the entropy. As found and stated above, the entropy of all four texts is comparable and close to 4bits which was to be expected for a human language. The information content of the Voynich text seems to be not random, but rather following the same laws that determine the entropy of the German, English, and Latin texts.
Looking at the statistics above: How probable is the Voynich manuscript a hoax?
While the Zipfian Law is well known in corpus linguistics today, it was not general knowledge in the time the Voynich manuscript was written. This is not a proof that the Voynich manuscript is not a hoax (In computer science, there is a quite long history of efforts to create random texts that meet the common statistics of human language). But if it is, the author had to have very deep statistical knowledge of languages. This is even more so as the most common words seem not be random, but somehow related, as could be expected of human language.
If the strings were purely random, a more or less equally distributed list of words could be expected (the shorter the words, the more likely are they to occur more then once. This is in the range of about 2 chars and it is not what we find in either of the fours texts!)
The entropy, and thereby the distribution of letters in the Voynich manuscript, is not random. A created a random text and found the entropy to be much higher, almost 6bits. There is not conditionality in random strings. Every chars is as probable to be in any position of the text as any other. The probability is one divided by the number of unique chars.
The entropy, or complexity, of the Voynich manuscript is a little lower than those of the German, English, and Latin texts. This indicates a more structured, less arbitrary distribution of the characters. This might hint at an artificial nature more than at an encrypted language. There is of course no conclusive evidence for that. (I shortly compared the entropy of texts in Tolkiens „Quenya“ and in Esperanto and found them to be loser to the Voynich manuscript than to the other texts‘ entropy)
What the statistics do show is the following: The Voynich manuscript is not purely random. It either mimics a natural language very well, is an encrypted language, or, which seems most plausible to me after looking at the statistics, it is a constructed language, written in an constructed alphabet, but non the less a human language that contains information and a message.