Back to: [Classical Cryptography and list of ciphers]
[Classical Cryptanalysis] [Cryptography Home]
On this page: [Introduction] [Frequency Analysis] [Index of Coincidence]
[Chi-Squared test] [Unicity Distance] [References]
This page is still under construction
Introduction
This section describes all the commonly used statistical methods for cryptanalysis of classical ciphers.
Most of the methods described here are simple, more complex methods such as markov models are described
in their own section.
Frequency Analysis
Frequency analysis is the practice of counting the number of occurances of different ciphertext characters in the
hope that the information can be used to break ciphers. Frequency analysis is most effective on substitution type
ciphers such as the caesar cipher, substitution cipher, polybius square etc. The following javascript will count the occurance of each character
and display the result. See
this section on applications of frequency counts for solving substitution ciphers.
The javascript above counts monogram frequencies, however it is also possible to count bigram and trigram frequencies. Bigram frequency
distributions are generally 'flatter' than the monogram counts because there are far more bigrams (676) than monograms (26). This means it
is more difficult to base conclusions on bigram frequency information, unless you have very large amounts of ciphertext. The situation is
even worse for trigrams.
Index of Coincidence
To put it simply, the index of coincidence is an indicator that you are working with text that
repeats roughly as often as English text (assuming you are working with english).
The index of coincidence can be used to distinguish monoalphabetic ciphers from polyalphabetic ciphers,
or in determining the key length of a polyalphabetic cipher.
Chi-Squared test
Unicity Distance
The unicity distance is a number that indicates the quantity of ciphertext required in order to
uniquely determine the plaintext of a message. It is a function of the length of the key used to encipher
the message and the statistical nature of the plaintext language. Given enough time, it is guaranteed that
any cipher can be broken given a length of ciphertext such that the unicity distance is 1. The unicity
distance is computed as:
H(K)
(|M|-H(M))
where H(K) is the information content of the key, H(M) is the information content per symbol of the message,
and |M| is the information content per symbol of the message assuming that all symbols are equally likely.
If K is chosen at random, H(K)=|K|, so for randomly chosen letters in English, H(K)=|K|=log2(26)=4.7 bits.
H(M) has been empirically determined to be 2.9 bits for English, so the unicity distance for English is 1
when |M| is (4.7/1.8)*|K|. In other words, any time the length of all messages sent under a particular key
in English exceeds 2.6 times the length of that key, the key and plaintext can eventually be determined.
Shannon noted that in a system with an infinite length random key, the unicity distance is infinite,
and that for any alphabetic substitution cipher with a random key of length greater than or equal to the
length of the message, plaintext cannot be derived from ciphertext alone. This type of cipher is called a
"one-time-pad", because of the use of pads of paper to implement it in WW2 and before.
References
Choose your way out: [Classical Cryptography and list of ciphers]
[Classical Cryptanalysis] [Cryptography Home]
[Modern Cryptography] [Modern Cryptanalysis]