In response to some discussion on twitter, I dug into how common different letters are at different positions in a word. Since the discussion was prompted by the popular word game Wordle the focus was on five letter words.
My approach was to download a word list from the first google hit for
word list download
, and then throw python’s Counter
module at it.
from collections import Counter
# Download a wordlist such as the one available here:
# https://github.com/dwyl/english-words
with open("../resources/words/words_alpha.txt") as d:
wl = [x.strip() for x in d.readlines()]
firsts = [x[0] for x in wl]
fives = [x for x in wl if len(x) == 5]
def top_letters(ctr, n=10):
return ",".join([x[0].upper() for x in ctr.most_common(n)])
print("For all words in the list")
print("Most common first letters")
print(top_letters(Counter(firsts)))
print("Most common all positions")
print(top_letters(Counter("".join(wl))))
print("\n")
print("For five letter words:")
print("Top letters any position")
print(top_letters(Counter("".join(fives))))
for i in range(5):
print(f"Top letters in position {i}")
letters = [x[i] for x in fives]
print(top_letters(Counter(letters)))
And here’s the result:
For all words in the list
Most common first letters
S,P,C,A,U,M,T,D,B,R
Most common all positions
E,I,A,O,N,S,R,T,L,C
For five letter words:
Top letters any position
A,E,S,O,R,I,L,T,N,U
Top letters in position 0
S,C,A,B,T,P,M,D,G,F
Top letters in position 1
A,O,E,I,U,R,L,H,N,T
Top letters in position 2
R,A,I,N,O,L,E,U,T,S
Top letters in position 3
E,A,I,T,N,L,O,R,S,U
Top letters in position 4
S,E,Y,A,T,N,R,D,L,O
So CARES
is a pretty good first word, since it contains the most common letter in four out of five positions and the second most common
letter in the other position.
It also contains four of the five most common letters across all positions.
If you’re thinking “Wait a minute, the most common letters are ETAOINSHRDL
, or something
like that”, you’re right,
but that’s taking account of word frequency,
which this word list does not.
So let’s find a corpus of English prose we can count the words in.
What about, for example, a plain text version of the
complete works of Shakespeare?
with open("../resources/words/shakespeare.txt") as d:
swl = [x.strip() for x in d.readlines()]
print("For the complete works of Shakespeare")
print("Most common letters")
print(top_letters(Counter("".join(swl)), n=11))
This yields E,T,O,A,H,S,N,R,I,L
.
n=11
here because the most common letter is in fact the space.
I could strip out spaces, and split it into words and repeat the exercise of finding the
most common letters in five letter words etc, but I couldn’t be bothered.
And, in honour of its new status as a work in the Public Domain,
doing the same process for
Winnie the Pooh
yields E,T,O,A,I,N,H,S,R,D
.