At the weekend --inspired by the 50th anniversary shenanigans, I spent many an hour transferring my entire collection of original Doctor Who episodes from DVD onto the external hard drive on my media server, so I could access them directly through XBMC instead of faffing around with discs.
Much fun and frivolity ensued, as I wrestled for quite a while, trying to find a naming convention for the files that would allow XBMC to index the episodes [about which, more in another post].
As I scanned through the hundreds of episodes I had, I got the feeling that certain words such as ‘terror’, ‘horror’ etc. cropped up quite a lot. So I thought what jolly japery it might be to do a bit of nerdy geeky data-analysis and find out the most popular words for a Doctor Who series name. Then I could string a few of the top words together to make the ultimate Doctor Who series title.
If you want to know how I did the nerdy stuff, then read on after the big reveal. If you only want to know the results, I can tell you that the most popular words [ie. more than 3 occurrences] used in classic Doctor Who series titles are:
99 THE
46 OF
9 DALEKS
7 SPACE
7 PLANET
6 DEATH
5 INVASION
4 TIME
4 TERROR
4 IN
4 EVIL
3 FROM
I therefore propose that the best Doctor Who story never made would have been:
Daleks Space Invasion of Planet Death
Royalty cheques to the usual address please!
Anyway, for my fellow geeks amongst you, here are the meat and potatoes of how I came to this great revelation:
For the task in hand, I used my favourite text-cruncher, Vim in its MacVim disguise. I started with a list of the series titles, snaffled from the BBC website.
An Unearthly Child
The Daleks
The Edge of Destruction
Marco Polo
The Keys of Marinus
The Aztecs
The Sensorites
The Reign of Terror
Planet of Giants
The Dalek Invasion of Earth
The Rescue
The Romans
The Web Planet
The first task was to break up this list into individual words. Time for some cryptic Vim mastery:
:%s/\>/\r/g
Inserts a line-break after each word. The list now looks like this:
An
Unearthly
ChildThe
DaleksThe
Edge
of
DestructionMarco
PoloThe
Keys
of
Marinus
Now it’s time to clean up a bit:
:%s/\W//g
Will remove any non-word characters. The list now looks like this:
An
Unearthly
ChildThe
DaleksThe
Edge
of
DestructionMarco
PoloThe
Keys
of
Marinus
Actually, those both look the same when pasted into Tumblr, but in Vim, the previous version of the list was a lot more ragged. with lots of spaces before and after words.
Let’s get rid of the blank lines:
:v/\w/d
Now the list looks like this:
An
Unearthly
Child
The
Daleks
The
Edge
of
Destruction
Marco
Polo
The
Keys
of
Marinus
The
Aztecs
Now we’re getting somewhere. Next step is to uppercase all the words, so that the counting and sorting functions don’t treat upper and lower case versions of the same word as being different words:
:%s/.*/\U&/
This gives us the following:
AN
UNEARTHLY
CHILD
THE
DALEKS
THE
EDGE
OF
DESTRUCTION
MARCO
POLO
Now, at last, we’re ready to start counting the individual words. First we need to sort them alphabetically:
:%sort
Giving us this:
AN
AZTECS
CHILD
DALEK
DALEKS
DESTRUCTION
EARTH
EDGE
GIANTS
INVASION
KEYS
MARCO
After sorting the words alphabetically, we can now use the uniq
command to count the individual occurrences of each word:
:%!uniq -c
At last we have a breakdown of the popularity of individual words:
1 AN
1 AZTECS
1 CHILD
1 DALEK
1 DALEKS
1 DESTRUCTION
1 EARTH
1 EDGE
1 GIANTS
1 INVASION
1 KEYS
1 MARCO
1 MARINUS
Again, Tumblr is helpfully stripping whitespace from this list. In reality, I found that the uniq
command re-introduced some spaces before the lines, so i used:
:%s/^\s\+
to get rid of those. Then finally I used
:%sort!
To sort the wordlist in reverse order, so the most popular words were at the top. The top of my actual real world list ended up like this:
99 THE
46 OF
9 DALEKS
7 SPACE
7 PLANET
6 DEATH
5 INVASION
4 TIME
4 TERROR
4 IN
4 EVIL
3 FROM
2 WEB
2 WARRIORS
2 WAR
2 TO
2 SEEDS
2 POWER
2 PELADON
2 MONSTER
2 MIND
2 MAKERS
2 FEAR
2 ENEMY
2 CYBERMEN
2 ARK
All other words on the list had only one occurrence.
So there you have it. Doctor Who statistics gathering powered by Vim. Could I write a more geeky post if I tried!