How I Determined The Author Of A Text In 3 Simple Steps

Daria Spizheva
7 min readMay 9, 2018

The first time I read the story about how J. K. Rowling was revealed to write “The Cuckoo’s Calling” instead of a mysterious Robert Galbraith I was excited by the idea of uncovering the author of a text. It took me some time to make a research on the existing approaches and tools.

The issue is that all the people use language differently. Some of their personal features are evident, others are not. Instead of making wild guesses, there is an easy-going way to prove or disprove the claims for those who know programming a little. Just a piece-of-cake method — no neural networks, no AI. Rough, plain, reliable.

This is the story where I will share my own experience of creating an authorship attribution program working well for different languages, text genres, and self-adjusting to the data you want to analyze. For you, I gathered here only the most fruity insights and discoveries with an ultimate step-by-step formula of building a working piece. Make yourself comfortable — we’re ready to start!

What Makes The Guide Unique?

Even a brief research of existing methods can be pretty embarrassing, especially for a newbie. I had a small advantage here as I have a university degree in computational linguistics. Still, the majority of the proposed techniques seem too dark and intricated for someone just willing to grab the idea of how it works. The most frequent problems are:

  • There are no online tools for authorship identification. To use whatever a program, you are supposed to download the source code, run it on your machine, and acquire skills in using it before you can start the analysis. This problem becomes even sharper in conjunction with the next:
  • Someone else’s code is dreadfully hard to understand. As the major part of such programs were written by amateurs, often there are no salvational comments or those written on foreign language. Let alone, the implementations are often far from the modern coding patterns and standards, include obscure solutions and terms. If you want to see the principle, you should better build your own piece.
  • Finally, no one tells you how to build such a program straight on. Authors write articles about what techniques can be used, give you thousands of recommendations, but no one tells what actually you should do. For many, this factor becomes an invincible obstacle forcing them to drop the attempts. At the time I started my research, I was looking for such a cheat note and found nothing. That’s why I guess this guide can be useful for everyone curious about authorship attribution and inspire them to get to know the area a bit closer.

No more lyrics.

Step 1. Text Preprocessing

Prepare the working environment. Choose a programming language. (At my own project, I used C# and Windows Forms Foundation at Visual Studio. Despite that, you’re free to make any other choice — the steps are well generalized.)

The reason for the step is to create a class, which would read the texts from a file/text field successively, line by line, cleaning them from digital noise and separating into words.

Reading the files and parsing them line by line

Input texts usually contain enough noise invisible for us but disturbing the software. We consider tags, markups, unexpected symbols to be redundant in our evidence, so you should get rid of them.

For each individual text, the separated words should be stored in a dictionary (hash-table) where the key is the word and the value is its entries in the text.

Separating the words and counting their entires

Token stands for an individual occurrence of a linguistic unit in speech or writing (source.) In simple words, it’s a character consequence separated by spaces from both sides. Our task is to make words “Tell” and “tell” be both treated like “tell”. That’s why we use .ToLowerInvariant() method to make all the tokens lowercase.

Next, save the 10–35 the most frequently used words in the dictionary and trim all the others.

Note. For simplicity, I will show you how to determine the author of a text with the most frequently used words method. This feature is generally known as one of the most popular and accurate. If you wish, you can choose any other feature later, for example, from this paper. The program and the tutorial were inspired exactly by this work.

Step 2. Training Phase

Qualify the similarity between texts

Suppose you have two dictionaries of two input texts. How can we qualify their similarity? For this, let’s utilize the Manhattan distance function by implementing the formula:

Manhattan distance function (source)

x — stands for a number of the word entries in the Text A.

y — stands for a number of the wor dentries in the Text B.

When applied, it returns a scalar value describing the distance between the texts. The code looks like this:

Counting the similarity between two texts

As we can see, the distance measure doesn’t bear any useful information to us right now. We got a number but we can’t determine if the texts are similar or not unless we have any key to decode the result. This key is called acceptance threshold and your next step is to find it.

Acceptance threshold generation

The acceptance threshold is the result of a training on some texts with proven authorship. It’s best if the set of authors for training contains the author you suspect to write the text you analyze. At the minimum, you need two authors with two texts each in the training set.

How to generate text pairs

Here, the “Problem” class stands for a pair of texts compared with each other, and the answer is boolean “Yes, the author is the same” or “No, there are different authors.” The program should determine the answer field automatically by comparing the names of files (they should start with the author’s name and surname).

The training itself looks the next:

Training procedure
Photo by Helloquence on Unsplash

Step 3. Comparing The Texts

Finally, you need to compare two texts using the threshold you have found previously. For that, find their similarity and compare the number with the threshold. If the similarity is bigger, the texts are written by the same author. If not, the authors are different. That’s all!

The visualization of the similarity values distribution (source)

Note. My own program was a little bit more complicated and more similar to that described in a paper. It has more features to track and makes a decision, which of them show the best results. However, the list of the most frequently used words has shown the best performance in all the cases. That’s also a reason why I present here only the parts of the code, which are though sustainable to deliver the principle alone.

As for the results, it was able to determine the authorship with up to 89% accuracy in several cases. It all depends on a training set and the texts’ length.So, if you have other results, try to expand the set of authors or texts. Still, if you need more details, don’t hesitate to reach out.

If you want to dive deeper into this field, have a look through the next works:

Sources:

If you feel like the material was useful for you, please, clap or follow me in Medium. Thanks for reading!

--

--

Daria Spizheva

As an experienced content creator with over 7 years honing my craft, I excel at helping companies uncover and share their narrative.