Comparing Text Complexity Of Steinbeck And Mr Men Books
On the 2nd of March 2019, the BBC published an article titled “Mr Greedy ‘almost as hard to read’ as Steinbeck classics.” The Article can be found here –
At a glance, Mr Greedy seems much simpler than Steinbeck’s classics, including of Mice & Men, due to this I decided it was worth doing our own analysis to compare the complexity of the texts.
The article refers to Renaissance UK, a company that analyses text complexity. As stated in the article, the research “examined more than 33,000 books for children and young people, scanning every page for sentence length, average word length and word difficulty level.”
For the purpose of this analysis I decided to use a number of the most commonly used text complexity calculations, specifically those that have been consistently used in the public sector and within commercial applications.
Below is a list of the metrics that I’m going to use to determine the overall complexity of a passage of text. I’ve decided to use the four metrics below as they all look at relatively different inputs within their formulas and are often used in many different applications.
The SMOG grade is a mathematical formula that looks to quantify the complexity of a text by considering the number of polysyllables (defined as words with three or more syllables) in relation to the number of sentences. SMOG stands for a Simple Measure of Gobbledygook.
A 2010 study published in the Journal of the Royal College of Physicians of Edinburgh refers to the SMOG index as “the gold standard”
The formula for the SMOG grade can be seen below
The Flesch-Kincaid grade level is a mathematical formula that looks to quantify the readability of English text by looking at the number of words in each sentence and the number of syllables in each word.
The US Government uses the Flesch-Kincaid formula to determine if the details of an insurance policy is sufficiently simple such that the majority of the population can easily understand it.
The formula for the Flesch-Kincaid Grade can be seen below
Automated Readability Index
The Automated Readability Index is a mathematical formula that looks to quantify the complexity of English text by considering the number of characters per word and the number of words per sentence.
Although there is debate around the validity of the Automated Readability Index when compares to the formulas which take in to account syllables, the Automated Readability Index is very simple to calculate and lightweight to run.
The formula for the Automated Readability Index can be seen below
Gunning Fog Index
The Gunning Fog Index is a mathematical formula that looks to determine what reading level is required to comprehend a passage of text by considering the words per sentence and the proportion of complex words.
The Gunning Fog Index is commonly used in many applications to ensure text is relatively simple to read by most audience.
The formula for the Gunning Fog Index can be seen below
Comparison Of Books
Once I had decided on which metrics to use, I wanted to apply each of them to samples of passages from Steinbeck’s classics and the Mr. Men books to validate the BBC’s article. I also decided to apply the metrics to a sample of the books listed on the Gutenberg project. For reference, the Gutenberg project is a collection of public open sourced books that are available for anyone to access online.
Due to lack a of availability for Mr. Greedy, I compared the Mr. Men book Mr. Tickle with Of Mice & Men and another Steinbeck novel, Cannery Row.
The results can be seen below
Of Mice & Men is calculated as being fairly more complex than Mr. Tickle across all four metrics. Interestingly, when comparing Mr. Tickle with Cannery Row, the difference in complexity becomes even more apparent. Cannery Row is significantly more difficult than Mr. Tickle across all of the metrics and this suggests that a much higher reading level is required to be able to digest the book.
The BBC article includes an extract from the opening of both Of Mice & Men and Mr. Greedy, the extracts look vastly different in complexity, but is used to suggest that intuition is incorrect and the two samples are similar. Running the two samples from the article with the four formulas above shows that the Of Mice & Men sample is vastly more complex than the Mr. Greedy sample.
Part of the reason why Of Mice & Men ends up with a lower complexity than Steinbeck’s Cannery Row, may be due to the fact that Lenny from Of Mice & Men tends to use quite simplistic language and this may have artificially lowered the complexity of the text.
It would be interesting to see the methodology that was used in the original study as the four popular metrics used in this analysis suggests that Steinbeck’s novels are significantly more complex than the Mr. Men books.
The caveat to add is that the metrics used in this analysis are relatively simple and don’t take in to account more complicated factors such as the structure of the sentence.
Nevertheless the article suggested that the factors used in the original study were “Sentence length, average word length and word difficulty level” and the four metrics I’ve used in my analysis cover these factors and therefore should offer a similar result to the original study.
Harib (Data Science) – Natural Language Processing (NLP) & RegEx was used to clean up the texts and to extract samples. This also ensured the samples weren’t mid sentence and were in a clean format so that they could be used with the complexity metrics.
The analysis was done within Python primarily using the following packages: math, stats, textstat. Textstat is a great package that contains a handful of useful functions when working with unstructured textual data.
Some of the four metrics give quite extreme outputs when calculated with the samples of text obtained from the BBC’s article. I believe this is primarily due to the short length of text and the fact that the opening paragraphs aren’t indicative of the rest of the book’s complexity.
I also ran the same analysis against a sample of Gutenberg Project books, to see these please clone the GitHub repository mentioned below.
Jordan (Data Engineering) – The Gutenberg Project books were retrieved by webscraping then using the requests library and storing as text using the BeautifulSoup(bs4) library. Due to the differences in the index I used a random range to be used in URL string for the loop.
Beautiful Soup – a package for parsing documents.
Requests – a package for making http requests.
All scripts, data sources and outputs can be found at –