Does Zipf’s Law Apply to Alzheimer’s Patients?



I was fascinated by Zipf’s Law when I came across it on a VSauce video. It is an empirical law that states that the frequency of occurrence of a word in a large text corpus is inversely proportional to its rank in its frequency table. The frequency distribution will resemble a Pareto distribution so that the 2nd word will occur 1/2 times the first, the 3rd word 1/3 times and the nth word 1/n times. The law applies to all languages, even the ones which we do not understand yet. Curious, I decided to test it out on a text corpus of Alzheimer’s patients describing a picture.

Alzheimer’s Disease (AD) is a neurodegenerative condition that usually occurs in older people over 65 years old and worsens over time.

AD kills more people than breast and prostate cancer combined.

There is no cure for AD yet, and it is the most expensive disease in the US – accounting for approximately 68% of the Medicare funding. The Alzheimer’s Association estimates that over 16 million people could be affected by AD by 2050 in the US alone.

Previous Work

Fraser’s paper in 2016 identified the top linguistic features that stand out in AD patients when describing the cookie theft picture. Originally used by Giles et al., the cookie theft picture (shown below) is now the standard test used by linguists to extract features in patients with Dementia. There is no time limit imposed on this task, and the patients are asked to describe everything they see in it. Given the many elements in play in the picture, the patients’ incoherence becomes evident. Fraser extracted 370 linguistic features from the descriptions and detected Dementia patients with an accuracy of 81% using the top 35 features.


The Cookie-Theft Picture

Generating the Corpuses

DementiaBank is a corpus of control (healthy) and dementia patients describing the cookie theft people. I accumulated the raw text of the descriptions of every group, tokenized the text and sorted it based on the frequency of occurrence of the words. The figure below shows the distribution of the top 30 most occurring words in both the groups and compares it to an ideal Pareto distribution that follows Zipf’s law:


Frequency Distribution of the top 30 most occurring words

Error with the frequency of the most-occurring word fixed

If we assume the frequency of the most occurring word as fixed, we can calculate the expected frequencies under Zipf’s law – the nth most common word will occur 1/n times.

I calculated the mean chi-squared error and mean-squared error (MSE) on both the groups. In both the cases, the control group had a lower error than the dementia group. Results are summarized in the table below. The ideal distribution would have an error of zero in every case. Fifty Random books from the Gutenberg Corpus are used as a baseline for comparison.

Average Chi-Squared Error

Mean Squared Error

Dementia Group 9.34 1186.7
Control Group 7.31 731.6
Gutenberg Corpus (50 Random Books) 2.37 412.86

Determining if the distributions fit the Power Law

Pleased with the results above, I decided to check if the two distributions can be fit as a Pareto distribution. Using Alstott’s implementation in Python, I checked the goodness of fit for both the groups while setting the minimum frequency to 1 and 800 (all frequencies less than this are not included in the goodness of fit test). The results are summarized in the table below. Results of an ideal case scenario (start at the max frequency of dementia group, and set the next 30 elements to have a value of 1/n) are also included for comparison. Alpha is the constant that defines the Zipf’s Law distribution (in most ideal cases it is 1), and Sigma is the standard deviation.

Min. Frequency
Dementia Group
Control Group
Dementia Group
Control Group
Ideal Case

Overall, if looked at all the words in a corpus, both the groups follow the power law with a reasonably low deviation. The deviation is higher when we look at just the first few words in the corpus. Even though the dementia group distribution is more Zipfian than the control group due to its lower deviation, the difference is almost negligible – and it is safe to say that both the curves follow Zipf’s law to a statistically adequate level.

Using Error as a metric for detection of AD

Now that I established that both the groups are equally Zipfian, I decided to run a final test before disregarding the idea that Zipf’s law might help in detection of AD. For every narration, I calculated the (Zipfian MSE)/(total # of words) and compared it to an empirical threshold of 1.65; where the Zipfian MSE is the same metric described in the previous section.

The test detects the right group for a narration with an accuracy of ~62.5% (63.2% for Dementia group, 62.2% for Control group).

Conclusions, Discussions and Future Work

Overall, the text of AD patients follows the Zipf’s law as much as healthy patients when analyzed over all the words in the respective corpuses. The average Zipfian MSE per word when the frequency of the most-occurring word is fixed can classify narrations of the cookie-theft problem with an accuracy of 62.5% (which is better than using all of Fraser’s 370 features without feature selection).

It will be interesting to use this metric in co-ordination with Fraser’s 35 top features and see if the overall accuracy increases. There are chances that this Zipfian metric is overlapping with the “Use of high-frequency words” metric used by Fraser (which is one of the most distinguishing features in her experiments), and will not affect the accuracy at all.

As the cookie-theft problem is used as a standard test, it would be interesting to view this from a topic modeling approach given the many elements in the picture; using LDA or LDA2Vec on this corpus to set baselines. That will probably be a part of what I do next when I play around with this data.

I will give an idea of the general direction of my thesis work in my next blog post – so stay tuned!

One thought on “Does Zipf’s Law Apply to Alzheimer’s Patients?

  1. luckytoilet says:

    Nice, haha, I’m not too surprised that Alzheimer’s speech follows a Zipf distribution. A lot of things do this — if you count the tokens in a large source code file in the Linux kernel, it will also follow a Zipf distribution.

    Also, to illustrate that a distribution is Zipfian, it might be helpful to plot log(freq) vs log(rank) — it should be roughly linear. Then it’s obvious if your distribution is not Zipfian (for example, exponential).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s