Dear Brazilian Data Scientists: Please understand why Word Cloud is horrible for Text Analysis and try TF-IDF + Word N-Grams
2019 Oct 11I know the title sounds a bit like clickbait, so please bear with me as the tone here will always be one of moderation.
I have already committed this error in the past (this thread shows that like a stubborn and foolish phoenix, I keep doing it), and I confess that when we see those word clouds we have the feeling that we are the best analysts in the world, because we can explain the subject to laypeople in a direct and simple way. After all, who wouldn’t want to be known as an excellent storyteller, right?
Despite this, in these poorly written lines, I will try to make my point as to why Word Cloud is a terrible instrument for analyzing texts/discourses. Besides being flawed for using only absolute frequency of words, this method allows too much room for what I call “interpretative leaps” that look more like delusions in the head of the analyst than actual analysis.
The problem…
Time and again, whenever elections arrive, or even some kind of announcement from someone important via a speech, the first thing we see is the famous Word Cloud (i.e., or Tag Cloud).
I myself love to use Word Cloud in some presidential speeches, as can be seen on my Github account where I used this method for all presidential speeches I could find at Itamaraty, which you can see in the shameful images below:
Word Cloud of speeches by former President Fernando Henrique Cardoso (FHC)
Word Cloud of speeches by former President Lula
Word Cloud of speeches by former President Dilma
Word Cloud of speeches by former President Temer
Word Cloud of speeches by President Bolsonaro
(PS: I know the code is hideous, but the daily grind is crushing and I haven’t had time to take it down yet, but if anyone wants to continue, just fork it and “do it yourself”)
Some examples of the recurrent poor use of Word Cloud can be seen on the BBC regarding the speeches of Brazilian presidents at the UN, in Época Magazine which did the same thing, in Folha de São Paulo which made a word cloud of the speeches of former president Dilma Rousseff, and in Estado de Minas which did the same regarding the diploma ceremony speech of President Jair Bolsonaro.
It’s always good when we have Data Journalism (Author’s Note [1]: Since the beginning of time, I have always thought that all journalism was based on data, but that’s for another post. Live and learn.) using quantitative tools for the analysis of this data. However, the Word Cloud resource says very little within a general context and gives room for “interpretative leaps” that would make Tarot and Astrology envious in terms of subjective nonsense.
A clear example of these interpretative leaps can be seen in an article on LinkedIn called “Word cloud: our candidates by themselves” in which the author placed more personal aspects of what he thought than adhering solely to what was actually said.
On Tarcízio Silva’s blog, in the post “What hides behind a word cloud”, he speaks of some analysis resources that are interesting from the point of view of discourse analysis. However, when we talk about computational linguistics and information retrieval, the Word Cloud reveals very little and can even be misleading depending on the case, and I will show this later.
In NLP, Semantics and Sequence aren’t everything, but they are 99%…
However, when we talk about Natural Language Processing (NLP), the use of Word Cloud shows very little given that this method is based only on the combination of Word Frequency + Removal of Stopwords. With this, two main aspects of text analysis are lost, which are:
a) Semantics regarding what is being said and how it relates to the corpus as a whole and not based on free interpretation (e.g., The word “manga” in Portuguese can have different meanings if used in different contexts, such as fruits or clothing sleeves); and
b) The sequence, without which the chance to understand the probability of words appearing within a logical chain is lost (e.g., if we have the words “bebe” (drinks/baby), “mata” (kills/forest), “fome” (hunger), we can generate sequences like “baby kills hunger” or “hunger kills baby” which are absolutely different facts).
I believe these two simple examples have made my point that the Word Cloud by itself doesn’t say much and, to make matters worse, leaves too much room for “interpretative leaps.”
Let’s see below a simple way to perform textual analysis using two techniques: TF-IDF and Word N-Grams.
TF-IDF + Word N-Grams = Better interpretation = Better insights
Basically, when we use Term Frequency–Inverse Document Frequency (TF-IDF) instead of simply performing an analysis of the frequency of words, we are placing a certain level of semantics in the text analysis. While we consider the frequency of the words (TF), we also consider the inverse document frequency (IDF). This combination of TF and IDF will measure the relevance of the word according to the degree of information it brings to the text as a whole.
In other words, TF-IDF performs this weighting relationship of relative occurrence and degree of information of the words within a corpus. This chart from the post “Process Text using TFIDF in Python” by Shivangi Sareen shows well what TF-IDF would be in relation to a corpus.
Source: Process Text using TF-IDF in Python
This issue of relative co-occurrence of words will be clearer in the practical example discussed later.
While the Word Cloud does not concern itself with the sequence, Word N-Grams come to fill this gap since not only words matter, but also the probability of the sequence of these words within a certain linguistic context and/or semantic chain. (Author’s Note [2]: I know some people will nitpick about it being just n-Grams, but I am borrowing this concept of Word n-Grams from Facebook FastText which has a great paper called “Bag of Tricks for Efficient Text Classification” that talks about this implementation of using local sequences of words (i.e., which will give us small concatenated local contexts) instead of letters for text classification, but that’s a topic for another post. By the way, I wrote about FastText here some time ago.).
This post does not pretend to be a reference on TF-IDF/Word-NGrams, but I strongly recommend what I consider the bible of NLP, which is the book “Neural Network Methods for Natural Language Processing” by Yoav Goldberg.
That said, let’s go to a practical example of why using TF-IDF and Word N-Grams shows much more than Word Cloud and how you can use them in your text analysis.
Practical Example: Text analysis of some authors from Instituto Mises Brazil
For some time now, I have been following Brazilian politics from the perspective of essayists from branches linked to libertarianism, anarcho-capitalism, secession, self-ownership, and related subjects; and one fact that caught my attention was the editorial shift that has been slowly happening in one of the main liberal Think tanks in Brazil, which is the Instituto Mises Brazil (IMB). (Author’s Note [3]: I think the editorial line has become poor over time and the institute has turned into a dull parody focused on developmentalism and banking financialism, with a reduced focus on freedom.)
My hypothesis is that due to editorial differences on subjects linked to secession, there was a split between Hélio Beltrão and the Chiocca Brothers (Cristiano and Fernando) with the consequent change of editorial management at IMB and the founding of the Rothbard Institute.
I am working on some additional posts about this, but just to show my point about Word Cloud, I will use as an example some texts from an IMB author named Fernando Ulrich.
Quickly looking at Fernando Ulrich’s editorial line, we can see that he speaks mostly about 3 subjects: Currency, Bitcoin, and the Banking System. Even a reader who started following this author yesterday knows this.
However, generating a simple Word Cloud with the texts from this author gives the following result:
Word Cloud of all posts by Fernando Ulrich
We can see that the words that appear most are: Market, year, Government, Economy, with Bitcoin, Currency (Moeda), and Money (Dinheiro) with a lower frequency (Author’s Note [4]: I intentionally did not use any type of stemming or lemmatization because I wanted to have more interpretability in this analysis).
However, let’s now look at the same texts using TF-IDF:
TF-IDF with Fernando Ulrich’s texts
Notice that using the top 30 words with the highest TF-IDF score, we can see that the words Bitcoin, Banks, Bacen (Central Bank), and Currency (Moeda) appear with a much higher TF-IDF score than the words from the Word Cloud (remembering: Market, year, Government, Economy). As a reader of Fernando Ulrich’s articles, it is clear that he speaks about these subjects with a much higher frequency, a fact that can be seen in his videos on his excellent YouTube channel.
Now within the perspective of mapping the local sequences of the texts using Word NGrams, we have the following result with the IMB texts that Fernando Ulrich wrote:
Word N-Grams of Fernando Ulrich’s texts (n=3)
We can see here that the themes converge towards subjects like central banks, the banking system, and issues related to the fractional-reserve system, which also converges with the themes of his YouTube channel and which every older reader of this author knows.
Conclusion
As we saw above, when we are going to perform text analysis in order to give an interpretation of what was said and/or the theme of what is being spoken, the use of Word Cloud is definitely not the best. The combination of TF-IDF + Word N-Grams generates results that represent reality much better because they bring a higher level of semantics and word chaining, which are practically aspects that represent reality better than frequency itself.
Final Notes
Final Note [1]: Intolerant friends from the extremes of both spectrums, hold your emotions in the comments because I have a very large pipeline of things I am doing in this direction to displease many people yet.
Even in 2019, this type of warning is important: Analyzing people’s texts does not mean being in line with what they say or even any kind of endorsement of their opinions.
Believe it or not, there are many people who have a mental model robust enough to read everything from the biography of a billionaire used in corporate fanfiction to understanding the struggle of people who, despite being 50.94% of a population given their racial group, do not have a single representative of weight in society like a CEO of a multinational or someone of relevance in the executive and judicial branches.
Final Note [2]: I know there is code involved and I will share it by the end of next week (at least the database I will release). I haven’t shared it yet because I have to put it at a minimum level of readability so that people don’t think I whipped a monkey until he managed to type the entire script.
Final Note [3]: There are a lot of grammatical, semantic, and logical errors in this post and they are all mine and I will adjust them one day. Gradually I will correct them, besides the fact that I was educated in Brazil in public schools during the 90s, I have the aggravating factors that I am typing this text on an Austrian keyboard with a spell checker in… German. So, bear with me.