Brazilian Heavy Metal: An Exploratory Data Analysis using NLP and LDA2023 May 23
Brazilian Heavy Metal: An Exploratory Data Analysis using NLP and LDA
Posted originaly in 03.07.2019
An analysis across the lyrics of two of greatest bands of Brazilian Heavy Metal: Angra and Sepultura
In most of my time, I get used to experimenting some NLP techniques and I noticed that even with the plethora of resources available, it’s very hard to find some NLP tech stuff attached with Data Analysis, i.e. related to general knowledge over the data like Text Mining.
It’s very cool to have a lot of scripts, applied blog posts, repositories in Github with code, but at least for me the analysis it’s where the technique shines most because anyone it’s able to write a script but only a few ones can extract knowledge of the data.
The idea here it’s getting the lyrics of two bands that I like and check their literary characteristics and try to find some relation or distinction between them.
For very deep and technical posts about NLP, LDA and so on, feel free to jump directly to the end of this post and choose a lot of very nice references about these topics.
And this is what this post is about, and was deeply inspired for a great job of the Machine Learning Plus.
Why Angra and Sepultura?
Heavy Metal it’s one of the most borderless music styles in the world and I would like to show two of the most iconic bands of my country and their literary characteristics in a simple way using Python, LDA, NLP and some imagination (you will see during the “interpretation” of topics.
About the Bands
Sepultura is a Brazilian heavy metal band from Belo Horizonte. Formed in 1984 by brothers Max and Igor Cavalera, the band was a major force in the groove metal, thrash metal, and death metal genres during the late 1980s and early 1990s. Sepultura has also been credited as one of the second wave of thrash metal acts from the late 1980s and early-to-mid 1990s.
Sepultura Oficial Website — Sepultura in Spotify
Angra is a Brazilian heavy metal band formed in 1991 that has gone through some line-up changes since its foundation. Led by Rafael Bittencourt, the band has gained a degree of popularity in Japan and Europe.
Angra Oficial Website — Angra in Spotify
Some personal questions that I always had about these bands and I’ll try to answer with this notebook is:
1) What’re the literary characteristics for Angra and Sepultura?
2) Which type of thematics did they talk about?
3) Who has more diversity in their topics?
NLP it’s still an unsolved problem even with all over promising about it. This two anthological pieces by Yoav Goldberg and The Gradient put that in perspective;
The creative process even with some patterns it’s a very complex that can involve a lot of poetic licenses. In this video, Rafael Bittencourt explains the whole process to compose a single lyric for the new album, and in this video, Max Cavalera speaks about the creative process behind the classic album “Roots” from 1996.
Natural Language Processing
Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
In machine learning and natural language processing, a Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body. Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body.
Latent Dirichlet Allocation
In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model.
Code & Data
All code, datasets, and images are stored in [this Github repo](https://github.com/fclesio/metalbr).
Data extraction and load
To extract the lyrics I used PyLyrics library using this script. Important: This library doesn’t have any update/bug fix since last year. Below we can import some libraries and start a small pre-processing over our data.
<iframe src=”https://medium.com/media/9a68b2384ebb65de1a0587526aee9528” frameborder=0></iframe>
The wrapper fetched 325 songs bringing the artist, lyric, and album.
One of the main challenges it’s that these bands usually write songs in multi-language (EN and PT-BR). For a matter of simplicity, we’ll concentrate only on the EN language.
To filter the PT-BR songs I’ll use textblob library that uses the Google API to check the language. The main caveat it’s if you re-run a lot of times maybe you receive the code HTTP Error 429: Too Many Requests.
Here we can see that from 325 lyrics 38% is from Angra and 62% is from Sepultura. Angra has 96% (119) of all lyrics in EN and Sepultura have 96% (194) of all lyrics in EN.
The most remarkable song in PT-BR from Angra (IMHO) it’s Caça e Caçador what was a song from the album Hunters and Pray. In the Temple of Shadows album the song Late Redemption it’s a good piece in EN/PT-BR.
Sepultura has some songs in PT-BR like Filhos do Mundo from Bestial Devastation, Prenuncio from Against and A Hora E A Vez Do Cabelo Nascer from the Beneath the Remains album that is a cover song from Mutantes. The most remarkable one in PT-BR it’s the Policia song.
With all PT-BR lyrics removed let’s perform a quick check in all albums of these bands.
<iframe src=”https://medium.com/media/bb65cae491e3ef0b3952454c21648e02” frameborder=0></iframe>
In a first look, considering our dataframe, we can see the first difference between these two bands that Sepultura has a larger discography and more songs per album.
This can be explained with the fact that even both bands faced a hiatus in the time of that they changed their main singers (Andre Matos and Edu Falaschi in Angra and Max Cavalera by Sepultura) Sepultura released 8 albums after their break (all of them with Derrick Green) and in meanwhile Angra released 6; and Sepultura it’s a more prolific band.
Let’s keep that information in mind because maybe it can be explained in the second moment in this analysis.
Let’s check the average songs per album.
<iframe src=”https://medium.com/media/d1086f6a3016cd2ea1c4eab6f3f1db15” frameborder=0></iframe>
As we visually inspected Sepultura not even have more albums, but have more songs per album.
To start our analysis one important aspect of Text Analysis it’s the [data pre-processing](https://en.wikipedia.org/wiki/Data_pre-processing). Here we’re literally can screw all analysis because the Pre-Processing it’s responsible to remove all noise of the data and normalize all data to get meaningful results. Kavita Ganesan made a great analysis of this topic and I strongly recommend the read.
The first step will be to remove all English stopwords of all lyrics.
PS: Personally I don’t like to use off-the-shelf stopwords list because every domain demands specific words subsets to define if some word it’s important or not. But let’s keep that way for a matter of simplicity. This nice text of Martina Pugliese explains it in detail. In terms of implementation this article of ML Whiz probably its the best resource available on the internet.
<iframe src=”https://medium.com/media/560ce46ad742905111930f074b5f5adb” frameborder=0></iframe>
After the stopwords removal, let’s perform a quick visual check on the most frequent words used by these two bands. In other words: What’s most used words in their compositions?
<iframe src=”https://medium.com/media/4698aa2f2c7583a49f6ea0fe48482ffd” frameborder=0></iframe>
<iframe src=”https://medium.com/media/416fa3feffb426f31b431251ef3a038d” frameborder=0></iframe>
If I could to perform some classification here to define Angra Lyrics based in their most common expressions would be like that:
Time relation: Time, Day, Wait, Night
Feelings: Like, Heart, Soul, Lie
Movement and distance: Come, Way, Away, Carry
Living and mind: Life, Let, Know, Dream, Mind, Live, inside, Leave
The absolute state of the world: Die
Typical Heavy Metal cliché expression: Oh
Now, a quick verification in Sepultura lyrics:
<iframe src=”https://medium.com/media/8538f06589c8a84c93919e26f115920c” frameborder=0></iframe>
State of the modern world: Death, War, Hate, Die, Lie, Pain, Lose, Fear, Blood, World
Action, distance and time: Way, Come, Stop, Make, Rise, Time, Think, Hear, Know
Mind issues: Live, Mind, Feel, Want, Look, Inside
Some latent differences about the themes discussed between Sepultura and Angra arise like:
The axis of compositional literature of Sepultura converges on subjects related to the theme of things/feelings linked to death, pain, war, hatred (if you already don’t know, Sepultura means “Grave” in PT-BR) which are considered the most aggressive/heavy themes;
Angra has a lighter theme talking more about existential issues that involve the passage of time, as well as some songs that have feelings linked to dreams and feelings linked to internal conflicts.
Let’s see the word cloud relative to the most frequent words of the two bands, only for a small comparison according to all the vocabulary used by the bands.
<iframe src=”https://medium.com/media/87ac65b435e8d0bf15be0e8b40062397” frameborder=0></iframe>
<iframe src=”https://medium.com/media/ccd0a7059cca558c01a746b26b3fc686” frameborder=0></iframe>
Main words Angra:
- life, time, know, heart, day, away, know, dream
<iframe src=”https://medium.com/media/fa4aceef1f04626e759872f1a5d20bc4” frameborder=0></iframe>
Main words Sepultura:
- way, life, death, world, mind
According to Johansson (2009) Lexical diversity is a measure of how many different words that are used in a text. The practical use of the Lexical Diversity it’s given by McCarthy and Jarvis(2010) they said that LD is the range of different words used in a text, with a greater range indicating a higher diversity.
A special consideration here it’s that Heavy Metal songs it is not supposed to contain a lot of different words, i.e. a great lexical richness. It’s because most of the cases each band can follow a single artistic concept and shape their creative efforts to some themes and of course because most of the time this kind of song has many choruses.
For example (regarding of band concept) Avatasia it’s a supergroup of Heavy Metal that talks about fiction, fantasy and religion; and in the other side Dream Theater talk about almost everything since religion until modern politics.
With this disclaimer let’s check the Lexical Diversity of this bands.
<iframe src=”https://medium.com/media/4e6c3a35d4f41286d6682d924f5372e3” frameborder=0></iframe>
<iframe src=”https://medium.com/media/8518a34ec129e29408870a985afcde23” frameborder=0></iframe>
There are almost no lexical diversity differences between these two bands, i.e. even using different words to shape their themes, there are no substantial lexical differences between them in terms of frequency in their themes.
According to Wikipedia, n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
In other others, n-grams it’s sequences that contain n_ words that can be used to model the probability of some sequence appears in a corpus, in our case, the n-gram(s) we can examine the most frequent combination of _n words in their literary dictionary.
For a matter of simplicity will focus on combinations of Bigrams n=2 e Trigrams n=3.
<iframe src=”https://medium.com/media/89c641b1fe5df29fa5597d71f00be815” frameborder=0></iframe>
<iframe src=”https://medium.com/media/e4b4c97efe265799719244dbe086b534” frameborder=0></iframe>
Here we can see some things:
The “You’re” it’s the top combination in n=2. This indicates that along with the whole corpus of Angra’s songs, there are lyrics that contain some kind of message being given to another person.
One of the most frequent bigrams is carry on but this has a reason for the data: In this data set we have the Angels Cry and Holy Live disks that contain the song Carry On and this causes a double counting;
The reason behind the me cathy and cathy come bigrams appears it’s because of a cover song called [Wuthering Heights](https://www.azlyrics.com/lyrics/angra/wutheringheights.html) from Kate Bush that repeats this chorus a lot;
We have the traditional heavy metal song chorus filler cliché oh oh appearing
<iframe src=”https://medium.com/media/543ae4a92c4d246f73ce0d306442eef1” frameborder=0></iframe>
Again appears in carry on in carry on time, on time forget, remains past carry,
The word Cathy appears in the Trigrams: heathcliff me cathy, me cathy come, cathy come home
Some bizarre pattern like ha ha ha, probably because of the data cleansing
<iframe src=”https://medium.com/media/32d18f704e46785dcdc799fad4aa8aff” frameborder=0></iframe>
In these bi-grams, we can already see a little more of Sepultura’s theme linked to themes related to brutality as I had put it before. Some mentions:
The song “Choke” has a very repetitive chorus, and that contributes to this composition of bi-grams.
The same thing with the classic “Roots” that has a very striking chorus
Let’s go to the tri-grams:
<iframe src=”https://medium.com/media/8d63a8c497e4a769aa05bed6220d292a” frameborder=0></iframe>
Here we see basically the same pattern with part of the tri-grams facing some very striking choruses.
Now we know a little about the theme of the two bands, however, a question that follows is: Within this theme what are the latent topics behind each composition?, i.e., there is a diversity within the themes inside the band’s concept?, what if we could group these songs according to their literary composition?
And here’s where we’re going to use LDA.
First, we will filter each of the artists within their respective dataframe:
<iframe src=”https://medium.com/media/21ebdc7b3aa9004edf52d61d352f3bf2” frameborder=0></iframe>
To do topic distinction I’ll arbitrarily choose 7 topics for each artist (it can be more or less) only for didactic purposes and maintenance of simplicity.
In other words: Given all Angra and Sepultura lyrics, what are the top 7 topics that they usually write more?
<iframe src=”https://medium.com/media/3bce4e55a463cf5563e017fa58d8aeaa” frameborder=0></iframe>
<iframe src=”https://medium.com/media/b921af289c5aafea21e6c2e00a7f116e” frameborder=0></iframe>
<iframe src=”https://medium.com/media/d4353893097f00636572f192fc4aac1e” frameborder=0></iframe>
<iframe src=”https://medium.com/media/30146460c1a76a16a2b50b85d5e7e946” frameborder=0></iframe>
Topic Distribution Angra
<iframe src=”https://medium.com/media/8eef9f6055b8bb6bfdc31810f5e23ad9” frameborder=0></iframe>
We can see here that much of Angra’s thematics is focused on topics 4, 0, 2 which I call the topics Look and know about the world along the time (#4), Face the pain along the time (#0), and In life dreams come and go away (#2)
Topic #4: eyes time life ive world love say inside know got
Topic #0: time dont day away way youre face pain just cause
Topic #2: let come like away day life wont wonder cold dreams
Right now let’s check Sepultura:
Topic Distribution Sepultura
<iframe src=”https://medium.com/media/d67136f1c90de502241d21aa596079f8” frameborder=0></iframe>
Sepultura focus on some topics such as life and fear time away(#3), being alive in a world with pain and death(#5), and Living in a world with war and blood spilling
Topic #3: dont just away time fear youre know life right look
Topic #5: end theres dead world death feel eyes pain left alive
Topic #0: war live world hear trust feel walk believe blood kill
Word per Topics
Here is only a table for us to check the order of the words in the topics that permeate the literary part of these two bands.
A special highlight here is that in this dataframe is also considered the frequency and ordinality of the word within the topic.
<iframe src=”https://medium.com/media/a0812e2f1ad74e0c46cc180104cf4ebf” frameborder=0></iframe>
Word per Topics Angra
<iframe src=”https://medium.com/media/70a81a34adf461dd1a4a832079f47973” frameborder=0></iframe>
Word per Topics Sepultura
<iframe src=”https://medium.com/media/7acace4f5e69f1711deeeb9df10d3c60” frameborder=0></iframe>
Topic Plotting with word distribution
Here with the pyLDAvis library, we can take a look at how the topics are distributed via visual inspection. The graph presented by pyLDAvis it’s called Intertopic Distance Map that consists in a two-dimensional plane whose centers are determined by computing the distance between topics, and then by using multidimensional scaling to project the intertopic distances onto two dimensions.
With that, we’re able to interpret the composition of each topic and which individual terms are most useful inside of some topic.
A most comprehensive introduction about the LDAVis can be found in Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63–70).
<iframe src=”https://medium.com/media/b767ba68475b5536f3dfcfeafcbd4809” frameborder=0></iframe>
LDA Plot for all Angra’s Topics
LDA Plot for all Sepultura’s Topics
Initially, I had 3 questions in mind and about the whole trip using NLP and LDA I personally think that I have some answers for them.
1) What’re the literary characteristics for Angra and Sepultura? Answer: Angra has as main literary characteristics themes related to the time and how the soul and life, mind and fate and waiting for something.
Sepultura has a more aggressive literary composition where they speak about death, war and pain and they sing several times with some lyrics that face death. They protest against a lost or sick world most of the time.
2) Which type of thematics did they talk about? Answer: Angra: Time, soul and fate. Sepultura: Death, War, and World of Pain.
3) Who has more diversity in their topics? Answer: Using an arbitrary number 7 of topics we can see that Sepultura has more diversity in terms of the distribution of topics.
Further ideas and TODOs
Include track names
Compare Sepultura Eras (Max — Derrick)
Compare Angra Eras (Mattos — Falaschi — Lioni)
Similarity between tracks (Content-Based)
Sepultura/Angra LSTM music lyric generator
Dominant topic per album
Lyric Generation using LSTM
References and useful links
Machine Learning Plus — LDA in Python — How to grid search best topic models?
Susan Li — Building a Content Based Recommender System for Hotels in Seattle
Susan Li — Automatically Generate Hotel Descriptions with LSTM
Shashank Kapadia — End-To-End Topic Modeling in Python: Latent Dirichlet Allocation (LDA)
Meghana Bhange — Arctic Monkeys Lyrics Generator with Data Augmentation
Greg Rafferty — LDA on the Texts of Harry Potter
Code Academy — Using Machine Learning to Analyze Taylor Swift’s Lyrics
Alexander Bell — Music Lyrics Analysis: Using Natural Language Processing to create a Lyrics-Based Music Recommender
Trucks and Beer — Amazing project with Lyrics scrapper
Anders Olson-Swanson — Natural Language Processing and Rap Lyrics
Brandon Punturo — Drake — Using Natural Language Processing to understand his lyrics
Degenerate State — Heavy Metal and Natural Language Processing — Part 1
Degenerate State — Heavy Metal and Natural Language Processing — Part 2
Degenerate State — Heavy Metal and Natural Language Processing — Part 3
Packt_Pub — Generating Lyrics Using Deep (Multi-Layer) LSTM
Mohammed Ma’amari — AI Generates Taylor Swift’s Song Lyrics
Notebook Taylor Swift’s Song Lyrics — Link in Colab
enrique a. — Word-level LSTM text generator. Creating automatic song lyrics with Neural Networks
Ivan Liljeqvist — Using AI to generate lyrics
Sarthak Anand — Music Generator
Franklin Wang — song-lyrics-generation
Tony Beltramelli — Deep-Lyrics