A small journey in the valley of Natural Language Processing and Text Pre-Processing for German language

Originally posted in MyHammer blog.

TL;DR: If you find yourself in the same situation what I was (i.e. millions of records with labeling problems, no fluency in the language, 200+ classes to predict and all of this in a very specific business segment) invest the maximum amount of time in text pre-processing, generation of word embeddings, and using some language rules/heuristics to refine your corpora.

Warning: Very long post with tons of references. Will take at least 40 min of reading.

A small journey in the German language for Pre-Processing in NLP

This is a summary of a talk that I was to give in Data Council in Berlin last year, but I rather gave a broader one called Low Hanging Fruit Projects in Machine Learning. This post will expand some points of the bullet points that I prepared for this presentation. If you saw my talk about LHF Projects, some of the content will be kind of familiar to you.

Disclaimer Project Report: This is only a project report with additional personal views and experiences, i.e. this is not a post in Towards Data Science, Keynote in O’Reilly Strata conference, best practices talk, Top-10-rule-list-that-you-must-do, Cautionary Tale, cognitive linguistics, applied linguistics, computational linguistics or any other kind of science at all.


With all this hype about language models like BERT, GPT-2, RoBERTa, and others, there is no doubt that NLP is one of the hottest topics nowadays.

NLP is in the center of countless discussions today like for instance, the “dangerous” GPT-2. OpenAI said that it would be dangerous to society and did not report the weights. And that a few months later some good developers managed to replicate all the code and afterwards everyone saw that was a good model that sometimes generates very brittle results.

Debates aside, one positive point today is that there are countless resources that brings the state of the art mixed with everyday applications, for example, like this NLP e-mail list provided by Sebastian Ruder.

However, what I am going to put in the following lines are some small aspects of NLP for the German language regarding the pre-processing part for a text project.

I will put some basic aspects of our journey and some more project-level considerations that deal with natural language processing, where I will try to compile our journey and some other features that I saw during that time.

Los geht’s?

German Language: Respecting the unknown unknowns

A hard lesson that I got during this time was: Language is extremely hard! No Deep Neural Network architecture will rescue you out, no AutoML will solve your problem, no big pre-trained model will be useful unless you do a proper text pre-processing.

If I could choose a single piece of advice of this very long note, probably those following mottos would be the ones for that:

Original Saying:

“ Give me six hours to chop down a tree and I will spend the first four sharpening the ax.” (Abraham Lincoln)

Machine Learning Saying:

“ Give me six hours to deliver a Machine Learning Model and I will spend the first four doing Feature Engineering.”

German NLP Saying:

“ Give me six hours to deliver a German NLP Model and I will spend the first five hours and thirty minutes doing text pre-processing.”:

As a few know, I wasn’t born or raised in any German-speaking country; and this already places a very big initial barrier on some aspects of language such as its nuances and even the understanding of trivial issues such as grammatical structure.

And here, I already put the first tip: If you are not a native speaker of the language, I suggest an understanding of at least an A2 equivalent certificate so that you have an understanding of the basic grammatical structure of the language before dealing directly with that language.

German, unlike my mother (Brazilian Portuguese), has a very different sentence structure in which Portuguese has the SVO (Subject-Verb-Object) structure, and in the German language, this rule is not so common, like for instance, the verbs can be at the end of a sentence.

It may seem small, but for example, for a Portuguese speaker, a negative answer or even a verb action is indicated in the first words of a sentence, not in the end. It forces us to an extra mental load to read the sentence until the end and then have the right context what’s going in the sentence.

In German, this rule is not necessarily true with the disadvantage that as a literate person in Brazilian Portuguese, I have to literally read the sentence completely, do the translation work in order to understand the sentence.

One factor that helped me a lot in that matter was, that since I was dealing with simple service request texts on an internet platform, this somewhat eased things because the textual structure is quite similar when someone wants certain types of services.

In other words, my corpora would be very restricted and would need a very large degree of specialization but in a single domain with a singular corpus.

If it were a type of text that required a very high degree of specialization as a constitutional, legal, or scientific text, in this case, I would have to go a little beyond A2 just as an initial prerequisite.

Obviously, this is not a mandatory requirement, but I see that understanding the language represents 20% + performance of your model that you have only by understanding what makes sense or not in a sentence or even in the form of preprocessing.

Here is the simple tip: Respect the complexities of language, and if the language is one that you are not native respect more and try to understand its structures first before the first line of code. I will explore the language a bit more in a few topics later in this post.

First, let’s take a look at the MyHammer case.

Context: Classification as a triage for a better match between tradesman and consumers

For those who don’t know, MyHammer is a marketplace that unites between Craftsman and consumer that needs some home services with the best quality.

Our main objective is defined by Craftsman receiving relevant jobs to work on and consumers placing jobs that need to be done and receiving good offers for high-quality services.  

To reach that goal, we need to deliver the most suitable job for each craftsman considering their skills, availability, relevance, and potential interestingness in terms of economics.

Our Text Classification project enters in that equation for re-label some jobs that are in different categories inside our platform and help those matches happen. 

In summary, our data hold the following characteristics: 

  • 200+ classes
  • Overlap of keywords between several classes
  • Past data mislabel, and as we created new classes, we didn’t correct the past (this phenomena I call as “Class Drift
  • Tons of abbreviations
  • Hierarchical Data (Taxonomy)  in terms of Business but not related at all in terms of language semantics
  • A lot of classes with 1000+ words per record
  • Dominance of imbalanced data (Top 10 categories have 26% of all data, Bottom 100 has less than 10%)
  • Miscellaneous classes that englobe several categories, and with that arising the entropy interclass

With this scenario, we made the first hard decision about the project that was to invest at least 95% of our time in understanding the language across each class and building a strong pre-processing pipeline. 

In other words: If we understand well our language inside our corpora, we can leverage that even using plain vanilla models to our advantage.

With that in mind, we jump to understand better our language instead of starting to use algorithms and hoping for some very complex algorithm to work. 

Language is Hard

Language is not hard. Language is very hardI don’t want to enter much in details around some hype about it  and the brittleness of the State of the art. 

But personally speaking I strongly believe that we’re far away even to be near to solve that kind of problem that involves language in terms of conversation or even for machines to generate texts sufficiently good enough to pass in a simple essay. 

Language contains tons of aspects and complexities that makes everything hard. In this very good post of Monkeylearn are described some of those complexities like:

polysemy: words that have several meanings

synonymy: different words that have similar meanings

ambiguity: statement or resolution is not explicitly defined, making several interpretations plausible.

phonology: systematic organization of sounds in spoken languages and signs in sign languages

morphology: study of the internal structure of words and forms a core part of linguistic study today.

syntax: Set of rules, principles, and processes that govern the sentence structure in a given language

semantics:  study of meaning in language that is concerned with the relationship between signifiers—like words, phrases, signs, and symbols—and what they stand for in reality, their denotation.

To understand in depth, one of these aspects in depth would demand at least a master’s degree in full time, at least.

The point that I would like to make here is that knowing those aspects and understand that language is hard.

From the beginning, our strategy was to start some statistical approach first to prune out non-relevant words in our corpora. After the heavy-lift work has been done, we would jump to the language/symbolic approach to fine-tune the corpora before going to train models to get a more safe side in terms of NLP modeling. 

Symbolic or statistical, what’s the best approach?

There’s a huge discussion about Symbolic versus Statistical approaches for language occurring nowadays.

Some proponents about the Statistical as a main paradigm are  Yann LeCun and Yoshua Bengio, and on the other side of the debate, it’s Gary Marcus. There are some resources available and some debates about that like this one between LeCun and Marcus  and this thread about it.

For practitioners that are daily in the trenches I would suggest pragmatism and use all tools and methods that solve your problem in an efficient and scalable way.

Here at MyHammer, I adopted a statistical approach for heavy lift work and some language ruling for tuning. Here the quotes from Lexalytics that I like about it: 

[…]The good point about statistical methods is that you can do a lot with a little. So if you want to build a NLP application, you may want to start with this family of methods[…]

[…]Statistical approaches have their limitations. When the era of HMM-based PoS taggers started, performances were around 95%. Well, it seems a very good result, an error rate of 5% seems acceptable. Maybe, but if you consider sentences of 20 words on average, 5% means that each sentence will have a word mislabeled […]

Source: Machine Learning Micromodels: More Data is Not Always Better

Yoav Goldberg in the SpaCy IRL Conference gave a great talk called “The missing elements in NLP”  where I think he excelled in say that as we move from a more linguistics expertise to a more Deep Learning approach to model NLP, we’re going in a path to have less debuggability and a more black-box approach. We can see better this in the following slide:

NLP Tomorrow

After that, I took a very hard decision to stay in some tool for training the Text Classification model that can provide me a certain minimum level of debuggability and transparency. That’s why in the beginning I choose Facebook FastText.

I know that FastText deals with neural networks internally, but as FastText relies a lot on the WordNGrams if I needed to debug some result or convergence problem in the classes I could use simple data analysis to explain why we’re getting some rogue results.

The strategy here was: Let’s do a very extreme pre-processing approach in our corpora to get the leanest corpora as we can, and after this optimization, we can play with different models and see what’s going on.

To exemplify that this figure from Kavita Ganesan explains our point:

Level of Text preprocessing. Source: All you need to know about text pre-processing for NLP and Machine Learning.  https://www.freecodecamp.org/news/all-you-need-to-know-about-text-preprocessing-for-nlp-and-machine-learning-bc1c5765ff67/

Some specifics in Pre-Processing for the German language

Umlauts (ä, ö and ü)  and Encoding

Long German Nouns

As we know that these long nouns can appear according to the situation inside of class our strategy was to analyze the WordNGrams and TF-IDF scores and see the relevance of these words. If word is relevant we did use some rules to break down those words and keep it, if not, remove of the vocabulary.

Part-of-Speech Tagging

  • We used  Part-of-Speech tagging to remove some words of our corpora. Our strategy consisted in a) always keep the verbs, b) placeholder usage to abstract data entities inside our text (ex: In our domain we use placeholders like sqm (square meters) and this placeholder gives some information gain in all particular classes that contains this word; c) as pronouns and conjunctions in our case most of the time do not contains any meaningful info we did cut it out.

For whom is interested in some examples, this table from NLTK is a good start:

German examples. Source: NLTK Universal Part-of-Speech Tagset

Stopwords: Analyze first, cut after…

One of the biggest endeavors in the project was to find a very nice library with consolidated German corpora in a word where the majority of the implementations and SOTA algorithms it’s crafted to English and first citizen language.

(Short Note: this blog post was written in July/2019. During this time we had a great evolution of NLP libraries in German. However, as we consolidate all ideas contained here in our own NLP library we decided not be dependent of those libraries anymore.)

As our task was only a multi-label text classification and we had a good amount of data, we decided to perform an extreme cut out of stopwords because as we’re not going to do any posterior application that heavily relied upon sequence like LSTM or seq2seq, we wouldn’t need to”save words” for our classifier. This gave us more room to use a very unorthodox approach.  

In the work of Silva e Ribeiro called “ The importance of stop word removal on recall values in text categorization”,  the authors showed a positive relation between stopword removal and recall, and we follow that methodology in our work. 

If I could give a specific advice in that matter I would suggest using only out-of-the-box stopwords lists from those packages only if you don’t have time at all to perform analysis in your corpora. Otherwise, always perform the analysis and create your personalized list.

To make my point clear about this matter, I’ll use the example from Chris Diehl.

Chris Diehl in his post called  “Social Signaling and Language Use“ provided a linguistic analysis in an e-mail from a company called Enron. This company was involved in a gigantic case of finance/corporate fraud and the whole story it’s described in the documentary “Enron: The Smartest Guys of the Room”.

The analysis consisted of discovering if there’s an existence of a manager-subordinate social relationship.

The original e-mail is presented below:

Doing a skim read, we can see that there’s a clear social relationship that characterizes a subordination. However, if we give this same text to a regular stopwords package, this will be the outcome:

As Chris Diehl pointed out, the terms that matter in this message are function words, not content words. In other words, the removal of these words could mischaracterize the whole message and meaning. For more, I suggest the reading of the entire article written by Chris. 

A great post about the differences in stopwords across several open source packages was made by Gosia Adamczyk in the post called Common pitfalls with the preprocessing of German text for NLP where she showed some differences between those packages.

Source: Common pitfalls with the preprocessing of German text for NLP, Gosia Adamczyk 

The key takeaway that we got here was: Trust in the stopwords from packages but check and if it’s necessary to mix all of them and use some information about your domain to enhance it.

Stopwords as Hyperparameters

In our project, as we’re started to go deeper into our corpora, we discovered very quickly that the normal list of stopwords not only was not suitable for us, but we needed to consolidate the maximum of them to remove from our corpora.

This was necessary because as we’re dealing with such amount of text, we wished to reduce the maximum amount of training time, and having lean corpora was mandatory to deal with that.

The main problem that I see in the current stopwords lists is that it is built on the top of tons of texts that it’s suitable for general purposes (e.g. German Senate corpora, Wikipedia Dump, etc.) but when we need to go in specific domains like Craftsmanship, the coverage of those stopwords lists wasn’t enough and not knowing that caused us a huge source of inefficiency.  I talk about this later on in how the German city names almost broke our classifier.

We followed a strategy to analyze the results and if we noticed some improvement and in the model performance or in gains in processing time, we add more stopwords in the list. Roughly speaking it was kind of “stopwords list as hyperparameters

This example from Lousy Linguist translates my point:

Source: https://twitter.com/lousylinguist/status/1068285983483822085

A single example that occurred to us in how a lack of personalized stopwords list almost broke our classifier.

In some point of time our models started to give very strange results when we received in the text the name Hamburg or München. Basically everytime that a model received those words, the model always gave a single service as a prediction. In other words, it was a clear case of overfitting.

Long story make short: We discovered that when the customers placed a service request for our Craftsman with those two cities in the text, our classifier always returned the same class, no matter if there’s another subset of words contained in the request (i.e. it makes totally sense since when someone needs to move most of the time the cites are placed).

A single example to illustrate that:

  • I would like to move my piano from Hamburg to Köln (Moving service)
  • I would like to paint my apartment. I’m located in downtown Hamburg. (Painting service)

The problem was that for the second case we always ended with the Moving Service classification.

The solution here was to debug the model analyzing the WordNGrams composition for this service, including the cities as stopwords and after that everything worked well. 

There’s no exhaustive list, but if I can to give some hints, I would classify the stopwords lists like this:

  • German states and citiesbayern, baden-württemberg, nordrhein-westfalen, hessen, sachsen, niedersachsen, rheinland-pfalz, thüringen
  • W-Fragewer, was, wann, wo, warum, wie, wozu
  • Pronounsdas, dein, deine, der, dich, die, diese, diesem, diesen, dieser, dieses, dir, du, er, es, euch, eur, eure, ich, ihm, ihn, ihnen, ihr, ihre, mein, meine, meinem, meinen, meiner, meines, mich, mir, sie, uns, unser, unsere, unserem, unseren, unserer, unseres, wir
  • Numbersnull, eins, zwei, drei, vier, fünf, sechs, sieben, acht, neun, zehn, elf, zwölf
  • Ordinalserste, zweite, dritte, vierte, fünfte, sechste, siebte, achte, neunte, zehnte, elfte, zwölfte, dreizehnte
  • Greetingsciao, hallo, bis, später, guten, tag, tschüss, wiederhören, wiedersehen, wochenende, hallo
  • Clarification wordsdass, dafür, daher, dabei, ab, zb, usw, schon, sowie, sowieso, seit, bereits, hierfür, oft, mehr, na

Some lessons along the way…

We had some lessons along the way for our case. The point here is not to define any truths but only exemplify that during the process of data analysis and perform experimentations with our dataset we found something totally different from what we hear about NLP and text pre-processing.

I divide those sections between what did work and what didn’t work.

What didn’t work
  • Lemmatization: Here I think it was the most surprising takeaway. Using Lemmatization as a common  “best” practice, we had 3% decrease in the accuracy in Top@5. The reason behind that was because some categories have some specific subset of words that makes them distinguishable and the Lemmatization was causing an involuntary category binning”. For example, We have some services that despite to have some similarity words in our corpora like  Wunschlackierung (Desired painting), Lackaufbereitung (paint preparation), Unfallreparaturen (accident repairs) and Kratzer im Lack (scratches in the paint); the Lemmatization caused a great loss in the distinguishable aspects of the corpora of a class and as consequence, we faced a harm in the algorithm performance. To solve this problem, we removed the Lemmatization of our pre-processing pipeline.

  • Lemmatization was too slow for our data: Another point that was a deal-breaker is that Lemmatization, even using a very good API as Spicy, took ages for our data in the beginning. Our dataset contains millions of records with text fields that contain dozens or hundreds of words for each line. Even using a 128 CPUs machine took a long time, and as we got this decrease in the model performance, we abandoned that approach. [Note: In the beginning, we used a previous version of spaCy that didn’t contain several improvements in comparison with the current version] 

  • Hyperparametrization of FastText: on the start of this project, we relied a lot in FastText, and we didn’t regret of that. But a huge limiting factor is that FastText has only a few number of meaningful parameters to use as hyperparameters, and here I’m specifically talking about the parameters  Window Size and Dimension that in our data  didn’t show any sign to be meaningful and/or useful in some strategy of grid-search or tuning. 

  • Spark for data pipeline and model building and training: Spark is a cool tool for Data Processing and I really like it, but for NLP all integrations that uses native Scala didn’t provide a minimum in terms of libraries, flexibility and easiness to use for us to rely upon for text pre-processing in German language. I personally think that text processing using linguistic features is one of the weakest points in Spark libraries. We’re still using it, but only for data aggregation and to dump from one place to another.

What worked
  • Own library of Pre-Processing and  Personalized list of Stopwords: This can sound a bit as over-engineering, but this was the only way for us to get the maximum of flexibility for our needs. We just unite all the best of all tools for German NLP like stopwords list, stemming, PoS and so on and craft something that can save us tons of time.

  • Hierarchical Models: This one deserves a special blog post, but for us using the natural ontology of the classes helped a lot in terms of accuracy in Top@5 because using a quasi segmented architecture for our models we naturally prune out the clashes between classes non related but that contains some words in common.

  • Using EDA + TF-IDF score to remove low frequent words: We learned that, for our case, removing low TF-IDF score words helped a lot in terms of experimentation without sacrificing performance. In the beginning, I just create a full list of TF-IDF and as I’ve included more words, I just monitor the performance to see if there’s some loss. My heuristic was: Get the minimum amount of words as possible with a tolerance of model performance in no more than 1%. This can look quite hard metric but at least in the beginning I was more concerned to have very lean corpora instead to have a good performance and a model more complex and brittle.

  • The oldie but Goldie Regex: This worked way better than Lambda and map() in Python for word replacement. The takeaway here is to use lambda and map only if it’s really necessary. 

  • Word Embeddings: We didn’t use embeddings before to force us to stress all possibilities with our corpora and hyperparameters. However, the loss in performance that I had with the removal of low IDF score words and using a personalized list of stopwords, I gained back just plugging embeddings in the model using FastText.

The key takeaway that we took was: Use the best practices as a start point but always check some alternatives because our data can have some specificities that can generate better results.

Final Remarks

If I have to do a wrap-up of all of these things, I would highlight as the main ones:

  1. Recognize your own language and jargon: For us understand that we’re dealing, i.e. web-based texts in the context of Craftsmanship and their language helped us to perform a better pre-processing that enhanced our models;
  2. Make your own corpus/vocabulary/embeddings: Out of the box stopwords lists and common text pre-processing helped a lot but in the end of the day we needed to create our own stopwords list and we used some low TF-IDF score stopwords list to prune out unnecessary words and reduce our corpora and consequently reducing the training time
  3. Do not automate misleading data: In our case one of the capital mistakes in the beginning was to try to automate the data pipeline and cleaning without considering some specifications of our case like jargon and abbreviations. It means: In Text Classification data pre-processing is gold, models are silver. 

Libraries for German NLP

  • German_stopwords: Last commit 3yo, but contains a good set of loosen words and abbreviations  
  • DEMorphy: morphological analyzer for German language (several types of Tagset for dictionary filtering)
  • textblob-de by markuskiller: PoS
  • GermaLemma: First one that uses PoS before Lemma (Spacy will do that out of the box in next version)* tmtoolkit: Wrapper with PoS, Lemmatization and Stemming for German. Good for EDA with Latent Dirichlet Allocation.
  • NLTK: There’s a small trick to use PoS with German
  • StanfordNLP: (for future test)
  • SpaCy: NLP with batteries included (syntactic dependency parsing, named entity recognition, PoS)
  • Python Stopwords: Generic compilation of several Stopwords