One of the hardest tasks in Machine Learning it is Text Analysis or Classification. This is due the nature of the text data itself that could contain an arbitrary complexity in terms of vocabulary, semantics, etymological components, morphology, grammar and polysemy to show a few examples.
The following paper from Kowsari et. al. called Text Classification Algorithms: A Survey it is probably one of the best resources – in the practitioners perspective – in Machine Learning.
The paper makes a warp-up about almost all available tools for Text Classification and explains in a clear language the advantages and caveats of all of them. It’s really important to say that the paper considers also the role of embeddings to capture syntatic (position of the word in the text) and semantics (meaning of the words) to enhance the learning task by the algorithms.
This is a mandatory resource for whom needs to apply in a practical way Text Classification and the authors even built a repository in Github to make all the blog post and code of the paper available.
Abstract: In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.Kowsari et. al. Link: https://www.mdpi.com/2078-2489/10/4/150