A Topic Modeling using LDA on the White Paper on Artificial Intelligence: a European approach to excellence and trust

This month the European Union Commission released a White Paper from called “White Paper on Artificial Intelligence: a European approach to excellence and trust”. The idea of this white paper was to establish the guidelines for the EU regarding Artificial Intelligence (AI).

For those who don’t know the report recognizes that the EU is now behind the USA and China when we discuss AI systems and user data, and the document tries to give some perspectives in the usage of AI in industrial data, where the EU has a great advantage in comparison with those regions.

I won’t discuss the political aspects of the document, but there are some good summaries like this one from Covington Digital Health and this one also from them.

I’ve read the report and at least for me was very disappointing, especially due to the fact that state members from the EU contain far better initiatives.

The reason that I found the report disappointing it’s because the EU Commission instead to look the best of some local initiatives like Finland (that made a very deep report that talks about AI in business, self-regulation and capturing values like moral, ethics and policies as north star); or Estonia (that contains an ambitious plan to implement most of the initiatives in public sector to increase their efficiency and give a proper return to the taxpayers) the Commission just put trigger/effect words regarding only risks and regulation as we’ll see below.

On top of that, considering the importance of AI in nowadays and the problems and endeavors that EU will face in a foreseeable future (e.g. aging population and the economic and social impacts, a challenging economic environment with central banks delaying financial crisis via Quantitative Easingan open tariff warbetween two of biggest players, and so on) the report takes in consideration too much focus in privacy and risks with almost no mention in innovation, usage in public sector or even in potentialities of AI. 

The report speaks for itself, but my point here it’s just to bring some charts and some analysis over the text contained in the report. 

First of all, let’s check as usual the most common words in the report:

As expected, the word {data} is the top 1 (sorry, but there’s no AI without data). Along the patch we have some interesting sequence {eu, systems, risks} that I consider that is the drivers of all report (I’ve read the full report).

Using a simple vanilla Word Cloud, we have:

Again, as we can see the words with more frequency in the report were {eu, ai, data, system, risk}. 

But as I told before I really dislike Word Clouds and other kinds of word frequencies, so I decided to use a TF-IDF through all document to check the real importance of each word inside of the document. For a matter of simplicity, I took only the top 30 terms. 

As expected we have the same words in top {ai, data, systems, risks} but we have some notable mentions like: a) the world {regulatory} in front of {human}; b) the word {commission} in front of {citizens} and {innovation} and c) the word {digital} out of top 10.

But spare words can be misleading in some instance, so let’s take a look in the most frequent word bigrams contained in the report:

In the top 3, we have {ai, systems}, {ai, applications} and {use, ai} that are the drivers of the report. Not a surprising result. After those word n-grams we have some interesting combinations like {fundamental,  rights}, {regulatory,  framework} and {personal, data}. 

Let’s take a look in the trigram compositions:

If the bigrams takes a history about regulatory framework over AI systems, the 3 grams gives a clearer history about the major points of the report that are: i) {highrisk, ai, applications}, ii) {remote, biometric, identification} and } iii) {regulatory, biometric, ai}.

Using a simple LDA analysis in 7 arbitrarily chosen topics, we had the following topics and those main with their Intertopic Distance Map:

Topics found via LDA:

Topic #1:

{europe, ensure, assessment, law, economic, member, conformity}

Topic #2:

{systems, requirements, safety, product, existing, information, ensuring}

Topic #3:

{ai, data, use, applications, legal, national, highrisk}

Topic #4:

{risks, products, services, public, certain, authorities, enforcement}

Topic #5:

{eu, ai, commission, european, rules, framework, including}

Topic #6:

{rights, protection, system, relevant, particular, citizens, eg}

Topic #7:

{legislation, ai, liability, digital, need, set, paper}


Today was a short one because I have as the main principle to not talk about politics due to personal reasons, but I believe that with those graphs anyone can at least guess the tonic of the words in the report.

As usual we have all code and data in the repo.

PS: Those are my personal views and this post doesn’t represent anyone else than me. I do not endorse any kind of political affiliation, candidate, or even any kind of political association inside of traditional frameworks.