report

Sentiment Analysis in the PERSONA project

Published in Social acceptance by

AA

What is Sentiment Analysis in the PERSONA context, and is it privacy-friendly?

Over the last couple decades, well-known advances in computing power and techniques have permitted the development of a practical application of computational linguistics known as natural language processing (NLP). One of the main applications of NLP has been sentiment analysis (SA), that is the interpretation of emotion underlying texts. The process allows analysts to identify, categorize, and extract subjective information in written source material in order to help an organization understand the social sentiments connected to it or to its work and expressed online.

At CyberEthics Lab., we recently conducted an SA on Twitter and Google results for the PERSONA project, whose goal is to develop an integrated impact assessment method for evaluating the all-around acceptability of automated technologies in border crossing(1)During the year 2019, following a growing trend, approximately 3.8 billion passengers flew across the globe according to the United Nation’s International Civil Aviation Organization. As rigid infrastructures, airports require large investments to increase capacity, which is nevertheless bound by physical limits. Therefore, a logical solution for accommodating a growing number of passengers is not to expand airports structures, but rather to expedite procedures thanks to automation.. PERSONA intended to disseminate questionnaires regarding said technologies to large samples of travellers at airports, ports, and train stations. However, travel restrictions triggered by the COVID-19 pandemic reduced the likelihood of success for that specific effort. Therefore, SA was viewed as a valid alternative to gather stakeholder feedback. Our goal was to assess the feelings users expressed online towards border control technologies such as aerial, land, and water drones, artificial intelligence and facial recognition, and automated gates.

Our analyses were conducted in full compliance with Twitter and Google’s terms and conditions and did not involve the processing of personal data. Indeed, in the case of Google, data is anonymized by the search engine itself(2)https://support.google.com/trends/answer/4365533?hl=it&ref_topic=6248052; in the case of Twitter, which informs its users of the public nature of the content of the message that they decide to post(3)Clause 1.2, third paragraph https://twitter.com/en/privacy. The language in the privacy policy meets the criteria set forth at page 13, Box 4 of the document at the following link: https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-data-protection_en.pdf , our algorithm discarded any personal information during the mining process.

The Google service used for this analysis is called Google Trends, and it is one of the many services made available by Google LLC. At the time of conducting our analysis, the US Privacy Shield – which gave us the legal basis to conduct the analysis – had not yet been invalidated by the Court of Justice of the European Union. The purpose of using Google Trends to measure interest in a particular topic across time in a given geographical area(4)Rogers, S. What is Google Trends data — and what does it mean? Jul. 1, 2016 https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8.

Methodology

Our analysis was conducted in two distinct phases over the course of three months. In April 2020, we produced preliminarily results, which confirmed the goodness of our method. Similar data were analysed more extensively in July 2020. Although the frequency of social network mentions and web search activity may have increased following specific events (e.g. the usage of facial recognition by American law enforcement agencies during activist protests may have spurred Google users to learn more about the technology), the sentiments associated have been demonstrated to remain stable over time. For a much more detailed, technical explanation, follow this link.

We analysed the frequency of search and mentioning of the following technologies:

  • Artificial Intelligence;
  • Automated Border Control gates;
  • Drones (aerial, land, water);
  • e-passport;
  • Facial recognition;
  • Fingerprint enrolment;
  • Iris Enrolment & ID;
  • Sensors.

Google Trends

In order to extract trends from the Google Trends data, we used a powerful and fast algorithm called STL. In essence, even if STL is an iterative procedure, it is very suitable for “long” time series and Big Data contexts when a multidimensional analysis is used. In our work, we grouped Google search queries(5)Especially in Europe, Google search queries are a good proxy for overall internet searches, since Google has a volume market share for searches of over 90%. https://gs.statcounter.com/search-engine-market-share/all/europe into the following four dimensions (i.e. categories).

  • All categories (AC)
  • Intelligence and Counterterrorism (IC)
  • Law Enforcement (LE)
  • Public Safety (PS)

The first variable accounts for the total number of searches performed in a given country or at the global level. Basically, this datum is indicative of the global direction taken by a single keyword across the time. The other three categories have been considered not only for their consistency with the project but also because of their relative high probability to provide valuable information. In fact, especially when “technical” topics are under consideration, it is not rare to face situations where a category-driven investigation results in poor or no outcomes altogether. This is explained by the relatively small number of searches found by examining too specific a category.

As for the time window, all the time series have been sampled at a weekly frequency for a time span of about 4.5 years, starting from the first week of January 2015. Our analysis is based on the data extracted up until July 7, 2020.

Twitter

We applied the NRC Emotion Lexicon, a popular lexigraphic criterion developed by the National Research Council of Canada, for sentiment categorization. The Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). Despite cultural differences, it has been shown that a majority of affective norms are stable across languages, therefore non-English sentences can be translated automatically, and the words subsequently compared to the same list. To put it more practically, the average Twitter user needs to reach a certain level of emotional activation prior to tweeting; this internal motivation needs to be “strong enough” to follow through with the envisioned action on the social network. Given the necessary strength of the prerequisite inner state behind it, a user’s Tweet can be compared to the associations listed in the Lexicon in order to generate scores that identify the strength of the emotions and sentiments expressed through the words in that Tweet.

The Twitter analysis has been based on the data extracted on in April and July 2020. We collected Tweets geo-localized both four airports of the ten busiest airports in Europe(6)Eurostat, Airport traffic data by reporting airport and airlines, Consulted: 20-07-2020 https://appsso.eurostat.ec.europa.eu/nui/submitViewTableAction.do (London Heathrow, Madrid Adolfo Suarez, Paris Charles de Gaulle, and Rome Fiumicino) and far from them.

Results

While SA results do not replace questionnaire data entirely, they have provided valuable, additional insights. Some of the elements in the set of technologies examined did not yield significant results in one or more types of searches. This in itself is significant, given that it may highlight a weak interest and desire to comment on the technologies.

Nevertheless, it is possible to say that the overall view that emerges from the SA is that the public has mixed feelings towards the set. Among the eight selected, the two technologies that seem to the best chances of being accepted easily by the general public in border crossing scenarios – based on the available data – are e-passports and artificial intelligence.

Both the aforementioned technologies might succeed mainly because of their ability to induce a positive sense of familiarity. In the case of electronic documents, the sensitive data are supposed to be collected and processed by an official entity, very likely to be familiar to the people involved. As for the Artificial Intelligence, it can be said that the general public is likely to know how automatic methods based on Artificial Intelligence are routinely employed in many contexts (e.g. in supermarkets to study purchasing behaviours).

Older technologies seem to be more positively accepted than newer ones, and notoriously more invasive ones, whose users and beneficiaries might be perceived to be separate entities, seem to instill a greater sense of fear than those whose users and beneficiaries are one and the same person. To better explain this final point, facial recognition, which has oscillating interest on Google Trends and is generally negatively perceived on Twitter, shares with drones, similarly perceived, the characteristic that its operators and its subjects are different (i.e. Law Enforcement Officers and citizens respectively); on the other hand, e-passports are operated by the subjects themselves.

For a more in-depth report, feel free to contact us.

Notes

Notes
1 During the year 2019, following a growing trend, approximately 3.8 billion passengers flew across the globe according to the United Nation’s International Civil Aviation Organization. As rigid infrastructures, airports require large investments to increase capacity, which is nevertheless bound by physical limits. Therefore, a logical solution for accommodating a growing number of passengers is not to expand airports structures, but rather to expedite procedures thanks to automation.
2 https://support.google.com/trends/answer/4365533?hl=it&ref_topic=6248052
3 Clause 1.2, third paragraph https://twitter.com/en/privacy. The language in the privacy policy meets the criteria set forth at page 13, Box 4 of the document at the following link: https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-data-protection_en.pdf
4 Rogers, S. What is Google Trends data — and what does it mean? Jul. 1, 2016 https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8
5 Especially in Europe, Google search queries are a good proxy for overall internet searches, since Google has a volume market share for searches of over 90%. https://gs.statcounter.com/search-engine-market-share/all/europe
6 Eurostat, Airport traffic data by reporting airport and airlines, Consulted: 20-07-2020 https://appsso.eurostat.ec.europa.eu/nui/submitViewTableAction.do