By Sameer Sharma

On the face of it, the Consultation Paper of the Information and Communication Technologies Authority (ICTA) regarding the amendments to the ICT Act for regulating the use and addressing the abuse and misuse of social media in Mauritius seems to be attempting to fight a noble cause. A closer reading of the proposal opens up a can of worms when it comes to both intended and unintended consequences. As a liberal democrat who holds individual freedoms dear, I cannot support such ideas, but this is not the purpose of this article.

Let us assume that the law passes and let us even assume that the National Digital Ethics Committee is independent, well meaning and is a staunch defender of democratic values, the devil will actually end up being in the details of implementation. The required investment, the continuous monitoring and the potential for unintended consequences even in such a scenario cannot be underestimated. Laws always sound easy to implement and enforce on paper, but practice is another story. The objective of this article then is to explain how and where, from a data monitoring perspective, risks could essentially come from.

If a data scientist were to be given the task to develop a social media monitoring ecosystem tomorrow, the first step would be to design a highly scalable data architecture. Natural language processing is a highly complex field in machine learning because interpreting any language is no easy task. Language is about vocabulary, culture and context. Different people even within the same country can speak it in different ways and even type it differently. In the case of Mauritius, Facebook messages may be in French, in Creole, in English or in a combination of all three. Different ethnic groups may also bring some more flavour to the way they express themselves online.

Labelling

Messages are considered to be unstructured data, and before any rules based or Artificial Intelligence (AI) model can be used to filter what is harmful and what is not, the data must be cleaned. From stop word removal, lemmatization to stemming, cleaning the data will not be easy in the case of Mauritian languages given its heterogeneity. Most Mauritians, for example, do not even know how to type in Creole in the same way. Context in language also matters, and in order to better capture this, more sophisticated AI models will need to be used, but before getting there, data cleaning and the conversion of unstructured to model readable structured data will be a mammoth task.

In the United States, data scientists team up with linguistic experts in order to refine the data cleaning and feature engineering process. Given the specificities and human capital limitations (do we have many AI experts and linguists who are used to social media Mauritian style?), transforming the data into relevant model inputs will be very difficult. Models need numeric inputs. Typically for such use cases, words will be tokenized and need to go through a word embedding layer which, in the case of Mauritius, will likely need to be trained along with the harmful versus not harmful classifier model. Beyond acting as an easy look up table and being more efficient than one hot encoding which leads to highly sparce matrices, embeddings when trained along with the model itself will provide a certain degree of closeness to words.

Language is about vocabulary, culture and context.

On the data architecture side, unstructured data consumes a lot of storage memory. Over time, only a cloud-based architecture will work. At the same time, a hybrid cloud architecture will be needed in order to properly manage data privacy concerns. Essentially on prem data lakes will require data masking with keys which will then be moved to the cloud. The most scalable architecture on Azure will likely revolve around Spark. Note that beyond the costs involved in setting up a hybrid cloud ecosystem that is scalable (no 30 day storage capacity like Safe City!), information security risk will need to be optimally managed, and this entails human capital costs and investments.

If we think about it, the dependent variable here will be what is harmful and what is not harmful (it is possible to have more than a binary classification but it does not change the logic). The labelling process before any model is even used will require a team of data scientists to label thousands of messages and also make use of what is known as semi supervised learning in order to obtain a representative and large enough training sample. While such classification problems are typically purely supervised (we need to know and label the Y variable in the training process), semi-supervised learning has proven to be a more scalable and cheaper approach to label hundreds of thousands of training and validation data.

With Creole, French, English and combinations to label and to segregate, this process can be prone to error in the wrong hands. When it comes to languages, especially in such use cases, human biases in the labelling process need to be well managed in order to avoid building a biased model. Typically, there should be three layers of independent labellers for the same sentence when building the training and validation sample sets on top of a clearly defined label. The National Digital Ethics Committee will need to be very precise in terms of how to define these labels, which goes beyond legal jargon. The labelling process, even when using semi-supervised learning, can be expensive and is quite time consuming. Unlike purely supervised learning, semi-supervised learning will not be accurate but will be faster and cheaper.

Modelling

On the modelling side, it should be understood here that even the best bi-directional Long Term Short Term Memory neural network models or even GRUs will need to be combined with more traditional rules based filters (rules based filters which are condition based also have their set of problems in the definitions, plus rules to filter out fake accounts and past offenders) in order to optimize performance. We are likely to need to build an ensemble of models per language cluster.

Even when relying on highly scalable data architectures, ensembling, assuming decent data quality, combining AI models to traditional rules based engines and assuming unbiased labelling of messages, models will lead to predictions with a certain degree of false negatives (it will miss genuinely harmful messages) and false positives (which may have unwanted consequences to the author of the messages). Because the way we express ourselves can be linked to ethnicities, age and even gender, it will be important to test the entire ecosystem (rules based plus ensemble of models) for disparate outcomes, be it by comparing the confusion matrices per sensitive class (gender, age, race) or/and by also using Fairness GANs (adversarial networks).

The conversion of unstructured to model readable structured data will be a mammoth task.

We do not want, for example, to see higher false positive rates on messages sent by certain communities versus others. Models need to be corrected for this. Language is full of biases especially social media data. Unintended consequences of blocking and filtering out messages, which were intrinsically not harmful because models made mistakes even after optimizing them, can be quite high.

Models will also need to be continuously monitored for performance and re-trained over time. Monitoring will involve continued bias testing and continuous labelling on a random sample of production data in order to calculate performance (you need to compare actual Y to predicted Y!) and to get new training and validation data for continued model re-trainings. We are talking about a massive undertaking which can be costly, and we are talking about very advanced AI here in order to properly filter messages. Those who think otherwise have likely not worked in NLP before.

Mauritius is not yet known to have any AI research centre of excellence or to have made much progress in this field. The Mauritian context and the way we write will require significant customization because the way we speak and the way other countries speak are different. Those who think that they can just approach a vendor who will supply a moderately customized model will end up with an even more biased model. Vendors also care about vendors and are black boxes. Models like these need to be developed domestically and be well understood.

While significant progress has been made in the AI explainability field (how did the black box neural network come up with its decision) especially with Shapley values and integrated gradient approaches, AI explainability, which is human interpretable, is no easy task. We are talking about explaining highly complex neural networks (because simpler models may not work as well to capture nuance and context) with non-linear activation functions across each hidden layer.

When we did not even think to store more than 30 days of data for the Safe City project and when we do not yet have a pool of AI experts in the country, we should perhaps think twice about such consultation papers and any derived laws. To intercept the message is one thing but to fairly filter out “harmful” content is much easier said than done. Lawyers in Mauritius are good at writing laws but this is a complex data science, data privacy and information security related task. The devil is in the details of implementation.

Sameer Sharma
Sameer Sharma is a chartered alternative investment analyst and a certified financial risk manager.