Learning outcomes

  1. Learn about the key paradigms and algorithms in machine learning.
  2. Get an understanding of data analytics based on machine learning and using modern programming tools, such as Python or R.
  3. Experience how machine learning and data analytics can be used in real-world applications.
  4. Acquire the ability to gather and synthesise information from multiple sources to aid in the systematic analysis of complex problems using machine learning tools and algorithms.

Exploratory Data Analysis

After the introduction to machine learning, one of the most useful exercises was the EDA tutorial, the focus of Unit 2. I have used EDA on many medical datasets, in particular focusing on missing values, but I had not ever used the python package missingno before; this was very instructive, especially the heatmap method to look for nullity correlation (Fig. 1), which I will use again in future EDA

Click to see details about `missingno`

K-means

The next notable technique to be introduced in the module was clustering, we focused on K-means clustering. This is a very useful and informative technique, which I used in my coursework in a previous module to develop a prototype product to segment the customers of a hypothetical bank. It was useful however to return to this technique, as I learned about how to choose the optimal number of clusters. Going into the unit, I was aware that the number of clusters, k, is a parameter of the algorithm that must be chosen for each run. But it was useful to learn, through the workbook and seminar, that methods such as silhouette analysis can be used to check for the optimal (or approximately optimal) number of clusters

Click to see details about K-means silhouette plot

AirBnB Team Task

The first team task to be undertaken was the AirBnB task. Here the problem setup was to define an important business question using the AirBnB listing dataset from Kaggle. Piotr and I discussed over video chat and had a range of hypotheses. However, we both agreed that the most important business question was around which factor affected the listing price the most. We moved from linear regression to XGBoost regression, a tree-based method. Whilst XGBoost has in-build feature importance, we decided instead to use permutation importance which is universal. I had heard about SHAPley additive explanations as a useful ‘explainable AI’ tool and explained this method to the team. Piotr thought it was great, and especially the visually interesting plots that show which are the influential variables

The key reflections from the AirBnB task were around the remote collaboration on the project and task. We wanted to use a shared jupyter notebook but were not yet using google colab to share the code (we used this later in the CIFAR task). In terms of collaboration, we met regularly, and were able to keep to our deadlines to progress with the project. This meant that getting the document prepared on time was straightforward.

Click to see details about SHAP analysis

Learning about neural networks

As the module further progressed, we moved towards neural networks. I was particularly looking forward to this phase of the module, as this was new territory for me. I was keen to use the workbooks and seminars to get a grounding in how neural networks, starting with a simple feedforward model, functions and is constructed as I knew that we would have to build and test a neural network in the team task. Using the workbooks, I was able to examine the basic unit of the neural network, the perceptron, which functions a lot like a logistic regression: with multiple inputs, a coefficient for each one, and a sigmoid function. However, there are differences too, in that the neurons in a more complex neural network can have differing activation functions such as rectified linear or linear.

Building an image recognition neural network (CIFAR-10)

In the CIFAR neural network project, we met initially to determine a strategy for the project. This was a difficult task, and neither of us had experience in building a neural network. We decided to use the same python package ‘keras’, so that we could combine our code. We split the task, so that I would perform the initial experiment, trying an ANN structure on the image data, and then building in a single convolutional layer. This clearly showed the advantage of convolutional architecture for image data. Then adding in layers such as pooling, and finally plotting the activation maps for the convolutional layers. This was a very interesting process, and we really got to grips with how the CNN works 'under the hood’

Click to see details of the CNN actiation map

Cause for concern?

Another cause for concern is deepfakes; in the reading list of unit 12 a government independent report for the centre for data ethics and innovation reported on the risk of deepfakes (‘Snapshot Paper - Deepfakes and Audiovisual Disinformation’, ND). In particular, that deepfakes can pose a threat to society (e.g. political disinformation) and the individual (e.g. identity theft). Strategies to combat deepfakes including tools to detect deepfakes, or awareness of the public of deepfakes may help.

Government Paper on the risk of Deepfakes and Disinformation