Stories by Warwick Data Science Society on Medium

Stranger Weather Ahead: Detecting Anomalies in Temporal Weather Data

Warwick Data Science Society — Tue, 21 Nov 2023 14:45:20 GMT

Hurricane GIF, Giphy

Introduction

Science fiction writer and biochemistry professor Isaac Asimov once said that “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’”. How can machine learning aid this process? In the spirit of discovery, our team investigated methods to detect anomalies in weather data.

Detecting anomalies in weather data has many different practical applications. Nowcasting, for example, aims to predict the weather in the next 0 to 6 hours, which is crucial for crisis management and avoiding disasters [1]. Similarly, anomaly detection helps to predict extreme weather events and how they behave over time. For example, researchers have used anomaly detection to track Cyclone Baaz in the North Indian Ocean [2].

But it’s not just about keeping track of what’s happening now. Anomaly detection also helps researchers predict disasters, such as floods and forest fires, before they happen. That way, firefighters and emergency responders have more time to prepare and protect people and property [3]. Additionally, in smart cities, weather can disrupt transportation systems. Therefore, we need a reliable way to deal with extreme weather. Anomaly detection helps us classify the weather conditions, making it easier to keep transport systems running smoothly [4]. So, whether it’s for immediate crisis response, long-term disaster prevention, or just making our cities work well, anomaly detection in weather data is a secret weapon against unpredictable weather.

Our Code

You can run our code on this Google Colaboratory notebook:

Google Colaboratory

The Dataset

We chose the Jenna_climate_2009_2016 dataset produced by the Max Planck Institute for Biogeochemistry in Germany. The temporal (timestamped) dataset has 15 different features including time, air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. We used only the data collected between 2009 and 2016. Researchers collected the data over a long time and with high granularity, which made the Max Planck dataset well-suited to our project.

The first stage in the project was to process the dataset so that we could apply different anomaly detection methods effectively. We inspected and cleaned the dataset to remove errors and inconsistencies, ensuring the accuracy and reliability of our models. Next, we visualised the data as shown in Figure 1. As expected, there was a clear yearly trend for many weather features. Therefore, we decided to focus on identifying anomalous years for weather.

Figure 1. Visualisation of some features of the dataset.

We used different methods for exploratory data analysis to gain a better understanding of the dataset’s variables and the relationships between them. These included heat maps to identify correlated features and scatter plots to analyse the relationship between features. Figure 2 shows that there was significant correlation between temperature and specific humidity and temperature and dew point, which could be analysed further.

Figure 2. Heat-map to show correlated features.

Anomaly Detection Methods

Throughout the project, we explored four different methods for anomaly detection: the k-NN algorithm, Z-scores, cluster- based local outlier factor (CBLOF) and autoencoders. These methods allowed us to approach anomaly detection differently, considering statistical measures, analysing relationships between features and training neural networks.

k-NN

The k-Nearest Neighbours algorithm, (k-NN) is a supervised machine learning algorithm that is widely used to solve classification problems. The k-NN algorithm assumes that similar things exist in close proximity.

The k-NN algorithm

We applied this algorithm to the dataset as follows:

After preprocessing, we normalised the temperature values to ensure uniform scaling.
We applied the k-NN algorithm to identify anomalies in the temperature data.
W selected the optimal k value to achieve the maximum accuracy of the modeL. The common case is k = 5. So, in the example, we initiated that k = 5 and proceeded with the process.
We calculated z-scores for the distances between data points and their neighbours, allowing us to standardise the distances.

# Normalize temperature values
scaler = StandardScaler()
train_df['T (degC)'] = scaler.fit_transform(train_df[['T (degC)']])

# Choose an appropriate value of k (number of neighbors)
k = 5

# Train KNN model
knn = NearestNeighbors(n_neighbors=k)
knn.fit(train_df[['T (degC)']])

# Detect Anomalies
distances, _ = knn.kneighbors(train_df[['T (degC)']])
# Calculate z-scores for distances
z_scores = ((distances - distances.mean()) / distances.std())

# Define a threshold for anomaly detection
anomaly_threshold = 2.0  # Adjust as needed

# Identify anomalies
anomalies = train_df[z_scores > anomaly_threshold]

The red data points in Figures 3 to 5 represent anomalies, indicating unusual temperature variations or extreme weather conditions in 2009, for the Temperature, Wind Speed and Humidity features.

Figure 3. Scatter plots of the temperature, windspeed and humidity (left to right) with anomalies highlighted in red.

The application of the KNN algorithm to detect weather anomalies in temperature data provides a starting point for further investigation and analysis. The k-NN algorithm’ performance degrades in high dimensionality spaces, so it is best applied to different weather features individually. Additionally, selecting the optimal k-value can be challenging, and requires some domain knowledge.

Z-Score

The z-score (also known as standard score) is the number of standard deviations by which the value of a new score is above or below the mean value. We computed the z-scores via the following formula,

where x is the raw data, mu is the population mean, and sigma is the population standard deviation.

For a normal distribution, the Empirical Rule states that we can detect outliers via the following observations:

a) 68.3% of the data points lie between +/- 1 standard deviations.

b) 95.4% of the data points lie between +/- 2 standard deviations.

c) 99.7% of the data points lie between +/- 3 standard deviations.

We assumed that the temperature data was approximately normal, even though for some years we observed slightly skewed data, and plotted the maximum obtained temperature z-score for each year.

Figure 4. Bar plot of the largest temperature z-score values by year

From Figure 4, we can see that 2015 has the largest maximum z-score value amongst all years. One potential explanation for this is that the distribution of temperature in 2015 might not be close to normal distribution. We can verify our guess by plotting a histogram for the temperature variable in 2015.

Figure 5. Histogram for temperature in 2015

We see that Figure 5 is a strongly positively skewed histogram, which suggests that the distribution is far from a normal distribution.

Cluster Based Local Outlier Factor

Cluster Based Local Outlier Factor (CBLOF) involves classifying anomalies based on their proximity to the cluster of data. Two separate variables are read from the data set, possessing distinct correlations. Data is then extracted per year to analyse these weather condition relationships. An anomaly score is assigned to each variable from the proximity to the centre of the cluster. Using a set threshold value, data points are classified as inliers and outliers based on the anomaly score. The variable relationships used were:

1. Temperature ( °C) and Specific Humidity (g/kg)

2. Temperature ( °C) and Wind Speed (m/s)

3. Temperature ( °C) and Dew Point (g/kg)

Relationships 1 and 2 can indicate the presence of many extreme weather events, namely hurricanes and tornadoes [5]. Relationship 3 relates to the cold weather phenomena like blizzards [6]. Monitoring the progress of these related variables can allude to formation of some of the events listed.

The threshold value is set as 1% and is constant throughout the investigated years, with the shape determined by the whole set of data. A threshold cluster is created, and the data is plotted. CBLOF creates smaller clusters which can be used to classify outliers. However, in this instance, they are ignored.

From the data set, points are separated into years and plotted. Upon comparison, anomalous years are determined from the shape of the plot and the number of outliers relative to the other years.

data = pd.read_csv(csv_path)
data['Date Time']=pd.to_datetime(data['Date Time'])
data.set_index('Date Time').head(2)
data1=data[data['Date Time'].between('01.01.2013','31.12.2013 ')]
minmax = MinMaxScaler(feature_range=(0, 1))

CBLOF involves two variables, resulting in outliers being plotted against other outliers. The threshold region determining these outliers is kept constant by reshaping the entire dataset rather than individual years.

data[['T (degC)','sh (g/kg)']] = minmax.fit_transform(data[['T (degC)','sh (g/kg)']])
data[['T (degC)','sh (g/kg)']].head()

X1 = data1['T (degC)'].values.reshape(-1,1)
X2 = data1['sh (g/kg)'].values.reshape(-1,1)
X = np.concatenate((X1,X2),axis=1)

In the plots in Table 1:

1. Purple contours represent the increasing anomaly score from the threshold.

2. Green contours represent the points at which the anomaly score is equal to the threshold value.

3. Pink contours show the region in which the anomaly score is less than the threshold.

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Purples_r)
plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='green')
plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='pink')

The most anomalous year for the data set was determined to be 2015, as shown in Table 1.

Table 1. CBLOF Clusters for weather data feature pairs in 2015.

Overall, CBLOF provides an efficient method to observe weather trends and relationships but has limited use when observing individual weather conditions. Observing these trends as a function of weather variables provided a useful insight into anomalous weather modelling, which would have been missed using other detection techniques. However, this benefit is limited to conditions with clear correlations. Almost no observations can be made when choosing two variables that are not closely associated.

Auto-encoders

In an autoencoder, there are two core components: the encoder and decoder. The encoder reduces high-dimensional data to a lower-dimensional representation, and the decoder attempts to reconstruct it. This autoencoder is utilised for anomaly detection by evaluating the accuracy of data reconstruction. Three main stages are involved: data pre-processing, building the autoencoder (comprising encoder and decoder models), and detecting anomalies (measuring reconstruction loss to identify anomalies).

Figure 6. Diagram of autoencoder architecture [9].

The other anomaly detection methods have highlighted temperature as a useful feature for detecting anomalous weather patterns. Therefore, we will detect anomalies in temperature patterns. First, the temperature data required some further preprocessing. The Z-score anomaly detection method highlighted that the temperature data for the years 2014 and 2015 had a score greater than 3 and so could be considered anomalous. Therefore, days from these years were considered anomalous and labelled “1”. whereas days from all other years were considered non anomalous and labelled “0”. Figure 7 shows sampled anomalous and non-anomalous days.

Figure 7. Sample anomalous and non-anomalous weather-days.

Next, we built and trained the autoencoder using the keras API, and plotted the training and validation loss (see Figure 8).

# Define the dimensions

input_dim = X_train_label_0.shape[1] # input data dimension
encoding_dim = 8 # dimension of the encoded data

# Encoder
input_layer = keras.Input(shape=(input_dim,))
encoder = keras.layers.Dense(encoding_dim, activation= 'relu')(input_layer)

# Decoder
decoder = keras.layers.Dense(input_dim, activation='sigmoid')(encoder)

# Create the autoencoder model
autoencoder = keras.Model(inputs=input_layer,outputs=decoder)
autoencoder.compile(optimizer='adam',loss='mean_squared_error')

# Train the autoencoder 
history = autoencoder.fit(X_train_label_0,X_train_label_0, epochs=100, batch_size=64, shuffle=True, validation_data=(X_test, X_test))

# Plot the training curves

plt.plot(history.history["loss"], label="Training Loss")
plt.plot(history.history["val_loss"], label="Validation Loss")
plt.legend

Figure 8. Training loss and validation loss for the autoencoder model.

Lastly, anomalies were detected by calculating the reconstruction loss ( the difference between the original data and the data after it has passed through the autoencoder) . If the loss is greater than a set threshold, then a day’’s weather is detected as an anomaly.

The threshold was set by:

Calculating the mean average error for normal weather-years before and after they have passed through the autoencoder.
Classifying any weather-years as anomalous if the reconstruction error is higher than one standard deviation from the normal weather-years.

Figure 9. Histograms of plot the reconstruction error on anomalous and non- anomalous weather days.

Overall, this method was not very accurate, with an accuracy score of only 47%. This could be improved by building a multi-modal model using all the relevant features for each timestep and using a long-short-term memory (LSTM) neural network to process the temporal data. Also, we could investigate different strategies for setting the threshold above which weather_days are classified as anomalous. Finally, autoencoders perform better at anomaly detection when the anomalies are more common. Therefore, they may not be suitable for detecting rare extreme weather events for applications such as now-casting.

Conclusion

Our project’s aim was to apply unsupervised learning methods and statistical analysis to detect anomalies in temporal weather data. Through a variety of machine learning models we succeed in detecting anomalies in weather data. Our dataset used data from one location over time rather than focusing on an extreme weather event, as would be the case for nowcasting applications. It would be interesting to see how these methods perform in different contexts.

Overall, anomalies offer valuable insights into unusual weather events, which can be of interest to meteorologists, climate scientists, and decision-makers responsible for weather forecasting and climate monitoring. However, the interpretation of anomalies should be done in conjunction with domain expertise and additional data sources to ensure accurate and meaningful insights.

Future Works

Currently, weather forecasting is a popular research area due to its importance. An extension to our research would be to involve radar observations, and use self-organising maps (SOMs) as an unsupervised learning method for detecting patterns in the way the radar products’ values transition from one time moment to another in both normal and severe weather conditions [7]. For detecting more extreme weather events, implementing a supplementary classification model based on Moran’s I index to ensure that the detected anomalies are not due to either a sensor failure or a security attack could be a more promising method [8].

References

Met Office: https://www.metoffice.gov.uk/binaries/content/assets/metofficegovuk/pdf/data/nowcasting-datasheet_2019.pdf
C. Piruthevi and C. S. K. Selvi, “Filtering of anomalous weather events and tracing their behavior,” 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 2017, pp. 1–5, doi: 10.1109/ICIIECS.2017.8275913.
M. R. Nosouhi, K. Sood, N. Kumar, T. Wevill and C. Thapa, “Bushfire Risk Detection Using Internet of Things: An Application Scenario,” in IEEE Internet of Things Journal, vol. 9, no. 7, pp. 5266–5274, 1 April1, 2022, doi: 10.1109/JIOT.2021.3110256.
J. C. Villarreal Guerra, Z. Khanam, S. Ehsan, R. Stolkin and K. McDonald-Maier, “Weather Classification: A new multi-class dataset, data augmentation approach and comprehensive evaluations of Convolutional Neural Networks,” 2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Edinburgh, UK, 2018, pp. 305–310, doi: 10.1109/AHS.2018.8541482.
P. B. Rutkevich, P. P. Rutkevych,”Tornado-type stationary vortex with nonlinear term due to moisture transport”, Advances in Science and Research, 2010, 4, pp. 77–82, doi:10.5194/asr-4–77–2010
Met Office: https://www.metoffice.gov.uk/weather/learn-about/weather/types-of-weather/snow/blizzard
A. Mihai, G. Czibula and E. Mihulet¸, “Analyzing Meteorological Data Using Unsupervised Learning Techniques,” 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 2019, pp. 529–536, doi: 10.1109/ICCP48234.2019.8959777.
C. Piruthevi and C. S. K. Selvi, “Filtering of anomalous weather events and tracing their behavior,” 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 2017, pp. 1–5, doi: 10.1109/ICIIECS.2017.8275913.
Tensorflow tutorial: Intro to Autoencoders (https://www.tensorflow.org/tutorials/generative/autoencoder)

This story was originally written by Vivek Chander, Dogukan Turkoz, Mariam Zenaishvili and Evyanne Ewusie — Educational Content Creator for WDSS. The article is licensed under CC BY-NC-SA 4.0.

Language in Focus, Part 1: Predicting Movie Magic

Warwick Data Science Society — Tue, 05 Sep 2023 19:02:07 GMT

https://medium.com/media/eff535554198b73f0b880de207bb26bd/href

Natural Language Processing (NLP) is a current topic in machine learning (ML). With the commercialisation of Chat GPT [3] and the integration of chatbots into business environments [4], NLP has become a part of our everyday lives. Before Generative Pre-trained Transformers (GPTs), a turning point in the field of NLP was the development of Recursive Neural Networks (RNNs). Researchers developed RNNs to better model the dependencies in sequential data [2]. Interpreting and predicting sequences of words was essential to the success of NLP.
This blog post will introduce how researchers approach NLP, exploring the most common data processing and classification methods. Next, it will cover RNNs in-depth, including memory cells, the back-propagation through time (BPTT) algorithm and the challenges RNNs face. Lastly, it will illustrate how we can use NLP to classify film genres by summarising one of WDSS’s own research projects!

Introduction to NLP

Data processing is a crucial stage in any ML project. As NLP is a well-established field, many tricks for processing and normalising text data exist:

Non-linguistic analysis is used on social media platforms where users write posts containing icons, special characters, platform-specific prefixes and web links. This preprocessing method involves removing characters, assigning values to represent non-text features, and removing images and links[1].
Morphological analysis [1] is another method for text data processing. It includes tokenisation, where researchers use a dictionary to convert each word into a unique integer, removing punctuation and removing “stop words” such as the, is, a, and and in English.
Syntactic analysis is a text data processing method where researchers tag words as a “part of speech”. These would be nouns, verbs, articles, adjectives and more.
Lastly, semantic analysis focuses on the meaning of words. This method uses emotion dictionaries to index the emotions associated with words.

Once the text data is processed, it can be classified. ML algorithms like Support Vector Machines (SVM), Naive Bayes and Decision Trees can all solve language classification tasks. However, deep neural networks have become more common due to their better performance in multi-class classification [7]. These neural networks include Convolutional Neural Networks (CNNs) [5] and RNNs [6].

What are Recurrent Neural Networks?

An RNN is a neural network with an internal state that stores previous inputs [2]. RNNs model the dependencies in sequential data, which is necessary for NLP as often words need to be understood in the context of a sentence or wider text. This is best illustrated by a diagram.

Neurons with recurrence [8].

Each recurrent cell represents a discrete time step. At each time step, the cell has weights that map the input vector x_t to an output vector y_t. These weights are shared between cells through the hidden states h_t. The recurrent network creates a summary of previous observations via these connections. In this way, the network “remembers” relationships between events over time in the data [2]. A recurrence relation updates the hidden state h_t at each time step. This relation is defined by a function f_h of the input x_t to that cell at that time and the previous state h_t-1 of the cell.

The recurrence relation [2].

An activation function is a non-linear function that defines the output of a neuron in a network. Here, f_h represents the activation function tanh. Therefore, the complete formula for calculating the hidden state is:

Formula for calculating the hidden state [2].

The output of a single recurrent neuron is a function of the hidden state.

Output of a single recurrent neuron [2].

Specifically, the hidden state is weighted and biased to set how harshly the network should apply these weights during training. Next, the result o_t is passed through another activation function: Softmax.

The hidden state is weighted and biased and the result is passed through the Softmax function [2].

RNNs train by an algorithm called backpropagation through time (BPTT). Backpropagation takes the derivative with respect to each parameter (called the “loss”) and shifts parameters to minimise loss [2]. The BPTT algorithm backpropagates the loss for each of the individual time steps, from the current time to the initial time in the sequence. This means that the gradient is computed repeatedly, which can cause problems. The two most common issues that this causes are the vanishing and exploding gradient problems. In short, gradients “explode” when the weight matrices are large. This prevents the network from converging at a stable predicted value. Similarly, gradients vanish when the weights are small. This prevents the network from using all the important information when training, leading to incorrect predictions. Exploding gradients can be prevented using a method called gradient clipping. Vanishing gradients can be mitigated by using the rectified linear unit (ReLU) activation function, or by using gated cells. These cells selectively control the flow into the neural network, filtering out what is not important [8]. Long Short Term Memory cells (LSTMs) and Gated Recurrent Units (GRUs) are common types of gated cells.

An example: Movie Genre Classification

One common application of NLP is grouping entertainment media, such as movies or blog posts, into categories for users to choose from. This type of problem requires a many-to-one RNN. The image below shows an example of a many-to-one RNN for classifying the sentiment (emotional positivity or negativity) of a sentence.

Many-to-one RNN [8].

At WDSS, we complete research projects exploring different areas of data science (see more here). For example, one research project aimed to classify movies into their respective genres based on their IMDb descriptions using an RNN. First, the researcher analysed the data to identify the uneven frequencies of genre classes.

Analysis of the class imbalance.

They then processed the data using the NLTK python library and split the training and testing datasets. Next, they set up the learning frameworks using the ML library PyTorch. This included building an RNN class and addressing the vanishing gradients problem with an LSTM class and a GRU class. Then, they built a function to train the model, using the cross entropy loss function and the Adam optimiser. Lastly, they evaluated the model’s accuracy and found that the unbalanced dataset meant that the model could classify dramas and documentaries well but not other genres.

RNN with GRUs model accuracy.

This shows that finding a good dataset is important!

Next Steps

The research project covered in this blog used the PyTorch library: an ML framework developed by Meta AI. You can find tutorials on this library in your own projects here. Python Natural Language Toolkit (NLTK) is another great resource for language processing. It provides language datasets, libraries for text data processing, guides and documentation. Lastly, serverless data processing is essential when working with real-time text data, such as social media feeds or chat forums. Amazon Web Services has a tutorial on serverless data processing that you could adapt to advance your NLP skills!

References

D. Rogers, A. Preece, M. Innes and I. Spasić, “Real-Time Text Classification of User-Generated Content on Social Media: Systematic Review,” in IEEE Transactions on Computational Social Systems, vol. 9, no. 4, pp. 1154–1166, Aug. 2022, doi: 10.1109/TCSS.2021.3120138.
Mohamed Abdel-Basset; Nour Moustafa; Hossam Hawash, “Introducing Recurrent Neural Networks,” in Deep Learning Approaches for Security Threats in IoT Environments, IEEE, 2023, pp.189–207, doi: 10.1002/9781119884170.ch8.
T. Wu et al., “A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development,” in IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, May 2023, doi: 10.1109/JAS.2023.123618.
M. Banisharif, A. Mazloumzadeh, M. Sharbaf and B. Zamani, “Automatic Generation of Business Intelligence Chatbot for Organizations,” 2022 27th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, Islamic Republic of, 2022, pp. 1–5, doi: 10.1109/CSICC55295.2022.9780490.
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition”, Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks”, Neural Comput., vol. 1, no. 2, pp. 270–280, 1989.
H. A. Sayyed, S. Rushikesh Sugave, S. Paygude and B. N Jazdale, “Study and Analysis of Emotion Classification on Textual Data,” 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatre, India, 2021, pp. 1128–1132, doi: 10.1109/ICCES51350.2021.9489204.
Alexander Amini and Ava Amini, “MIT 6.S191: Introduction to Deep Learning”, Accessible at: IntroToDeepLearning.com

This story was originally written by Evyanne Ewusie — Educational Content Creator for WDSS, and used content from a research project by Peter Hyland. The article is licensed under CC BY-NC-SA 4.0.

Reinforcement Learning in Action, Part 1: Electric Vehicle Routing

Warwick Data Science Society — Tue, 11 Jul 2023 15:51:38 GMT

https://medium.com/media/84a28d91ed126f3ad3a5c5fdaef3285b/href

“The only real mistake is the one from which we learn nothing.” — Henry Ford

As we generate more and more data, engineering problems are increasingly solvable using machine learning and optimisation. Many current research papers explore a subset of machine learning (ML) called reinforcement learning (RL) to support many areas of control systems engineering. RL trains an agent to take actions based on observations of its environment to maximise a reward [1].

This blog post will introduce some main components of RL: policies, value functions, Markov Decision Processes (MDPs), and optimisation. Next, it will investigate the first application of RL for this blog series: Autonomous Driving Systems. Lastly, this post will highlight some valuable resources to start your RL projects, including the Python library Open AI Gym [2]— created by the same research laboratory behind Chat-GPT.

About Reinforcement Learning

Imagine a robot whose body includes a standing component that must balance on top of a cart with wheels. The robot is the agent and can take the actions of moving left and right in its environment. When the robot balances successfully it receives a reward. The robot can make four types of observations: the cart’s position, the cart’s velocity, the robot body’s angle, and the robot body’s velocity.

The Cart Pole Environment from Open AI [3].

Sometimes there is one outcome to actions, such as balancing a pole on a cart using classical mechanics. These scenarios are called deterministic environments. Alternatively, the environment could be probabilistic and have random outcomes to actions, such as playing against a random opponent in chess. Processes in probabilistic environments are called Markov Decision Processes (MDPs) [4]. In RL, we often assume the agent makes decisions as part of an MDP to account for unexpected events, such as someone bumping into the robot cart.

In RL, the rewards are delayed. An agent only receives confirmation that its decisions were correct occasionally, after making many choices. For example, the cart may need to move left and right a few times to balance the robot’s body. Therefore, the agent must decide on a set of rules to follow to maximise the possibility of a future reward, called a policy.

The Policy Equation [4]

The policy π describes the probability of the agent taking action a given that it is currently in state s. RL aims to maximise value functions which determine the value of being in a state given the policy.

The Value Equation [4]

This equation describes the expected future reward given the agent starts in a given state and enacts that policy. The value is regulated using the discount rate γ, which scales how much it favours getting a reward in the future or now as a probability. Each state has its value, which quantifies the total reward an agent can receive in the future if it starts in that state.

Training an ML model and determining a control theory or RL policy are all optimisation problems. Many different methods exist to determine policies for RL, but recent research often focuses on iteratively improving policies using algorithms like Q-learning or deep learning methods like DQNs [1].

What is an Autonomous Driving (AD) system?

Before exploring how RL can be applied to Autonomous Driving, we need to first describe what an AD system is.

Diagram of an AD system [5].

In general, the key tasks carried out by an AD system can be split into two groups: “scene understanding” and “decision making and planning” [5]. Scene understanding involves mapping the surrounding area and locating the vehicle on that map. Decision-making and planning involve finding an appropriate route based on the information from the scene understanding stage. This includes trajectory optimisation, which ensures that the planned motion is feasible given how the vehicle can move. This also includes control of the vehicle, which defines the speed, steering angle, and braking actions necessary to move through the mapped environment [5].

An Example

The electric vehicle (EV) routing problem described in the paper by Lin, B. et al. [6] is a key research area that used RL. The main objective of the problem is to minimise the total distance travelled by a fleet of electric delivery vehicles. Hard-coding the solution would make it less generalisable to other variants. The paper states that a set of customers is scattered in a region, all of whom a fleet of EVs must visit during a time window. All EVs start fully charged at a depot, and there are stations to recharge their batteries during the time frame. By the end of the time frame, they are supposed to return to the depot. RL can be applied to find routes for the EVs such that all the customer demands are satisfied during their time windows and the total distance travelled by the fleet is minimised.

The Electric Vehicle Routing Problem [6].

To solve this type of problem, the paper by Kiran, B. et al. [5] demonstrates how deep RL has been used to ensure vehicles keep to lanes. Policies for more complex actions, including lane changes, overtaking, and navigating intersections, can be learnt using Q — learning and hybrid neural network architectures.

The paper by Kiran, B. et al. [5] also highlights that future research should focus on developing more consistent ways of validating RL autonomous driving systems when the testing environment is often difficult to control. Furthermore, because RL systems are often trained and tested in simulations [5] suggests that future research considers was to “bridge the gap” between results in simulation and reality.

Next Steps

The self-balancing robot example described in this tutorial is inspired by the Cart Pole tutorial from OpenAI. Nicholas Renotte’s Youtube Channel has a great tutorial on how to run this. If you are new to RL, data science, or programming in general, the “Geeks for Geeks” website is a goldmine! If you are interested in developing your own RL projects, see pages 50–52 of the paper by Li, Y. et al. [1] which lists many RL problems to solve and possible approaches to try out! Finally, for a challenge, look for data science competitions on Kaggle, like this past competition to build the best trading bot using RL. You may even win a prize!

References

Li, Y. (2017). Deep Reinforcement Learning: An Overview (Version 6). arXiv. https://doi.org/10.48550/ARXIV.1701.07274
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1606.01540
A. G. Barto, R. S. Sutton and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5, pp. 834–846, Sept.-Oct. 1983, doi: 10.1109/TSMC.1983.6313077.
S. L. Brunton (2021, February 12, Machine Learning Meets Control Theory. (2021). Cassyni. https://doi.org/10.52843/cassyni.x2t0sp
Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A. A. A., Yogamani, S., & Pérez, P. (2020). Deep Reinforcement Learning for Autonomous Driving: A Survey (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2002.00444
Lin, B., Ghaddar, B., & Nathwani, J. (2020). Deep Reinforcement Learning for Electric Vehicle Routing Problem with Time Windows. arXiv. https://doi.org/10.48550/ARXIV.2010.02068

This story was originally written by Evyanne Ewusie — Educational Content Creator for WDSS. The article is licensed under CC BY-NC-SA 4.0

Ride-Hailing: Data Science Solving the Challenges

Warwick Data Science Society — Tue, 16 May 2023 11:31:39 GMT

Bolt is a transportation platform company that provides ride-hailing, scooter-sharing, and food-delivery services

What is Ride-Hailing? Uber, Bolt, Lyft, Didi, Waymo, Tesla?

Ride-hailing is a service that enables users to request a ride on demand from a private driver who uses their own vehicle to transport them to their destination through a mobile app. Popular ride-hailing companies include Uber, Bolt, Lyft, and Didi. These companies offer a convenient and often cheaper alternative to traditional taxi services. Passengers can use the app to request a ride, view the driver’s details, track the driver’s location, and pay for the ride. The ride-hailing industry has revolutionized the way people get around in urban areas, providing an affordable and convenient alternative to traditional taxis and an alternative that is clearly better in
terms of safety, security, and comfort compared to poor-quality public transport [1]. Additionally, some ride-hailing companies are exploring the use of self-driving cars to disrupt further the traditional transportation industry (Waymo, Tesla).

Differential Pricing Strategies to Encourage Platform Participation

There are four different pricing strategies in the transportation industry, including uniform pricing, differential customer pricing, differential driver pricing, and bilateral differential pricing. Various studies analyze how these strategies impact the platform, the regulator, and the platform participants. The researchers conclude that each pricing strategy may be optimal in different situations based on the maximal number of potential drivers and customer preferences. The studies provide managerial implications for ride-sharing platforms to tailor their pricing strategies according to supply conditions and user cognition. They have highlighted the effects of customers’ and drivers’ attitudes toward differential pricing and it is suggested that regulators focus on sustainable development and market stability. [2]

Traffic Forecasting Using Machine Learning

The Traffic4cast Traffic Map Movie datasets provide a huge amount of real-world traffic data across different cities. It has already been used for successful short-term traffic predictions and transfer learning. This year, Traffic4cast tackled a new challenge of predicting traffic congestion using only sparse traffic count data. The resulting dataset has been made available for 10 cities, and it could be a first step towards a traffic graph benchmark dataset in machine learning. While the models seem to capture historic distribution well, more input signals and better placement of vehicle detectors could improve the detection of the global traffic state. [3]

The authors found that message passing did not improve results compared to city global principal component features. They think that this is due to the heuristic nature of their feature engineering and that a GNN may work better. However, combining the city's global context features and the spatial coordinates of roads was enough to learn about interactions and local context. Target encodings were critical, and the degree to which they could be leveraged depended on label density. Gradient-boosted decision tree ensembles were competitive but had shortcomings such as difficulty in embedding sequential features. An RNN or hybrid approach could be used in future traffic research. [4]

Easing Traffic Congestion

Ride-hail companies aimed to combine the benefits of personal cars with public transportation’s efficiency and lower costs. However, despite efforts to promote pooled trips, most users chose solo-ride services like UberX and Lyft over shared rides. This increased vehicle miles traveled (VMT) and negated hopes that pooling would mitigate VMT growth. The use of ride-hail to connect to public transportation or reduce cruising for parking did not significantly mitigate VMT growth. This has important implications for hopes that autonomous ride-hail services will lead to VMT reductions. Public policy must balance individual benefits against societal costs and prioritize space-efficient modes of transportation like public transportation, walking, and biking. [5]

Economic Benefits

The studies used stated preference data to estimate the utility coefficients for ride-sharing options. The negative and statistically significant intercept implies that pooling has an inherent disutility for consumers. They found that the inconvenience of taking a pooled ride was valued at $3.61 per ride. The negative price coefficient explains why a significant number of ride-sharing users would be willing to choose cheaper pooled rides. During the research, it was also found that low-income respondents are more likely to choose pooling. Respondents who own more cars in the household per person are less likely to choose pooling. In addition, people in a rush are less likely to choose pooling. The study used a mixed logit model to account for heterogeneity in respondents that may bias their repeated choices. [6]

Inequalities Through Simulations

The authors of this research conducted a simulation of an agent-based automated taxi system to explore the fairness of driver incomes. The simulation considered various city layouts, traffic conditions, and matching algorithms. They found that inequality in driver incomes is influenced by factors such as the demand-to-supply ratio, the spatial distribution of requests, driver idling strategy, and the matching algorithm. To address income inequality, they proposed a new matching algorithm that promotes drivers with lower incomes, leading to fairer income distribution without negatively affecting average incomes. It highlights the importance of simulation models for testing policy changes and monitoring the social effects of platform-based systems. However, it is important to note that short-term income differences among drivers can have significant consequences, such as increased overtime or driver attrition. The simulation does not capture the full complexity of emerging inequalities and feedback loops, which can further amplify wage gaps. Incorporating additional factors like driver skills and varying working hours could provide a more comprehensive understanding of long-term inequalities. Additionally, the current setup of ride-hailing companies, with its gamified and algorithm-driven incentives, poses challenges for implementing hourly wage systems or unionizing efforts. This study has limitations, such as simplified city models, averaged traffic conditions, and the omission of factors like driver skills and surge pricing. Despite these limitations, their findings shed light on the need for addressing income inequality in ride-hailing platforms and the potential role of algorithmic fairness policies in achieving better social outcomes. [7]

Autonomous taxis

This study on dynamic autonomous taxi operations proposes dispatching strategies for autonomous taxis in both single-request mode and hybrid-request mode. In the single request mode, a network flow model is used to centrally plan routes for taxis based on real-time information and requests. In the hybrid request mode, a combination of centralized and decentralized autonomous dispatchers is employed to plan short-term and long-term routes for taxis. Experimental results show that the proposed strategies outperform other approaches in terms of service quality and economic efficiency. However, the study has limitations, such as the reliance on static travel time estimates and the need for further improvement in long-term reservation assignment and computational efficiency. Future research will focus on addressing these limitations and extending the strategies to ride-sharing scenarios. [8]

References:
1. Tirachini, A. Ride-hailing, travel behaviour and sustainable mobility: an international review. Transportation 47, 2011–2047 (2020). https://doi.org/10.1007/s11116-019-10070-2

2. Zhao, D., Yuan, Z., Chen, M. and Yang, S. (2022), Differential pricing strategies of ride-sharing platforms: choosing customers or drivers?. Intl. Trans. in Op. Res., 29: 1089–1131. https://doi.org/10.1111/itor.13045

3. Neun, M., Eichenberger, C., Martin, H., Spanring, M., Siripurapu, R., Springer, D., … & Hochreiter, S. (2023). Traffic4cast at NeurIPS 2022 — Predict Dynamics along Graph Edges from Sparse Node Data: Whole City Traffic and ETA from Stationary Vehicle Detectors. arXiv preprint arXiv:2303.07758

4. Lumiste, M., & Ilie, A. (2022). Large scale traffic forecasting with gradient boosting, Traffic4cast 2022 challenge. arXiv preprint arXiv:2211.00157

5. Schaller, B. (2021). Can sharing a ride make for less traffic? Evidence from Uber and Lyft and implications for cities. Transport policy, 102, 1–10.

6. Naumov, S., & Keith, D. (2023). Optimizing the economic and environmental benefits of ride-hailing and pooling. Production and Operations Management, 32, 904– 929. https://doi.org/10.1111/poms.13905

7. Bokányi, E., Hannák, A. (2020). Understanding Inequalities in Ride-Hailing Services Through Simulations. Sci Rep 10, 6500. https://doi.org/10.1038/s41598-020-63171-9

8. Duan, L., Wei, Y., Zhang, J., & Xia, Y. (2020). Centralized and decentralized autonomous dispatching strategy for dynamic autonomous taxi operation in hybrid request mode. Transportation Research Part C: Emerging Technologies, 111, 397–420.

Bolt Website, LinkedIn, Instagram, Facebook

This story was originally written by Alexandru Pascu — President of the WDSS. The article is licensed under CC BY-NC-SA 4.0

Revolutionizing Cancer Detection and Diagnosis with Artificial Intelligence in Healthcare

Warwick Data Science Society — Mon, 10 Apr 2023 19:42:34 GMT

Rayscape AI tools that assist radiologists in the analysis of medical images

How AI Can Transform Cancer Imaging

Cancer detection and diagnosis is a critical area in healthcare where Artificial Intelligence (AI) is revolutionizing imaging techniques. It is enabling faster, more accurate, and more informative assessments, helping doctors identify early-stage cancers and better assess cancer risk factors. AI systems like machine learning and deep learning can analyze and interpret large datasets to identify complex patterns and relationships that human performance might not detect. Having the ability to aid in screening tests for cancers like breast cancer, identify cervical precancers, and determine the stage of prostate cancer, despite the excitement AI generates, the question remains of whether it is ready for clinical implementation and whether it will help all patients.

AI for Cancer Screening, Diagnosis, and Prognosis

AI algorithms can analyze various data types to identify asymptomatic patients at risk of developing cancer, helping clinicians investigate and triage symptomatic patients. Multi-modal data integration, such as combining radiomic, genomic, and clinical factors, improves diagnostic precision. Artificial Intelligence can automate detection and classification of pre-malignant lesions, identify and classify indeterminate pulmonary nodules, and predict treatment response. Liquid biopsy tests analyze cell-free DNA, and machine learning can enhance high-dimension data approaches. Digital pathology and deep learning models predict recurrence risk following surgical resection of several cancers. Further research is needed to validate and generalize findings for broader clinical implementation.

Sybil: The AI Tool for Personalized Lung Cancer Risk Analysis

The development of Sybil represents a significant advance in lung cancer screening. Low-dose computed tomography (LDCT) scans have limitations and can produce false positives, leading to unnecessary invasive procedures. Sybil’s deep-learning algorithm can predict a patient’s likelihood of developing lung cancer within six years, based on personalized risk factors and imaging data. It was trained on hundreds of CT (computed tomography) scans to detect subtle patterns indicating future cancer risk. Sybil’s performance was tested on diverse sets of lung LDCT scans, achieving strong C-index and ROC (Receiver operating characteristic) - AUC (area under the curve) scores. Sybil’s predictive ability has implications for personalized medicine beyond lung cancer screening, such as predicting other types of cancer or diseases.

DeepMind Mammography AI System

Researchers from Google Health & Cancer Research UK Imperial Centre, Northwestern University, and Royal Surrey County Hospital have developed an AI system capable of outperforming clinical specialists in detecting breast cancer from mammograms. The study used large datasets from breast cancer screening programmes in the UK and US, and the AI system accurately predicted, from screening mammograms alone, whether a patient would have a biopsy positive for breast cancer with lower false positive and false negative rates than human experts. The AI system also demonstrated generalisation across populations and screening settings. Future research is required to understand the full extent to which this technology can benefit breast cancer screening programs.

Ethical Implications of AI in healthcare

Patient privacy is a fundamental right that must be respected and protected in all healthcare settings, including those that involve the use of AI. These systems must be designed to ensure that patient data is secure and protected from unauthorized access. They must be designed and trained in a way that ensures fairness and prevents bias. There must be clear lines of responsibility and accountability for the use of AI in healthcare while patients must be informed about the use of such technology in their healthcare and must provide their consent for its use.

The Future of AI in Healthcare

AI have been tested to optimize imaging acquisition and improve image quality in MRI (Magnetic resonance imaging) examinations. In the future, it will play a crucial role in patient-specific cancer treatment planning, personalized radiation therapy, and in managing cancer drug resistance. Further advances in machine learning and especially deep learning could result in quicker and more accurate diagnoses and personalized treatment options for patients with cancer.

Alex Ouyang | Abdul Latif Jameel Clinic for Machine Learning in Health. January 20, 2023. MIT https://news.mit.edu/2023/ai-model-can-detect-future-lung-cancer-0120

Nadia Jaber. Can Artificial Intelligence Help See Cancer in New, and Better, Ways? March 22, 2022. NCI https://www.cancer.gov/news-events/cancer-currents-blog/2022/artificial-intelligence-cancer-imaging

Hunter, B.; Hindocha, S.; Lee, R.W. The Role of Artificial
Intelligence in Early Cancer Diagnosis. Cancers 2022, 14, 1524. https://doi.org/10.3390/cancers14061524

Koh, DM., Papanikolaou, N., Bick, U. et al. Artificial intelligence and machine learning in cancer imaging. Commun Med 2, 133 (2022). https://doi.org/10.1038/s43856-022-00199-0

McKinney, S.M., Sieniek, M., Godbole, V. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020). https://doi.org/10.1038/s41586-019-1799-6

Naik N, Hameed BMZ, Shetty DK, Swain D, Shah M, Paul R, Aggarwal K, Ibrahim S, Patil V, Smriti K, Shetty S, Rai BP, Chlosta P and Somani BK. Legal and Ethical Consideration in Artificial Intelligence in Healthcare: Who Takes Responsibility? Front. Surg. 9:862322 (2022). https://doi.org/10.3389/fsurg.2022.862322

Rayscape Website, Facebook account, LinkedIn account
WDSS relevant links

This story was originally written by Alexandru Pascu and Ruxandra Ilinca Stilpeanu. The article is licensed under CC BY-NC-SA 4.0

Generative AI and Technology in Art

Warwick Data Science Society — Thu, 09 Mar 2023 10:23:46 GMT

Generative AI in Art & Design

Graph in the art noaveau style generated using Midjourney

So, how does generative AI work? Generative AI uses machine learning algorithms to create new content based on existing data. This means that the AI system can analyze vast amounts of data, such as images, texts, and audio, and learn from this data to generate new content that is similar in style or form to the original data.

Midjourney benefits and challenges

The Midjourney tool was developed and trained by a team of researchers at an independent research lab. They trained the AI system using a deep neural network, a Stable Diffusion model, which was fed a large dataset of images. The system then learned to generate new designs based on this data, using a process known as “latent space interpolation”.

One of the key features of Midjourney is its ability to take into account a wide range of design constraints and preferences. For example, a user can specify the size, shape, and orientation of a building, as well as the materials that should be used and the overall aesthetic style. Midjourney will then generate a range of design options that meet these constraints and preferences, providing the user with a diverse set of options to choose from.

One of the primary benefits of using Midjourney is that it allows architects and designers to quickly and easily explore a wide range of design options, without the need for extensive manual work. This can save time and resources, and allow them to focus on more creative and strategic aspects of the design process.

One of the challenges with generative AI technologies like Midjourney is the issue of copyright infringement. Because the AI system is generating new content based on existing data, there is a risk that the resulting designs could be too similar to existing designs, potentially infringing on the original creator’s intellectual property rights.

Other Generative AI tools (MusicLM, Med-PaLM, etc)

However, generative AI also has many potential applications beyond architecture, design and visual arts. For example, it could be used to create new music (Google researchers have introduced MusicLM, an AI model that can generate high-fidelity music from text. MusicLM creates music at a constant 24 kHz throughout a number of minutes by modeling the conditional music generating process as a hierarchical sequence-to-sequence modeling problem. Source infoq), art, or literature. It could also be used in fields such as medicine (Google has built the best artificial intelligence yet for answering medical questions. The Med-PaLM AI can answer multiple-choice questions from medical licensing exams and common health queries on search engines with greater accuracy than any previous AI and almost as well as human doctors. Source newscientist) or engineering to generate new solutions to complex problems.

ChatGPT

ChatGPT is a transformer language model that uses generative AI techniques to generate natural language responses. Specifically, it uses a deep neural network architecture known as a transformer to generate text based on the input it receives from the user. This transformer architecture allows ChatGPT to analyze and learn from vast amounts of data, enabling it to generate coherent and contextually appropriate responses to a wide range of queries and prompts. So, while ChatGPT is primarily a transformer language model, it also leverages generative AI techniques to generate natural language text.

Conclusions

In conclusion, generative AI is a rapidly evolving field with many exciting possibilities for design and architecture. The Midjourney tool is just one example of how this technology can be used to create innovative designs and models. However, it is important to consider the potential challenges and ethical implications of using AI to generate new content, and to ensure that appropriate measures are taken to protect intellectual property rights.

To address some of these intriguing questions and to help us explore the world of Midjourney and Generative AI, as part of our DataBasic episode for 2023, we have Hassan Ragab, Egyptian Designer/Conceptual Artist in California — one of the worlds most well known AI artist. Stay tuned for this gripping and engaging episode as we unravel the usage of AI in Art with Hassan, as he shares his experience and perspectives.

P.S. ChatGPT has been used to generate parts of this article and we thought it would be a fun use case considering the topic of the blog. I first encountered this approach on Cassie Kozyrkov’s Medium.

Hassan Ragab’s Instagram account, LinkedIn account and website
WDSS relevant links including the podcast, YouTube, TikTok and social media (LinkedIn, Instagram, Twitter)

This story was originally written by Alexandru Pascu. The article is licensed under CC BY-NC-SA 4.0

Fighting Financial Fraud with Data Science

Warwick Data Science Society — Sun, 25 Dec 2022 16:21:39 GMT

Money joke that can become the reality of a fraud

The world lost approximately $28.6 Billion in credit card fraud in 2020, according to a 2021 Nilson Report. Credit card fraud increased by 42% in the last quarter of 2021 — a trend that is expected to continue well into 2022. Identity theft, money laundering, credit card fraud, and authorized push payment (APP) fraud are a few types of financial crime committed by fraudsters who are increasingly becoming innovative in their theft methodologies. It is becoming extremely challenging to detect and isolate fraudulent cases, especially those committed in real-time.

Fraud Rate VS Decline Rate

It is pertinent for financial institutions to identify and isolate fraudulent cases from a plethora of regular transactions i.e. accurately capture the fraud rate (number of fraudulent attempts/total attempts in a given period). One way to achieve this is through anomaly detection i.e. to check if an incoming transaction by an individual fits the person’s profile and prior behavior. If it does not, it can be flagged as a potentially fraudulent activity. However, it is possible that the flagged transaction is not fraudulent and is falsely declined. False declines can prove to be costly due to lost revenue, increased operational effort in non-fraudulent cases, and loss in customer satisfaction/experience. Currently, the industry false positive rate is around 20 to 1 i.e. for every 20 fraud alerts, one is a correctly identified fraud attempt. Hence, it is important for any financial institution to maintain a healthy balance between fraud and decline rates.

Future of Fraud

New-age innovations such as Blockchain technology, a decentralized ledger system, are trusted immensely for their immutability and security provisions. However, cryptocurrency hacks and thefts in 2020 alone involved $513 Million worth of bitcoin and other cryptocurrencies.

With the world becoming increasingly digital are we becoming more vulnerable to crime? What is the target age group and the different methods used to conduct fraud? Can data science combat financial crime and save the day? Can machine learning and rule-based writing predict and prevent fraud? Can the evolution of data science outsmart the evolution of fraud?

To address some of these intriguing questions and to help us explore the world of Fraud Analytics, as part of our DataBasic episode for 2022, we have Humera Yakub, Head of Fraud at Pay.UK — one of the UK’s foremost retail interbank payment systems. Stay tuned for this gripping and engaging episode as we unravel the Fraud Analytics industry with Humera, as she shares her experience and industry expertise.

Humera Yakub Linkedin account, Pay.UK website
WDSS relevant links including the podcast, YouTube, TikTok and social media (LinkedIn, Instagram, Twitter)

This story was originally written by Ria Govila. The article is licensed under CC BY-NC-SA 4.0

Visualising the Intersection of Disease and the Human Genome

Warwick Data Science Society — Tue, 26 Oct 2021 07:47:48 GMT

Try it for Yourself
This article is the write-up for an interactive web application, allowing you to visualise CpG islands and SnP sites for different diseases. You can acces the app here to test it on diseases of your choosing.

Introduction

Beginning in the early ’90s, the Human Genome Project (HGP) was an international research collaboration that attempted to sequence and map all of the genes for Homo sapiens. The HGP was completed in 2003 and paved way for significant developments to be made in drug discovery.

In 2000, the University of California Santa Cruz (UCSC) and collaborators of the HGP started work on creating a free public access human genome assembly. To this day, it remains open access and has many features which can be used to relate information from the human genome to disease.

It was found that single nucleotide polymorphisms (SnPs) and CpG islands are commonly located around diseases. An SnP is the most common type of genetic variation between people. They occur, on average, once in every 1000 nucleotides (roughly 4 million SnPs in total). As an example, an SnP may replace the nucleotide cytosine © with the nucleotide thymine (T) in a certain stretch of DNA. CpG islands are important because they represent areas of the genome that have, for some reason, been protected from the mutating properties of methylation through evolutionary time (which tends to change the G in CpG pairs to an A). Often, they point to the presence of an important piece of intergenic DNA, such as that found in the promoter regions of genes where transcription factors bind. In cancers, loss of expression of genes occurs about 10 times more frequently by hypermethylation of promoter CpG islands than by mutations.

Building a Visualisation

The National Center for Biotechnology Information (NCBI) contains information on where many diseases can be found within the human genome. These positions were collated onto a spreadsheet, along with a link to more information on the relevant disease.

From the locations of the diseases obtained from the NCBI, information on SnP sites and CpG islands was collected using the API on the UCSC genome browser using the SnP and CpG island tracks.

GViz, a Bioconductor package for R, was used to then display the genome location, producing a visual representation graphically of the SnP sites and CpG islands.

The lines on the SnP track show sites that have a minor allele frequency of at least 1% and are mapped to a single location in the reference genome assembly. The highlighted areas on the CpG island are where there is a GC content of higher than 50% for greater than 200bp. The red line on the chromosome highlights the area in which the disease is located, and what is able to be viewed on the tracks is data within the highlighted region.

One significant challenge faced was that, due to the vast amount of data needed to produce the genome plots, once a disease had been selected, it took a considerable amount of time for the API to collect the data. This meant it was less than ideal due to users not wanting to wait a lengthy period of time. The way this was overcome was by pre-generating the plots for each disease in the spread sheet, meaning that the plots were shown instantaneously.

Future improvements to the programme could include but not limited to caching the data collected rather than having pre-generated plots. This would work well with another improvement which was considered whereby the disease spreadsheet was connected to the NCBI database which contained every single disease known with the possibility of more disease identification as well. This would mean the programme could keep up-to date with new diseases entering the database and would not need manually updating.

This post was originally written by Joshua Magiera. Alternative text for images provided by Dominika Daraż.

The article is licensed under CC BY-NC-SA 4.0

Visualising the Intersection of Disease and the Human Genome was originally published in Geek Culture on Medium, where people are continuing the conversation by highlighting and responding to this story.

Higher or Lower: Reinventing a Classic Card Game

Warwick Data Science Society — Mon, 18 Oct 2021 12:03:14 GMT

This post is the corresponding write-up for a Warwick Data Science Society (WDSS) project in which a small team of society members collaborated to produce a web-toy that plays a game of Higher or Lower using the Twitter follower counts of celebrities. You can play this game at this link.

credits: Ron Maijen

Motivation

Sometimes, simplicity is beautiful. Higher or Lower is a game embodying this philosophy. Played solo with a standard deck of cards, play consists of revealing these one at a time after first guessing whether the next card will have a higher or lower value. In recent years, this game has been re-envisioned as a popular web toy, in which card values are replaced by the number of global monthly Google searches for various topics. Not wishing to limit ourselves to search results, we decided to implement our own online version of the classic game based on the follower counts of Twitter celebrities. The final app can be found at the link above, and we will spend the rest of this post looking into the techniques behind our approach as well as reviewing the lessons this project can teach us about collaborative data science at WDSS.

For our purposes, this project is an ideal medium to practise web scraping and creating sharable products for others to enjoy. Web scraping is a way of extracting data from websites, leveraging automation to gather information efficiently and without unnecessary repetition. In all, three members of WDSS worked on this project, combining their specific skills to develop the final product. Tim Hargreaves, focused on the backbone of the app, Matthew Bardsley, the visuals of the game, and I (Parth Devalia) have responsibility for the communication of results.

Implementation

The source code for the web app and scraping scripts have been open-sourced in this repository.

With 330 million monthly users, Twitter has become an indispensable medium for instant news and opinions from politicians, brands, and of course, celebrities. Manually defining celebrityhood and iterating through matching accounts would be a difficult task, so we decided to look at what existing resources we could take advantage of. We eventually settled on a website called ProfileRehab. On this site, links to celebrities’ Twitter accounts are sorted into categories. By scraping this information we were able to collect Twitter profile URLs for around five hundred celebrities, matched to their names. We then interfaced with the Twitter API to read their respective follower counts and download their profile pictures. This entire scraping process was performed using Python, to take advantage of the rich ecosystem of web scraping packages the language has.

You may well be asking, “What is an API?”, and so I will take a moment to introduce this term. Application Programming Interfaces (APIs) simply allow applications to communicate with each other and are responsible for much of the connectivity we rely upon. They act as messengers, taking your request, telling the target application what you want to do, and then returning the response. Rather than accessing the application server directly, APIs offer us a dedicated access point, improving security and reliability. A common analogy is that of a restaurant — the API is the waiter, the interface between your table and the kitchen, taking your request and returning the response (the food).

With regard to our project, we decided to access the data we wanted through an API as Twitter have made recent obfuscations to their website code to make direct scraping more difficult. Use of Twitter’s API, as is often the case, is subject to terms and conditions regarding the usage of the data obtained. Additionally, the company have implemented a rate limit; that is, a maximum number of requests they can handle in a given timespan (just like with a restaurant waiter). Careful examination of these limitations needs to be considered when using an API, but fortunately we found them to be adequate for our needs.

The application is made using Shiny, a package for the R programming language that allows you to build interactive web applications. The framework allows for the development of powerful and flexible web applications with no need for HTML, CSS or JavaScript knowledge. For this reason, Shiny stands out for its unrivalled speed of development.

Despite its benefits, the raw product of Shiny development is not always the prettiest and can lack strong mobile support. To overcome this, Matthew implemented additional styling using CSS to improve the aesthetics of the final application.

Takeaways

This project allowed us three WDSS members to work together in creating something that we wouldn’t have done individually. Further, if not for the society, we would not have had the opportunity to work together. This highlights the role of WDSS, bringing people from different backgrounds together to solve challenging problems.

Projects are extremely important in growing your skills and are critical for developing a strong portfolio. Through collaboration, we can see problems from new perspectives, build our professional networks, and gain experience of working in a team. This is in comparison to university work, that is often done alone without using real world data, and usually without a solid final product.

This project leverages infrastructure offered by WDSS, such as our blogging platform and Shiny server. For this reason, alongside the support
offered by experienced students, working with WDSS to complete research makes it easier to get projects off the ground and showcase what you can do.

Thank you for reading and we hope you enjoy playing our implementation of Higher or Lower.

This post was originally written by Parth Devalia. Alternative text for images provided by Dominika Daraż.

The article is licensed under CC BY-NC-SA 4.0

Higher or Lower: Reinventing a Classic Card Game was originally published in Geek Culture on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Data-Driven Dive into UK Party Conference Leaders’ Speeches

Warwick Data Science Society — Thu, 23 Sep 2021 06:13:00 GMT

Party conferences are a mainstay in British politics whereby politicians, party members and affiliated people descend on a chosen city in order to set the party agenda, raise funds and attempt to get a soundbite into the mainstream media. The hallmark of these conferences are the Leaders’ speeches, where the current head of the party aims to appeal to their party base or even attract some new voters through media coverage.

Data Background

This analysis would not have been possible without the transcripts provided by British Political Speech. They describe themselves as “an online archive of British political speech and a place for the discussion, analysis, and critical appreciation of political rhetoric” and produce speeches dating back to 1895.

For my study, I aim to observe the nuances of party conference leadership speeches from 2010 to 2018. These dates were chosen as they coincide with a change in the British political landscape, following the 2010 election whilst still providing us with enough data to conduct meaningful analysis. For this study, I will only observe the three ‘mainstay’ political parties: the Conservatives, the Labour Party and the Liberal Democrats.

Upon importing and tidying the data, we can observe the 5 most used words within the speeches.

There are no surprises here. In fact, the top 5 most used words here are from the top 6 most used words in the English language according to the Oxford English Corpus, a text corpus comprising over 2 billion words.

Carrying on our analysis with these common words would create a dull analysis, so to counteract this, we will temporarily remove them. We do this using the tidytext package in R, which contains a comprehensive list of stop words. These are common words in the English language which would add nothing to certain parts of our analysis if they were to be included. A separate dataframe was created to store the non-stopwords which totalled 56,989 words, meaning that 105,883 words were removed.

It is worth noting, that there is no definitive list of stopwords. Instead, different words would be considered stopwords depending on the context, although we have just used a generic list for simplicity. We will see later in the post, a more nuanced way of handling uninformative words called TF-IDF.

We can now observe the most used 5 non-stopwords.

This is more like what we would have expected the vocabulary of a Leader’s speech to look like.

We can go further and visualise the set of each party’s 100 most commonly used non-stopwords through wordclouds. This is done below in each party’s traditional colour. (Blue for Conservative, Red for Labour, Yellow for Liberal Democrat).

We can see that the words identified to be most common before appear most often in these wordclouds too (denoted by their large size). The only visible differences are the party’s names, particularly visible for both Labour and the Liberal Democrats. Here we see the obvious flaws in word clouds, they barely allow us to observe any differences between the parties and provide no numerical insight. We will aim to address this weakness later with different methods.

Before we dive into some more detailed text analysis, we could have a quick exploration of the word count for each Leader’s speech.

From this, we can see that Ed Miliband can be quite the rambler at times.

Text Analysis

Analysing Sentiment

Basic counts and summaries are great, but with modern data science techniques, we can go much further. For example, we can infer the sentiment (loosely, how positive or negative in tone) each speech is.

We do this by referencing the contents of each speech against the AFINN lexicon, created by Finn Arup Neilson. This lexicon assigns an integer value from -5 to 5 to a vast number English words with negative numbers indicating negative sentiment and positive numbers indicating positive sentiment. Here we list a random word for each sentiment value.

We can then take an average of the sentiments over all word in each speech and visualise it to see the trend of speech sentiment over time. This is conducted on the dataset with stopwords included, otherwise it could distort the sentiment (though note that most stopwords have a neutral sentiment).

As we can see, the speeches are overwhelmingly positive, with the only negative score being Jeremy Corbyn’s 2018 speech to the Labour Party conference. Other notable values include David Cameron’s consistency between 2010 and 2015 for the Conservative Party and the significant jump in positivity when Theresa May took over the Conservative Leadership in 2016.

Term Frequency and Zipf’s Law

We have seen that raw word counts on their own aren’t particularly useful. One flaw of many is that longer texts will naturally have higher word counts for all words. Instead, a more useful metric is how often a certain word (also called a term) appears as a proportion of all words. This is known as term frequency and defined as

So that we can compare across parties, we will look at term frequencies as a proportion of the occurrences of each term in all speeches by the party the term came from. We start by looking at the distribution of term frequencies for each party.

It should be noted that there are longer tails for these graphs that have not been shown. Instead, we have truncated the the really popular words such as ‘the’, ‘and’ and ‘to’ to make it easier to see the main body of the plot.

The plots all display a similar distribution for each party with many ‘rare’ words and fewer popular words.

It turns out that these long-tailed distributions are common in almost every occurrence of natural language. In fact, George Zipf, a 20th century American linguist created Zipf’s law. This formalises the above observation, stating that the frequency that a word appears in a text is inversely proportional to its rank.

Put simply, the most frequent word will appear at twice the rate of the second most frequent word and at three times that of the third most frequent word.

Zipf’s law is largely accurate for many natural languages, including English (though as always, there are exceptions). For example, in the Brown Corpus of American English text, which contains slightly over 1 million words: ‘the’ appears the most times at ~70000 times, ‘of’ the second most at ~36000 times and ‘and’ the third most at ~29000, as would be roughly expected according to Zipf’s law.

We can attempt to visualise this law for our own text by plotting rank on the x-axis and term frequency on the y-axis, both on log scales.

Why the logs?
By definition, if two values x and y are inversely proportional, then we can find a constant a such that y=a/x. Taking logarithms and rearranging gives log(y) = log(a) - log(x). In other words, x and y are inversely proportional if and only if their logarithms lie on a straight line with a negative slope.

We can see that all three parties have similar text structures largely obey Zipf’s Law. That said, we can see that our curve deviates from a straight line at the lower rank tail, suggesting that the most popular words in the speeches are being used more often than they would in a natural language. Additionally, we would expect a slope of approximately −1; by fitting a linear model (shown in grey), we obtain a coefficient which is close to this value.

TF-IDF Analysis

We’ve seen that we can use a list of stop words to filter our data to leave only meaningful words. However, this list is fixed and not linked to our data in any way. We’ve already seen that ‘people’ is used very commonly in our speeches and so doesn’t provide that meaningful of an insight to us. Could construct a value that helped us to see the relative frequency of a term among our speeches, in order to see how important a word is to a specific speech compared to the others?

We can indeed. In fact the work has already been done for us in the form of a value value called the TF-IDF. It is calculated by multiplying the term frequency (TF) from earlier by a new value called the inverse document frequency (IDF).

Loosely speaking, TF-IDF asks two questions:

Is the specific term used more than expected in a given speech?
Is it rare for a speech to contain a the specific term?
If the answer to both of these questions is “yes”, then TF-IDF is large, an the term is considered to be relatively important.

We can calculate the TF-IDF score for each word in each speech before using these to find the most ‘important’ word in each speech.

We obtain some interesting results here. For example, it’s clear to see the Liberal Democrats’ sharp pivot to a anti-Brexit strategy following the referendum of 2016. Or how in 2014, the Conservatives announced their plan to increase the 40% income tax threshold (known as the 40p tax rate). We also see Jeremy Corbyn’s plan for a ‘kinder’ politics emerge in his first conference speech as leader in 2015, alongside the Grenfell Tower disaster mentioned in 2017.

The names such as ‘Harry’ and ‘Maurice’ that crop up here were intriguing at first glance. These were in reference to ‘Harry Beckough’ and ‘Maurice Reeves’, who were, respectively, a longstanding Conservative member and a furniture shop owner whose premises was burned to the ground during the London riots.

Complexity Consideration

There are a number of ways that we can observe the complexity of a text, or in this case a speech. For this piece we choose the average number of syllables per word. The data for this was taken from the quanteda package and we can visualise the results as so.

We can see profound variations between different leaders in this plot. Ed Miliband and David Cameron, the leaders of Labour and the Conservatives who gave speeches between 2010–2014 and 2010–2015, respectively, had a much lower complexity than the most recent leaders such as Jeremy Corbyn of Labour and Vince Cable of the Liberal Democrats, who together count for the top 6 most complex speeches.

We used mean syllable count in this piece as a metric for speech complexity as it is simple for a layperson to understand. That said, there are many more subtle and interesting complexity measures available through quanteda, such as the Flesch–Kincaid readability score.

A Different Way of Deciding Elections?

The First Past the Post system is often bemoaned in the UK as being unsuitable for modern-day politics. Now, it is not my place to comment on this system but if pushed to suggest another system, the aforementioned Quanteda package does give us another option…

We can calculate the mean scrabble score per word of the leader’s party conference speech each year! First let us observe the most impressive efforts that the politicians managed:

Theresa May managed an incredible score of 37 in 2018 with ‘Czechoslovakia’ but this would of course be disqualified for being a proper noun. As a result, Nick Clegg holds the record with 30 points scored for ‘unequivocally’! We can also visualise the mean score per word as follows.

As we can see, the Conservatives, who have been in power since 2010 would not win a single year should it be decided by Scrabble. In fact, the Liberal Democrats would win 6 out of the 9 years we have studied with Labour, under Jeremy Corbyn, taking the other 3 years — I’m sure both parties would be happy with that in hindsight!

Just in case anyone was under any illusion, of course mean Scrabble score is a poor way of deciding elections and I am not endorsing its use — at the very least, a game of Pictionary would be more appropriate…

Takeaways

With that, I end my brief incursion into British political speeches. While I have barely begun to scratch the surface of Natural Language Processing (NLP) methods, I hope that I have shown the power of the ways that these techniques can be used to summarise large pieces of text through sentiment, TF-IDF and syllable complexity.

I had minimal experience with NLP methods upon embarking on this project and would like to thank WDSS (in particular, Janique Krasnowska) for supporting me until completion. I feel like I’ve learned a lot and certainly furthered my knowledge and experience. I would suggest anyone who would like to conduct some data science studies outside their degree looks out for research opportunites with WDSS and seizes them with both hands — I will certainly be looking out for more chances!

Appendix: Summary Table of All Speech Metrics

Part 1

Part 2

This post was written by Ewan Yeaxlee. Alternative image text by Dominika Daraz. Article lincensed under CC BY-NC-SA 4.0.

A Data-Driven Dive into UK Party Conference Leaders’ Speeches was originally published in Geek Culture on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Warwick Data Science Society on Medium

Stranger Weather Ahead: Detecting Anomalies in Temporal Weather Data

Introduction

Our Code

The Dataset

Anomaly Detection Methods

k-NN

Z-Score

Cluster Based Local Outlier Factor

Auto-encoders

Conclusion

Future Works

References

Language in Focus, Part 1: Predicting Movie Magic

Introduction to NLP

What are Recurrent Neural Networks?

An example: Movie Genre Classification

Next Steps

References

Reinforcement Learning in Action, Part 1: Electric Vehicle Routing

About Reinforcement Learning

What is an Autonomous Driving (AD) system?

An Example

Next Steps

Related Content

References

Ride-Hailing: Data Science Solving the Challenges

What is Ride-Hailing? Uber, Bolt, Lyft, Didi, Waymo, Tesla?

Differential Pricing Strategies to Encourage Platform Participation

Traffic Forecasting Using Machine Learning

Easing Traffic Congestion

Economic Benefits

Inequalities Through Simulations

Autonomous taxis

Revolutionizing Cancer Detection and Diagnosis with Artificial Intelligence in Healthcare

How AI Can Transform Cancer Imaging

AI for Cancer Screening, Diagnosis, and Prognosis

Sybil: The AI Tool for Personalized Lung Cancer Risk Analysis

DeepMind Mammography AI System

Ethical Implications of AI in healthcare

The Future of AI in Healthcare

Generative AI and Technology in Art

Generative AI in Art & Design

Fighting Financial Fraud with Data Science

Visualising the Intersection of Disease and the Human Genome

Introduction

Building a Visualisation

Higher or Lower: Reinventing a Classic Card Game

Motivation

Implementation

Takeaways

A Data-Driven Dive into UK Party Conference Leaders’ Speeches

Data Background

Text Analysis

Analysing Sentiment

Term Frequency and Zipf’s Law

TF-IDF Analysis

Complexity Consideration

A Different Way of Deciding Elections?

Takeaways

Appendix: Summary Table of All Speech Metrics