Stories by Rajanie Prabha on Medium

Graph Analysis Made Easy with PyG Explainability

Rajanie Prabha — Sun, 14 May 2023 05:03:50 GMT

By: Anh Hoang Nguyen, Rajanie Prabha, Kevin Su as part of the Stanford CS224W course project.

Greetings, fellow graph enthusiasts! Do you want to understand why your GNN is doing what it’s doing, without having to resort to mind-reading or witchcraft? Probably staring at your model for a really long time works. In case that doesn’t give you the ‘why’ of things, you can resort to the Pytorch Geometric Explainability module. Please refer to the Colab for the full code!

Quick outline of the blog:

First, we train a GNN model on the node property prediction task using the ogbn-arxiv dataset
Then, we’ll use PyG’s explainability module to apply the GNNExplainer algorithm (Ying et al. 2019) on explaining this model’s predictions as well as how to visualize the explanation result
Talk about Explanation evaluation metrics
Take a closer look at PyG’s implementation of the GNNExplainer algorithm

Why AI Explainability Matters: Trust, Ethics, and Accountability

Explainability in Machine Learning is an ongoing effort in the research community. In traditional machine learning pipelines, neural networks are often viewed as black boxes that learn arbitrary non-linear functions that optimize certain objective functions. Many issues can arise, however, because there are no guarantees that neural networks actually perform the task that it was optimized to do. Unwanted behavior can creep into neural networks from bad data, or from artifacts of the inductive bias of the network. The field of machine learning explainability aims to build trust and transparency in models by providing important context for the reasons why a machine learning model made predictions.

The PyG community has been actively working on an explainability framework along with many benchmark datasets, evaluation methods, etc., to start exploring the world of interpretability in Graph Machine Learning.

What PyG’s Explainability Module Can Do:

PyG’s explainability module provides several tools for gaining insights into the decision-making processes of GNNs. The PyTorch Geometric explainability module provides a powerful set of tools for understanding how our models are making decisions based on graphs. By providing detailed visualizations and explanations of the decision-making process, we can gain a deeper understanding of the inner workings of our models and make more informed decisions about how to improve them.

The Explainability Toolset

The PyG Explainability module has four main parts:

Explainer: Class for instance-level explanations of GNNs. Explainer is the centerpiece of the Explainability module. On a high level, it takes in:

a model to explain,
explanation configurations, and
an explainability algorithm (represented by ExplainerAlgorithm class). The output of this class is an explanation wrapped in the Explanation class.

High-level overview of the PyG Explainability framework [Credits: https://medium.com/@pytorch_geometric/graph-machine-learning-explainability-with-pyg-ff13cffc23c2]

2. Explanations: The Explanation class is a type of Data or HeteroData object that holds masks for different components of the graph data such as nodes, edges, and features, along with their attributes. With the help of the Explanation class, one can extract the induced explanation subgraph, which consists of non-zero explanation attributions, and the complement to the explanation subgraph. The class also provides methods for thresholding and visualizing the explanation.

3. Algorithms: An abstract base class for implementing explainer algorithms. We plan to use GNNExplainer to explain the reasons for model recommendations trained on the ogbn-arxiv dataset. Given a trained model and a prediction, GNNExplainer identifies a subgraph structure and a subset of node features that are most influential for the prediction.

4. Explanation Metrics: Metrics to judge the quality of explanations [Explained later in detail].

Okay, but how do these interpretations work?

In the context of GNNs, an “explanation” refers to a subset of the original graph that is represented as a mask or subgraph. This subset consists of weighted nodes, edges, and possibly node features, and the weights assigned to these entities reflect their relative significance in explaining the model’s results.

More formally, we can define our problem as follows.

Let G = (V, E) represent a graph with |𝑉 | nodes and |𝐸| edges. Each node can have d-dimensional features. Also, we can treat our GNN model as a function f: V → Y where in the context of node classification, Y is the finite set of possible labels.

Our explanation for the prediction class 𝑦ₜ of a target node 𝑣ₜ will consist of an edge mask M_E(𝐸, 𝑓, 𝑣ₜ, 𝑦ₜ) ∊ℝ|𝑉|×|𝑉| and for a complex model, the node feature mask M_NF(V, 𝑓, 𝑣ₜ, 𝑦ₜ) ∊ℝ|𝑉|×d where each element is an importance score of that edge or node feature to the prediction. The importance score (sometimes also referred to as the weight) lies in the range [0,1] for soft masking, and {0,1} for hard masking.

(See Amara et al. (2022) for more details)

Dataset ogbn-arxiv: Scholarly Network Exploration

Let’s unlock the secrets of scholarly communication networks. The ogbn-arxiv dataset, an Open Graph Benchmark (OBG) dataset, provides a wealth of graph data that can be used to gain valuable insight into the world of research papers and their references. With millions of nodes and edges, this dataset offers a unique opportunity to understand the relationships and patterns of scholarly work.

Each paper is represented as a node, and each edge between nodes represents a citation relationship, where one paper cites another. The dataset also includes a 128-dimensional feature vector for each paper, which is created by averaging the embeddings of words in the paper’s title and abstract. The word embeddings are generated using the skip-gram model applied over the MAG corpus. The dataset contains over 1.6 million papers and over 19 million citation relationships.

Prediction task: a 40-class classification problem that aims to predict the primary categories of arXiv papers where categories are subject areas of the arXiv CS papers, such as cs.AI, cs.LG, and cs.OS.

Data split: the idea is to train models on past papers and subsequently use them to predict the topic areas of newly released papers. Specifically, papers published before 2017 are used for training, those published in 2018 for validation, and those released after 2019 for testing.

Let’s put on our coding caps and explore this resource.

Model: Graph Convolution Networks

Let’s use Graph Convolution Network (GCN) to build our GNN model (Kipf et al. 2017). The PyG’s built-in GCNConv layer will come in handy for our implementation. BN is the Batch Normalization layer, followed by ReLU activation and a Dropout layer.

The below snippet shows how you can define various layers for your network:

self.convs = torch.nn.ModuleList(
        [GCNConv(in_channels=input_dim,out_channels=hidden_dim)] + 
        [GCNConv(in_channels=hidden_dim,out_channels=hidden_dim) 
        for i in range(num_layers - 2)] +
        [GCNConv(in_channels=hidden_dim, out_channels=output_dim)]
        )
self.bns = torch.nn.ModuleList(
        [torch.nn.BatchNorm1d(num_features=hidden_dim)
        for i in range(num_layers - 1)]
        )
self.softmax = torch.nn.LogSoftmax()

And, this is how you can proceed with the forward function:

out = None
for i in range(len(self.bns)):
    x = self.convs[i](x, adj_t)
    x = self.bns[i](x)
    x = torch.nn.functional.relu(x)
    x = torch.nn.functional.dropout(x, p=self.dropout, training=self.training)

x = self.convs[-1](x, adj_t)
if not self.return_embeds:
    x = self.softmax(x)
out = x

Here, adj_t is the Graph connectivity in COO format with shape [2, num_edges].

And, this is how you can train your model. We are using NLL (Negative Log Likelihood Loss for this) and use accuracy as the evaluation metric.

def train(model, data, train_idx, optimizer, loss_fn):
  model.train()
  loss = 0
  optimizer.zero_grad()
  o = model(data.x, data.edge_index)
  o = o[train_idx]
  y = torch.flatten(data.y[train_idx])
  loss = loss_fn(o, y)
  loss.backward()
  optimizer.step()
return loss.item()

def _eval_acc(self, y_true, y_pred):
  acc_list = []
  for i in range(y_true.shape[1]:
    is_labeled = y_true[:,i] == y_true[:,i]
    correct = y_true[is_labeled,i] == y_pred[is_labeled,i] 
    acc_list.append(float(np.sum(correct))/len(correct))
return {'acc': sum(acc_list)/len(acc_list)}


### Training Params:
args = {
  'device': device,
  'num_layers': 3,
  'hidden_dim': 64,
  'dropout': 0.5,
  'lr': 0.01,
  'epochs': 100 
  }

What do we get?

Train Accuracy: 70.73%, Valid Accuracy: 70.17% Test Accuracy: 69.36%

Now, you are wondering, What? Why? How? What does it mean? We can answer some of these burning questions via the PyG explainability framework!

Let’s open this black box:

IT IS GAME TIME!

Below is the code snippet that initializes the Explainer class. Specifically, it focuses on explaining the behavior of a single node in the graph by identifying the top 40 contributing features. The Explainer object is instantiated with various configuration parameters, including the model being explained, the algorithm used for generating explanations (GNNExplainer in this case), and the explanation_type, which is set to 'model' indicating that the goal is to explain the model's behavior rather than individual predictions.

The node_mask_type and edge_mask_type parameters indicate that the explanation will be focused on node attributes and object-level edges. Additionally, the model_config dictionary specifies the task being performed by the model (multiclass_classification) and the level at which the explanation is generated (node). The threshold_config the dictionary specifies the method used to threshold the contributions and is set to the top 40 features.

# Explanability for a single node
from torch_geometric.explain import Explainer, GNNExplainer
from torch_geometric.explain.metric import unfaithfulness, fidelity

# Threshold contributions by the top 40 features.
topk = 40
node_index=10

explainer_individual = Explainer(
    model=model,
    algorithm=GNNExplainer(epochs=200),
    explanation_type='model',
    node_mask_type='attributes',
    edge_mask_type='object',
    model_config=dict(
        mode='multiclass_classification',
        task_level='node',
        return_type='log_probs',
    ),
    threshold_config=dict(threshold_type  = 'topk', value=topk)
)

# Get explanation for node indexed 10
explanation_individual = explainer_individual(data.x, data.edge_index, index=node_index)

In order to interpret the prediction for node indexed 10, the GNNExplainer provides the below feature importance bar plot for the top 10 features.

We can visualize feature importance and path graph by:

path_features = "feature_importance.png"
explanation_individual.visualize_feature_importance(path_features, top_k=10)

path_graph = "graph_importance.png"
explanation_individual.visualize_graph(path_graph)

Feature importance for node 10 in obgn-arxiv dataset

Graph Visualization for node 10, top 40 nodes

How do we know if this makes sense?

Metrics are all you need!

unfaithfulness

This metric evaluates how faithful an Explanation is to an underlying GNN predictor, as described in the paper (Agarwal and Queen 2023). GEF (graph explanation unfaithfulness) can be expressed as:

where y refers to the prediction probability vector obtained from the original graph, and y_hat refers to the prediction probability vector obtained from the masked subgraph. Finally, the Kullback-Leibler (KL) divergence score quantifies the distance between the two probability distributions.

fidelity

Fidelity evaluates the contribution of the produced explanatory subgraph to the initial prediction, either by giving only the subgraph to the model (fidelity-) or by removing it from the entire graph (fidelity+). The fidelity scores capture how well an explainable model reproduces the natural phenomenon or the GNN model logic. It evaluates the fidelity of an Explainer given an Explanation, as described in the paper (Amara et. al. 2022).

For phenomenon explanations, the fidelity scores are given by:

For model explanations, the fidelity scores are given by:

For metrics:

explanation_metrics = explainer_metrics(data.x, data.edge_index)
print(f'Generated explanations in {explanation_metrics.available_explanations}')

fid_pm = fidelity(explainer_metrics, explanation_metrics)
print("Fidelity:", fid_pm)

char_score = characterization_score(fid_pm[0], fid_pm[1])
print("Characterization score:", char_score)

The Fidelity values achieved are (0.9929, 0.4951)

Characterization score: 0.7426

But what exactly does the above result mean for our explanation?

Before going into what this score means, we need some background on what types of explanations are considered good.

Given a GNN model, there can be many possible explanations that can be put into 2 categories as proposed in Amara et al. (2022)!

Sufficient Explanation: An explanation is considered sufficient if it can independently lead to the model’s initial prediction. The same prediction may have multiple sufficient explanations due to the graph’s different configurations. The fidelity scores fid– of a sufficient explanation is close to 0.
Necessary Explanation: An explanation is considered necessary if the model’s prediction changes when it’s removed from the original graph. Necessary explanations have a fidelity score fid+ close to 1.

An explanation is a characterization of the prediction if it is both necessary and sufficient.

The characterization_score is recommended as a global evaluation metric. The reason being since it is a weighted harmonic mean of fid+ and 1-fid–, it can balance the sufficiency and necessity requirements for an explanation. Ring any bells? We can’t read your mind but we bet you are also thinking about the F1-score that combines precision and recall.

Characterization score

The characterization score with equal weights on fid+ and 1-fid– is low as soon as one of the two terms is low.

Our explanation has a characterization score of 0.7 which means it is pretty good!

Implementing Your Own Explainer Algorithm

To implement your own custom explainability function, you can extend the ExplainerAlgorithm abstract base class. All you need to implement are the following two abstract methods:

forward(model, x, edge_index, target, index): The function that computes the explanations. model is the model used for explanations, x is the input node features, edge_index is the input edge features, target is the target of the model that is being explained (used in phenomena explanations), and index is the index of the model output to explain.
supports(self) -> bool: Returns whether or not the algorithm supports the current explainer_config and model_config parameters

The ExplainerAlgorithm abstract base class also contains a list of helpful utility functions that can be used when implementing your explainer algorithm. See the source for more details.

Finally, A Closer Look at GNNExplainer

In case you are still lost, let’s take a closer look at the GNNExplainer algorithm under the hood so that we can examine how this explainability algorithm works.

In the GNNExplainer problem setup, we perform multi-label node classification. We let G be a graph on edges E and nodes V with d-dimensional node features 𝞦={𝑥₁…𝑥ₙ}. Additionally, let f be a label function on nodes f: V → {1,….C} that maps nodes to one of the C classes. The GNN model ɸ seeks to approximate f.

Recall that at layer 1, the update of a GNN model ɸ involves three key computations.

First, the model computes a message for every pair of nodes. This message function is

where h terms are corresponding representations for noded 𝑣ᵢ and 𝑣ⱼ in layer l-1 and rᵢⱼ is the relation between nodes.

2. Second, for each node 𝑣ᵢ the GNN aggregates messages from 𝑣ᵢ’s neighborhood Nᵥᵢ, and aggregates the messages via an aggregation function

3. Finally, the GNN takes the aggregated message M(l)ᵢ and 𝑣ᵢ’s previous representation h(l-1)ᵢ, and non-linearly transforms them to obtain 𝑣ᵢ’s representation

The key insight of GNNExplainer is that the computation graph of a node 𝑣, which is defined by the aggregation procedure, fully determines all of the information the GNN uses to make a prediction. The mathematical details of the specific optimization that GNNExplainer utilizes are out of scope for this tutorial, mostly because there is a different objective function for single-instance explanations vs. joint learning of graph structural and node feature information. For greater detail, see (Ying et al., 2020).

A figure demonstrating how aggregation determines node + node feature importance, from Ying et al., 2019.

Conclusion

In this blog post, we have trained a GCN on the ogbn-arxiv dataset (an Open Graph Benchmark (OBG) dataset) and showed how we can use PyG's built-in explainability module in order to produce explanations for both node features and graph structure. Then, we have examined ways to evaluate the quality of the explanations, and also how the algorithm GNNExplainer works under the hood. Hopefully, this tutorial has given you a greater understanding of the capabilities of the Explanability module in PyG and now life has started to make sense again!

References:

Graph Analysis Made Easy with PyG Explainability was originally published in Stanford CS224W: Machine Learning with Graphs on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Prag Saga

Rajanie Prabha — Mon, 21 Feb 2022 09:18:00 GMT

Hey y’all,

It’s 12:53 am and owing to my bad habits, I am high on caffeine.

This post was long due (had been sitting in my drive for 2 years now in the draft mode) so I had to change the opening to reflect my present life. Although, it’s pretty much the same, trying to find that work-life balance.

December 2017

With a two-week Christmas break and being a proud owner of a Schengen visa, this seemed like the perfect opportunity to visit someplace in Europe. How to decide where? The answer to that was pretty simple because Prague is known to be one of the cheapest party places across Europe. Forgive me, I’m a student, and I’m not earning yet. I tried making plans with some of my friends, but nothing seemed to work out, because people either had budget issues or they had folks visiting them. Me being myself, I had to go, even if I had to go alone.

I booked a hostel in the city center and a BlaBlaCar for 29th December morning. The interesting part was that I talked to one of my friends the night before, and he wanted to go too. I was so excited to have some company because yes, I was a little scared of a solo trip. So, he, his roommate, and I packed our bags and had an amazing car ride to Prague.

Prague is a beautiful city, a city with a typical touristy vibe. We roamed around the city, Charles Bridge, Petrin Tower, Castle, Museums, Christmas market, old square town, etc. We kept walking, observing, and feeling the excitement of the city. I would suggest Popocafepetl for their cheap alcohol and lively atmosphere. We went to a lot of chocolate factories tasting chocolates for free (YASS), had amazing Indian food (my friends were really into Indian food, well we don’t have to guess why) and some tasteless (we can agree to disagree) Czech food, went for a stupid (weird?) haunted underground tour but, all in all, had a lot of fun. We also explored Zizkov in Praha 3, which had a typical hippie vibe, ending the night with really good Mexican food (would highly recommend Las Adelitas), and happy lively bars. Freaking go to Bukowski. Period.

The guys had to leave on the 31st, so I was on my own for the next two days. I didn’t visit a lot of touristy places after that, but I did walk around and explore the city with the two new Australian friends I just made at the hostel I was staying at. Ickle and Bickle (had to anonymize the names LOL) are two amazing crazy humans who I will never forget. I met a lot of other people in the hostel itself, Canadians, Germans, Mexicans, Indians, Australians, Americans, etc. Nobody had any plans for the evening, and we were all just hanging out in the common area of the hostel, drinking and chilling. I was literally the only sober person in the entire city that night (new year’s Eve), and trust me, I had as much fun as the other guys and maybe even more. Intoxication is not always proportional to having fun. Also, I didn’t want to get drunk, because it was my first solo experience and I didn’t want to take any risks. At around 10 PM, we got out on the road and just walked towards the Charles Bridge to watch the fireworks. Oh, man! The fireworks were so insane, I was shaking for the next few hours. The bridge got so crowded, that I felt squished, I was shouting (literally) and holding onto Ickle and Bickle. I thought I was going to die that night and I thought that was it. But somehow, we got pushed away from the crowd and I could eventually breathe. I can safely say that it was something I can never forget. The craziness, the drunk atmosphere, the fireworks, the crowd, it was all so thrilling. We watched the fireworks and just took a lot of pictures, hugged each other at midnight. A part of me missed my people who were not there with me to look at the fireworks, to hug, and to wish. We sang we danced, we got back to the hostel, we played pool till 5 in the morning, we talked about life and death and all the weird things in the world. Exhausted, I went to bed at 6 in the morning with a little ache in my heart for I will never see any of these guys again. It was one hell of a trip. I met so many people, learned about so many cultures, traditions, languages, festivals, and drinking habits!

That night was one of the most amazing nights of my life. It was absolutely what I wanted it to be like. Fun and insane. Prague is a beautiful place. I fell in love with the city.

Solo trips are subjective. You might just want to chill alone or you might just want to explore the world or you might just want to meet new people and hang around. Mine had all these things. Being an extrovert, I love meeting people, so that was the highlight of my trip (also chocolate factories) but yes, I would say that everyone should take a solo trip because it will remove the baggage of always finding someone to go on a trip with you, and it will make you realize that having fun on your own is something cool. Not just the city, explore your pure interests. Cheers!

Thank you, Prague :)

Also, thanks to Ronn Jacob for helping me with the editing!

Tacotron-2 : Implementation and Experiments

Rajanie Prabha — Fri, 03 Aug 2018 13:11:04 GMT

Why do we want to do Text-to-Speech?

Not one but many reasons where TTS can be used such as accessibility features for people with little to no vision, communication-ware for mute people, voice assistants such as siri, screen readers, automated telephony systems, audio books, easier language learning etc.

In December 2016, Google released it’s new research called ‘Tacotron-2’, a neural network implementation for Text-to-Speech synthesis. Before moving forward, I would like you to checkout the results they posted on their blog https://google.github.io/tacotron/publications/tacotron2/ and get excited about the mechanism.

Aren’t the results awesome and so human-like? Yes, that’s what motivated me to figure out how they did it and try to implement it eventually. I worked on Tacotron-2’s implementation and experimentation as a part of my Grad school course for three months with a Munich based AI startup called Luminovo.AI . I wanted to develop such a synthesizer on Angela Merkel’s speech.

SEQ2SEQ MODEL WITH ATTENTION

The working of the system was described by Jonathan Shen and Ruoming Pang, Software Engineers, Google Brain and Machine Perception Teams,

‘’In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.

First part of the model i.e. Seq2Seq architecture which is responsible for converting texts into mel-spectrograms and these spectrograms are fed in a wave-net model to produce audio waveforms. One interesting thing is these two parts of the Tacotron architecture (Seq2Seq and Wavenet vocoder) can be trained independently. I worked on the Seq2Seq model.

The model is an encoder-attention-decoder setup where they use ‘Location sensitive attention’. The first part is an Encoder which converts the character sequence into word embedding vector. This representation is later consumed by the Decoder to predict spectrograms. Since I was using a German dataset, I made sure that my character space had german alphabets.

The Encoder is composed of 3 convolutional layers each containing 512 filters of shape 5 x 1, followed by batch normalization and ReLU activations.
The next part is the attention network which takes the encoder output as input and tries the summarize the full encoded sequence as a fixed length context vector for each decoder output step.
The output of the final convolutional layer is passed into a single bi-directional LSTM layer containing 512 units (256 in each direction) to generate the encoded features.

ATTENTION-BASED MODELS FOR SPEECH RECOGNITION:

Attention mechanism used here takes into account both the location of the focus in the previous step and the features of input sequence.

Let say we have data x = {x1,x2,x3….xN}. We pass this data to the encoder which produces an encoded output sequence h = {h1,h2,h3….hN}.

A(i) = Attention( s(i-1), A(i-1), h ) where s(i-1) is the previous decoding state and A(i-1) is the previous alignment.

s(i-1) is 0 for the first iteration of first step.

Attention function is usually implemented by scoring each element in h separately and then normalizing the score.

G(i) = A(i,0) h(0) + A(i,1) h(1) + ……. + A(i,N) h(N)

Y(i) ~ Generate ( s(i-1), G(i) )

where s is the decoding output, A(i) is a vector of attention weights called alignment.

Finally, s(i) = Recurrency ( s(i-1), G(i), Y(i) )

Recurrency is usually LSTM.

DECODER :

The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time. The prediction from the previous time step is first passed through a small pre-net containing 2 fully connected layers of 256 hidden ReLU units. The pre- net output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units. Finally, the predicted mel spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to add to the prediction to improve the overall reconstruction. Each post-net layer is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer.

LOSS FUNCTION:

Summed mean squared error (MSE)

In parallel to spectrogram frame prediction, the concatenation of decoder LSTM output and the attention context is projected down to a scalar and passed through a sigmoid activation to predict the probability that the output sequence has completed. This “stop token” prediction is used during inference to allow the model to dynamically determine when to terminate generation instead of always generating for a fixed duration.

I decided to go with pytorch for my implementation, tracked the training with tensorboard, used gcloud Tesla K80 GPU, connected to server ports by ‘ssh -NfL’, and heavily used jupyter lab during development. [life saver kit]

I referenced various github repositories [1, 2] to understand the paper, implementation, correcting bugs in my own code. Due to the natural complexity of the problem statement, I could not get astonishing human-like speech results but I learned a lot of things about Text-to-speech and that was the major goal when I started doing the project.

Some results for reference:

Predicted mel-spectrogram : As you can compare the upper regions, it has a lot of gaps and still needs a lot of training. The right side (solid green) is just padding in one batch.

Target mel-spectrogram

Attention (As you can see in the lower left side, it looks like it is learning to align but it still needs around one week of training to get that perfect diagonal for attention)

All the images produced above are after 50K iterations (1 iteration = 1 batch) i.e. 3 days of training. This model needs around 300K iterations to get any close to human-like. You can see that, the predicted mel-spectrograms look pretty nice even when the attention is not learned properly. Save yourself from the trap and care about the attention!

Training loss from tensorboard

Validation loss from Tensorboard

EXPERIMENTS and CONCLUSIONS:

Implementing the model and training it was not as trivial as I thought initially. I came across numerous issues which I want you guys to know beforehand and save hours on your GPU.

Study your data. This is the most important part of the project. Listen to your data samples, check the length of text samples, duration of audio samples etc. You can save a lot of time during training, if you know your data well. M-AILABS announced their huge speech dataset earlier this year. They have humongous speech dataset in many different languages. I used the Angela Merkel data from the German female section, which has 12 hours of speech from her public speeches and interviews. This dataset was lesser as compared to LJSpeech (most popular english dataset, 24 hours of speech). I figured this out only when I started training and spent days observing the results. So, heads up!
TTS is highly computationally expensive. Being a student, I just had access to one GPU (Nvidia Tesla K80) on Google cloud. Given the structure of the dataset I was using, my GPU only allowed batch-size of 8 while training. Google says, they train it with a batch-size of 64. I first tried with a batch-size of 2 (because of limited GPU memory) and when the model failed to show any convergence after 2–3 days of training, I sorted my data as per length of the text and as per duration of the audio, and started training with batch-size 8. Although, I couldn’t optimize more with the dataset and the GPU I had. So, plan accordingly.
Teacher-forcing ratio. In teacher-forced training, the model is assisted by true labels i.e. it uses the current frame of the Ground Truth to predict the next decoding step. It is not clear in the paper regarding what ratio to use. Even if the attention is not learned, the model will predict good frames for training data in teacher-forced mode but in the evaluation mode it will not work because we don’t have ground truth (Thought the model was working since the predicted mels looked nice regardless of poor alignments). I did training with 1.0, 0.75 and 0.5 to make the model learn alignments. During eval mode, teacher-forcing should be turned off.
It takes days to train and get alignments. It is a real cumbersome process to train a TTS system. It might take around 7–10 days to train the model provided that you have limited GPU support (We are no google). And then, debugging the code with such a model, is another story.
Hyperparameter tuning is very important part of Tacotron-2 system. The batch-size, learning-rate, teacher-forcing ratio, batch length are some of the parameters you should pay extra importance too. Things vary with datasets, so it is very sensitive!

CONCLUSION

Text-to-speech is still a really complex research problem and it was exciting to work on this. My overall experience was amazing and I learnt a lot of things about TTS systems, audio waveforms, recurrent networks, mel-spectrograms, attention mechanisms and I hope that this post can help you in any way in your journey with TTS systems. In future, I would like to see an optimized version of Tacotron-2 model, something which is more robust across languages, easier to train and less computationally heavy.

So, I would just say, preprocess your data well, tune your hyperparameters, log everything on tensorboard and get going! All the best!

Special thanks to Luminovo.AI for their support!