Stories by Texar on Medium

Introducing Texar-PyTorch: An ML Library Integrating the Best of TensorFlow into PyTorch

Texar — Thu, 17 Oct 2019 16:39:31 GMT

Crossposted on the Petuum blog.

We are excited to introduce Texar-PyTorch, an open-source general-purpose machine learning toolkit that supports a broad set of applications with a focus on natural language processing (NLP) and text generation tasks.

Stemming from its already-popular Texar TensorFlow equivalent, Texar-PyTorch integrates many of the best features from Tensorflow into PyTorch. The toolkit is highly customizable, exposing APIs at multiple abstraction levels to suit both novice and experienced users.

In particular, Texar-PyTorch replicates comprehensive useful TensorFlow (TF) modules to significantly enhance PyTorch existing functionalities, including:

Data: Best practice of tf.data for easy data processing, batching, and iteration, all efficient based on buffered shuffling, caching, and lazy-loading. We also replicate TFRecord to ingest arbitrary complex data types and large files.
Modeling: Abundant functions and excellent modularization of ML models, such as the principled design of sequence models including text generation decoders, attention mechanisms, and RNNs, etc.
Training: We replicate high-level APIs of TF Estimator and keras.Model but with much greater flexibility, for turnkey model training, evaluation, prediction, TensorBoard visualization, and seamless combination with external hyperparameter tuning tools.

What Texar-PyTorch Provides

With the best TF features integrated into the intuitive PyTorch programming model, Texar-Pytorch provides comprehensive support for building ML applications:

State-of-the-Art Model Building Blocks — building an ML model is like assembling Lego bricks. Plugging-in and swapping-out modules as you like. Read more
Easy and Efficient Data Processing — rich built-in processors for common types of datasets. Simple-but-powerful interfaces for arbitrary custom Best practice integrated, no worry about efficiency. Read more
Turnkey and Flexible Model Training with Executors — Getting free of boilerplate code for training and evaluation loops, while still highly flexible to customize for your specialized need. Read more

Code Example 1 demonstrates the complete code of using Texar-PyTorch to build and train a state-of-the-art sequence-to-sequence model for, e.g., text summarization and machine translation.

Code Example 1: Building and training a conditional GPT-2 model (e.g., for text summarization) with Texar-PyTorch.

Why Choose Texar?

Supports both TensorFlow & PyTorch. Sometimes it’s not your choice of which underlying framework to use, and learning a new higher-level framework is probably just as time-consuming as writing the parts yourself. Now with Texar, you can use the same interfaces with minimal changes in both frameworks. The two versions can even share pre-trained model weights that you’ve downloaded.
Provides Natural Language Processing, All in One Kit. Texar has a comprehensive coverage of neural models on natural language processing tasks, especially text generation. Figure 1 gives a snapshot of Texar modules. With Texar, not only will you have access to a complete range of state-of-the-art pre-trained models, but you’ll also find all the utilities you need, from data processing to modeling to training and evaluation. We’ve got you covered.
Facilitates Novice- and Expert-Friendly. Whether you’ve just picked up deep learning, or you’re an experienced researcher, you’ll find Texar easy to use. Texar provides state-of-the-art built-in components but remains flexible enough for customizations.

Figure 1: Texar provides a comprehensive set of modules for data processing, model architectures, loss functions, training, evaluation, as well as a range of state-of-the-art pre-trained ML/NLP models (e.g., BERT, GPT-2, etc).

In the following, we provide more details of the three key parts with Texar-PyTorch, including modeling, data, and training.

Modeling

As shown in Figure 1, Texar-Pytorch offers a full set of ML modules. With the well-designed interfaces, users can freely build arbitrary models by assembling the building blocks.

The following example shows how flexible the module interfaces are to meet the needs of different learning algorithms, such as maximum-likelihood learning and adversarial learning. Moreover, Texar provides interfaces at multiple abstraction levels for users of different expertise. For example:

It’s straightforward to invoke a common inference method, e.g., teacher-forcing decoding, by simply setting the decoder argument `decoding_strategy=’train_greedy’`.
OTOH, to perform advanced inference, e.g., Gumbel softmax decoding for adversarial learning, users can use a GumbelSoftmaxHelper. Expert users can further define new Helpers to customize whatever decoding strategies.

Code Example 2: Building a pre-trained GPT-2 language model, fine-tuning with maximum-likelihood learning and adversarial learning (using BERT as the discriminator).

To summarize, modeling with Texar-PyTorch features the following key advantages:

Excellent modularization — switching between different learning contexts is enabled by simply plugging in/swapping out a couple of modules.
Multi-level interfaces — high-level intuitive interfaces for novice users and low-level highly-customizable ones for expert users.
Built-in state-of-the-art pre-trained models — BERT, GPT-2, RoBERTa, XLNet and more, for tasks of text encoding, classification, sequence tagging, and generation.

Data

Texar-Pytorch data modules are designed for easy, efficient, and customizabledata access for any ML and NLP tasks. Combining the best practices from TensorFlow tf.data, the modules greatly enhances the PyTorch native DataLoader by:

Decoupling single instance processing and batching — for clearer program logic and easier customization
Buffer-based shuffling, caching, and lazy-loading — for greater efficiency
Extensive dataset iterators — no extra user configuration needed
More intuitive APIs — no expertise needed to get the best practices in your project

Texar-PyTorch Built-in Datasets

For common types of datasets, Texar-Pytorch already includes ready-to-use modules, as shown in Figure 2 below.

Figure 2: Texar-Pytorch built-in datasets for a majority of ML and NLP tasks.

In particular, RecordData is Texar’s equivalent to Tensorflow’s well-known TFRecordData, which reads files in binary format and thus allows arbitrary data types ranging from text to images. Cool, isn’t it! What’s more — The usage pattern is very similar to TFRecordData. The example below says it all.

Let’s say you want to train an image captioning model. Each data example would typically contain an image, a caption, and other meta info. Below is how you would do it in Texar-Pytorch.

Code Example 3: Loading complex image captioning data with Texar-Pytorch RecordData.

Creating Custom Datasets

Users can customize how to process and batch data instances, and Texar will take care of caching, lazy processing, iterating for you. The toy example below explains it.

Code Example 4: A customized dataset that performs BPE tokenization for input text.

Executor

Have you ever been bored by writing the training-evaluation loop, again and again, each time when starting a new project? Have you desired a single API to automate the loop, equipped with logging, checkpointing, visualization, and hyperparameter tuning? Do you even want the API to be flexible enough for your non-traditional algorithms, e.g., alternating multiple losses in adversarial learning? Texar Executor is here for you.

Executor is the PyTorch equivalent of the widely-used TF Estimator and tf.keras.Model, but is designed to be lightweight and much more customizable.

To demonstrate the power of Executor, we show an example of a hand-written train-eval loop v.s. Executor:

Let’s say we want the following functions in our project:

Print logs every `logging_steps` iteration to the console, a log file, and Tensorboard.
Perform validation every `validate_steps` iteration, by evaluating the model output with the BLEU metric.
If validation results improve, save the current checkpoint. If results failed to improve for `patience` consecutive trials, load the previous checkpoint, and scale the learning rate.

The steps above describe a pretty universal training loop. Here’s what a hand-written training loop would look like:

Code Example 5: A typical hand-written train-eval loop.

The code is very lengthy and tedious. Things can get even more troublesome when you need to add or change some functionalities. Now, what will the code look like, if we used Executors?

Code Example 6: The same train-eval loop with Executor.

And this is how Executor logs look in the command line:

Here you can observe that the validation BLEU is updated in-place, based on the previously predicted values. This is thanks to the Executor streaming metrics, which allows incremental computation of metric values. No need to wait until the end to see results on a large validation set!

As we can see, code with Executor is much more structured and readable. It is also much more extensible:

Q: What if we also want to do validation after each epoch?
A: Simply change `validate_every` to:

Q: What if we want to perform early stopping after we’ve scaled the learning rate `early_stop_patience` times?
A: Simply change `action_on_plateau` to:

Q: What if we also want to measure the word-level loss?
A: Simply add a new metric to `valid_metrics`:

Q: What if we want to do hyperparameter tuning and train the model multiple times?
A: Simply create an Executor for each set of hyperparameters that you want to test. Since Executor takes care of everything besides model creation, you don’t need to worry about consuming extra memory or accidentally retaining objects from previous runs. Here’s an example of using Executor with hyperopt.
Q: What if, at the end of each epoch, we want to upload the current checkpoint to the server, send an email containing the training progress, and take the dog out for a walk?
A: Weird, but okay. Simply register a custom action on a condition of your choice, and do whatever you wish:

Switching from Texar-TF to Texar-PyTorch

If you are a previous Texar-TF user, switching to Texar-PyTorch requires only minimal effort. Compared to Texar TensorFlow, Texar PyTorch has almost the same interfaces, making transitions between backends easy.

Although having similar interfaces, we also follow coding conventions for each framework, so you wouldn’t feel like learning a new sub-language. To this end, we changed some of the lower-level extensible interfaces to match the native design of respective frameworks more closely . Most of the changes lie in the data and executor modules, but as you’ve already seen, they’re still pretty easy to pick up.

Getting Started

To get started, visit our GitHub repository and follow the installation instructions. Useful resources include:

Documentation: We have detailed documentation for every module and function.
Examples: We strongly encourage you to check out our examples to get a basic idea of how Texar is used in practice. The examples are clearly documented and cover rich use cases.
ASYML Library: Find quick links to all Texar resources in one place.

*Petuum is a corporate sponsor of Texar. Petuum engineers are continuously contributing to the Texar code base and have been pivotal in this release.

Connecting the Dots Between MLE and RL for Sequence Generation

Texar — Tue, 27 Nov 2018 20:09:19 GMT

Crossposted on the Petuum blog.

Sequence generation is a ubiquitous problem in many applications, such as machine translation, text summarization, image captioning, and so forth.

Recently, we published a paper on a unified perspective of a variety of well-used learning algorithms for sequence generation, based on a generalized entropy regularized policy optimization formulation. We show that these algorithms are mathematically equivalent to specifying certain hyperparameter configurations in the framework. The new principled treatment provides systematic understanding and comparison among the algorithms and inspires further enhancement. We also propose a new interpolation algorithm based on the universal framework, which shows consistent improvement in machine translation and text summarization.

The development of sequence models such as recurrent neural networks (RNNs) with diﬀerent cells and attention mechanisms has enabled great advances in tasks requiring sequence generation. These models can be trained with a variety of learning algorithms, which we’ll outline below.

Popular Algorithms (The Dots)

The standard training algorithm is based on maximum-likelihood estimation (MLE), which seeks to maximize the log-likelihood of ground-truth sequences. Despite its computational simplicity and efﬁciency, MLE training suﬀers from exposure bias — that is, the model is trained to predict the next token given the ground-truth tokens that came before. Since the resulting model does not have access to the ground truth, at test time, the tokens generated by the model itself are used to make the next prediction instead. This discrepancy between training and testing can cause mistakes in prediction to quickly accumulate.

There have been several efforts to alleviate this issue, many of which resort to reinforcement learning (RL) techniques. For example, Ranzato et al., 2015, adopt a policy gradient algorithm that avoids the training/testing discrepancy by using the same decoding strategy at both training and test time. However, RL-based approaches for sequence generation can face prohibitively poor sample efﬁciency and high variance.

For more practical training, others have developed a diverse set of methods that are in a middle ground between the MLE and RL paradigms. For example, RAML (Norouzi et al., 2016) adds reward-aware perturbation to the MLE data examples, SPG (Ding & Soricut, 2017) leverages reward distribution for eﬀective sampling of policy gradient, and other approaches such as data noising (Xie et al., 2017) also show improved results.

Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation is the most widely-used approach to train a sequence generation model due to its simplicity and eﬃciency. MLE aims to ﬁnd the optimal parameter value that maximizes the data log-likelihood:

Reward Augmented Maximum Likelihood (RAML)

RAML was originally proposed to incorporate task metric rewards into MLE training and has shown superior performance to vanilla MLE. Speciﬁcally, RAML introduces an exponentiated reward distribution e(y|y*) ∝ exp{R(y|y*)} where R, as in vanilla policy optimization, is a task metric such as BLEU. RAML maximizes the following objective:

The RAML objective reduces to the vanilla MLE objective if we replace the task reward R in e(y|y*) with the MLE δ-reward, which is a reward function defined as:

Data Noising

Adding noise to training data is a widely adopted technique for regularizing models. Previous work has proposed several data noising strategies in the sequence generation context. For example, unigram noising with probability γ replaces each token in data y* with a sample from the unigram frequency distribution. The resulting noisy data is then used in MLE training. Formally, it is equivalent to using a reward:

where u(·) is the unigram frequency distribution. With a relaxed (i.e., smoothed) reward, data noising expands the exploration space of vanilla MLE locally. The effect is essentially the same as the RAML algorithm, except that RAML expands the exploration space based on the task metric reward.

Softmax Policy Gradient (SPG)

SPG was developed with the purpose of adapting the vanilla policy gradient to use as the reward for sampling. SPG has the following objective:

where R is a common reward. As a variant of the standard policy gradient algorithm, SPG aims to address the exposure bias problem and shows promising results.

Figure 1. Effective exploration space of different algorithms. (a): The exploration space of MLE is exactly the set of training examples. (b): RAML and Data Noising use smooth rewards and allow larger exploration space surrounding the training examples. ©: Common policy optimization such as SPG basically allows the whole exploration space.

Connecting the Dots

We establish a uniﬁed perspective of this broad set of learning algorithms. Speciﬁcally, we present a generalized entropy regularized policy optimization (ERPO) framework and show that the apparently diverse algorithms, such as MLE, RAML, SPG, and data noising, can all be re-formulated as special instances of the framework with the only diﬀerence being the choice of reward and the values of a couple of hyperparameters.

In addition to a new understanding of existing algorithms, our uniﬁed perspective also facilitates the development of new algorithms for improved learning. We present an example new algorithm that, as training proceeds, gradually expands the exploration space by annealing the reward and hyperparameter values. The annealing, in eﬀect, dynamically interpolates among the existing algorithms. Experiments on machine translation and text summarization show that the interpolation algorithm achieves signiﬁcant improvement over the various existing methods.

The General Framework

Our general framework is aimed at unifying all of the above algorithms with a common mathematical formulation. The framework is based on policy optimization, which, in general, maximizes the expected reward under the model distribution. A rich line of research into entropy regularized policy optimization (ERPO) has stabilized learning by augmenting policy optimization with information theoretic regularizers. Here, we present a generalized formulation of ERPO. Specifically, assuming a variational distribution q(y|x), we adopt the objective:

where (x, y*) is the pair from training data; y is the sentence sampled following distribution q(y|x); KL(·||·) is the KL divergence; H(·) is the Shannon Entropy; α and β are balancing weights of the respective terms; and pθ is the sequence generation model parameterized with θ.

Using the Lagrange multipliers method, this objective can be maximized with an EM-style procedure that iterates two coordinate ascent steps optimizing q and θ, respectively. At iteration n:

Other Algorithms as Special Instances

By assuming the ERPO framework, we can characterize other sequence generation algorithms as special instances within it.

Maximum Likelihood Estimation (MLE)

Let (R = Rδ, α → 0, β = 1). From the E-step of ERPO, we have q(y|x) = 1 if y = y*, and 0 otherwise. The M-step is therefore equivalent to

which recovers precisely the MLE objective.

That is, MLE can be seen as an instance of the policy optimization algorithm with the δ-reward and the above weight values. Any sample y that fails to match precisely the data y* will receive a negative inﬁnite reward and never contribute to model learning.

Reward Augmented Maximum Likelihood (RAML)

As we discussed, the RAML objective reduces to the vanilla MLE objective if we replace the task reward R in e(y|y*) with the MLE δ-reward. The relation between MLE and RAML still holds within ERPO. Similar to the way we recovered MLE from ERPO, if we let (α → 0, β = 1), but set R to the task metric reward, then the M-step of ERPO is precisely equivalent to maximizing the above RAML objective.

Data Noising

Though previous literature has covered techniques such as including a data pre-processing step that differs from the above learning algorithms, the ERPO framework can also subsume data noising as a special instance. Speciﬁcally, starting from the ERPO reformulation of MLE, which takes (R = Rδ, α → 0, β = 1), data noising can be formulated as using the unigram-relaxed Rδ discussed above.

Softmax Policy Gradient (SPG)

SPG can also readily ﬁt into our ERPO framework. By taking the gradient of the objective of SPG w.r.t θ, we immediately get the same update rule as in ERPO with (α = 1, β = 0, R = common reward).

Note that the only difference between the SPG and RAML conﬁguration is that now α = 1. SPG thus moves a step further than RAML by leveraging both the reward and the model distribution for full exploration. Sufﬁcient exploration at training time would, in theory, boost the test-time performance. However, with the increased learning difﬁculty, additional sophisticated optimization and approximation techniques must be used (Ding & Soricut, 2017) in order to make the training practical.

Figure 2. A unified formulation of different learning algorithms. Each algorithm is a special instance of the general ERPO framework taking certain specifications of the hyperparameters (R, α, β).

Application: Interpolation Algorithm

In our generalized ERPO framework, a series of well-used learning algorithms can all be understood as instances of the framework with certain speciﬁcations of the three hyperparameters (R, α, β). Each of the algorithms can be seen as a point in the hyperparameter space (Figure 1). Generally, a point with a more restricted reward function R and a very small α tends to have a smaller effective exploration space and allow efﬁcient learning (e.g., MLE), while in contrast, a point with smooth R and a larger α would lead to a more difﬁcult learning problem, but permit more sufﬁcient exploration and better test-time performance (e.g., (softmax) policy gradient). In our paper, we also explore an example algorithm that interpolates the existing ones.

The interpolation algorithm exploits the natural idea of starting learning from the most restricted yet easiest problem conﬁguration, and gradually expands the exploration space to reduce the discrepancy from the test time — the easy-to-hard learning paradigm. As we have mapped common algorithms to points in the hyperparameter space, interpolation becomes very straightforward and only requires annealing of the hyperparameter values.

Experimental Results

We evaluate the above interpolation algorithm on the tasks of machine translation and text summarization. The proposed algorithm consistently improves over a variety of previous methods, as shown in the figures below.

Figure 3. The graph on the left is the convergence curve of different learning algorithms in the machine translation task. On the right is the improvement on text summarization in comparison to MLE.

Code

Our code for experiments is available here. Implementations are based on Texar, a general-purpose and easy-to-use text generation toolkit.

Introducing Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation and Beyond

Texar — Tue, 18 Sep 2018 18:41:13 GMT

Crossposted on the Petuum blog.

We are excited to introduce Texar, an open-source, general-purpose toolkit that supports a broad set of machine learning applications with a focus on text generation tasks. Texar is particularly suitable for researchers and practitioners of fast model prototyping and experimentation.

Text Generation at a Glance

Text generation spans a broad set of natural language processing (NLP) tasks that aim to generate natural language from input data or machine representations. Such tasks include machine translation, dialog systems, text summarization, article writing, text paraphrasing and manipulation, image captioning, and more. While this field has undergone rapid progress in both academic and industry settings, in part due to the integration of modern deep learning approaches, considerable research efforts are still needed in order to improve techniques and enable real-world applications.

Text generation tasks have many common properties and share two central goals:

Generating human-like, grammatical, and readable text.
Generating text that contains all relevant information inferred from inputs. For example, in machine translation, the translated sentence that is generated must express the same meaning as the source sentence.

To this end, a few key techniques are increasingly widely-used, such as neural encoder- decoders, attentions, memory networks, adversarial methods, reinforcement learning, and structured supervision, as well as optimization, data pre-processing and result post-processing procedures, evaluations, and etc. These techniques are often combined together in various ways to tackle different problems (Figure 1).

Figure 1. An example of various model architectures used in text generation tasks, where E refers to encoder, D to decoder, C to Classifier, A to attention, Prior to prior distribution, and M to memory.

It is therefore highly desirable to have an open-source platform that unifies the development of these diverse yet closely-related text generation applications, backed with clean and consistent implementations of the core algorithms. Such a unified platform would enable reuse of common components and functionalities; standardize design, implementation, and experimentation; foster reproducible research; and, importantly, encourage technique sharing among different text generation tasks so that an algorithmic advance developed for a specific task can quickly be evaluated and generalized to many other tasks.

Introducing Texar

To that end, we have developed Texar, an open-source toolkit focused on text generation tasks, using the TensorFlow language. Texar is modular, versatile, and extensible. It extracts common patterns underlying the diverse tasks and methodologies within text generation and creates a library of highly reusable modules and functionalities.

Figure 2. Texar’s main modules and functionalities.

Versatility

Texar contains a wide range of modules and functionalities for composing arbitrary model architectures and implementing various learning algorithms such as maximum likelihood learning, reinforcement learning, adversarial learning, probabilistic modeling, and so forth (Figure 2).

Modularity

Texar decomposes diverse complex machine learning models/algorithms into highly-reusable model architecture, loss, and learning process modules, among others.

Users can easily construct their own models at a high conceptual level by assembling Texar’s modules like building blocks. Texar makes plugging-in and swapping-out modules simple — for example, switching between maximum likelihood learning and reinforcement learning only involves changing a few lines of code.

Extensibility

Texar can be effortlessly integrated with any user-customized, external modules, and is fully compatible with the TensorFlow open source community, including TensorFlow-native interfaces, features, and resources.

Usability

With Texar, users can customize models with templates/examples and simple Python/YAML configuration files, or program from Texar’s Python Library APIs for maximal customizability.

Texar provides convenient automatic variable reuse (no need to worry about complicated TensorFlow variable scopes), simple function-like calls to perform module logic, and rich configuration options with sensible default values for every module.

Texar emphasizes well-structured, highly-readable code with uniform design patterns and consistent styles, along with clean documentation and rich tutorial examples.

Texar is currently supporting several research and engineering projects at Petuum, Inc. We hope the toolkit can also empower the community to accelerate technique development in text generation and beyond. We also invite researchers and practitioners to join and further enrich the toolkit so that, together, we can advance text generation research and applications.

Please check out the following resources to learn more about Texar:

Website: https://texar.io

GitHub: https://github.com/asyml/texar

Examples: https://github.com/asyml/texar/blob/master/examples

Documentation: https://texar.readthedocs.io/

Tech report: https://arxiv.org/pdf/1809.00794.pdf