Open In App

ML – Multi-Task Learning

Last Updated : 24 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Multi-task learning combines examples (soft limitations imposed on the parameters) from different tasks to improve generalization. When a section of a model is shared across tasks, it is more constrained to excellent values (if the sharing is acceptable), which often leads to better generalization.

The diagram below shows a common type of multi-task learning in which several supervised tasks (predicting  \mathbf{y}^{(i)}    given x) share the same input x, as well as an intermediate-level representation \mathbf{h}^{(shared)}     that captures a common pool of components (shared). The model is divided into two sorts of parts, each with its own set of parameters:

  1. Task-specific parameters – which only benefit from the examples of their task to achieve good generalization. The higher layers of the neural network are depicted in the diagram below.
  2. Generic parameters – those that apply to all tasks (which benefit from the pooled data of all the tasks). The lower levels of the neural network are depicted in the diagram below.

 

Multi-task learning can take many different shapes in deep learning frameworks, and this diagram represents a common scenario in which the tasks share a common input but have many target random variables. The lower layers of a deep network (whether supervised and feedforward or with a generative component with downward arrows) can be shared across tasks, and task-specific parameters (associated with the weights into and out of \mathbf{h}^{(1)}     and \mathbf{h}^{(2)}    , respectively) can be learned on top of those, resulting in a shared representation \mathbf{h}^{(shared)}    . The core idea is that variances in input x are explained by a common pool of factors and that each job is linked to a subset of these factors. Top-level hidden units \mathbf{h}^{(1)}     and \mathbf{h}^{(2)}     are specialised for each task (predicting \mathbf{y}^{(1)}     and \mathbf{y}^{(2)}    , respectively) in this example, while some intermediate-level representation(shared) is shared across all tasks. In the unsupervised learning environment, some of the top-level components should be related to none of the output tasks (\mathbf{h}^{(3)}    ): these are the pieces that explain some of the input changes but are not beneficial for predicting \mathbf{y}^{(1)}    or \mathbf{y}^{(2)}    .

 

Learning curves show how the negative log-likelihood loss has changed over time (indicated as a number of training iterations over the dataset, or epochs). In this scenario, we use MNIST to train a maxout network. The training goal decreases over time, while the validation set average loss eventually rises, resulting in an asymmetric U-shaped curve.
Improved generalisation and generalisation error bounds can be achieved thanks to the shared parameters, which can significantly improve statistical strength. Of course, this will only occur if precise assumptions regarding the statistical link between distinct activities are valid, meaning that some tasks are related. Some of the factors that explain the differences observed in the data associated with different tasks are consistent across two or more tasks, which is the basic prior premise in deep learning.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads