Spatial Transformer Networks

Jaderberg et al. (2015) introduced the learnable Spatial Transformer (ST) module that can be used to empower standard neural networks to actively spatially transform feature maps or input data. In essence, the ST can be understood as a black box that applies some spatial transformation (e.g., crop, scale, rotate) to a given input (or part of it) conditioned on the particular input during a single forward path. In general, STs can also be seen as a learnable attention mechanism (including spatial transformation on the region of interest). Notably, STs can be easily integrated in existing neural network architectures without any supervision or modification to the optimization, i.e., STs are differentiable plug-in modules. The authors could show that STs help the models to learn invariances to translation, scale, rotation and more generic warping which resulted in state-of-the-art performance on several benchmarks, see image below.


ST Example: Results (after training) of using a ST as the first layer of a fully-connected network (`ST-FCN Affine`, left) or a convolutional neural network (`ST-CNN Affine`, right) trained for cluttered MNIST digit recognition are shown. Clearly, the output of the ST exhibits much less translation variance and attends to the digit. Taken from Jaderberg et al. (2015) linked video.

Model Description

The aim of STs is to provide neural networks with spatial transformation and attention capabilities in a reasonable and efficient way. Note that standard neural network architectures (e.g., CNNs) are limited in this regard¹. Therefore, the ST constitutes parametrized transformations \(\mathcal{T}_{\boldsymbol{\theta}}\) that transform the regular input grid to a new sampling grid, see image below. Then, some form of interpolation is used to compute the pixel values in the new sampling grid (i.e., interpolation between values of the old grid).


Two examples of applying the parametrised sampling grid to an image \(\textbf{U}\) producing the output \(\textbf{V}\). The green dots represent the new sampling grid which is obtained by transforming the regular grid \(\textbf{G}\) (defined on \(\textbf{V}\)) using the transformation \(\mathcal{T}\). (a) The sampling grid is the regular grid \(\textbf{G} = \mathcal{T}_{\textbf{I}} (\textbf{G})\), where \(\textbf{I}\) is the identity transformation matrix. (b) The sampling grid is the result of warping the regular grid with an affine transformation \(\mathcal{T}_{\boldsymbol{\theta}} (\textbf{G})\). Taken from Jaderberg et al. (2015).

Two examples of applying the parametrised sampling grid to an image \(\textbf{U}\) producing the output \(\textbf{V}\). The green dots represent the new sampling grid which is obtained by transforming the regular grid \(\textbf{G}\) (defined on \(\textbf{V}\)) using the transformation \(\mathcal{T}\).
(a) The sampling grid is the regular grid \(\textbf{G} = \mathcal{T}_{\textbf{I}} (\textbf{G})\), where \(\textbf{I}\) is the identity transformation matrix.
(b) The sampling grid is the result of warping the regular grid with an affine transformation \(\mathcal{T}_{\boldsymbol{\theta}} (\textbf{G})\).
Taken from Jaderberg et al. (2015).

To this end, the ST is divided into three consecutive parts:

Localisation Network: Its purpose is to retrieve the parameters \(\boldsymbol{\theta}\) of the spatial transformation \(\mathcal{T}_{\boldsymbol{\theta}}\) taking the current feature map \(\textbf{U}\) as input, i.e., \(\boldsymbol{\theta} = f_{\text{loc}} \left(\textbf{U} \right)\). Thereby, the spatial transformation is conditioned on the input. Note that dimensionality of \(\boldsymbol{\theta}\) depends on the transformation type which needs to be defined beforehand, see some examples below. Furthermore, the localisation network can take any differentiable form, e.g., a CNN or FCN.

Examples of Spatial Transformations

The following examples highlight how a regular grid

\[ \textbf{G} = \left\{ \begin{bmatrix} x_i^t \\ y_i^t \end{bmatrix} \right\}_{i=1}^{H^t \cdot W^t} \]

defined on the output/target map \(\textbf{V}\) (i.e., \(H^t\) and \(W^t\) denote height and width of \(\textbf{V}\)) can be transformed into a new sampling grid

\[ \widetilde{\textbf{G}} = \left\{ \begin{bmatrix} x_i^s \\ y_i^s \end{bmatrix} \right\}_{i=1}^{H^s \cdot W^s} \]

defined on the input/source feature map \(\textbf{U}\) using a parametrized transformation \(\mathcal{T}_{\boldsymbol{\theta}}\), i.e., \(\widetilde{G} = T_{\boldsymbol{\theta}} (G)\). Visualizations have bee created by me, interactive versions can be found here.

This transformation allows cropping, translation, rotation, scale and skew to be applied to the input feature map. It has 6 degrees of freedom (DoF).

This transformation is more constrained with only 3-DoF. Therefore it only allows cropping, translation and isotropic scaling to be applied to the input feature map.

This transformation has 8-DoF and can be seen as an extension to the affine transformation. The main difference is that affine transformations are constrained to preserve parallelism.
Grid Generator: Its purpose to create the new sampling grid \(\widetilde{\textbf{G}}\) on the input feature map \(\textbf{U}\) by applying the predefined parametrized transformation using the parameters \(\boldsymbol{\theta}\) obtained from the localisation network, see examples above.
Sampler: Its purpose is to compute the warped version of the input feature map \(\textbf{U}\) by computing the pixel values in the new sampling grid \(\widetilde{\textbf{G}}\) obtained from the grid generator. Note that the new sampling grid does not necessarily align with the input feature map grid, therefore some kind of interpolation is needed. Jaderberg et al. (2015) formulate this interpolation as the application of a sampling kernel centered at a particular location in the input feature map, i.e.,

\[ V_i^c = \sum_{n=1}^{H^s} \sum_{m=1}^{W^s} U_{n,m}^c \cdot \underbrace{k(x_i^s - x_m^t; \boldsymbol{\Phi}_x)}_{k_{\boldsymbol{\Phi}_x}} \cdot \underbrace{k(y_i^s - y_n^t; \boldsymbol{\Phi}_y)}_{k_{\boldsymbol{\Phi}_y}}, \]

where \(V_i^c \in \mathbb{R}^{W^t \times H^t}\) denotes the new pixel value of the \(c\)-th channel at the \(i\)-th position of the new sampling grid coordinates² \(\begin{bmatrix} x_i^s & y_i^s\end{bmatrix}^{T}\) and \(\boldsymbol{\Phi}_x, \boldsymbol{\Phi}_y\) are the parameters of a generic sampling kernel \(k()\) which defines the image interpolation. As the sampling grid coordinates are not channel-dependent, each channel is transformed in the same way resulting in spatial consistency between channels. Note that although in theory we need to sum over all input locations, in practice we can ignore this sum by just looking at the kernel support region for each \(V_i^c\) (similar to CNNs).

The sampling kernel can be chosen freely as long as (sub-)gradients can be defined with respect to \(x_i^s\) and \(y_i^s\). Some possible choices are shown below.

\[ \begin{array}{lcc} \hline \textbf{Interpolation Method} & k_{\boldsymbol{\Phi}_x} & k_{\boldsymbol{\Phi}_x} \\ \hline \text{Nearest Neightbor} & \delta( \lfloor x_i^s + 0.5\rfloor - x_m^t) & \delta( \lfloor y_i^s + 0.5\rfloor - y_n^t) \\ \text{Bilinear} & \max \left(0, 1 - \mid x_i^s - x_m^t \mid \right) & \max (0, 1 - \mid y_i^s - y_m^t\mid ) \\ \hline \end{array} \]


This transformation allows cropping, translation, rotation, scale and skew to be applied to the input feature map. It has 6 degrees of freedom (DoF).


This transformation is more constrained with only 3-DoF. Therefore it only allows cropping, translation and isotropic scaling to be applied to the input feature map.


This transformation has 8-DoF and can be seen as an extension to the affine transformation. The main difference is that affine transformations are constrained to preserve parallelism.

The figure below summarizes the ST architecture and shows how the individual parts interact with each other.


Architecture of ST Module. Taken from Jaderberg et al. (2015).

Motivation: With the introduction of GPUs, convolutional layers enabled computationally efficient training of feature detectors on patches due to their weight sharing and local connectivity concepts. Since then, CNNs have proven to be the most powerful framework when it comes to computer vision tasks such as image classification or segmentation.

Despite their success, Jaderberg et al. (2015) note that CNNs are still lacking mechanisms to be spatially invariant to the input data in a computationally and parameter efficient manner. While convolutional layers are translation-equivariant to the input data and the use of max-pooling layers has helped to allow the network to be somewhat spatially invariant to the position of features, this invariance is limited to the (typically) small spatial support of max-pooling (e.g., \(2\times 2\)). As a result, CNNs are typically not invariant to larger transformations, thus need to learn complicated functions to approximate these invariances.

What if we could enable the network to learn transformations of the input data? This is the main idea of STs! Learning spatial invariances is much easier when you have spatial transformation capabilities. The second aim of STs is to be computationally and parameter efficient. This is done by using structured, parameterized transformations which can be seen as a weight sharing scheme.

Implementation

Jaderberg et al. (2015) performed several supervised learning tasks (distorted MNIST, Street View House Numbers, fine-grained bird classification) to test the performance of a standard architecture (FCN or CNN) against an architecture that includes one or several ST modules. They could emperically validate that including STs results in performance gains, i.e., higher accuracies across multiple tasks.

The following reimplementation aims to reproduce a subset of the distored MNIST experiment (RTS distorted MNIST) comparing a standard CNN with a ST-CNN architecture. A starting point for the implementation was this pytorch tutorial by Ghassen Hamrouni.

RTS Distorted MNIST

While Jaderberg et al. (2015) explored multiple distortions on the MNIST handwriting dataset, this reimplementation focuses on the rotation-translation-scale (RTS) distorted MNIST, see image below. As described in appendix A.4 of Jaderberg et al. (2015) this dataset can easily be generated by augmenting the standard MNIST dataset as follows: * randomly rotate by sampling the angle uniformly in \([+45^{\circ}, 45^{\circ}]\), * randomly scale by sampling the factor uniformly in \([0.7, 1.2]\), * translate by picking a random location on a \(42\times 42\) image (MNIST digits are \(28 \times 28\)).


RTS Distorted MNIST Examples

Note that this transformation could also be used as a data augmentation technique, as the resulting images remain (mostly) valid digit representations (humans could still assign correct labels).

The code below can be used to create this dataset:

Code

import torch
from torchvision import datasets, transforms


def load_data():
    """loads MNIST datasets with 'RTS' (rotation, translation, scale)
    transformation

    Returns:
        train_dataset (torch dataset): training dataset
        test_dataset (torch dataset): test dataset
    """
    def place_digit_randomly(img):
        new_img = torch.zeros([42, 42])
        x_pos, y_pos = torch.randint(0, 42-28, (2,))
        new_img[y_pos:y_pos+28, x_pos:x_pos+28] = img
        return new_img

    transform = transforms.Compose([
        transforms.RandomAffine(degrees=(-45, 45),
                                scale=(0.7, 1.2)),
        transforms.ToTensor(),
        transforms.Lambda(lambda img: place_digit_randomly(img)),
        transforms.Lambda(lambda img: img.unsqueeze(0))
    ])
    train_dataset = datasets.MNIST('./data', transform=transform,
                                   train=True, download=True)
    test_dataset = datasets.MNIST('./data', transform=transform,
                                   train=True, download=True)
    return train_dataset, test_dataset


train_dataset, test_dataset = load_data()

Model Implementation

The model implementation can be divided into three tasks:

Network Architectures: The network architectures are based upon the description in appendix A.4 of Jaderberg et al. (2015). Note that there is only one ST at the beginning of the network such that the resulting transformation is only applied over one channel (input channel). For the sake of simplicity, we only implement an affine transformation matrix. Clearly, including an ST increases the networks capacity due to the number of added trainable parameters. To allow for a fair comparison, we therefore increase the capacity of the convolutional and linear layers in the standard CNN.

The code below creates both architectures and counts their trainable parameters.

Code

import torch.nn as nn
import numpy as np
import torch.nn.functional as F


def get_number_of_trainable_parameters(model):
  """taken from
  discuss.pytorch.org/t/how-do-i-check-the-number-of-parameters-of-a-model/4325
  """
  model_parameters = filter(lambda p: p.requires_grad, model.parameters())
  params = sum([np.prod(p.size()) for p in model_parameters])
  return params


class CNN(nn.Module):

  def __init__(self, img_size=42, include_ST=False):
      super(CNN, self).__init__()
      self.ST = include_ST
      self.name = 'ST-CNN Affine' if include_ST else 'CNN'
      c_dim = 32 if include_ST else 36
      self.convs = nn.Sequential(
          nn.Conv2d(1, c_dim, kernel_size=9, stride=1, padding=0),
          nn.MaxPool2d(kernel_size=(2,2), stride=2),
          nn.ReLU(True),
          nn.Conv2d(c_dim, c_dim, kernel_size=7, stride=1, padding=0),
          nn.MaxPool2d(kernel_size=(2,2), stride=2),
          nn.ReLU(True),
      )
      out_conv = int((int((img_size - 8)/2) - 6)/2)
      self.classification = nn.Sequential(
          nn.Linear(out_conv**2*c_dim, 50),
          nn.ReLU(True),
          nn.Linear(50, 10),
          nn.LogSoftmax(dim=1),
      )
      if include_ST:
          loc_conv_out_dim = int((int(img_size/2) - 4)/2) - 4
          loc_regression_layer = nn.Linear(20, 6)
          # initalize final regression layer to identity transform
          loc_regression_layer.weight.data.fill_(0)
          loc_regression_layer.bias = nn.Parameter(
              torch.tensor([1., 0., 0., 0., 1., 0.]))
          self.localisation_net = nn.Sequential(
              nn.Conv2d(1, 20, kernel_size=5, stride=1, padding=0),
              nn.MaxPool2d(kernel_size=(2,2), stride=2),
              nn.ReLU(True),
              nn.Conv2d(20, 20, kernel_size=5, stride=1, padding=0),
              nn.ReLU(True),
              nn.Flatten(),
              nn.Linear(loc_conv_out_dim**2*20, 20),
              nn.ReLU(True),
              loc_regression_layer
          )
      return

  def forward(self, img):
      batch_size = img.shape[0]
      if self.ST:
          out_ST = self.ST_module(img)
          img = out_ST
      out_conv = self.convs(img)
      out_classification = self.classification(out_conv.view(batch_size, -1))
      return out_classification

  def ST_module(self, inp):
      # act on twice downsampled inp
      down_inp = F.interpolate(inp, scale_factor=0.5, mode='bilinear',
                                recompute_scale_factor=False, align_corners=False)
      theta_vector = self.localisation_net(down_inp)
      # affine transformation
      theta_matrix = theta_vector.view(-1, 2, 3)
      # grid generator
      grid = F.affine_grid(theta_matrix, inp.size(), align_corners=False)
      # sampler
      out = F.grid_sample(inp, grid, align_corners=False)
      return out

  def get_attention_rectangle(self, inp):
      assert inp.shape[0] == 1, 'batch size has to be one'
      # act on twice downsampled inp
      down_inp = F.interpolate(inp, scale_factor=0.5, mode='bilinear',
                               recompute_scale_factor=False, align_corners=False)
      theta_vector = self.localisation_net(down_inp)
      # affine transformation matrix
      theta_matrix = theta_vector.view(2, 3).detach()
      # create normalized target rectangle input image
      target_rectangle = torch.tensor([
          [-1., -1., 1., 1., -1.],
          [-1., 1., 1., -1, -1.],
          [1., 1., 1., 1., 1.]]
      ).to(inp.device)
      # get source rectangle by transformation
      source_rectangle = torch.matmul(theta_matrix, target_rectangle)
      return source_rectangle


# instantiate models
cnn = CNN(img_size=42, include_ST=False)
st_cnn = CNN(img_size=42, include_ST=True)
# print trainable parameters
for model in [cnn, st_cnn]:
  num_trainable_params = get_number_of_trainable_parameters(model)
  print(f'{model.name} has {num_trainable_params} trainable parameters')

Trainable Paramas — Trainable Parameters

Training Procedure: As described in appendix A.4 of Jaderberg et al. (2015), the networks are trained with standard SGD, batch size of \(256\) and base learning rate of \(0.01\). To reduce computation time, the number of epochs is limited to \(50\).

The loss function is the multinomial cross entropy loss, i.e.,

\[ \text{Loss} = - \sum_{i=1}^N \sum_{k=1}^C p_i^{(k)} \cdot \log \left( \widehat{p}_i^{(k)} \right), \]

where \(k\) enumerates the number of classes, \(i\) enumerates the number of images, \(p_i^{(k)} \in \{0, 1\}\) denotes the true probability of image \(i\) and class \(k\) and \(\widehat{p}_i^{(k)} \in [0, 1]\) is the probability predicted by the network. Note that the true probability distribution is categorical (hard labels), i.e.,

\[ p_i^{(k)} = 1_{k = y_i} = \begin{cases}1 & \text{if } k = y_i \\ 0 & \text{else}\end{cases} \]

where \(y_i \in \{0, 1, \cdots, 9 \}\) is the label assigned to the \(i\)-th image \(\textbf{x}_i\). Thus, we can rewrite the loss as follows

\[ \text{Loss} = - \sum_{i=1}^N \log \left( \widehat{p}_{i, y_i} \right), \]

which is the definition of the negative log likelihood loss (NLLLoss) in Pytorch, when the logarithmized predictions \(\log \left( \widehat{p}_{i, y_i} \right)\) (matrix of size \(N\times C\)) and class labels \(y_i\) (vector of size \(N\)) are given as input.

The code below summarizes the whole training procedure.

Code

from livelossplot import PlotLosses
from torch.utils.data import DataLoader


def train(model, dataset):
    # fix hyperparameters
    epochs = 50
    learning_rate = 0.01
    batch_size = 256
    step_size_scheduler = 50000
    gamma_scheduler = 0.1
    # set device
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Device: {device}')

    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True,
                            num_workers=4)

    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, gamma=gamma_scheduler,
                                                step_size=step_size_scheduler)

    losses_plot = PlotLosses()
    print(f'Start training with {model.name}')
    for epoch in range(1, epochs+1):
        avg_loss = 0
        for data, label in data_loader:
            model.zero_grad()

            log_prop_pred = model(data.to(device))
            # multinomial cross entropy loss
            loss = F.nll_loss(log_prop_pred, label.to(device))

            loss.backward()
            optimizer.step()
            scheduler.step()

            avg_loss += loss.item() / len(data_loader)

        losses_plot.update({'log loss': np.log(avg_loss)})
        losses_plot.send()
    trained_model = model
    return trained_model

Test Procedure: A very simple test procedure to evaluate both models is shown below. It is basically the same as in the pytorch tutorial.

Code

def test(trained_model, test_dataset):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    test_loader = DataLoader(test_dataset, batch_size=256, shuffle=True,
                            num_workers=4)
    with torch.no_grad():
        trained_model.eval()
        test_loss = 0
        correct = 0
        for data, label in test_loader:
            data, label = data.to(device), label.to(device)

            log_prop_pred = trained_model(data)
            class_pred = log_prop_pred.max(1, keepdim=True)[1]

            test_loss += F.nll_loss(log_prop_pred, label).item()/len(test_loader)
            correct += class_pred.eq(label.view_as(class_pred)).sum().item()

        print(f'{trained_model.name}: avg loss: {np.round(test_loss, 2)},  ' +
              f'avg acc {np.round(100*correct/len(test_dataset), 2)}%')
    return

Results

Lastly, the results can also divided into three sections:

Training Results: Firstly, we train our models on the training dataset and compare the logarithmized losses:
Code
```
trained_cnn = train(cnn, train_dataset)
```
Training Results CNN
Code
```
trained_st_cnn = train(st_cnn, train_dataset)
```
Training Results ST-CNN

The logarithmized losses already indicate that the ST-CNN performs better than the standard CNN (at least, it decreases the loss faster). However, it can also be noted that training the ST-CNN seems less stable.
Test Performance: While the performance on the training dataset may be a good indicator, test set performance is much more meaningful. Let’s compare the losses and accuracies between both trained models:
Code
```
for trained_model in [trained_cnn, trained_st_cnn]:
    test(trained_model, test_dataset)
```
Test Results

Clearly, the ST-CNN performs much better than the standard CNN. Note that training for more epochs would probably result in even better accuracies in both models.

Visualization of Learned Transformations: Lastly, it might be interesting to see what the ST module actually does after training:

Code

import matplotlib.pyplot as plt
from matplotlib.patches import ConnectionPatch


def visualize_learned_transformations(trained_st_cnn, test_dataset, digit_class=8):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    trained_st_cnn.to(device)
    n_samples = 5

    data_loader = DataLoader(test_dataset, batch_size=256, shuffle=True)
    batch_img, batch_label = next(iter(data_loader))
    i_samples = np.where(batch_label.numpy() == digit_class)[0][0:n_samples]

    fig = plt.figure(figsize=(n_samples*2.5, 2.5*4))
    for counter, i_sample in enumerate(i_samples):
        img = batch_img[i_sample]
        label = batch_label[i_sample]

        # input image
        ax1 = plt.subplot(4, n_samples, 1 + counter)
        plt.imshow(transforms.ToPILImage()(img), cmap='gray')
        plt.axis('off')
        if counter == 0:
            ax1.annotate('Input', xy=(-0.3, 0.5), xycoords='axes fraction',
                        fontsize=14, va='center', ha='right')

        # image including border of affine transformation
        img_inp = img.unsqueeze(0).to(device)
        source_normalized = trained_st_cnn.get_attention_rectangle(img_inp)
        # remap into absolute values
        source_absolute = 0 + 20.5*(source_normalized.cpu() + 1)
        ax2 = plt.subplot(4, n_samples, 1 + counter + n_samples)
        x = np.arange(42)
        y = np.arange(42)
        X, Y = np.meshgrid(x, y)
        plt.pcolor(X, Y, img.squeeze(0), cmap='gray')
        plt.plot(source_absolute[0], source_absolute[1], color='red')
        plt.axis('off')
        ax2.axes.set_aspect('equal')
        ax2.set_ylim(41, 0)
        ax2.set_xlim(0, 41)
        if counter == 0:
            ax2.annotate('ST', xy=(-0.3, 0.5), xycoords='axes fraction',
                        fontsize=14, va='center', ha='right')
        # add arrow between
        con = ConnectionPatch(xyA=(21, 41), xyB=(21, 0), coordsA='data',
                              coordsB='data', axesA=ax1, axesB=ax2,
                              arrowstyle="-|>", shrinkB=5)
        ax2.add_artist(con)

        # ST module output
        st_img = trained_st_cnn.ST_module(img.unsqueeze(0).to(device))

        ax3 = plt.subplot(4, n_samples, 1 + counter + 2*n_samples)
        plt.imshow(transforms.ToPILImage()(st_img.squeeze(0).cpu()), cmap='gray')
        plt.axis('off')
        if counter == 0:
            ax3.annotate('ST Output', xy=(-0.3, 0.5), xycoords='axes fraction',
                        fontsize=14, va='center', ha='right')
        # add arrow between
        con = ConnectionPatch(xyA=(21, 41), xyB=(21, 0), coordsA='data',
                              coordsB='data', axesA=ax2, axesB=ax3,
                              arrowstyle="-|>", shrinkB=5)
        ax3.add_artist(con)

        # predicted label
        log_pred = trained_st_cnn(img.unsqueeze(0).to(device))
        pred_label = log_pred.max(1)[1].item()

        ax4 = plt.subplot(4, n_samples, 1 + counter + 3*n_samples)
        plt.text(0.45, 0.43, str(pred_label), fontsize=22)
        plt.axis('off')
        #plt.title(f'Ground Truth {label.item()}', y=-0.1, fontsize=14)
        if counter == 0:
            ax4.annotate('Prediction', xy=(-0.3, 0.5), xycoords='axes fraction',
                        fontsize=14, va='center', ha='right')
        # add arrow between
        con = ConnectionPatch(xyA=(21, 41), xyB=(0.5, 0.65), coordsA='data',
                              coordsB='data', axesA=ax3, axesB=ax4,
                              arrowstyle="-|>", shrinkB=5)
        ax4.add_artist(con)
    return


visualize_learned_transformations(st_cnn, test_dataset, 2)

Clearly, the ST module attends to the digits such that the ST output has much less variation in terms of rotation, translation and scale making the classification task for the follow up CNN easier.

Pretty cool, hugh?

Footnotes

Clearly, convolutional layers are not rotation or scale invariant. Even the translation-equivariance property does not necessarily make CNNs translation-invariant as typically some fully connected layers are added at the end. Max-pooling layers can introduce some translation invariance, however are limited by their size such that often large translation are not captured.↩︎
Jaderberg et al. (2015) define the transformation with normalized coordinates, i.e., \(-1 \le x_i^s, y_i^s \le 1\). However, in the sampling kernel equations it seems more likely that they assume unnormalized/absolute coordinates, e.g., in equation 4 of the paper normalized coordinates would be nonsensical.↩︎