参数化教程#

原作者: Mario Lezcano

将深度学习模型正则化是一项极具挑战性的任务。由于优化函数的复杂性,经典的技术如惩罚(penalty)方法在应用于深度模型时往往达不到预期效果。当使用条件不佳的模型时,这尤其成问题。这些例子是训练长序列和 GANs 的 RNNs。近年来,人们提出了一些技术来正则化这些模型并提高它们的收敛性。在循环模型上,提出了控制循环核奇异值的方法,使 RNN 具有良好的状态(well-conditioned)。另一种正则化循环模型的方法是通过“权值归一化”。该方法提出将参数的学习与参数正则化的学习分离开来。为此,将参数除以其 Frobenius 范数,并学习一个编码其范数的单独参数。在“谱范数”的名义下,对 GANs 提出了类似的正则化。该方法通过将网络参数除以其谱范数而不是其 Frobenius 范数来控制网络的 Lipschitz 常数。

所有这些方法都有共同的模式:它们都在使用参数之前以适当的方式变换参数。在第一种情况下,他们通过使用将矩阵映射到正交矩阵的函数使其正交。在权值和谱范数的情况下,他们用原始参数除以其范数。

更一般地说,所有这些示例都使用函数在形参上添加额外的结构。换句话说,它们使用一个函数来约束参数。

在本教程中,您将学习如何实现并使用此模式在模型上添加约束。这样做就像编写自己的 Module 一样简单。

手工实现参数化#

假设想要具有对称权值的方形线性层,即权值为 X,使得 X = Xᵀ。一种方法是将矩阵的上三角部分复制到下三角部分:

import torch
from torch import nn
from torch.nn.utils import parametrize


def symmetric(X):
    return X.triu() + X.triu(1).transpose(-1, -2)

X = torch.rand(3, 3)
A = symmetric(X)
assert torch.allclose(A, A.T)  # A 是对称的
print(A)                       # 快速目视检查
tensor([[0.6385, 0.1805, 0.1233],
        [0.1805, 0.6446, 0.0409],
        [0.1233, 0.0409, 0.3928]])

然后可以使用这个想法来实现具有对称权重的线性层:

class LinearSymmetric(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(n_features, n_features))

    def forward(self, x):
        A = symmetric(self.weight)
        return x @ A

该层可以作为正则的线性层

layer = LinearSymmetric(3)
out = layer(torch.rand(8, 3))

这个实现虽然是正确的,而且是独立的,但也存在一些问题:

  1. 它重新实现了这个层。必须像 x @ A 那样实现线性层,这对于线性层来说不是什么大问题,但是想象一下必须重新实现 CNN 或 Transformer …

  2. 它不分离层和参数化。如果参数化更加困难,将不得不为想要在其中使用它的每个层重写它的代码。

  3. 每次使用该层时,它都会重新计算参数化。如果在正向传递过程中多次使用该层(想象一下 RNN 的循环核),它将在每次调用该层时计算相同的 A

参数化简介#

参数化(parametrizations)可以解决所有这些问题。

让我们首先使用 torch.nn.utils.parametrize 重新实现上面的代码。唯一要做的就是把参数化写成普通的 nn.Module

class Symmetric(nn.Module):
    def forward(self, X):
        return X.triu() + X.triu(1).transpose(-1, -2)

这就是我们要做的。一旦有了这个,就可以把任何正则层变换成对称的层:

layer = nn.Linear(3, 3)
parametrize.register_parametrization(layer, "weight", Symmetric())
ParametrizedLinear(
  in_features=3, out_features=3, bias=True
  (parametrizations): ModuleDict(
    (weight): ParametrizationList(
      (0): Symmetric()
    )
  )
)

现在,线性层的矩阵是对称的

A = layer.weight
assert torch.allclose(A, A.T)
print(A)
tensor([[ 0.5477,  0.4742, -0.3670],
        [ 0.4742,  0.1533,  0.4901],
        [-0.3670,  0.4901,  0.1949]], grad_fn=<AddBackward0>)

可以对任何其他层做同样的事情。例如,可以创建带有斜对称内核的 CNN。使用类似的参数化,将带符号的上三角形部分复制到下三角形部分

class Skew(nn.Module):
    def forward(self, X):
        A = X.triu(1)
        return A - A.transpose(-1, -2)


cnn = nn.Conv2d(in_channels=5, out_channels=8, kernel_size=3)
parametrize.register_parametrization(cnn, "weight", Skew())
# Print a few kernels
print(cnn.weight[0, 1])
print(cnn.weight[2, 2])
tensor([[ 0.0000,  0.0043,  0.1344],
        [-0.0043,  0.0000,  0.0796],
        [-0.1344, -0.0796,  0.0000]], grad_fn=<SelectBackward0>)
tensor([[ 0.0000,  0.1431,  0.1439],
        [-0.1431,  0.0000,  0.1359],
        [-0.1439, -0.1359,  0.0000]], grad_fn=<SelectBackward0>)

检查参数化 module#

当模块被参数化时,会发现该模块在三方面发生了变化:

  1. model.weight 现在是属性(property)

  2. 存在新的 module.parametrizations 属性(attribute)

  3. 非参数化权重已移动到 module.parametrizations.weight.original

参数化 weight 后,layer.weight 被变换为 Python 属性。这个属性在每次请求 layer.weight 时计算 parametrization(weight),就像在上面的 LinearSymmetric 实现中所做的那样。

注册的参数化存储在模块内的 parametrizations 属性下。

layer = nn.Linear(3, 3)
print(f"Unparametrized:\n{layer}")
parametrize.register_parametrization(layer, "weight", Symmetric())
print(f"\nParametrized:\n{layer}")
Unparametrized:
Linear(in_features=3, out_features=3, bias=True)

Parametrized:
ParametrizedLinear(
  in_features=3, out_features=3, bias=True
  (parametrizations): ModuleDict(
    (weight): ParametrizationList(
      (0): Symmetric()
    )
  )
)

这个 parametrizations 属性是 nn.ModuleDict,并且可以这样访问它

print(layer.parametrizations)
print(layer.parametrizations.weight)
ModuleDict(
  (weight): ParametrizationList(
    (0): Symmetric()
  )
)
ParametrizationList(
  (0): Symmetric()
)

nn.ModuleDict 的每个元素是 ParametrizationList,它的行为类似于 nn.Sequential。这个列表将允许在权重上连接参数化。因为这是一个列表,我们可以访问索引它的参数化。这就是 Symmetric 参数化的所在

print(layer.parametrizations.weight[0])
Symmetric()

注意到的另一件事是,如果打印参数,会看到参数 weight 被移动了

print(dict(layer.named_parameters()))
{'bias': Parameter containing:
tensor([-0.4708, -0.4272, -0.5326], requires_grad=True), 'parametrizations.weight.original': Parameter containing:
tensor([[ 0.4326, -0.5063, -0.5035],
        [-0.2364,  0.0440, -0.2994],
        [-0.3409, -0.3391,  0.0666]], requires_grad=True)}

现在它位于 layer.parametrizations.weight.original

print(layer.parametrizations.weight.original)
Parameter containing:
tensor([[ 0.4326, -0.5063, -0.5035],
        [-0.2364,  0.0440, -0.2994],
        [-0.3409, -0.3391,  0.0666]], requires_grad=True)

除了这三个小的区别之外,参数化和手动的实现完全相同

symmetric = Symmetric()
weight_orig = layer.parametrizations.weight.original
print(torch.dist(layer.weight, symmetric(weight_orig)))
tensor(0., grad_fn=<DistBackward0>)

参数化是一等公民(first-class citizens)#

因为 layer.parametrizationsnn.ModuleList,它意味着参数化被正确注册为原始模块的子模块。同样,在模块中注册参数的相同规则也适用于注册参数化。例如,如果参数化具有参数,那么在调用 model = model.cuda() 时,这些参数将从 CPU 移动到 CUDA。

缓存参数化的值#

参数化是上下文管理器 parametrize.cached() 提供的内置缓存系统管理的:

class NoisyParametrization(nn.Module):
    def forward(self, X):
        print("Computing the Parametrization")
        return X

layer = nn.Linear(4, 4)
parametrize.register_parametrization(layer, "weight", NoisyParametrization())
print("Here, layer.weight is recomputed every time we call it")
foo = layer.weight + layer.weight.T
bar = layer.weight.sum()
with parametrize.cached():
    print("Here, it is computed just the first time layer.weight is called")
    foo = layer.weight + layer.weight.T
    bar = layer.weight.sum()
Computing the Parametrization
Here, layer.weight is recomputed every time we call it
Computing the Parametrization
Computing the Parametrization
Computing the Parametrization
Here, it is computed just the first time layer.weight is called
Computing the Parametrization

Concatenating 参数化#

连接(Concatenating)两个参数化就像将它们注册到同一个张量上一样简单。可以使用它从更简单的参数化创建更复杂的参数化。例如,Cayley 映射将斜对称矩阵映射为行列式为正的正交矩阵。可以将 Skew 和实现 Cayley 映射的参数化 concatenate 起来,以获得具有正交权重的层

class CayleyMap(nn.Module):
    def __init__(self, n):
        super().__init__()
        self.register_buffer("Id", torch.eye(n))

    def forward(self, X):
        # (I + X)(I - X)^{-1}
        return torch.linalg.solve(self.Id + X, self.Id - X)

layer = nn.Linear(3, 3)
parametrize.register_parametrization(layer, "weight", Skew())
parametrize.register_parametrization(layer, "weight", CayleyMap(3))
X = layer.weight
print(torch.dist(X.T @ X, torch.eye(3)))  # X is orthogonal
tensor(1.9679e-07, grad_fn=<DistBackward0>)

这也可用于 prune 参数化模块,或重用参数化。例如,矩阵指数将对称矩阵映射到对称正定矩阵(Symmetric Positive Definite,简称 SPD),但矩阵指数也将斜对称矩阵映射到正交矩阵。利用这两个事实,可以重用之前的参数化

class MatrixExponential(nn.Module):
    def forward(self, X):
        return torch.matrix_exp(X)

layer_orthogonal = nn.Linear(3, 3)
parametrize.register_parametrization(layer_orthogonal, "weight", Skew())
parametrize.register_parametrization(layer_orthogonal, "weight", MatrixExponential())
X = layer_orthogonal.weight
print(torch.dist(X.T @ X, torch.eye(3)))         # X is orthogonal

layer_spd = nn.Linear(3, 3)
parametrize.register_parametrization(layer_spd, "weight", Symmetric())
parametrize.register_parametrization(layer_spd, "weight", MatrixExponential())
X = layer_spd.weight
print(torch.dist(X, X.T))                      # X is symmetric
print((torch.linalg.eigh(X).eigenvalues > 0.).all())  # X is positive definite
tensor(1.7764e-07, grad_fn=<DistBackward0>)
tensor(9.6571e-08, grad_fn=<DistBackward0>)
tensor(True)

初始化参数化#

参数化带有一种初始化机制。如果实现带签名的 right_inverse 方法

def right_inverse(self, X: Tensor) -> Tensor

它将在赋值给参数化张量时使用。

让我们升级 Skew 类的实现来支持这一点

class Skew(nn.Module):
    def forward(self, X):
        A = X.triu(1)
        return A - A.transpose(-1, -2)

    def right_inverse(self, A):
        # We assume that A is skew-symmetric
        # We take the upper-triangular elements, as these are those used in the forward
        return A.triu(1)

现在可以初始化带有 Skew 参数化的层

layer = nn.Linear(3, 3)
parametrize.register_parametrization(layer, "weight", Skew())
X = torch.rand(3, 3)
X = X - X.T                             # X is now skew-symmetric
layer.weight = X                        # Initialize layer.weight to be X
print(torch.dist(layer.weight, X))      # layer.weight == X
tensor(0., grad_fn=<DistBackward0>)

当 concatenate 参数化时,right_inverse 按预期工作。要了解这一点,让我们升级 Cayley 参数化,使其也支持初始化

class CayleyMap(nn.Module):
    def __init__(self, n):
        super().__init__()
        self.register_buffer("Id", torch.eye(n))

    def forward(self, X):
        # Assume X skew-symmetric
        # (I + X)(I - X)^{-1}
        return torch.linalg.solve(self.Id + X, self.Id - X)

    def right_inverse(self, A):
        # Assume A orthogonal
        # See https://en.wikipedia.org/wiki/Cayley_transform#Matrix_map
        # (X - I)(X + I)^{-1}
        return torch.linalg.solve(X - self.Id, self.Id + X)

layer_orthogonal = nn.Linear(3, 3)
parametrize.register_parametrization(layer_orthogonal, "weight", Skew())
parametrize.register_parametrization(layer_orthogonal, "weight", CayleyMap(3))
# Sample an orthogonal matrix with positive determinant
X = torch.empty(3, 3)
nn.init.orthogonal_(X)
if X.det() < 0.:
    X[0].neg_()
layer_orthogonal.weight = X
print(torch.dist(layer_orthogonal.weight, X))  # layer_orthogonal.weight == X
tensor(2.7863, grad_fn=<DistBackward0>)

这个初始化步骤可以更简洁地写成

layer_orthogonal.weight = nn.init.orthogonal_(layer_orthogonal.weight)

这个方法的名称来源于这样一个事实,即我们通常期望 forward(right_inverse(X)) == X。这是一种直接的重写方式,即值为 X 的初始化之后的 forward 应该返回值 X。事实上,有时,放松这种关系可能是有趣的。例如,考虑以下随机 pruning 方法的实现:

class PruningParametrization(nn.Module):
    def __init__(self, X, p_drop=0.2):
        super().__init__()
        # sample zeros with probability p_drop
        mask = torch.full_like(X, 1.0 - p_drop)
        self.mask = torch.bernoulli(mask)

    def forward(self, X):
        return X * self.mask

    def right_inverse(self, A):
        return A

在这种情况下,对于每个矩阵 A forward(right_inverse(A)) == A 是不正确的,只有当矩阵 A 在与掩码相同的位置上有 0 时才成立。即便如此,如果把一个张量赋给修剪过的参数,那么张量实际上是修剪过的就不足为奇了

layer = nn.Linear(3, 4)
X = torch.rand_like(layer.weight)
print(f"Initialization matrix:\n{X}")
parametrize.register_parametrization(layer, "weight", PruningParametrization(layer.weight))
layer.weight = X
print(f"\nInitialized weight:\n{layer.weight}")
Initialization matrix:
tensor([[0.4091, 0.5341, 0.9634],
        [0.1564, 0.7707, 0.7291],
        [0.8022, 0.4453, 0.7149],
        [0.5246, 0.1759, 0.7719]])

Initialized weight:
tensor([[0.0000, 0.5341, 0.9634],
        [0.1564, 0.7707, 0.7291],
        [0.8022, 0.4453, 0.7149],
        [0.0000, 0.1759, 0.7719]], grad_fn=<MulBackward0>)

移除参数化#

可以使用 parametrize.remove_parametrizations() 从模块中的参数或缓冲区中删除所有的参数化

layer = nn.Linear(3, 3)
print("Before:")
print(layer)
print(layer.weight)
parametrize.register_parametrization(layer, "weight", Skew())
print("\nParametrized:")
print(layer)
print(layer.weight)
parametrize.remove_parametrizations(layer, "weight")
print("\nAfter. Weight has skew-symmetric values but it is unconstrained:")
print(layer)
print(layer.weight)
Before:
Linear(in_features=3, out_features=3, bias=True)
Parameter containing:
tensor([[ 0.3413, -0.2740,  0.3453],
        [ 0.0479, -0.4459, -0.5757],
        [ 0.0605,  0.5496,  0.4016]], requires_grad=True)

Parametrized:
ParametrizedLinear(
  in_features=3, out_features=3, bias=True
  (parametrizations): ModuleDict(
    (weight): ParametrizationList(
      (0): Skew()
    )
  )
)
tensor([[ 0.0000, -0.2740,  0.3453],
        [ 0.2740,  0.0000, -0.5757],
        [-0.3453,  0.5757,  0.0000]], grad_fn=<SubBackward0>)

After. Weight has skew-symmetric values but it is unconstrained:
Linear(in_features=3, out_features=3, bias=True)
Parameter containing:
tensor([[ 0.0000, -0.2740,  0.3453],
        [ 0.2740,  0.0000, -0.5757],
        [-0.3453,  0.5757,  0.0000]], requires_grad=True)

When removing a parametrization, we may choose to leave the original parameter (i.e. that in layer.parametriations.weight.original) rather than its parametrized version by setting the flag leave_parametrized=False

当移除参数化时,可以选择保留原始参数(即在 layer.parametriations.weight.original 中),而不是它的参数化版本,通过设置标志 leave_parametrized=False

layer = nn.Linear(3, 3)
print("Before:")
print(layer)
print(layer.weight)
parametrize.register_parametrization(layer, "weight", Skew())
print("\nParametrized:")
print(layer)
print(layer.weight)
parametrize.remove_parametrizations(layer, "weight", leave_parametrized=False)
print("\nAfter. Same as Before:")
print(layer)
print(layer.weight)
Before:
Linear(in_features=3, out_features=3, bias=True)
Parameter containing:
tensor([[ 0.2074,  0.0732,  0.3313],
        [-0.5749, -0.4251,  0.1003],
        [-0.1283, -0.0165,  0.5685]], requires_grad=True)

Parametrized:
ParametrizedLinear(
  in_features=3, out_features=3, bias=True
  (parametrizations): ModuleDict(
    (weight): ParametrizationList(
      (0): Skew()
    )
  )
)
tensor([[ 0.0000,  0.0732,  0.3313],
        [-0.0732,  0.0000,  0.1003],
        [-0.3313, -0.1003,  0.0000]], grad_fn=<SubBackward0>)

After. Same as Before:
Linear(in_features=3, out_features=3, bias=True)
Parameter containing:
tensor([[0.0000, 0.0732, 0.3313],
        [0.0000, 0.0000, 0.1003],
        [0.0000, 0.0000, 0.0000]], requires_grad=True)