$35
MP6: AutoVC
In this MP, you will construct neural network layers using PyTorch for use with the system described in "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss" (ICML 2019) by Kaizhi Qian et al. (of the professor's own group!). The file you actually need to complete is submitted.py. The unit tests, provided in tests/test_visible.py, may be run using grade.py.
import numpy as np
import matplotlib.figure
import matplotlib.pyplot as plt
%matplotlib inline
import torch
torch.manual_seed(417)
np.random.seed(417)
import importlib
import submitted
importlib.reload(submitted)
<module 'submitted' from 'C:\\Users\\mahir.000\\Desktop\\ece417\\ece417_mp6\\src\\submitted.py'>
PyTorch basics
For this MP, we will be primarily using the PyTorch machine learning framework.
Compared to other frameworks presently used for machine learning, PyTorch has achieved wide currency across disciplines, both in research and in production, for the great depth in its repertoire of functionality and for how much is handled automatically by the library (including and especially its simple automatic gradient calculation interface). This leaves for the user few extra requirements for implementing a given machine learning model.
A number of comparisons to NumPy may hint at the ease with which PyTorch may be used. To start with the basics, just as the primary object of manipulation in NumPy is the N-dimensional array (ndarray), PyTorch's object of choice is the N-dimensional tensor.
x = np.array([1,2,3,4,5])
y = torch.Tensor([1,2,3,4.5])
print(torch.zeros(5), np.zeros(5))
print(torch.ones(5), np.ones(5))
print(torch.stack([torch.randn(4) for _ in range(3)]),np.stack([np.random.rand(4) for _ in range(3)]))
tensor([0., 0., 0., 0., 0.]) [0. 0. 0. 0. 0.]
tensor([1., 1., 1., 1., 1.]) [1. 1. 1. 1. 1.]
tensor([[ 0.6558, -0.7298, -1.3288, 0.4065],
[-0.7632, -0.4722, 1.2657, 0.0846],
[ 0.0128, 0.5777, -0.4009, 0.1150]]) [[0.01897779 0.98651468 0.09738904 0.53350548]
[0.26298251 0.9601228 0.19629425 0.42507458]
[0.33849926 0.20808288 0.40207791 0.51957192]]
The two behave very similarly, since many methods used in PyTorch are designed with their direct equivalents in NumPy in mind:
randa, randb = torch.randn(5), np.random.rand(5)
print(randa[1:3], randb[1:3])
print(randa.unsqueeze(0).T, np.expand_dims(randb, 0).T)
print(torch.cos(torch.sin(randa)), np.cos(np.sin(randb)))
print(torch.outer(randa, torch.from_numpy(randb)), np.outer(randa.numpy(), randb))
print(torch.cumsum(randa, 0), np.cumsum(randb))
print(torch.msort(randa), np.msort(randb))
print(torch.argmax(randa), np.argmax(randb))
tensor([-0.0171, 0.7800]) [0.27820731 0.73292514]
tensor([[-1.7107],
[-0.0171],
[ 0.7800],
[ 0.1523],
[ 0.8582]]) [[0.80200273]
[0.27820731]
[0.73292514]
[0.96839393]
[0.98492235]]
tensor([0.5485, 0.9999, 0.7627, 0.9885, 0.7271]) [0.75262939 0.96252497 0.78441341 0.67930824 0.67248935]
tensor([[-1.3720, -0.4759, -1.2538, -1.6566, -1.6849],
[-0.0137, -0.0048, -0.0126, -0.0166, -0.0169],
[ 0.6255, 0.2170, 0.5717, 0.7553, 0.7682],
[ 0.1222, 0.0424, 0.1116, 0.1475, 0.1500],
[ 0.6883, 0.2388, 0.6290, 0.8311, 0.8453]], dtype=torch.float64) [[-1.37200102 -0.47593443 -1.2538287 -1.65664955 -1.68492502]
[-0.01374873 -0.00476931 -0.01256453 -0.01660118 -0.01688452]
[ 0.62553075 0.21699082 0.57165293 0.75530938 0.7682009 ]
[ 0.12217011 0.04237968 0.11164743 0.1475167 0.1500345 ]
[ 0.6882811 0.23875833 0.62899851 0.83107852 0.84526325]]
tensor([-1.7107, -1.7279, -0.9479, -0.7956, 0.0626]) [0.80200273 1.08021004 1.81313518 2.78152912 3.76645147]
tensor([-1.7107, -0.0171, 0.1523, 0.7800, 0.8582]) [0.27820731 0.73292514 0.80200273 0.96839393 0.98492235]
tensor(4) 4
Of course, as useful as these simple functions may be, they're certainly not the entire story.
A typical PyTorch model consists of, at minimum, a class with two procedures:
an initialization method (as with any Python class), in which one assembles a number of layers into a coherent whole, and
a forward method, in which said layers are used in the forward propagation of a number of inputs.
The layers of a neural network are all typically organized into modules, which when combined form a graph of computations (think the graph in lecture 22 slide 24) based on which gradients can be computed:
class MyFirstModule(torch.nn.Module):
# the initialization method, with any number of input parameters as desired
def __init__(self, input_size, output_size):
# every module must be subclassed from nn.Module and thus be initialized with respect to it
super(MyFirstModule, self).__init__()
# we can declare a number of layers
self.linear1 = torch.nn.Linear(input_size,35)
self.relu = torch.nn.ReLU()
self.linear2 = torch.nn.Linear(35,100)
self.relu6 = torch.nn.ReLU6()
self.linear3 = torch.nn.Linear(100,output_size)
self.selu = torch.nn.SELU()
def forward(self, module_input):
return self.selu(
self.linear3(
self.relu6(
self.linear2(
self.relu(
self.linear1(module_input))))))
Since each of the layers takes in exactly one input tensor and returns exactly one output tensor, a more concise way to define the class above might be as follows:
class MyFirstModule(torch.nn.Module):
def __init__(self, input_size, output_size):
super(MyFirstModule, self).__init__()
# wrap all layers in a Sequential container (which is itself a Module)
layer_list = [
torch.nn.Linear(input_size,35),
torch.nn.ReLU(),
torch.nn.Linear(35,100),
torch.nn.ReLU6(),
torch.nn.Linear(100,output_size),
torch.nn.SELU()
]
self.layers = torch.nn.Sequential(*layer_list)
def forward(self, module_input):
return self.layers(module_input)
We can of course directly manipulate an instance of this module after having constructed it, for instance when needing to load saved parameters:
mfm = MyFirstModule(10,5)
for parameter_name, parameter_value in mfm.named_parameters():
# randomizing-out all parameters because we can
parameter_value = torch.randn(parameter_value.shape)
print("Set",parameter_name,"to random values")
Set layers.0.weight to random values
Set layers.0.bias to random values
Set layers.2.weight to random values
Set layers.2.bias to random values
Set layers.4.weight to random values
Set layers.4.bias to random values
To obtain an output from this model, we can just call it with an input (the arguments after self in the forward method):
current_input = torch.randn(2,10)
desired_output = torch.randn(2,5)
current_output = mfm(current_input)
We can use any of the loss functions provided in torch.nn to obtain a metric for model performance. To then obtain the gradients of this loss, all we need to do is call the model's backward method:
loss_function = torch.nn.MSELoss() # a Module, just like any other neural net layer
current_loss = loss_function(current_output,desired_output)
# calculate the gradients for each parameter in the model
current_loss.backward()
# now print these gradients
for parameter_name, parameter_value in mfm.named_parameters():
print(parameter_name,parameter_value.grad)
layers.0.weight tensor([[ 1.5921e-03, -2.8884e-02, -1.8274e-02, -2.5900e-03, -1.5458e-02,
-1.9175e-02, -2.4949e-02, 2.5715e-02, 7.3068e-03, -1.3617e-02],
[ 4.8482e-02, -2.8644e-02, 2.0272e-03, 3.9995e-03, 5.9004e-03,
3.0402e-04, 4.8454e-03, 2.3276e-02, 6.0636e-03, 4.8217e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 7.7546e-02, -4.5816e-02, 3.2425e-03, 6.3971e-03, 9.4375e-03,
4.8628e-04, 7.7502e-03, 3.7230e-02, 9.6987e-03, 7.7123e-02],
[ 1.0122e-01, -5.9804e-02, 4.2325e-03, 8.3502e-03, 1.2319e-02,
6.3475e-04, 1.0116e-02, 4.8597e-02, 1.2660e-02, 1.0067e-01],
[-7.0236e-02, 4.4936e-02, -6.7954e-04, -5.4592e-03, -6.6216e-03,
1.9207e-03, -3.9294e-03, -3.6792e-02, -9.6592e-03, -6.7982e-02],
[ 6.7414e-02, 4.9666e-04, 2.9288e-02, 9.4887e-03, 3.0793e-02,
2.8110e-02, 4.2973e-02, -3.6418e-03, -1.8261e-03, 8.8983e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[-7.8391e-03, -2.3531e-03, -4.9123e-03, -1.3269e-03, -4.8664e-03,
-4.8446e-03, -7.0596e-03, 2.4730e-03, 7.9619e-04, -1.1596e-02],
[ 1.6632e-02, -9.8265e-03, 6.9544e-04, 1.3720e-03, 2.0241e-03,
1.0430e-04, 1.6622e-03, 7.9851e-03, 2.0802e-03, 1.6541e-02],
[ 7.0787e-04, -1.2842e-02, -8.1252e-03, -1.1516e-03, -6.8729e-03,
-8.5254e-03, -1.1093e-02, 1.1433e-02, 3.2487e-03, -6.0545e-03],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[-1.8115e-02, -2.3038e-02, -2.2904e-02, -4.7805e-03, -2.1104e-02,
-2.3279e-02, -3.2129e-02, 2.1430e-02, 6.3167e-03, -3.6371e-02],
[-7.6777e-04, 1.3929e-02, 8.8128e-03, 1.2490e-03, 7.4546e-03,
9.2469e-03, 1.2032e-02, -1.2401e-02, -3.5237e-03, 6.5669e-03],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[-5.9929e-02, 1.0144e-01, 4.0839e-02, 1.4875e-03, 2.9696e-02,
4.4963e-02, 5.3348e-02, -8.7737e-02, -2.4293e-02, -2.3679e-02],
[ 2.0305e-04, -3.6838e-03, -2.3307e-03, -3.3033e-04, -1.9715e-03,
-2.4455e-03, -3.1820e-03, 3.2796e-03, 9.3190e-04, -1.7367e-03],
[-9.8917e-02, 9.9451e-02, 2.2781e-02, -4.1662e-03, 1.0932e-02,
2.7535e-02, 2.6963e-02, -8.4108e-02, -2.2803e-02, -7.6069e-02],
[-8.3610e-04, 1.5169e-02, 9.5971e-03, 1.3602e-03, 8.1180e-03,
1.0070e-02, 1.3102e-02, -1.3505e-02, -3.8373e-03, 7.1513e-03],
[ 2.0367e-01, -1.2034e-01, 8.5163e-03, 1.6802e-02, 2.4788e-02,
1.2772e-03, 2.0356e-02, 9.7785e-02, 2.5473e-02, 2.0256e-01],
[-1.9947e-03, 3.6189e-02, 2.2896e-02, 3.2451e-03, 1.9368e-02,
2.4024e-02, 3.1259e-02, -3.2218e-02, -9.1548e-03, 1.7061e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[-2.4398e-02, 1.4415e-02, -1.0202e-03, -2.0127e-03, -2.9693e-03,
-1.5300e-04, -2.4384e-03, -1.1713e-02, -3.0514e-03, -2.4265e-02],
[ 1.0981e-01, -5.0660e-02, 1.3924e-02, 1.0443e-02, 2.1328e-02,
1.0450e-02, 2.3750e-02, 4.0025e-02, 1.0117e-02, 1.1694e-01],
[-5.9253e-02, 3.5008e-02, -2.4776e-03, -4.8880e-03, -7.2112e-03,
-3.7157e-04, -5.9219e-03, -2.8448e-02, -7.4108e-03, -5.8930e-02],
[ 1.6658e-01, -1.3658e-01, -1.8083e-02, 1.0026e-02, -1.1017e-03,
-2.5156e-02, -1.7641e-02, 1.1405e-01, 3.0542e-02, 1.4492e-01],
[-3.5633e-02, 2.1053e-02, -1.4900e-03, -2.9396e-03, -4.3367e-03,
-2.2345e-04, -3.5613e-03, -1.7108e-02, -4.4567e-03, -3.5439e-02],
[-3.6142e-02, 2.1354e-02, -1.5112e-03, -2.9815e-03, -4.3986e-03,
-2.2664e-04, -3.6121e-03, -1.7352e-02, -4.5203e-03, -3.5945e-02],
[ 7.9156e-02, -4.6768e-02, 3.3098e-03, 6.5300e-03, 9.6335e-03,
4.9638e-04, 7.9111e-03, 3.8004e-02, 9.9001e-03, 7.8725e-02],
[ 3.5131e-04, -6.3734e-03, -4.0324e-03, -5.7152e-04, -3.4110e-03,
-4.2311e-03, -5.5053e-03, 5.6742e-03, 1.6123e-03, -3.0048e-03]])
layers.0.bias tensor([ 0.0196, -0.0214, 0.0000, 0.0000, 0.0000, -0.0342, -0.0447, 0.0285,
-0.0591, 0.0000, 0.0000, 0.0085, -0.0073, 0.0087, 0.0000, 0.0325,
-0.0095, 0.0000, -0.0215, 0.0025, 0.0139, -0.0103, -0.0900, -0.0246,
0.0000, 0.0000, 0.0000, 0.0108, -0.0588, 0.0262, -0.0458, 0.0157,
0.0160, -0.0350, 0.0043])
layers.2.weight tensor([[ 0.0000e+00, -2.3097e-03, 0.0000e+00, ..., -2.8995e-02,
-1.6771e-02, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[-1.1915e-02, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, -1.7853e-02],
...,
[ 0.0000e+00, 2.5191e-03, 0.0000e+00, ..., 3.1623e-02,
1.8291e-02, 0.0000e+00],
[-7.1061e-03, -1.7084e-05, 0.0000e+00, ..., -2.1446e-04,
-1.2405e-04, -1.0648e-02],
[ 0.0000e+00, 1.4920e-03, 0.0000e+00, ..., 1.8730e-02,
1.0834e-02, 0.0000e+00]])
layers.2.bias tensor([-6.8678e-02, 0.0000e+00, -2.0062e-02, 4.3527e-02, 4.1833e-02,
0.0000e+00, 4.3875e-02, 0.0000e+00, 6.2778e-02, 0.0000e+00,
-6.3562e-02, -9.4380e-02, -1.1849e-02, 8.3106e-02, 0.0000e+00,
-1.6889e-05, 0.0000e+00, -3.8472e-02, -1.6180e-02, 3.1494e-02,
-7.2846e-02, -2.2661e-02, 0.0000e+00, -8.8434e-02, -4.8880e-02,
-9.6579e-03, 0.0000e+00, 0.0000e+00, -7.4028e-03, 0.0000e+00,
5.7478e-02, 0.0000e+00, -7.8950e-02, 0.0000e+00, 1.0900e-02,
-1.2709e-02, -1.7813e-02, 0.0000e+00, 1.8243e-02, -1.3147e-02,
0.0000e+00, -4.4632e-02, -2.1418e-02, 0.0000e+00, 4.3657e-02,
8.5483e-02, 0.0000e+00, -2.7920e-02, 1.1365e-02, 0.0000e+00,
5.7112e-02, -5.2439e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, -8.9939e-03, 6.2273e-03, 0.0000e+00, -7.8890e-02,
0.0000e+00, 0.0000e+00, -6.7656e-02, -1.6824e-02, 0.0000e+00,
0.0000e+00, 9.0385e-02, 1.9273e-02, 0.0000e+00, -2.2012e-02,
0.0000e+00, -4.6660e-03, -2.5817e-02, -3.8371e-02, -2.1670e-02,
-3.0830e-03, -3.2138e-02, 4.1012e-02, -1.7648e-02, 0.0000e+00,
3.3021e-02, -5.9520e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
5.2843e-02, -7.1051e-02, 4.0850e-02, 0.0000e+00, 0.0000e+00,
-6.1965e-03, 0.0000e+00, -2.1183e-02, 0.0000e+00, -7.1938e-02,
0.0000e+00, 2.1085e-02, 7.4903e-02, -1.2473e-02, 4.4364e-02])
layers.4.weight tensor([[-6.7523e-02, 0.0000e+00, 9.0370e-03, -6.5435e-02, 2.7447e-03,
0.0000e+00, -7.6078e-02, 0.0000e+00, -1.3680e-02, 0.0000e+00,
-9.4432e-02, -4.5662e-03, 7.1972e-03, -2.2315e-02, 0.0000e+00,
-9.5683e-02, 0.0000e+00, -4.5218e-02, -7.9139e-02, -4.1864e-02,
-4.6618e-03, 2.7548e-02, 0.0000e+00, -3.0045e-02, -4.3538e-03,
4.6304e-03, 0.0000e+00, 0.0000e+00, -3.2340e-02, 0.0000e+00,
-1.5212e-02, 0.0000e+00, -3.6960e-02, 0.0000e+00, -2.0725e-02,
-7.4335e-04, 1.4566e-02, 0.0000e+00, 8.1939e-03, 1.3637e-02,
0.0000e+00, -1.6721e-02, -9.1496e-03, 0.0000e+00, -1.3181e-01,
-1.1075e-01, 0.0000e+00, 2.3071e-03, -1.2785e-01, 0.0000e+00,
-1.2130e-02, -1.0071e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, -1.5816e-02, -1.5375e-01, 0.0000e+00, -1.5986e-01,
0.0000e+00, 0.0000e+00, -7.9905e-02, -1.0209e-01, 0.0000e+00,
0.0000e+00, -1.0206e-01, 6.7378e-03, 0.0000e+00, 6.6634e-03,
0.0000e+00, -3.8311e-02, 1.0815e-02, -1.5496e-01, 1.0542e-02,
-8.8653e-02, 3.9243e-03, 2.2292e-03, -2.5498e-02, 0.0000e+00,
-6.3205e-02, -9.6039e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00,
-9.6072e-02, -1.3223e-01, -8.0264e-02, 0.0000e+00, 0.0000e+00,
6.2892e-03, 0.0000e+00, -3.5603e-02, 0.0000e+00, -3.6629e-02,
0.0000e+00, 1.3718e-02, -2.8573e-02, -1.1925e-02, -7.5153e-02],
[-9.2272e-04, 0.0000e+00, -4.4959e-02, -8.9419e-04, -1.3655e-02,
0.0000e+00, -1.1924e-01, 0.0000e+00, -6.8391e-02, 0.0000e+00,
-1.1422e-01, -6.2399e-05, -3.5806e-02, -6.2933e-03, 0.0000e+00,
-1.3075e-03, 0.0000e+00, -6.1791e-04, -1.0815e-03, -1.0864e-01,
-6.3705e-05, -1.3705e-01, 0.0000e+00, -4.1057e-04, -5.9495e-05,
-2.3036e-02, 0.0000e+00, 0.0000e+00, -4.4193e-04, 0.0000e+00,
-2.0787e-04, 0.0000e+00, -8.7952e-02, 0.0000e+00, -1.0777e-02,
-1.9212e-02, -7.2468e-02, 0.0000e+00, -4.0765e-02, -6.7845e-02,
0.0000e+00, -6.1456e-02, -1.2503e-04, 0.0000e+00, -1.1674e-01,
-1.6963e-02, 0.0000e+00, -1.1478e-02, -3.8022e-02, 0.0000e+00,
-3.7210e-02, -6.6990e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, -2.1613e-04, -2.1010e-03, 0.0000e+00, -9.2028e-03,
0.0000e+00, 0.0000e+00, -2.1831e-02, -4.3847e-02, 0.0000e+00,
0.0000e+00, -4.8261e-02, -3.3520e-02, 0.0000e+00, -3.3150e-02,
0.0000e+00, -5.2353e-04, -5.3803e-02, -7.2773e-02, -5.2444e-02,
-3.1929e-02, -1.9523e-02, -1.1090e-02, -3.4371e-02, 0.0000e+00,
-1.1081e-01, -1.3124e-04, 0.0000e+00, 0.0000e+00, 0.0000e+00,
-1.3129e-03, -5.7174e-02, -1.2882e-01, 0.0000e+00, 0.0000e+00,
-3.1289e-02, 0.0000e+00, -1.0374e-01, 0.0000e+00, -7.6372e-03,
0.0000e+00, -6.8248e-02, -3.9046e-04, -5.8478e-02, -1.0270e-03],
[-7.6839e-02, 0.0000e+00, 5.6959e-02, -7.4464e-02, 1.7300e-02,
0.0000e+00, 3.5798e-02, 0.0000e+00, 5.5046e-02, 0.0000e+00,
9.4583e-03, -5.1962e-03, 4.5363e-02, -1.9194e-02, 0.0000e+00,
-1.0888e-01, 0.0000e+00, -5.1457e-02, -9.0059e-02, 6.4246e-02,
-5.3050e-03, 1.7363e-01, 0.0000e+00, -3.4191e-02, -4.9545e-03,
2.9185e-02, 0.0000e+00, 0.0000e+00, -3.6802e-02, 0.0000e+00,
-1.7310e-02, 0.0000e+00, 4.8476e-02, 0.0000e+00, -1.2720e-02,
1.9034e-02, 9.1811e-02, 0.0000e+00, 5.1645e-02, 8.5953e-02,
0.0000e+00, 4.4363e-02, -1.0412e-02, 0.0000e+00, -3.1002e-02,
-1.1003e-01, 0.0000e+00, 1.4541e-02, -1.0793e-01, 0.0000e+00,
2.4549e-02, -4.6677e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, -1.7998e-02, -1.7496e-01, 0.0000e+00, -1.7465e-01,
0.0000e+00, 0.0000e+00, -6.9458e-02, -7.2223e-02, 0.0000e+00,
0.0000e+00, -6.7618e-02, 4.2467e-02, 0.0000e+00, 4.1998e-02,
0.0000e+00, -4.3597e-02, 6.8164e-02, -1.0319e-01, 6.6442e-02,
-6.9082e-02, 2.4734e-02, 1.4050e-02, 6.2087e-03, 0.0000e+00,
4.1904e-02, -1.0929e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
-1.0933e-01, -9.3146e-02, 4.0894e-02, 0.0000e+00, 0.0000e+00,
3.9640e-02, 0.0000e+00, 6.6381e-02, 0.0000e+00, -3.4294e-02,
0.0000e+00, 8.6463e-02, -3.2516e-02, 4.6804e-02, -8.5523e-02],
[ 1.5439e-01, 0.0000e+00, 9.6841e-03, 1.4962e-01, 2.9413e-03,
0.0000e+00, 2.5352e-01, 0.0000e+00, 7.7191e-02, 0.0000e+00,
2.9194e-01, 1.0441e-02, 7.7125e-03, 5.5055e-02, 0.0000e+00,
2.1878e-01, 0.0000e+00, 1.0339e-01, 1.8095e-01, 1.6847e-01,
1.0659e-02, 2.9520e-02, 0.0000e+00, 6.8699e-02, 9.9549e-03,
4.9620e-03, 0.0000e+00, 0.0000e+00, 7.3945e-02, 0.0000e+00,
3.4781e-02, 0.0000e+00, 1.4337e-01, 0.0000e+00, 5.4453e-02,
1.4625e-02, 1.5609e-02, 0.0000e+00, 8.7806e-03, 1.4614e-02,
0.0000e+00, 7.9448e-02, 2.0921e-02, 0.0000e+00, 3.7876e-01,
2.6362e-01, 0.0000e+00, 2.4723e-03, 3.1674e-01, 0.0000e+00,
5.2671e-02, 2.7445e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 3.6163e-02, 3.5155e-01, 0.0000e+00, 3.7024e-01,
0.0000e+00, 0.0000e+00, 1.9666e-01, 2.6200e-01, 0.0000e+00,
0.0000e+00, 2.6491e-01, 7.2202e-03, 0.0000e+00, 7.1405e-03,
0.0000e+00, 8.7598e-02, 1.1589e-02, 4.0188e-01, 1.1296e-02,
2.2338e-01, 4.2053e-03, 2.3888e-03, 8.1204e-02, 0.0000e+00,
2.1853e-01, 2.1959e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00,
2.1967e-01, 3.3961e-01, 2.6950e-01, 0.0000e+00, 0.0000e+00,
6.7395e-03, 0.0000e+00, 1.5091e-01, 0.0000e+00, 8.8556e-02,
0.0000e+00, 1.4700e-02, 6.5333e-02, 6.6522e-02, 1.7184e-01],
[ 2.4896e-02, 0.0000e+00, -2.6980e-02, 2.4126e-02, -8.1946e-03,
0.0000e+00, -3.3951e-02, 0.0000e+00, -3.0734e-02, 0.0000e+00,
-2.4421e-02, 1.6836e-03, -2.1488e-02, 5.0865e-03, 0.0000e+00,
3.5279e-02, 0.0000e+00, 1.6672e-02, 2.9179e-02, -4.1253e-02,
1.7188e-03, -8.2246e-02, 0.0000e+00, 1.1078e-02, 1.6053e-03,
-1.3824e-02, 0.0000e+00, 0.0000e+00, 1.1924e-02, 0.0000e+00,
5.6086e-03, 0.0000e+00, -3.2244e-02, 0.0000e+00, 2.1369e-03,
-9.7984e-03, -4.3489e-02, 0.0000e+00, -2.4463e-02, -4.0715e-02,
0.0000e+00, -2.5953e-02, 3.3735e-03, 0.0000e+00, -1.1692e-02,
3.2728e-02, 0.0000e+00, -6.8879e-03, 2.8109e-02, 0.0000e+00,
-1.4960e-02, 2.7151e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 5.8314e-03, 5.6688e-02, 0.0000e+00, 5.5259e-02,
0.0000e+00, 0.0000e+00, 1.8583e-02, 1.5372e-02, 0.0000e+00,
0.0000e+00, 1.3045e-02, -2.0116e-02, 0.0000e+00, -1.9894e-02,
0.0000e+00, 1.4125e-02, -3.2288e-02, 2.0071e-02, -3.1472e-02,
1.6574e-02, -1.1716e-02, -6.6554e-03, -8.4458e-03, 0.0000e+00,
-3.4369e-02, 3.5410e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00,
3.5422e-02, 1.9709e-02, -3.7403e-02, 0.0000e+00, 0.0000e+00,
-1.8777e-02, 0.0000e+00, -4.1033e-02, 0.0000e+00, 9.7615e-03,
0.0000e+00, -4.0956e-02, 1.0535e-02, -2.6193e-02, 2.7709e-02]])
layers.4.bias tensor([-0.2561, -0.2397, -0.0469, 0.7446, -0.0295])
AutoVC
AutoVC, put very simply, is a zero-shot style transfer autoencoder for voice conversion.
(There's a lot to unpack in that sentence, so read on...)
"style transfer"
The primary assumption that AutoVC makes is that any given speech utterance is dependent on two parts, each separately distributed (Sec. 3.1):
1) a content-specific component, corresponding roughly to the information about a sentence that would be captured in a textual transcription, and 2) a speaker-specific component, imparting information about how a given individual vocally produces that sentence.
It is important that an utterance converted to use the speaker-specific information of a target speaker sound as much like that target speaker as possible, while maintaining constant content-specific information (Eq. 2). For this to be achieved, therefore, it must be possible to disentangle both of these components readily.
"autoencoder"
An autoencoder (Fig. 1) is a combination of an encoder network and a decoder network, the output of the former serving as the input of the latter. It is often used to learn a lower-dimensional representation or 'embedding' of a given piece of data; because some information is lost in this dimension reduction, the reduction may be considered an information 'bottleneck'. An autoencoder is often trained to approximate its input as closely as possible, improving the quality of the embedding in the process.
AutoVC's encoder attempts to output an embedding for the content-specific component of an utterance by one speaker. This output, together with a similarly produced speaker-specific embedding from an utterance by another speaker, are then fed into AutoVC's decoder to yield a converted utterance. It is the size of the bottleneck (Sec. 3.3) that is tuned to ensure that the content embedding contains as little residual information about the first speaker as possible.
"zero-shot"
Most, if not all, prior voice conversion attempts require that the source and the target be known to the system in the training process. AutoVC, however, is able to handle speakers that it did not encounter in training. This ability stems largely from the speaker embedding being more than just a one-hot vector, since there is a separate encoder trained to generate it.
Trying it yourself
Provided for you are two files, source_utterance.flac and target_utterance.flac. Once you have completed the main MP, you can try AutoVC out, attempting to convert the voice in the source utterance into the voice in the target utterance, by running python _main.py and viewing the file converted_utterance.flac. You may also specify the files to use manually (for instance, transferring the voice in a.wav into b.wav and saving the result in c.wav), by running python _main.py b.wav a.wav c.wav.
What to deliver
This MP primarily consists of the implementation of different PyTorch modules. Some of them will correspond to existing PyTorch layers, while others are direct re-implementations of AutoVC components. (Most of the code you don't have to write is adapted from Kaizhi's original code, as well as an adjusted version thereof.)
Each function to write has type hints in its signature for both inputs and output. In the line def f(x: int, y: float) -> str:, x is an integer, y is a floating-point number, and the output from calling f(x,y) is a string.
The hints used for individual tensors are supplied by the torchtyping package, which can make PyTorch code you write somewhat easier to understand since it allows you to specify information about dimensions. Here's a brief summary of what you will encounter in the hints:
Each hint is of the form TensorType[ ... ], where ... is a comma-separated list of dimensions. Note that these names are not exposed to you in the function itself, so you will still need to extract them from the inputs.
For each dimension list, an arbitrary string is mapped to a single dimension, and later occurrences of that string are mapped to that same dimension. For example, the function def f(x: TensorType["batch", "length"]) -> TensorType["length", "batch"]: takes tensors of a single dimension as input and outputs tensors of a shape transposed from that of the input.
1. Linear layers
The first Module you will design is a simple linear, fully-connected layer, appropriately named LineEar.
importlib.reload(submitted)
help(submitted.LineEar.__init__)
help(submitted.LineEar.forward)
Help on function __init__ in module submitted:
__init__(self, input_size: int, output_size: int) -> None
Sets up the following Parameters:
self.weight - A Parameter holding the weights of the layer,
of size (output_size, input_size).
self.bias - A Parameter holding the biases of the layer,
of size (output_size,).
You may also set other instance variables at this point, but these are not strictly necessary.
Help on function forward in module submitted:
forward(self, inputs: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_size',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'output_size',), 'cls_name': 'TensorType'}]
Performs forward propagation of the inputs.
Input:
inputs - the inputs to the cell.
Output:
outputs - the outputs from the cell.
Note that all dimensions besides the last are preserved
between inputs and outputs.
importlib.reload(submitted)
layer = submitted.LineEar(11,2)
layer.weight = torch.nn.Parameter(torch.Tensor([[0, 1, 4, 13, 28, 33, 47, 54, 64, 70, 72], [0, 1, 9, 19, 24, 31, 52, 56, 58, 69, 72]])) # 11-length Golomb rulers
layer.bias = torch.nn.Parameter(torch.randn_like(layer.bias))
current_input = torch.stack([sum(torch.eye(y+1,11)[y] for y in z) for z in torch.randint(11,size=(22,2))]) # each row is zero except for one (with value 2) or two (each with value 1)
current_output = layer(current_input)
print(current_output)
tensor([[ 41.2861, 42.7532],
[ 61.2861, 54.7532],
[142.2861, 140.7532],
[ 5.2861, 9.7532],
[ 58.2861, 64.7532],
[126.2861, 127.7532],
[142.2861, 140.7532],
[ 13.2861, 18.7532],
[ 75.2861, 75.7532],
[ 97.2861, 88.7532],
[ 77.2861, 76.7532],
[100.2861, 95.7532],
[126.2861, 127.7532],
[ 97.2861, 88.7532],
[ 58.2861, 64.7532],
[ 28.2861, 23.7532],
[ 33.2861, 30.7532],
[100.2861, 95.7532],
[ 71.2861, 69.7532],
[ 64.2861, 57.7532],
[ 28.2861, 23.7532],
[ 5.2861, 9.7532]], grad_fn=<AddBackward0>)
2. Long Short-Term Memory
The next Module you will design is an LSTM module, appropriately named EllEssTeeEmm, constructed from torch.nn.LSTMCell modules. In addition to handling more than one layer, it needs to handle an optional bidirectional mode:
importlib.reload(submitted)
help(submitted.EllEssTeeEmm.__init__)
help(submitted.EllEssTeeEmm.forward)
Help on function __init__ in module submitted:
__init__(self, input_size: int, hidden_size: int, num_layers: int, bidirectional: bool = False) -> None
Sets up the following:
self.forward_layers - A ModuleList of num_layers EllEssTeeEmmCell layers.
The first layer should have an input size of input_size
and an output size of hidden_size,
while all other layers should have input and output both of size hidden_size.
If bidirectional is True, then the following apply:
- self.reverse_layers - A ModuleList of num_layers EllEssTeeEmmCell layers,
of the exact same size and structure as self.forward_layers.
- In both self.forward_layers and self.reverse_layers,
all layers other than the first should have an input size of two times hidden_size.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'input_size',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'output_size',), 'cls_name': 'TensorType'}]
Performs the forward propagation of an EllEssTeeEmm layer.
Inputs:
x - The inputs to the cell.
Outputs:
output - The resulting (hidden state) output h.
If bidirectional was True when initializing the EllEssTeeEmm layer, then the "output_size"
of the output should be twice the hidden_size.
Otherwise, this "output_size" should be exactly the hidden size.
importlib.reload(submitted)
# 6-length Golomb rulers
more_rulers = torch.Tensor([[0,1,4,10,12,17],[0,1,4,10,15,17],[0,1,8,11,13,17],[0,1,8,12,14,17]])
normalized_rulers = [more_rulers/torch.norm(more_rulers,p=k,dim=1,keepdim=True) for k in range(1,5)]
layer = submitted.EllEssTeeEmm(6,4,2)
layer.forward_layers[0].weight_ih = torch.nn.Parameter(torch.cat(normalized_rulers))
layer.forward_layers[0].weight_hh = torch.nn.Parameter(torch.cat(normalized_rulers)[:,:4])
layer.forward_layers[1].weight_ih = torch.nn.Parameter(torch.cat(normalized_rulers)[:,1:5])
layer.forward_layers[1].weight_hh = torch.nn.Parameter(torch.cat(normalized_rulers)[:,2:])
layer.forward_layers[0].bias_ih = torch.nn.Parameter(torch.randn_like(layer.forward_layers[0].bias_ih))
layer.forward_layers[0].bias_hh = torch.nn.Parameter(torch.randn_like(layer.forward_layers[0].bias_hh))
layer.forward_layers[1].bias_ih = torch.nn.Parameter(torch.randn_like(layer.forward_layers[1].bias_ih))
layer.forward_layers[1].bias_hh = torch.nn.Parameter(torch.randn_like(layer.forward_layers[1].bias_hh))
current_input = torch.stack([sum(torch.eye(y+1,6)[y] for y in z) for z in torch.randint(6,size=(22,4))]).unsqueeze(0) # each row is zero except for one (with value 2) or two (each with value 1)
current_output = layer(current_input)
print(current_output)
tensor([[[-0.0539, -0.3133, 0.0547, 0.2526],
[-0.0661, -0.3994, 0.1034, 0.3050],
[-0.0200, -0.5075, 0.2088, 0.3843],
[ 0.0725, -0.6093, 0.3442, 0.4672],
[ 0.1486, -0.6592, 0.4361, 0.5158],
[ 0.2279, -0.7065, 0.5137, 0.5660],
[ 0.2810, -0.7280, 0.5519, 0.5954],
[ 0.3102, -0.7367, 0.5677, 0.6103],
[ 0.3548, -0.7604, 0.5997, 0.6391],
[ 0.3655, -0.7575, 0.5954, 0.6414],
[ 0.4028, -0.7790, 0.6249, 0.6674],
[ 0.3644, -0.7459, 0.5795, 0.6354],
[ 0.3626, -0.7532, 0.5907, 0.6391],
[ 0.3902, -0.7727, 0.6152, 0.6591],
[ 0.3865, -0.7640, 0.6029, 0.6532],
[ 0.2578, -0.6784, 0.4997, 0.5617],
[ 0.1430, -0.6204, 0.4426, 0.4906],
[ 0.1367, -0.6448, 0.4683, 0.4997],
[ 0.1737, -0.6771, 0.5025, 0.5285],
[ 0.1739, -0.6674, 0.4908, 0.5228],
[ 0.1580, -0.6536, 0.4773, 0.5106],
[ 0.1797, -0.6748, 0.5010, 0.5296]]], grad_fn=<SelectBackward0>)
3. Gated Recurrent Units
The GRU was introduced in Kyunghyun Cho et al. "Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation". It omits the output gate of the LSTM and adjusts the remaining gates so that the hidden state at time t is an affine combination of the state at time t−1 (scaled by an update gate—like the LSTM's forget gate) and a new state (scaled by a reset gate—think the LSTM's memory cell but with an input scaled by the input gate). Although theoretically the GRU is somewhat limited in terms of its processing of much longer sequences when compared to the LSTM, it has been shown to rival the LSTM (and in some cases outperform it) in many of the same experiments, with a noticeable reduction in training time and complexity.
In addition to the LSTM layer, you will also design a GRU module, appropriately named GeeArrYou, constructed from torch.nn.GRUCell modules. While this does not need to handle a bidirectional mode, it still needs to handle multiple GRU layers, and in addition it does need to handle an optional dropout value.
importlib.reload(submitted)
help(submitted.GeeArrYou.__init__)
help(submitted.GeeArrYou.forward)
Help on function __init__ in module submitted:
__init__(self, input_size: int, hidden_size: int, num_layers: int, dropout: float = 0) -> None
Sets up the following:
self.forward_layers - A ModuleList of num_layers GeeArrYouCell layers.
The first layer should have an input size of input_size
and an output size of hidden_size,
while all other layers should have input and output both of size hidden_size.
self.dropout - A dropout probability, usable as the "p" value of F.dropout.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'input_size',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'hidden_size',), 'cls_name': 'TensorType'}]
Performs the forward propagation of a GeeArrYou layer.
Inputs:
x - The inputs to the cell.
Outputs:
output - The resulting (hidden state) output h.
Note that the input to each GeeArrYouCell (except the first) should be
passed through F.dropout with the dropout probability provided when
initializing the GeeArrYou layer.
# 6-length Golomb rulers
more_rulers = torch.Tensor([[0,1,4,10,12,17],[0,1,4,10,15,17],[0,1,8,11,13,17],[0,1,8,12,14,17]])
normalized_rulers = [more_rulers/torch.norm(more_rulers,p=k,dim=1,keepdim=True) for k in range(1,4)]
layer = submitted.GeeArrYou(6,4,2)
layer.forward_layers[0].weight_ih = torch.nn.Parameter(torch.cat(normalized_rulers))
layer.forward_layers[0].weight_hh = torch.nn.Parameter(torch.cat(normalized_rulers)[:,:4])
layer.forward_layers[1].weight_ih = torch.nn.Parameter(torch.cat(normalized_rulers)[:,1:5])
layer.forward_layers[1].weight_hh = torch.nn.Parameter(torch.cat(normalized_rulers)[:,2:])
layer.forward_layers[0].bias_ih = torch.nn.Parameter(torch.randn_like(layer.forward_layers[0].bias_ih))
layer.forward_layers[0].bias_hh = torch.nn.Parameter(torch.randn_like(layer.forward_layers[0].bias_hh))
layer.forward_layers[1].bias_ih = torch.nn.Parameter(torch.randn_like(layer.forward_layers[1].bias_ih))
layer.forward_layers[1].bias_hh = torch.nn.Parameter(torch.randn_like(layer.forward_layers[1].bias_hh))
current_input = torch.stack([sum(torch.eye(y+1,6)[y] for y in z) for z in torch.randint(6,size=(22,3))]).unsqueeze(0) # each row is zero except for at most three values
current_output = layer(current_input)
print(current_output)
tensor([[[ 0.2865, -0.0184, -0.0327, 0.1857],
[ 0.4593, -0.0210, 0.0905, 0.3131],
[ 0.5668, -0.0133, 0.2838, 0.4147],
[ 0.6381, -0.0012, 0.4486, 0.4913],
[ 0.6919, 0.0103, 0.5414, 0.5427],
[ 0.7331, 0.0231, 0.6190, 0.5859],
[ 0.7661, 0.0362, 0.6765, 0.6212],
[ 0.7923, 0.0501, 0.7314, 0.6543],
[ 0.8135, 0.0644, 0.7775, 0.6839],
[ 0.8326, 0.0773, 0.7990, 0.7052],
[ 0.8482, 0.0909, 0.8272, 0.7273],
[ 0.8616, 0.1042, 0.8510, 0.7473],
[ 0.8736, 0.1171, 0.8673, 0.7643],
[ 0.8850, 0.1287, 0.8668, 0.7742],
[ 0.8944, 0.1410, 0.8810, 0.7883],
[ 0.9025, 0.1532, 0.8954, 0.8019],
[ 0.9097, 0.1651, 0.9081, 0.8146],
[ 0.9166, 0.1764, 0.9133, 0.8243],
[ 0.9230, 0.1872, 0.9144, 0.8317],
[ 0.9285, 0.1982, 0.9216, 0.8408],
[ 0.9334, 0.2090, 0.9290, 0.8496],
[ 0.9381, 0.2193, 0.9316, 0.8563]]], grad_fn=<CopySlices>)
4.1. AutoVC's Content encoder
The first of the AutoVC modules you will be implementing is the content encoder (Fig. 3(a)), appropriately named Encoder.
Note that you do not need to handle the concatenation of speaker embedding and spectrogram at the beginning, nor the dimensionality reduction at the end, as these are performed for you in _main.py.
importlib.reload(submitted)
help(submitted.Encoder.__init__)
help(submitted.Encoder.forward)
Help on function __init__ in module submitted:
__init__(self, dim_neck: int, dim_emb: int, freq: int)
Sets up the following:
self.convolutions - the 1-D convolution layers.
The first should have 80 + dim_emb input channels and 512 output channels,
while each following convolution layer should have 512 input and 512 output channels.
All such layers should have a 5x5 kernel, with a stride of 1,
a dilation of 1, and a padding of 2.
The output of each convolution layer should be fed into a BatchNorm1d layer of 512 input features,
and the output of each BatchNorm1d should be fed into a ReLU layer.
self.recurrents - a bidirectional EllEssTeeEmm with two layers, an input size of 512,
and an output size of dim_neck.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_dim', 'length',), 'cls_name': 'TensorType'}]) -> Tuple[Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'dim_neck',), 'cls_name': 'TensorType'}], Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'length', 'dim_neck',), 'cls_name': 'TensorType'}]]
Performs the forward propagation of the AutoVC encoder.
After passing the input through the convolution layers, the last two dimensions
should be transposed before passing those layers' output through the EllEssTeeEmm.
The output from the EllEssTeeEmm should then be split *along the last dimension* into two chunks,
one for the forward direction (the first self.recurrent_hidden_size columns)
and one for the backward direction (the last self.recurrent_hidden_size columns).
4.2. AutoVC's decoder
The second of the AutoVC modules you will be implementing is the decoder (much of Fig. 3(c)), appropriately named Decoder.
importlib.reload(submitted)
help(submitted.Decoder.__init__)
help(submitted.Decoder.forward)
Help on function __init__ in module submitted:
__init__(self, dim_neck: int, dim_emb: int, dim_pre: int) -> None
Sets up the following:
self.recurrent1 - a unidirectional EllEssTeeEmm with one layer, an input size of 2*dim_neck + dim_emb
and an output size of dim_pre.
self.convolutions - the 1-D convolution layers.
Each convolution layer should have dim_pre input and dim_pre output channels.
All such layers should have a 5x5 kernel, with a stride of 1,
a dilation of 1, and a padding of 2.
The output of each convolution layer should be fed into a BatchNorm1d layer of dim_pre input features,
and the output of that BatchNorm1d should be fed into a ReLU.
self.recurrent2 - a unidirectional EllEssTeeEmm with two layers, an input size of dim_pre
and an output size of 1024.
self.fc_projection = a LineEar layer with an input size of 1024 and an output size of 80.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_length', 'input_dim',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_length', 'output_dim',), 'cls_name': 'TensorType'}]
Performs the forward propagation of the AutoVC decoder.
It should be enough to pass the input through the first EllEssTeeEmm,
the convolution layers, the second EllEssTeeEmm, and the final LineEar
layer in that order--except that the "input_length" and "input_dim" dimensions
should be transposed before input to the convolution layers, and this transposition
should be undone before input to the second EllEssTeeEmm.
4.3. AutoVC's decoder post-network
The third of the AutoVC modules you will be implementing is the decoder post-network (part of Fig. 3(c)), appropriately named Postnet.
importlib.reload(submitted)
help(submitted.Postnet.__init__)
help(submitted.Postnet.forward)
Help on function __init__ in module submitted:
__init__(self) -> None
Sets up the following:
self.convolutions - a Sequential object with five Conv1d layers, each with 5x5 kernels,
a stride of 1, a padding of 2, and a dilation of 1:
The first should take an 80-channel input and yield a 512-channel output.
The next three should take 512-channel inputs and yield 512-channel outputs.
The last should take a 512-channel input and yield an 80-channel output.
Each layer's output should be passed into a BatchNorm1d,
and (except for the last layer) from there through a Tanh,
before being sent to the next layer.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_channels', 'n_mels',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'input_channels', 'n_mels',), 'cls_name': 'TensorType'}]
Performs the forward propagation of the AutoVC decoder.
If you initialized this module properly, passing the input through self.convolutions here should suffice.
4.4. Speaker Embedder
The last module you will implement is a speaker embedding encoder, appropriately named SpeakerEmbedderGeeArrYou. This is not exactly the same such encoder used in the original AutoVC, but is simplified somewhat by the use of GRUs.
importlib.reload(submitted)
help(submitted.SpeakerEmbedderGeeArrYou.__init__)
help(submitted.SpeakerEmbedderGeeArrYou.forward)
Help on function __init__ in module submitted:
__init__(self, n_hid: int, n_mels: int, n_layers: int, fc_dim: int, hidden_p: float) -> None
Sets up the following:
self.rnn_stack - an n_layers-layer GeeArrYou with n_mels input features,
n_hid hidden features, and a dropout of hidden_p.
self.projection - a LineEar layer with an input size of n_hid
and an output size of fc_dim.
Help on function forward in module submitted:
forward(self, x: typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'frames', 'n_mels',), 'cls_name': 'TensorType'}]) -> typing.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'fc_dim',), 'cls_name': 'TensorType'}]
Performs the forward propagation of the SpeakerEmbedderGeeArrYou.
After passing the input through the RNN, the last frame of the output
should be taken and passed through the fully connected layer.
Each of the frames should then be normalized so that its Euclidean norm is 1.
Extra Credit: Custom LSTMCells and GRUCells
As an extra credit option, you may implement your own LSTM or GRU cell classes (EllEssTeeEmmCell and GeeArrYouCell). These should have the same parameters as LSTMCell and GRUCell respectively, and should necessarily exhibit the same behavior as them in forward propagation. Note that your implementation of EllEssTeeEmm and GeeArrYou must be unchanged from that of the main MP except for the substitution of the PyTorch cell classes with your own.
To check the behavior of your cell classes, it is enough to add from extra import * to the list of imports in submitted.py, substitute LSTMCell with EllEssTeeEmmCell and GRUCell with GeeArrYouCell wherever they occur, and reload the notebook/run grade.py as before.
Extra Credit: Transferring your voice
As an extra credit option, you may record an utterance (about five to seven seconds in length, as 16KHz WAV) and attempt to transfer to it and from it an utterance from the VCTK corpus.
Five points will be awarded for a successful transfer of your voice onto an utterance from the VCTK corpus.
Five points will be awarded for a successful transfer of a voice from the VCTK corpus onto an utterance of yours.
In both cases, half that amount will be awarded if the voice transfer is evident but the resulting utterance is unintelligible.
Alternatively, you and a partner may record different 5-7 second utterances and attempt to transfer them over between each other. A full ten points will be awarded if both directions are intelligible; five points will be awarded if one direction is intelligible but the other direction has issues.
(These point valuations may be adjusted upward if intelligibility issues persist across attempts at this task.)
Caveats for this MP
You are not allowed to use any modules listed under "Linear Layers" from the list of modules in torch.nn within your implementation of this MP, and the same goes for those modules listed under "Recurrent Layers" (apart from LSTMCell and GRUCell as noted in sections 2 and 3 above). In both cases, however, you are welcome to peruse their documentation and source code as provided there. This will be checked (either on submission or manually) and you will lose points if this is discovered.
You are encouraged, rather than wholesale importing torch, torch.nn, or torch.nn.functional directly, to instead import specific functions or modules as you find necessary (that is, use from torch import sort or from torch.nn.functional import hardtanh). For your convenience, the suite of functions and modules that were used in the reference implementation of this MP are provided as a list of imports in submitted.py.