!pip install -Uqq tsai
Developing an Intuition for Transformers and Applying Them to Time Series Classification
Struggling to learn a new deep learning architecture, such as Transformer, can be quite challenging. However, it doesn’t have to be so daunting. In this blog post, I will demonstrate a practical approach to start using a new architecture, specifically the Transformer. We will construct a basic Transformer architecture and progressively fine-tune it to achieve the performance of the TST architecture (A Transformer-based Framework for Multivariate Time Series Representation Learning).
Content:
- Dataset to use: FaceDetection
- Reference Model: TST Model implemented in tsai repository
- Baseline Model: LSTM Model
- Our Simple TST Model
- Improvement
References:
You’ll need fastai
and tsai
to run the code in this blog-post
import torch
from torch.utils.data import Dataset
from torch import nn
from fastai.data.core import DataLoader, DataLoaders
from fastai.learner import Learner
from fastai.losses import LabelSmoothingCrossEntropyFlat
from fastai.metrics import RocAucBinary, accuracy
from fastai.torch_core import Module
from fastai.layers import Flatten
from tsai.models.TST import TST
from tsai.models.RNN import LSTM
from tsai.data.external import get_UCR_data
from tsai.callback.core import ShowGraph as ShowGraphCallback2
from tsai.learner import plot_metrics
from tsai.imports import default_device
import numpy as np
from torch.nn.modules.transformer import TransformerEncoderLayer, TransformerEncoder
Dataset: FaceDetection
Why Time Series? Although the Transformer originates from the NLP domain and outperforms all previous architectures, I believe that, for those not yet familiar with NLP, it is more advantageous to start with a domain that requires less preprocessing, such as Time Series. This way, we can focus our attention on understanding the architecture itself.
In this tutorial, we will be using a dataset from the well-known UEA & UCR Time Series repository. Although we won’t delve into the details of this dataset in this blog post, it’s worth mentioning its purpose. The objective is to classify whether a given MEG signal (Magnetoencephalography) represents a face or not. The input dimension is 144, and the sequence length is 62.
I chose this dataset because it contains a reasonable amount of data (5,890 training instances and 3,524 testing instances) and has been used in a Transformer tutorial in the tsai repository. This ensures that we have a reliable reference model to aim to outperform.
We will utilize utility functions from the tsai and fastai libraries to facilitate our work and streamline the process.
= 64, 144, 2, 62
batch_size, c_in, c_out, seq_len = get_UCR_data('FaceDetection', return_split=False)
X, y, splits
= X[splits[0]]
X_train = y[splits[0]]
y_train = X[splits[1]]
X_valid = y[splits[1]]
y_valid
= np.mean(X_train, axis=(0,2), keepdims=True)
mean_trn = np.std(X_train, axis=(0,2), keepdims=True) std_trn
class TSDataset(Dataset):
"""TimeSeries DataSet for FaceDetection"""
def __init__(self, X, y):
super(TSDataset, self).__init__()
self.X = torch.tensor(X)
self.Y = torch.concat([torch.tensor([_y == '0'], dtype=int) for _y in y])
def __len__(self): return len(self.X)
def __getitem__(self, i):
return self.X[i], self.Y[i]
The following code demonstrates how to create data loaders for the training and validation sets:
= TSDataset(X_train, y_train)
dset_train = TSDataset(X_valid, y_valid)
dset_valid
= DataLoader(dset_train, batch_size=batch_size, shuffle=True)
dl_train = DataLoader(dset_valid, batch_size=batch_size, shuffle=False)
dl_valid
= DataLoaders(dl_train, dl_valid)
dls = dls.cuda() dls
= next(iter(dl_train)) x, y
x.shape, y.shape
(torch.Size([64, 144, 62]), torch.Size([64]))
Reference Model
The reference model we will be using is the TST (Transformer-based Framework for Multivariate Time Series Representation Learning) and implemented by the tsai library.
def evaluate_model(model, n_epoch=30):
= Learner(dls, model, loss_func=LabelSmoothingCrossEntropyFlat(),
learn =[RocAucBinary(), accuracy], cbs=ShowGraphCallback2())
metrics1e-4) learn.fit_one_cycle(n_epoch,
= TST(c_in, c_out, seq_len, dropout=0.3, fc_dropout=0.9, n_heads=1, n_layers=1) model
evaluate_model(model)
epoch | train_loss | valid_loss | roc_auc_score | accuracy | time |
---|---|---|---|---|---|
0 | 0.943482 | 0.710201 | 0.503936 | 0.505675 | 00:02 |
1 | 0.959244 | 0.705551 | 0.515424 | 0.515607 | 00:01 |
2 | 0.940436 | 0.697859 | 0.534342 | 0.532350 | 00:01 |
3 | 0.900420 | 0.692391 | 0.563204 | 0.543133 | 00:01 |
4 | 0.872391 | 0.682138 | 0.601740 | 0.567537 | 00:01 |
5 | 0.873357 | 0.674679 | 0.632084 | 0.591090 | 00:01 |
6 | 0.827063 | 0.668141 | 0.662991 | 0.618331 | 00:01 |
7 | 0.802059 | 0.658147 | 0.682879 | 0.635925 | 00:01 |
8 | 0.753277 | 0.653781 | 0.693962 | 0.641884 | 00:01 |
9 | 0.744286 | 0.647927 | 0.698235 | 0.643303 | 00:01 |
10 | 0.727256 | 0.646205 | 0.705161 | 0.654370 | 00:01 |
11 | 0.699470 | 0.644024 | 0.707230 | 0.654654 | 00:01 |
12 | 0.707701 | 0.639246 | 0.713946 | 0.663167 | 00:01 |
13 | 0.677575 | 0.637510 | 0.716578 | 0.663734 | 00:01 |
14 | 0.690399 | 0.635938 | 0.720142 | 0.667991 | 00:01 |
15 | 0.651648 | 0.635806 | 0.720622 | 0.667991 | 00:01 |
16 | 0.639079 | 0.634356 | 0.722736 | 0.665153 | 00:01 |
17 | 0.672199 | 0.633124 | 0.724921 | 0.665437 | 00:01 |
18 | 0.639468 | 0.631378 | 0.728070 | 0.668558 | 00:01 |
19 | 0.638936 | 0.629368 | 0.728399 | 0.668558 | 00:01 |
20 | 0.625763 | 0.627779 | 0.731610 | 0.672247 | 00:01 |
21 | 0.619361 | 0.626804 | 0.732947 | 0.671680 | 00:01 |
22 | 0.633200 | 0.626732 | 0.732423 | 0.673666 | 00:01 |
23 | 0.633324 | 0.625232 | 0.735347 | 0.673099 | 00:01 |
24 | 0.639257 | 0.625497 | 0.733850 | 0.672531 | 00:01 |
25 | 0.626306 | 0.625547 | 0.733988 | 0.671396 | 00:01 |
26 | 0.616693 | 0.625246 | 0.734917 | 0.674234 | 00:01 |
27 | 0.627592 | 0.625567 | 0.733930 | 0.671396 | 00:01 |
28 | 0.616872 | 0.625370 | 0.734323 | 0.671680 | 00:01 |
29 | 0.617217 | 0.624772 | 0.734979 | 0.672815 | 00:01 |
For the simplicity for the reader, I do not use any normalization technique and training in lesser number of epochs than the original reference notebook. After 100 epochs, they reach an accuracy arount 0.70.1
Baseline
We’ll begin our journey by exploring an LSTM model, which was commonly used for sequence classification in the pre-transformer era.
Note: Observing the validation loss may lead you to believe that the model is overfitting. However, this is not the case, as the final metric (accuracy) continues to increase.
= LSTM(144,2,rnn_dropout=0.3, fc_dropout=0.3) model
evaluate_model(model)
epoch | train_loss | valid_loss | roc_auc_score | accuracy | time |
---|---|---|---|---|---|
0 | 0.698941 | 0.696725 | 0.522427 | 0.517026 | 00:02 |
1 | 0.696502 | 0.695753 | 0.526428 | 0.516175 | 00:01 |
2 | 0.694726 | 0.694219 | 0.533478 | 0.519013 | 00:01 |
3 | 0.691055 | 0.692617 | 0.541788 | 0.529228 | 00:01 |
4 | 0.691306 | 0.691157 | 0.549403 | 0.538309 | 00:01 |
5 | 0.684514 | 0.689195 | 0.559889 | 0.545119 | 00:01 |
6 | 0.680696 | 0.686942 | 0.571706 | 0.545970 | 00:01 |
7 | 0.668403 | 0.685266 | 0.581042 | 0.554767 | 00:01 |
8 | 0.659610 | 0.685288 | 0.585615 | 0.558456 | 00:01 |
9 | 0.656271 | 0.686517 | 0.588135 | 0.557889 | 00:01 |
10 | 0.651706 | 0.688422 | 0.589830 | 0.557889 | 00:01 |
11 | 0.630258 | 0.691245 | 0.590714 | 0.557605 | 00:01 |
12 | 0.620531 | 0.697241 | 0.591923 | 0.557605 | 00:01 |
13 | 0.606660 | 0.700341 | 0.597773 | 0.571793 | 00:01 |
14 | 0.595829 | 0.706305 | 0.598964 | 0.574347 | 00:01 |
15 | 0.581478 | 0.708097 | 0.604908 | 0.579455 | 00:01 |
16 | 0.570084 | 0.709881 | 0.610240 | 0.580590 | 00:01 |
17 | 0.560884 | 0.712225 | 0.612305 | 0.583144 | 00:01 |
18 | 0.570109 | 0.713588 | 0.616308 | 0.583995 | 00:01 |
19 | 0.558659 | 0.713829 | 0.616927 | 0.581725 | 00:01 |
20 | 0.545606 | 0.714969 | 0.618493 | 0.580874 | 00:01 |
21 | 0.541346 | 0.715872 | 0.620730 | 0.582293 | 00:01 |
22 | 0.538415 | 0.716904 | 0.622545 | 0.580590 | 00:01 |
23 | 0.535115 | 0.717429 | 0.623803 | 0.581725 | 00:01 |
24 | 0.531399 | 0.718325 | 0.623906 | 0.582577 | 00:01 |
25 | 0.540339 | 0.718253 | 0.624539 | 0.582009 | 00:01 |
26 | 0.539049 | 0.718644 | 0.624380 | 0.582293 | 00:01 |
27 | 0.529248 | 0.718712 | 0.624360 | 0.583144 | 00:01 |
28 | 0.524167 | 0.718756 | 0.624453 | 0.583712 | 00:01 |
29 | 0.524851 | 0.718751 | 0.624465 | 0.583712 | 00:01 |
Our TST
The diagram above illustrates the Transformer architecture as presented in the “Attention is All You Need” paper. The breakthrough in this architecture is the Multi-Head Attention. The idea behind Attention is that if your model can focus on the most important parts of a long sequence, it can perform better without being affected by noise.
How does it work? Well, in my experience, when we are not very familiar with a new architecture, we shouldn’t focus too much on understanding every detail of the architecture. I spent a lot of time reading various tutorials, trying to grasp the clever idea behind this, only to realize that I still didn’t know how to apply it to a real case. I will attempt to cover building Self-Attention from scratch in a future blog post. However, in this one, we will start by learning how to use the Transformer module from PyTorch.
What do we need to pay attention to here? We will mainly focus on the shapes of the input and output. The input maintains its shape after passing through the Transformer Encoder. Subsequently, the output is flattened and passed through a linear layer, which generates the appropriate number of classes for the given classification task
Our first model as below is a very simple architecture with just one TransformerEncoder
Layer and one Linear
Layer
= 64, 144, 128, 2, 62, 0.7, 0.9 batch_size, c_in, d_model, c_out, seq_len, dropout, fc_dropout
class OurTST(Module):
def __init__(self, c_in, c_out, seq_len, dropout):
self.c_in, self.c_out, self.seq_len = c_in, c_out, seq_len
= TransformerEncoderLayer(d_model=c_in, nhead=1, dropout=dropout)
encoder_layer self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=1)
self.head = nn.Linear(seq_len*c_in, c_out)
def forward(self, x):
= x.swapaxes(1,2) # [bs,c_in,seq_len] -> [bs,seq_len,c_in]
o = self.transformer_encoder(o) # [bs,c_in,seq_len] -> [bs,c_in,seq_len]
o = o.reshape(o.shape[0], -1) # [bs,c_in,seq_len] -> [bs,c_in x seq_len]
o = self.head(o) # [bs,c_in x seq_len] -> [bs,]
o return o
= OurTST(c_in, c_out, seq_len, 0.9) model
model
OurTST(
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=144, out_features=144, bias=True)
)
(linear1): Linear(in_features=144, out_features=2048, bias=True)
(dropout): Dropout(p=0.9, inplace=False)
(linear2): Linear(in_features=2048, out_features=144, bias=True)
(norm1): LayerNorm((144,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((144,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.9, inplace=False)
(dropout2): Dropout(p=0.9, inplace=False)
)
)
)
(head): Linear(in_features=8928, out_features=2, bias=True)
)
30) evaluate_model(model,
epoch | train_loss | valid_loss | roc_auc_score | accuracy | time |
---|---|---|---|---|---|
0 | 0.771023 | 0.765103 | 0.501380 | 0.503405 | 00:01 |
1 | 0.767232 | 0.743445 | 0.526820 | 0.517877 | 00:01 |
2 | 0.760522 | 0.713145 | 0.577938 | 0.552781 | 00:01 |
3 | 0.718567 | 0.686044 | 0.636998 | 0.599603 | 00:01 |
4 | 0.707388 | 0.671847 | 0.679890 | 0.627128 | 00:01 |
5 | 0.667916 | 0.675715 | 0.702388 | 0.650114 | 00:01 |
6 | 0.646124 | 0.692107 | 0.713854 | 0.657492 | 00:01 |
7 | 0.626370 | 0.714405 | 0.720573 | 0.665153 | 00:01 |
8 | 0.623622 | 0.739222 | 0.721124 | 0.671396 | 00:01 |
9 | 0.597921 | 0.755921 | 0.722508 | 0.669410 | 00:01 |
10 | 0.595011 | 0.766809 | 0.725383 | 0.672815 | 00:01 |
11 | 0.574535 | 0.771822 | 0.729465 | 0.675936 | 00:01 |
12 | 0.573907 | 0.776058 | 0.732046 | 0.671112 | 00:01 |
13 | 0.570190 | 0.785060 | 0.733693 | 0.673950 | 00:01 |
14 | 0.552551 | 0.789218 | 0.734351 | 0.674234 | 00:01 |
15 | 0.575410 | 0.794398 | 0.736869 | 0.678774 | 00:01 |
16 | 0.563437 | 0.795728 | 0.738365 | 0.676788 | 00:01 |
17 | 0.563861 | 0.796549 | 0.739659 | 0.680193 | 00:01 |
18 | 0.538637 | 0.797076 | 0.740454 | 0.679909 | 00:01 |
19 | 0.548644 | 0.796367 | 0.741566 | 0.681896 | 00:01 |
20 | 0.549104 | 0.798111 | 0.741827 | 0.685017 | 00:01 |
21 | 0.539753 | 0.801571 | 0.741463 | 0.683598 | 00:01 |
22 | 0.542557 | 0.801905 | 0.742229 | 0.683031 | 00:01 |
23 | 0.547139 | 0.803032 | 0.742469 | 0.682463 | 00:01 |
24 | 0.532426 | 0.802947 | 0.743231 | 0.683598 | 00:01 |
25 | 0.524330 | 0.803357 | 0.743226 | 0.683314 | 00:01 |
26 | 0.525560 | 0.803959 | 0.743130 | 0.683598 | 00:01 |
27 | 0.534251 | 0.804135 | 0.743205 | 0.682747 | 00:01 |
28 | 0.541934 | 0.804315 | 0.743135 | 0.682747 | 00:01 |
29 | 0.527424 | 0.804290 | 0.743156 | 0.682747 | 00:01 |
Well, our model outperforms the LSTM model and even better than the TST model after 30 epochs
Upgrades
In this section, I will discuss how we can build upon our basic Transformer architecture to achieve even greater results. We will explore several ideas inspired by the original paper and general deep learning concepts.
1- Feature Standardizing:
To enhance neural network training, it is recommended that input data have a zero mean and unit standard deviation ( for more details, refer to this lesson from fast.ai). In line with the original paper, we employ feature standardization, which standardizes each feature separately.
= np.mean(X_train, axis=(0,2), keepdims=True)
mean_trn = np.std(X_train, axis=(0,2), keepdims=True)
std_trn # In the Dataset
... self.X = (self.X - mean_trn)/std_trn
2- Input Projection
Before feeding the input into the TransformerEncoder
layer, it can be projected into another dimension, allowing us to control the input received by the TransformerEncoder
. In general, using suitable techniques, a deeper network can potentially outperform a shallow one.
def __init__( ... )
self.W_P = nn.Linear(c_in, d_model)
def forward(self, x):
= x.swapaxes(1, 2)
o = self.W_P(o) # Input Projection
o
3- Positional Encoding
Transformers do not inherently capture the positional order of input data, which can be crucial for certain tasks. To embed this information, we can employ techniques such as passing the input through a specific function (e.g., a sinusoidal
function) or creating learnable parameters for position (as implemented in our code).
def __init__( ... )
# Positional encoding
= torch.empty((seq_len, d_model), device=default_device())
W_pos -0.02, 0.02)
nn.init.uniform_(W_pos, self.W_pos = nn.Parameter(W_pos, requires_grad=True)
def forward(self, x):
= x.swapaxes(1, 2)
o = self.W_P(o)
o = o + self.W_pos # Positional Encoding o
4- DropOut
Deep neural networks can be prone to overfitting. To mitigate this issue, we can introduce dropout layers in our model, making it more resistant to overfitting. In our architecture, there are two types of dropout: one within the TransformerEncoder
layer and another just before the final Linear
layer.
def __init__( ... )
# Transformer encoder layers
= TransformerEncoderLayer(d_model=d_model, nhead=1, dropout=drop_out) # dropout inside Transformer Layer
encoder_layer self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=n_layers)
self.head = nn.Sequential(
nn.GELU(),
Flatten(),# fully connected dropout
nn.Dropout(fc_dropout), * d_model, c_out)
nn.Linear(seq_len )
= 64, 144, 128, 2, 62, 0.9
batch_size, c_in, d_model, c_out, seq_len,fc_dropout = get_UCR_data('FaceDetection', return_split=False)
X, y, splits
= X[splits[0]]
X_train = y[splits[0]]
y_train = X[splits[1]]
X_valid = y[splits[1]]
y_valid
= np.mean(X_train, axis=(0,2), keepdims=True)
mean_trn = np.std(X_train, axis=(0,2), keepdims=True) std_trn
class TSDataset(Dataset):
"""TimeSeries DataSet for FaceDetection"""
def __init__(self, X, y):
super(TSDataset, self).__init__()
self.X = torch.tensor(X)
self.X = (self.X - mean_trn)/std_trn
self.Y = torch.concat([torch.tensor([_y == '0'], dtype=int) for _y in y])
def __len__(self): return len(self.X)
def __getitem__(self, i):
return self.X[i], self.Y[i]
class OurTST(Module):
def __init__(self, c_in, c_out, d_model, seq_len, n_layers, drop_out, fc_dropout):
self.c_in, self.c_out, self.seq_len = c_in, c_out, seq_len
self.W_P = nn.Linear(c_in, d_model)
# Positional encoding
= torch.empty((seq_len, d_model), device=default_device())
W_pos -0.02, 0.02)
nn.init.uniform_(W_pos, self.W_pos = nn.Parameter(W_pos, requires_grad=True)
# Transformer encoder layers
= TransformerEncoderLayer(d_model=d_model, nhead=1, dropout=drop_out)
encoder_layer self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=n_layers)
self.head = nn.Sequential(
nn.GELU(),
Flatten(),
nn.Dropout(fc_dropout),* d_model, c_out)
nn.Linear(seq_len
)
def forward(self, x):
= x.swapaxes(1, 2) # [bs,c_in,seq_len] -> [bs,seq_len,c_in]
o = self.W_P(o) # [bs,seq_len,c_in] -> [bs,seq_len,d_model]
o = o + self.W_pos
o = self.transformer_encoder(o) # [bs, seq_len, d_model] -> [bs, seq_len, d_model]
o = o.contiguous()
o = self.head(o) # [bs,seq_len x d_model] -> [bs,c_out]
o return o
= OurTST(c_in, c_out, d_model, seq_len, 3, 0.4 ,0.9) model
=30) evaluate_model(model, n_epoch
epoch | train_loss | valid_loss | roc_auc_score | accuracy | time |
---|---|---|---|---|---|
0 | 0.925684 | 0.714104 | 0.499395 | 0.500000 | 00:02 |
1 | 0.941063 | 0.708397 | 0.513332 | 0.510499 | 00:02 |
2 | 0.910563 | 0.697380 | 0.547459 | 0.532066 | 00:02 |
3 | 0.875563 | 0.684338 | 0.586375 | 0.553916 | 00:02 |
4 | 0.788294 | 0.675705 | 0.634576 | 0.583144 | 00:02 |
5 | 0.729792 | 0.669882 | 0.668022 | 0.613224 | 00:02 |
6 | 0.698079 | 0.661823 | 0.690192 | 0.640182 | 00:02 |
7 | 0.678686 | 0.650137 | 0.703406 | 0.646141 | 00:02 |
8 | 0.669151 | 0.639466 | 0.713121 | 0.653235 | 00:02 |
9 | 0.640111 | 0.631399 | 0.720097 | 0.656924 | 00:02 |
10 | 0.636223 | 0.625919 | 0.728919 | 0.664586 | 00:02 |
11 | 0.609655 | 0.623980 | 0.735162 | 0.663734 | 00:02 |
12 | 0.600260 | 0.626050 | 0.739055 | 0.669694 | 00:02 |
13 | 0.595127 | 0.623003 | 0.741405 | 0.671396 | 00:02 |
14 | 0.592127 | 0.624625 | 0.742487 | 0.675369 | 00:02 |
15 | 0.572983 | 0.630688 | 0.746745 | 0.677923 | 00:02 |
16 | 0.573267 | 0.628453 | 0.747862 | 0.682463 | 00:02 |
17 | 0.564986 | 0.627425 | 0.749735 | 0.680761 | 00:02 |
18 | 0.567010 | 0.626230 | 0.751817 | 0.684733 | 00:02 |
19 | 0.555792 | 0.628040 | 0.751030 | 0.680761 | 00:02 |
20 | 0.546230 | 0.633149 | 0.751046 | 0.683031 | 00:02 |
21 | 0.547146 | 0.631326 | 0.752979 | 0.684166 | 00:02 |
22 | 0.548416 | 0.632791 | 0.752416 | 0.684449 | 00:02 |
23 | 0.548463 | 0.634959 | 0.752966 | 0.682747 | 00:02 |
24 | 0.541696 | 0.634788 | 0.753698 | 0.683598 | 00:02 |
25 | 0.541297 | 0.634837 | 0.753692 | 0.684733 | 00:02 |
26 | 0.530234 | 0.635058 | 0.753956 | 0.683314 | 00:02 |
27 | 0.539544 | 0.635267 | 0.753939 | 0.683031 | 00:02 |
28 | 0.546123 | 0.635198 | 0.753946 | 0.683314 | 00:02 |
29 | 0.535531 | 0.635230 | 0.753916 | 0.683314 | 00:02 |
Let’s try training with more epochs. In the following example, we will train for 100 epochs, which is the same as in this tutorial from tsai
= OurTST(c_in, c_out, d_model, seq_len, 3, 0.6 ,0.9)
model =100) evaluate_model(model, n_epoch
epoch | train_loss | valid_loss | roc_auc_score | accuracy | time |
---|---|---|---|---|---|
0 | 0.986113 | 0.723611 | 0.495537 | 0.501986 | 00:02 |
1 | 0.985938 | 0.715433 | 0.500191 | 0.498581 | 00:02 |
2 | 0.959804 | 0.714567 | 0.505571 | 0.503121 | 00:02 |
3 | 0.935837 | 0.711624 | 0.512527 | 0.509081 | 00:02 |
4 | 0.949967 | 0.708520 | 0.520180 | 0.511351 | 00:02 |
5 | 0.938500 | 0.703618 | 0.531381 | 0.521283 | 00:02 |
6 | 0.914983 | 0.698704 | 0.546194 | 0.528944 | 00:02 |
7 | 0.916452 | 0.693311 | 0.562272 | 0.545687 | 00:02 |
8 | 0.863562 | 0.688229 | 0.581601 | 0.557605 | 00:02 |
9 | 0.825720 | 0.684290 | 0.603750 | 0.570658 | 00:02 |
10 | 0.814094 | 0.675768 | 0.626344 | 0.585982 | 00:02 |
11 | 0.770269 | 0.671013 | 0.646063 | 0.598751 | 00:02 |
12 | 0.734332 | 0.666361 | 0.663611 | 0.621169 | 00:02 |
13 | 0.728247 | 0.666351 | 0.673837 | 0.612372 | 00:02 |
14 | 0.703636 | 0.659145 | 0.688090 | 0.640749 | 00:02 |
15 | 0.695590 | 0.652872 | 0.696829 | 0.652667 | 00:02 |
16 | 0.685371 | 0.648839 | 0.701994 | 0.639330 | 00:02 |
17 | 0.677378 | 0.640827 | 0.706331 | 0.654370 | 00:02 |
18 | 0.659291 | 0.637410 | 0.711023 | 0.658059 | 00:02 |
19 | 0.654945 | 0.636273 | 0.717630 | 0.659478 | 00:02 |
20 | 0.642616 | 0.639057 | 0.721688 | 0.658343 | 00:02 |
21 | 0.629476 | 0.639091 | 0.728980 | 0.667991 | 00:02 |
22 | 0.618588 | 0.642530 | 0.735525 | 0.673099 | 00:02 |
23 | 0.612620 | 0.642298 | 0.741208 | 0.671680 | 00:02 |
24 | 0.606063 | 0.644125 | 0.745705 | 0.677639 | 00:02 |
25 | 0.597042 | 0.653301 | 0.746442 | 0.673099 | 00:02 |
26 | 0.591410 | 0.649610 | 0.749507 | 0.680761 | 00:02 |
27 | 0.599989 | 0.644790 | 0.752807 | 0.684449 | 00:02 |
28 | 0.580830 | 0.655860 | 0.755328 | 0.683598 | 00:02 |
29 | 0.579203 | 0.667195 | 0.754020 | 0.683314 | 00:02 |
30 | 0.576821 | 0.666351 | 0.757188 | 0.679342 | 00:02 |
31 | 0.571501 | 0.670810 | 0.757713 | 0.688138 | 00:02 |
32 | 0.577291 | 0.670262 | 0.760897 | 0.689841 | 00:02 |
33 | 0.565669 | 0.672775 | 0.759541 | 0.687003 | 00:02 |
34 | 0.581174 | 0.681387 | 0.757111 | 0.685868 | 00:02 |
35 | 0.571533 | 0.672656 | 0.760768 | 0.689841 | 00:02 |
36 | 0.568005 | 0.684501 | 0.759986 | 0.685868 | 00:02 |
37 | 0.558923 | 0.689911 | 0.758634 | 0.687287 | 00:02 |
38 | 0.544595 | 0.691652 | 0.760238 | 0.692111 | 00:02 |
39 | 0.547767 | 0.690527 | 0.760010 | 0.690976 | 00:02 |
40 | 0.547662 | 0.695285 | 0.760551 | 0.694949 | 00:02 |
41 | 0.545006 | 0.692685 | 0.761823 | 0.692111 | 00:02 |
42 | 0.557433 | 0.701815 | 0.761501 | 0.696084 | 00:02 |
43 | 0.556936 | 0.702578 | 0.758627 | 0.687003 | 00:02 |
44 | 0.542515 | 0.713648 | 0.757356 | 0.688422 | 00:02 |
45 | 0.540261 | 0.718203 | 0.757706 | 0.686152 | 00:02 |
46 | 0.531610 | 0.718681 | 0.758163 | 0.692111 | 00:02 |
47 | 0.525029 | 0.722788 | 0.760487 | 0.685868 | 00:02 |
48 | 0.531557 | 0.714598 | 0.761604 | 0.695516 | 00:02 |
49 | 0.528427 | 0.720070 | 0.757874 | 0.688422 | 00:02 |
50 | 0.533936 | 0.731660 | 0.760347 | 0.694381 | 00:02 |
51 | 0.532107 | 0.734147 | 0.758814 | 0.687571 | 00:02 |
52 | 0.528784 | 0.726680 | 0.761935 | 0.690125 | 00:02 |
53 | 0.525845 | 0.736096 | 0.760733 | 0.694665 | 00:02 |
54 | 0.535664 | 0.743276 | 0.758939 | 0.689841 | 00:02 |
55 | 0.521018 | 0.735471 | 0.761122 | 0.691544 | 00:02 |
56 | 0.523641 | 0.729754 | 0.761421 | 0.692963 | 00:02 |
57 | 0.525150 | 0.735508 | 0.761838 | 0.689274 | 00:02 |
58 | 0.516105 | 0.739418 | 0.763721 | 0.699205 | 00:02 |
59 | 0.511782 | 0.742465 | 0.761415 | 0.694098 | 00:02 |
60 | 0.523468 | 0.742341 | 0.760741 | 0.695800 | 00:02 |
61 | 0.520403 | 0.743021 | 0.761011 | 0.694381 | 00:02 |
62 | 0.514355 | 0.741946 | 0.762210 | 0.700624 | 00:02 |
63 | 0.511770 | 0.744806 | 0.762579 | 0.694949 | 00:02 |
64 | 0.514207 | 0.747035 | 0.761403 | 0.694665 | 00:02 |
65 | 0.504732 | 0.747706 | 0.760959 | 0.689841 | 00:02 |
66 | 0.507337 | 0.746426 | 0.761844 | 0.693530 | 00:02 |
67 | 0.502984 | 0.753452 | 0.761258 | 0.696084 | 00:02 |
68 | 0.503284 | 0.748423 | 0.762783 | 0.694381 | 00:02 |
69 | 0.510511 | 0.748741 | 0.762635 | 0.697219 | 00:02 |
70 | 0.502065 | 0.757950 | 0.761098 | 0.690692 | 00:02 |
71 | 0.499528 | 0.758117 | 0.760646 | 0.696084 | 00:02 |
72 | 0.508516 | 0.758758 | 0.759330 | 0.694665 | 00:02 |
73 | 0.497975 | 0.761433 | 0.759172 | 0.694098 | 00:02 |
74 | 0.497121 | 0.762476 | 0.758746 | 0.695233 | 00:02 |
75 | 0.494813 | 0.761791 | 0.759932 | 0.696368 | 00:02 |
76 | 0.496914 | 0.761701 | 0.760781 | 0.698922 | 00:02 |
77 | 0.492758 | 0.762113 | 0.760283 | 0.697219 | 00:02 |
78 | 0.495429 | 0.760835 | 0.760252 | 0.698354 | 00:02 |
79 | 0.500510 | 0.766248 | 0.759687 | 0.695800 | 00:02 |
80 | 0.491652 | 0.764638 | 0.760198 | 0.693530 | 00:02 |
81 | 0.495746 | 0.766025 | 0.760069 | 0.694098 | 00:02 |
82 | 0.492518 | 0.767861 | 0.759896 | 0.694381 | 00:02 |
83 | 0.492122 | 0.767575 | 0.759625 | 0.695800 | 00:02 |
84 | 0.495317 | 0.768247 | 0.758977 | 0.695233 | 00:02 |
85 | 0.503044 | 0.767743 | 0.759250 | 0.696935 | 00:02 |
86 | 0.488295 | 0.768618 | 0.759364 | 0.696368 | 00:02 |
87 | 0.495461 | 0.769804 | 0.759051 | 0.694381 | 00:02 |
88 | 0.499347 | 0.769520 | 0.758997 | 0.695233 | 00:02 |
89 | 0.511932 | 0.769175 | 0.758995 | 0.695233 | 00:02 |
90 | 0.504696 | 0.769521 | 0.758882 | 0.694949 | 00:02 |
91 | 0.491524 | 0.769784 | 0.758762 | 0.694098 | 00:02 |
92 | 0.498932 | 0.769638 | 0.758810 | 0.694665 | 00:02 |
93 | 0.493265 | 0.770175 | 0.758754 | 0.694665 | 00:02 |
94 | 0.497908 | 0.770099 | 0.758764 | 0.694665 | 00:02 |
95 | 0.486489 | 0.769981 | 0.758742 | 0.694665 | 00:02 |
96 | 0.496604 | 0.770035 | 0.758724 | 0.694665 | 00:02 |
97 | 0.485646 | 0.769963 | 0.758733 | 0.694381 | 00:02 |
98 | 0.499033 | 0.769941 | 0.758741 | 0.694381 | 00:02 |
99 | 0.490694 | 0.769939 | 0.758742 | 0.694381 | 00:02 |
There may be differences between the implementation of the Transformer in pytorch and tsai (for example, pytorch
uses LayerNorm
in the TransformerEncoder
layer, which is popular in NLP, while tsai
employs BatchNorm
)