Transformer for timeseries

timeseries
deeplearning
transformer
Author

Dien-Hoa Truong

Published

April 26, 2023

Developing an Intuition for Transformers and Applying Them to Time Series Classification

Transformer Architecture

Struggling to learn a new deep learning architecture, such as Transformer, can be quite challenging. However, it doesn’t have to be so daunting. In this blog post, I will demonstrate a practical approach to start using a new architecture, specifically the Transformer. We will construct a basic Transformer architecture and progressively fine-tune it to achieve the performance of the TST architecture (A Transformer-based Framework for Multivariate Time Series Representation Learning).

Content:

References:

You’ll need fastai and tsai to run the code in this blog-post

!pip install -Uqq tsai
import torch
from torch.utils.data import Dataset
from torch import nn
from fastai.data.core import DataLoader, DataLoaders
from fastai.learner import Learner
from fastai.losses import LabelSmoothingCrossEntropyFlat
from fastai.metrics import RocAucBinary, accuracy
from fastai.torch_core import Module
from fastai.layers import Flatten
from tsai.models.TST import TST
from tsai.models.RNN import LSTM
from tsai.data.external import get_UCR_data
from tsai.callback.core import ShowGraph as ShowGraphCallback2
from tsai.learner import plot_metrics
from tsai.imports import default_device
import numpy as np
from torch.nn.modules.transformer import TransformerEncoderLayer, TransformerEncoder

Dataset: FaceDetection

Note

Why Time Series? Although the Transformer originates from the NLP domain and outperforms all previous architectures, I believe that, for those not yet familiar with NLP, it is more advantageous to start with a domain that requires less preprocessing, such as Time Series. This way, we can focus our attention on understanding the architecture itself.

In this tutorial, we will be using a dataset from the well-known UEA & UCR Time Series repository. Although we won’t delve into the details of this dataset in this blog post, it’s worth mentioning its purpose. The objective is to classify whether a given MEG signal (Magnetoencephalography) represents a face or not. The input dimension is 144, and the sequence length is 62.

I chose this dataset because it contains a reasonable amount of data (5,890 training instances and 3,524 testing instances) and has been used in a Transformer tutorial in the tsai repository. This ensures that we have a reliable reference model to aim to outperform.

We will utilize utility functions from the tsai and fastai libraries to facilitate our work and streamline the process.

batch_size, c_in, c_out, seq_len = 64, 144, 2, 62
X, y, splits = get_UCR_data('FaceDetection', return_split=False)

X_train = X[splits[0]]
y_train = y[splits[0]]
X_valid = X[splits[1]]
y_valid = y[splits[1]]

mean_trn = np.mean(X_train, axis=(0,2), keepdims=True)
std_trn = np.std(X_train, axis=(0,2), keepdims=True)
class TSDataset(Dataset):
    """TimeSeries DataSet for FaceDetection"""
    def __init__(self, X, y):
        super(TSDataset, self).__init__()
        self.X = torch.tensor(X)
        self.Y = torch.concat([torch.tensor([_y == '0'], dtype=int) for _y in y])
    
    def __len__(self): return len(self.X)
    
    def __getitem__(self, i):
        return self.X[i], self.Y[i]

The following code demonstrates how to create data loaders for the training and validation sets:

dset_train = TSDataset(X_train, y_train)
dset_valid = TSDataset(X_valid, y_valid)

dl_train = DataLoader(dset_train, batch_size=batch_size, shuffle=True)
dl_valid = DataLoader(dset_valid, batch_size=batch_size, shuffle=False)

dls = DataLoaders(dl_train, dl_valid) 
dls = dls.cuda()
x, y = next(iter(dl_train))
x.shape, y.shape
(torch.Size([64, 144, 62]), torch.Size([64]))

Reference Model

The reference model we will be using is the TST (Transformer-based Framework for Multivariate Time Series Representation Learning) and implemented by the tsai library.

def evaluate_model(model, n_epoch=30):
    learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropyFlat(), 
                    metrics=[RocAucBinary(), accuracy],  cbs=ShowGraphCallback2())
    learn.fit_one_cycle(n_epoch, 1e-4) 
model = TST(c_in, c_out, seq_len, dropout=0.3, fc_dropout=0.9, n_heads=1, n_layers=1)
evaluate_model(model)
epoch train_loss valid_loss roc_auc_score accuracy time
0 0.943482 0.710201 0.503936 0.505675 00:02
1 0.959244 0.705551 0.515424 0.515607 00:01
2 0.940436 0.697859 0.534342 0.532350 00:01
3 0.900420 0.692391 0.563204 0.543133 00:01
4 0.872391 0.682138 0.601740 0.567537 00:01
5 0.873357 0.674679 0.632084 0.591090 00:01
6 0.827063 0.668141 0.662991 0.618331 00:01
7 0.802059 0.658147 0.682879 0.635925 00:01
8 0.753277 0.653781 0.693962 0.641884 00:01
9 0.744286 0.647927 0.698235 0.643303 00:01
10 0.727256 0.646205 0.705161 0.654370 00:01
11 0.699470 0.644024 0.707230 0.654654 00:01
12 0.707701 0.639246 0.713946 0.663167 00:01
13 0.677575 0.637510 0.716578 0.663734 00:01
14 0.690399 0.635938 0.720142 0.667991 00:01
15 0.651648 0.635806 0.720622 0.667991 00:01
16 0.639079 0.634356 0.722736 0.665153 00:01
17 0.672199 0.633124 0.724921 0.665437 00:01
18 0.639468 0.631378 0.728070 0.668558 00:01
19 0.638936 0.629368 0.728399 0.668558 00:01
20 0.625763 0.627779 0.731610 0.672247 00:01
21 0.619361 0.626804 0.732947 0.671680 00:01
22 0.633200 0.626732 0.732423 0.673666 00:01
23 0.633324 0.625232 0.735347 0.673099 00:01
24 0.639257 0.625497 0.733850 0.672531 00:01
25 0.626306 0.625547 0.733988 0.671396 00:01
26 0.616693 0.625246 0.734917 0.674234 00:01
27 0.627592 0.625567 0.733930 0.671396 00:01
28 0.616872 0.625370 0.734323 0.671680 00:01
29 0.617217 0.624772 0.734979 0.672815 00:01

Note

For the simplicity for the reader, I do not use any normalization technique and training in lesser number of epochs than the original reference notebook. After 100 epochs, they reach an accuracy arount 0.70.1

Baseline

We’ll begin our journey by exploring an LSTM model, which was commonly used for sequence classification in the pre-transformer era.

Note

Note: Observing the validation loss may lead you to believe that the model is overfitting. However, this is not the case, as the final metric (accuracy) continues to increase.

model = LSTM(144,2,rnn_dropout=0.3, fc_dropout=0.3)
evaluate_model(model)
epoch train_loss valid_loss roc_auc_score accuracy time
0 0.698941 0.696725 0.522427 0.517026 00:02
1 0.696502 0.695753 0.526428 0.516175 00:01
2 0.694726 0.694219 0.533478 0.519013 00:01
3 0.691055 0.692617 0.541788 0.529228 00:01
4 0.691306 0.691157 0.549403 0.538309 00:01
5 0.684514 0.689195 0.559889 0.545119 00:01
6 0.680696 0.686942 0.571706 0.545970 00:01
7 0.668403 0.685266 0.581042 0.554767 00:01
8 0.659610 0.685288 0.585615 0.558456 00:01
9 0.656271 0.686517 0.588135 0.557889 00:01
10 0.651706 0.688422 0.589830 0.557889 00:01
11 0.630258 0.691245 0.590714 0.557605 00:01
12 0.620531 0.697241 0.591923 0.557605 00:01
13 0.606660 0.700341 0.597773 0.571793 00:01
14 0.595829 0.706305 0.598964 0.574347 00:01
15 0.581478 0.708097 0.604908 0.579455 00:01
16 0.570084 0.709881 0.610240 0.580590 00:01
17 0.560884 0.712225 0.612305 0.583144 00:01
18 0.570109 0.713588 0.616308 0.583995 00:01
19 0.558659 0.713829 0.616927 0.581725 00:01
20 0.545606 0.714969 0.618493 0.580874 00:01
21 0.541346 0.715872 0.620730 0.582293 00:01
22 0.538415 0.716904 0.622545 0.580590 00:01
23 0.535115 0.717429 0.623803 0.581725 00:01
24 0.531399 0.718325 0.623906 0.582577 00:01
25 0.540339 0.718253 0.624539 0.582009 00:01
26 0.539049 0.718644 0.624380 0.582293 00:01
27 0.529248 0.718712 0.624360 0.583144 00:01
28 0.524167 0.718756 0.624453 0.583712 00:01
29 0.524851 0.718751 0.624465 0.583712 00:01

Our TST

Transformer Architecture

The diagram above illustrates the Transformer architecture as presented in the “Attention is All You Need” paper. The breakthrough in this architecture is the Multi-Head Attention. The idea behind Attention is that if your model can focus on the most important parts of a long sequence, it can perform better without being affected by noise.

How does it work? Well, in my experience, when we are not very familiar with a new architecture, we shouldn’t focus too much on understanding every detail of the architecture. I spent a lot of time reading various tutorials, trying to grasp the clever idea behind this, only to realize that I still didn’t know how to apply it to a real case. I will attempt to cover building Self-Attention from scratch in a future blog post. However, in this one, we will start by learning how to use the Transformer module from PyTorch.

What do we need to pay attention to here? We will mainly focus on the shapes of the input and output. The input maintains its shape after passing through the Transformer Encoder. Subsequently, the output is flattened and passed through a linear layer, which generates the appropriate number of classes for the given classification task

Our first model as below is a very simple architecture with just one TransformerEncoder Layer and one Linear Layer

Simple TST Artchitecture
batch_size, c_in, d_model, c_out, seq_len, dropout, fc_dropout  = 64, 144, 128, 2, 62, 0.7, 0.9
class OurTST(Module):
    def __init__(self, c_in, c_out, seq_len, dropout):
        self.c_in, self.c_out, self.seq_len = c_in, c_out, seq_len
        encoder_layer = TransformerEncoderLayer(d_model=c_in, nhead=1, dropout=dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=1)
        self.head = nn.Linear(seq_len*c_in, c_out)
    def forward(self, x):
        o = x.swapaxes(1,2) # [bs,c_in,seq_len] -> [bs,seq_len,c_in]
        o = self.transformer_encoder(o) # [bs,c_in,seq_len] -> [bs,c_in,seq_len]
        o = o.reshape(o.shape[0], -1) # [bs,c_in,seq_len] -> [bs,c_in x seq_len]
        o = self.head(o) # [bs,c_in x seq_len] -> [bs,]
        return o
model = OurTST(c_in, c_out, seq_len, 0.9)
model
OurTST(
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=144, out_features=144, bias=True)
        )
        (linear1): Linear(in_features=144, out_features=2048, bias=True)
        (dropout): Dropout(p=0.9, inplace=False)
        (linear2): Linear(in_features=2048, out_features=144, bias=True)
        (norm1): LayerNorm((144,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((144,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.9, inplace=False)
        (dropout2): Dropout(p=0.9, inplace=False)
      )
    )
  )
  (head): Linear(in_features=8928, out_features=2, bias=True)
)
evaluate_model(model, 30)
epoch train_loss valid_loss roc_auc_score accuracy time
0 0.771023 0.765103 0.501380 0.503405 00:01
1 0.767232 0.743445 0.526820 0.517877 00:01
2 0.760522 0.713145 0.577938 0.552781 00:01
3 0.718567 0.686044 0.636998 0.599603 00:01
4 0.707388 0.671847 0.679890 0.627128 00:01
5 0.667916 0.675715 0.702388 0.650114 00:01
6 0.646124 0.692107 0.713854 0.657492 00:01
7 0.626370 0.714405 0.720573 0.665153 00:01
8 0.623622 0.739222 0.721124 0.671396 00:01
9 0.597921 0.755921 0.722508 0.669410 00:01
10 0.595011 0.766809 0.725383 0.672815 00:01
11 0.574535 0.771822 0.729465 0.675936 00:01
12 0.573907 0.776058 0.732046 0.671112 00:01
13 0.570190 0.785060 0.733693 0.673950 00:01
14 0.552551 0.789218 0.734351 0.674234 00:01
15 0.575410 0.794398 0.736869 0.678774 00:01
16 0.563437 0.795728 0.738365 0.676788 00:01
17 0.563861 0.796549 0.739659 0.680193 00:01
18 0.538637 0.797076 0.740454 0.679909 00:01
19 0.548644 0.796367 0.741566 0.681896 00:01
20 0.549104 0.798111 0.741827 0.685017 00:01
21 0.539753 0.801571 0.741463 0.683598 00:01
22 0.542557 0.801905 0.742229 0.683031 00:01
23 0.547139 0.803032 0.742469 0.682463 00:01
24 0.532426 0.802947 0.743231 0.683598 00:01
25 0.524330 0.803357 0.743226 0.683314 00:01
26 0.525560 0.803959 0.743130 0.683598 00:01
27 0.534251 0.804135 0.743205 0.682747 00:01
28 0.541934 0.804315 0.743135 0.682747 00:01
29 0.527424 0.804290 0.743156 0.682747 00:01

Well, our model outperforms the LSTM model and even better than the TST model after 30 epochs

Upgrades

In this section, I will discuss how we can build upon our basic Transformer architecture to achieve even greater results. We will explore several ideas inspired by the original paper and general deep learning concepts.

1- Feature Standardizing:

To enhance neural network training, it is recommended that input data have a zero mean and unit standard deviation ( for more details, refer to this lesson from fast.ai). In line with the original paper, we employ feature standardization, which standardizes each feature separately.

mean_trn = np.mean(X_train, axis=(0,2), keepdims=True)
std_trn = np.std(X_train, axis=(0,2), keepdims=True)
... # In the Dataset
self.X = (self.X - mean_trn)/std_trn

2- Input Projection

Before feeding the input into the TransformerEncoder layer, it can be projected into another dimension, allowing us to control the input received by the TransformerEncoder. In general, using suitable techniques, a deeper network can potentially outperform a shallow one.

    def __init__( ... )
        self.W_P = nn.Linear(c_in, d_model)
    def forward(self, x):
        o = x.swapaxes(1, 2)  
        o = self.W_P(o)  # Input Projection
        

3- Positional Encoding

Transformers do not inherently capture the positional order of input data, which can be crucial for certain tasks. To embed this information, we can employ techniques such as passing the input through a specific function (e.g., a sinusoidal function) or creating learnable parameters for position (as implemented in our code).

    def __init__( ... )
        # Positional encoding
        W_pos = torch.empty((seq_len, d_model), device=default_device())
        nn.init.uniform_(W_pos, -0.02, 0.02)
        self.W_pos = nn.Parameter(W_pos, requires_grad=True)
    def forward(self, x):
        o = x.swapaxes(1, 2)  
        o = self.W_P(o)  
        o = o + self.W_pos # Positional Encoding

4- DropOut

Deep neural networks can be prone to overfitting. To mitigate this issue, we can introduce dropout layers in our model, making it more resistant to overfitting. In our architecture, there are two types of dropout: one within the TransformerEncoder layer and another just before the final Linear layer.

    def __init__( ... )
        # Transformer encoder layers
        encoder_layer = TransformerEncoderLayer(d_model=d_model, nhead=1, dropout=drop_out) # dropout inside Transformer Layer
        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=n_layers)

        self.head = nn.Sequential(
            nn.GELU(),
            Flatten(),
            nn.Dropout(fc_dropout), # fully connected dropout
            nn.Linear(seq_len * d_model, c_out)
        )
batch_size, c_in, d_model, c_out, seq_len,fc_dropout = 64, 144, 128, 2, 62, 0.9
X, y, splits = get_UCR_data('FaceDetection', return_split=False)

X_train = X[splits[0]]
y_train = y[splits[0]]
X_valid = X[splits[1]]
y_valid = y[splits[1]]

mean_trn = np.mean(X_train, axis=(0,2), keepdims=True)
std_trn = np.std(X_train, axis=(0,2), keepdims=True)
class TSDataset(Dataset):
    """TimeSeries DataSet for FaceDetection"""
    def __init__(self, X, y):
        super(TSDataset, self).__init__()
        self.X = torch.tensor(X)
        self.X = (self.X - mean_trn)/std_trn
        self.Y = torch.concat([torch.tensor([_y == '0'], dtype=int) for _y in y])
    
    def __len__(self): return len(self.X)
    
    def __getitem__(self, i):
        return self.X[i], self.Y[i]
class OurTST(Module):
    def __init__(self, c_in, c_out, d_model, seq_len, n_layers, drop_out, fc_dropout):
        self.c_in, self.c_out, self.seq_len = c_in, c_out, seq_len
        self.W_P = nn.Linear(c_in, d_model)

        # Positional encoding
        W_pos = torch.empty((seq_len, d_model), device=default_device())
        nn.init.uniform_(W_pos, -0.02, 0.02)
        self.W_pos = nn.Parameter(W_pos, requires_grad=True)

        # Transformer encoder layers
        encoder_layer = TransformerEncoderLayer(d_model=d_model, nhead=1, dropout=drop_out)
        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=n_layers)

        self.head = nn.Sequential(
            nn.GELU(),
            Flatten(),
            nn.Dropout(fc_dropout),
            nn.Linear(seq_len * d_model, c_out)
        )

    def forward(self, x):
        o = x.swapaxes(1, 2)  # [bs,c_in,seq_len] -> [bs,seq_len,c_in]
        o = self.W_P(o)  # [bs,seq_len,c_in] -> [bs,seq_len,d_model]
        o = o + self.W_pos
        o = self.transformer_encoder(o)  # [bs, seq_len, d_model] -> [bs, seq_len, d_model]
        o = o.contiguous()
        o = self.head(o)  # [bs,seq_len x d_model] -> [bs,c_out]
        return o
model = OurTST(c_in, c_out, d_model, seq_len, 3, 0.4 ,0.9)
evaluate_model(model, n_epoch=30)
epoch train_loss valid_loss roc_auc_score accuracy time
0 0.925684 0.714104 0.499395 0.500000 00:02
1 0.941063 0.708397 0.513332 0.510499 00:02
2 0.910563 0.697380 0.547459 0.532066 00:02
3 0.875563 0.684338 0.586375 0.553916 00:02
4 0.788294 0.675705 0.634576 0.583144 00:02
5 0.729792 0.669882 0.668022 0.613224 00:02
6 0.698079 0.661823 0.690192 0.640182 00:02
7 0.678686 0.650137 0.703406 0.646141 00:02
8 0.669151 0.639466 0.713121 0.653235 00:02
9 0.640111 0.631399 0.720097 0.656924 00:02
10 0.636223 0.625919 0.728919 0.664586 00:02
11 0.609655 0.623980 0.735162 0.663734 00:02
12 0.600260 0.626050 0.739055 0.669694 00:02
13 0.595127 0.623003 0.741405 0.671396 00:02
14 0.592127 0.624625 0.742487 0.675369 00:02
15 0.572983 0.630688 0.746745 0.677923 00:02
16 0.573267 0.628453 0.747862 0.682463 00:02
17 0.564986 0.627425 0.749735 0.680761 00:02
18 0.567010 0.626230 0.751817 0.684733 00:02
19 0.555792 0.628040 0.751030 0.680761 00:02
20 0.546230 0.633149 0.751046 0.683031 00:02
21 0.547146 0.631326 0.752979 0.684166 00:02
22 0.548416 0.632791 0.752416 0.684449 00:02
23 0.548463 0.634959 0.752966 0.682747 00:02
24 0.541696 0.634788 0.753698 0.683598 00:02
25 0.541297 0.634837 0.753692 0.684733 00:02
26 0.530234 0.635058 0.753956 0.683314 00:02
27 0.539544 0.635267 0.753939 0.683031 00:02
28 0.546123 0.635198 0.753946 0.683314 00:02
29 0.535531 0.635230 0.753916 0.683314 00:02

Let’s try training with more epochs. In the following example, we will train for 100 epochs, which is the same as in this tutorial from tsai

model = OurTST(c_in, c_out, d_model, seq_len, 3, 0.6 ,0.9)
evaluate_model(model, n_epoch=100)
epoch train_loss valid_loss roc_auc_score accuracy time
0 0.986113 0.723611 0.495537 0.501986 00:02
1 0.985938 0.715433 0.500191 0.498581 00:02
2 0.959804 0.714567 0.505571 0.503121 00:02
3 0.935837 0.711624 0.512527 0.509081 00:02
4 0.949967 0.708520 0.520180 0.511351 00:02
5 0.938500 0.703618 0.531381 0.521283 00:02
6 0.914983 0.698704 0.546194 0.528944 00:02
7 0.916452 0.693311 0.562272 0.545687 00:02
8 0.863562 0.688229 0.581601 0.557605 00:02
9 0.825720 0.684290 0.603750 0.570658 00:02
10 0.814094 0.675768 0.626344 0.585982 00:02
11 0.770269 0.671013 0.646063 0.598751 00:02
12 0.734332 0.666361 0.663611 0.621169 00:02
13 0.728247 0.666351 0.673837 0.612372 00:02
14 0.703636 0.659145 0.688090 0.640749 00:02
15 0.695590 0.652872 0.696829 0.652667 00:02
16 0.685371 0.648839 0.701994 0.639330 00:02
17 0.677378 0.640827 0.706331 0.654370 00:02
18 0.659291 0.637410 0.711023 0.658059 00:02
19 0.654945 0.636273 0.717630 0.659478 00:02
20 0.642616 0.639057 0.721688 0.658343 00:02
21 0.629476 0.639091 0.728980 0.667991 00:02
22 0.618588 0.642530 0.735525 0.673099 00:02
23 0.612620 0.642298 0.741208 0.671680 00:02
24 0.606063 0.644125 0.745705 0.677639 00:02
25 0.597042 0.653301 0.746442 0.673099 00:02
26 0.591410 0.649610 0.749507 0.680761 00:02
27 0.599989 0.644790 0.752807 0.684449 00:02
28 0.580830 0.655860 0.755328 0.683598 00:02
29 0.579203 0.667195 0.754020 0.683314 00:02
30 0.576821 0.666351 0.757188 0.679342 00:02
31 0.571501 0.670810 0.757713 0.688138 00:02
32 0.577291 0.670262 0.760897 0.689841 00:02
33 0.565669 0.672775 0.759541 0.687003 00:02
34 0.581174 0.681387 0.757111 0.685868 00:02
35 0.571533 0.672656 0.760768 0.689841 00:02
36 0.568005 0.684501 0.759986 0.685868 00:02
37 0.558923 0.689911 0.758634 0.687287 00:02
38 0.544595 0.691652 0.760238 0.692111 00:02
39 0.547767 0.690527 0.760010 0.690976 00:02
40 0.547662 0.695285 0.760551 0.694949 00:02
41 0.545006 0.692685 0.761823 0.692111 00:02
42 0.557433 0.701815 0.761501 0.696084 00:02
43 0.556936 0.702578 0.758627 0.687003 00:02
44 0.542515 0.713648 0.757356 0.688422 00:02
45 0.540261 0.718203 0.757706 0.686152 00:02
46 0.531610 0.718681 0.758163 0.692111 00:02
47 0.525029 0.722788 0.760487 0.685868 00:02
48 0.531557 0.714598 0.761604 0.695516 00:02
49 0.528427 0.720070 0.757874 0.688422 00:02
50 0.533936 0.731660 0.760347 0.694381 00:02
51 0.532107 0.734147 0.758814 0.687571 00:02
52 0.528784 0.726680 0.761935 0.690125 00:02
53 0.525845 0.736096 0.760733 0.694665 00:02
54 0.535664 0.743276 0.758939 0.689841 00:02
55 0.521018 0.735471 0.761122 0.691544 00:02
56 0.523641 0.729754 0.761421 0.692963 00:02
57 0.525150 0.735508 0.761838 0.689274 00:02
58 0.516105 0.739418 0.763721 0.699205 00:02
59 0.511782 0.742465 0.761415 0.694098 00:02
60 0.523468 0.742341 0.760741 0.695800 00:02
61 0.520403 0.743021 0.761011 0.694381 00:02
62 0.514355 0.741946 0.762210 0.700624 00:02
63 0.511770 0.744806 0.762579 0.694949 00:02
64 0.514207 0.747035 0.761403 0.694665 00:02
65 0.504732 0.747706 0.760959 0.689841 00:02
66 0.507337 0.746426 0.761844 0.693530 00:02
67 0.502984 0.753452 0.761258 0.696084 00:02
68 0.503284 0.748423 0.762783 0.694381 00:02
69 0.510511 0.748741 0.762635 0.697219 00:02
70 0.502065 0.757950 0.761098 0.690692 00:02
71 0.499528 0.758117 0.760646 0.696084 00:02
72 0.508516 0.758758 0.759330 0.694665 00:02
73 0.497975 0.761433 0.759172 0.694098 00:02
74 0.497121 0.762476 0.758746 0.695233 00:02
75 0.494813 0.761791 0.759932 0.696368 00:02
76 0.496914 0.761701 0.760781 0.698922 00:02
77 0.492758 0.762113 0.760283 0.697219 00:02
78 0.495429 0.760835 0.760252 0.698354 00:02
79 0.500510 0.766248 0.759687 0.695800 00:02
80 0.491652 0.764638 0.760198 0.693530 00:02
81 0.495746 0.766025 0.760069 0.694098 00:02
82 0.492518 0.767861 0.759896 0.694381 00:02
83 0.492122 0.767575 0.759625 0.695800 00:02
84 0.495317 0.768247 0.758977 0.695233 00:02
85 0.503044 0.767743 0.759250 0.696935 00:02
86 0.488295 0.768618 0.759364 0.696368 00:02
87 0.495461 0.769804 0.759051 0.694381 00:02
88 0.499347 0.769520 0.758997 0.695233 00:02
89 0.511932 0.769175 0.758995 0.695233 00:02
90 0.504696 0.769521 0.758882 0.694949 00:02
91 0.491524 0.769784 0.758762 0.694098 00:02
92 0.498932 0.769638 0.758810 0.694665 00:02
93 0.493265 0.770175 0.758754 0.694665 00:02
94 0.497908 0.770099 0.758764 0.694665 00:02
95 0.486489 0.769981 0.758742 0.694665 00:02
96 0.496604 0.770035 0.758724 0.694665 00:02
97 0.485646 0.769963 0.758733 0.694381 00:02
98 0.499033 0.769941 0.758741 0.694381 00:02
99 0.490694 0.769939 0.758742 0.694381 00:02

Note

There may be differences between the implementation of the Transformer in pytorch and tsai (for example, pytorch uses LayerNorm in the TransformerEncoder layer, which is popular in NLP, while tsai employs BatchNorm)