double down on dual patch norm, fix MAE and Simmim to be compatible with dual patchnorm

seeing a signal with dual patchnorm in another repository, fully incorporate
adopt dual patchnorm paper for as many vit as applicable, release 1.0.0
2025-12-30 16:12:29 +00:00 · 2023-02-10 10:39:50 -08:00 · 2023-02-06 09:45:12 -08:00 · 2023-02-03 08:11:29 -08:00 · 2022-12-05 10:47:36 -08:00 · 2022-12-02 11:28:11 -08:00
31 changed files with 1917 additions and 129 deletions
--- a/README.md
+++ b/README.md
@@ -6,6 +6,7 @@
 - [Install](#install)
 - [Usage](#usage)
 - [Parameters](#parameters)
+- [Simple ViT](#simple-vit)
 - [Distillation](#distillation)
 - [Deep ViT](#deep-vit)
 - [CaiT](#cait)
@@ -29,6 +30,8 @@
 - [Adaptive Token Sampling](#adaptive-token-sampling)
 - [Patch Merger](#patch-merger)
 - [Vision Transformer for Small Datasets](#vision-transformer-for-small-datasets)
+- [3D Vit](#3d-vit)
+- [ViVit](#vivit)
 - [Parallel ViT](#parallel-vit)
 - [Learnable Memory ViT](#learnable-memory-vit)
 - [Dino](#dino)
@@ -51,6 +54,8 @@ The official Jax repository is <a href="https://github.com/google-research/visio

 A tensorflow2 translation also exists <a href="https://github.com/taki0112/vit-tensorflow">here</a>, created by research scientist <a href="https://github.com/taki0112">Junho Kim</a>! 🙏

+<a href="https://github.com/conceptofmind/vit-flax">Flax translation</a> by <a href="https://github.com/conceptofmind">Enrico Shippole</a>!
+
 ## Install

 ```bash
@@ -106,6 +111,33 @@ Embedding dropout rate.
 - `pool`: string, either `cls` token pooling or `mean` pooling


+## Simple ViT
+
+<a href="https://arxiv.org/abs/2205.01580">An update</a> from some of the same authors of the original paper proposes simplifications to `ViT` that allows it to train faster and better.
+
+Among these simplifications include 2d sinusoidal positional embedding, global average pooling (no CLS token), no dropout, batch sizes of 1024 rather than 4096, and use of RandAugment and MixUp augmentations. They also show that a simple linear at the end is not significantly worse than the original MLP head
+
+You can use it by importing the `SimpleViT` as shown below
+
+```python
+import torch
+from vit_pytorch import SimpleViT
+
+v = SimpleViT(
+    image_size = 256,
+    patch_size = 32,
+    num_classes = 1000,
+    dim = 1024,
+    depth = 6,
+    heads = 16,
+    mlp_dim = 2048
+)
+
+img = torch.randn(1, 3, 256, 256)
+
+preds = v(img) # (1, 1000)
+```
+
 ## Distillation

 <img src="./images/distill.png" width="300px"></img>
@@ -633,7 +665,7 @@ preds = v(img) # (2, 1000)

 <img src="./images/nest.png" width="400px"></img>

-This <a href="https://arxiv.org/abs/2105.12723">paper</a> decided to process the image in hierarchical stages, with attention only within tokens of local blocks, which aggregate as it moves up the heirarchy. The aggregation is done in the image plane, and contains a convolution and subsequent maxpool to allow it to pass information across the boundary.
+This <a href="https://arxiv.org/abs/2105.12723">paper</a> decided to process the image in hierarchical stages, with attention only within tokens of local blocks, which aggregate as it moves up the hierarchy. The aggregation is done in the image plane, and contains a convolution and subsequent maxpool to allow it to pass information across the boundary.

 You can use it with the following code (ex. NesT-T)

@@ -647,7 +679,7 @@ nest = NesT(
    dim = 96,
    heads = 3,
    num_hierarchies = 3,        # number of hierarchies
-    block_repeats = (2, 2, 8),  # the number of transformer blocks at each heirarchy, starting from the bottom
+    block_repeats = (2, 2, 8),  # the number of transformer blocks at each hierarchy, starting from the bottom
    num_classes = 1000
 )

@@ -937,6 +969,117 @@ img = torch.randn(4, 3, 256, 256)
 tokens = spt(img) # (4, 256, 1024)
 ```

+## 3D ViT
+
+By popular request, I will start extending a few of the architectures in this repository to 3D ViTs, for use with video, medical imaging, etc.
+
+You will need to pass in two additional hyperparameters: (1) the number of frames `frames` and (2) patch size along the frame dimension `frame_patch_size`
+
+For starters, 3D ViT
+
+```python
+import torch
+from vit_pytorch.vit_3d import ViT
+
+v = ViT(
+    image_size = 128,          # image size
+    frames = 16,               # number of frames
+    image_patch_size = 16,     # image patch size
+    frame_patch_size = 2,      # frame patch size
+    num_classes = 1000,
+    dim = 1024,
+    depth = 6,
+    heads = 8,
+    mlp_dim = 2048,
+    dropout = 0.1,
+    emb_dropout = 0.1
+)
+
+video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)
+
+preds = v(video) # (4, 1000)
+```
+
+3D Simple ViT
+
+```python
+import torch
+from vit_pytorch.simple_vit_3d import SimpleViT
+
+v = SimpleViT(
+    image_size = 128,          # image size
+    frames = 16,               # number of frames
+    image_patch_size = 16,     # image patch size
+    frame_patch_size = 2,      # frame patch size
+    num_classes = 1000,
+    dim = 1024,
+    depth = 6,
+    heads = 8,
+    mlp_dim = 2048
+)
+
+video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)
+
+preds = v(video) # (4, 1000)
+```
+
+3D version of <a href="https://github.com/lucidrains/vit-pytorch#cct">CCT</a>
+
+```python
+import torch
+from vit_pytorch.cct_3d import CCT
+
+cct = CCT(
+    img_size = 224,
+    num_frames = 8,
+    embedding_dim = 384,
+    n_conv_layers = 2,
+    frame_kernel_size = 3,
+    kernel_size = 7,
+    stride = 2,
+    padding = 3,
+    pooling_kernel_size = 3,
+    pooling_stride = 2,
+    pooling_padding = 1,
+    num_layers = 14,
+    num_heads = 6,
+    mlp_radio = 3.,
+    num_classes = 1000,
+    positional_embedding = 'learnable'
+)
+
+video = torch.randn(1, 3, 8, 224, 224) # (batch, channels, frames, height, width)
+pred = cct(video)
+```
+
+## ViViT
+
+<img src="./images/vivit.png" width="350px"></img>
+
+This <a href="https://arxiv.org/abs/2103.15691">paper</a> offers 3 different types of architectures for efficient attention of videos, with the main theme being factorizing the attention across space and time. This repository will offer the first variant, which is a spatial transformer followed by a temporal one.
+
+```python
+import torch
+from vit_pytorch.vivit import ViT
+
+v = ViT(
+    image_size = 128,          # image size
+    frames = 16,               # number of frames
+    image_patch_size = 16,     # image patch size
+    frame_patch_size = 2,      # frame patch size
+    num_classes = 1000,
+    dim = 1024,
+    spatial_depth = 6,         # depth of the spatial transformer
+    temporal_depth = 6,        # depth of the temporal transformer
+    heads = 8,
+    mlp_dim = 2048
+)
+
+video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)
+
+preds = v(video) # (4, 1000)
+```
+
 ## Parallel ViT

 <img src="./images/parallel-vit.png" width="350px"></img>
@@ -1227,6 +1370,47 @@ logits, embeddings = v(img)
 embeddings # (1, 65, 1024) - (batch x patches x model dim)
 ```

+Or say for `CrossViT`, which has a multi-scale encoder that outputs two sets of embeddings for 'large' and 'small' scales
+
+```python
+import torch
+from vit_pytorch.cross_vit import CrossViT
+
+v = CrossViT(
+    image_size = 256,
+    num_classes = 1000,
+    depth = 4,
+    sm_dim = 192,
+    sm_patch_size = 16,
+    sm_enc_depth = 2,
+    sm_enc_heads = 8,
+    sm_enc_mlp_dim = 2048,
+    lg_dim = 384,
+    lg_patch_size = 64,
+    lg_enc_depth = 3,
+    lg_enc_heads = 8,
+    lg_enc_mlp_dim = 2048,
+    cross_attn_depth = 2,
+    cross_attn_heads = 8,
+    dropout = 0.1,
+    emb_dropout = 0.1
+)
+
+# wrap the CrossViT
+
+from vit_pytorch.extractor import Extractor
+v = Extractor(v, layer_name = 'multi_scale_encoder') # take embedding coming from the output of multi-scale-encoder
+
+# forward pass now returns predictions and the attention maps
+
+img = torch.randn(1, 3, 256, 256)
+logits, embeddings = v(img)
+
+# there is one extra token due to the CLS token
+
+embeddings # ((1, 257, 192), (1, 17, 384)) - (batch x patches x dimension) <- large and small scales respectively
+```
+
 ## Research Ideas

 ### Efficient Attention
@@ -1669,6 +1853,48 @@ Coming from computer vision and new to transformers? Here are some resources tha
 }
 ```

+```bibtex
+@misc{Beyer2022BetterPlainViT
+    title     = {Better plain ViT baselines for ImageNet-1k},
+    author    = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
+    publisher = {arXiv},
+    year      = {2022}
+}
+
+```
+
+```bibtex
+@article{Arnab2021ViViTAV,
+    title   = {ViViT: A Video Vision Transformer},
+    author  = {Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario Lucic and Cordelia Schmid},
+    journal = {2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
+    year    = {2021},
+    pages   = {6816-6826}
+}
+```
+
+```bibtex
+@article{Liu2022PatchDropoutEV,
+    title   = {PatchDropout: Economizing Vision Transformers Using Patch Dropout},
+    author  = {Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith},
+    journal = {ArXiv},
+    year    = {2022},
+    volume  = {abs/2208.07220}
+}
+```
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2302.01327,
+    doi     = {10.48550/ARXIV.2302.01327},
+    url     = {https://arxiv.org/abs/2302.01327},
+    author  = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
+    title   = {Dual PatchNorm},
+    publisher = {arXiv},
+    year    = {2023},
+    copyright = {Creative Commons Attribution 4.0 International}
+}
+```
+
 ```bibtex
@misc{vaswani2017attention,
    title   = {Attention Is All You Need},
--- a/examples/cats_and_dogs.ipynb
+++ b/examples/cats_and_dogs.ipynb
@@ -16,7 +16,7 @@
    "\n",
    "* Dogs vs. Cats Redux: Kernels Edition - https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition\n",
    "* Base Code - https://www.kaggle.com/reukki/pytorch-cnn-tutorial-with-cats-and-dogs/\n",
-    "* Effecient Attention Implementation - https://github.com/lucidrains/vit-pytorch#efficient-attention"
+    "* Efficient Attention Implementation - https://github.com/lucidrains/vit-pytorch#efficient-attention"
   ]
  },
  {
@@ -342,7 +342,7 @@
    "id": "ZhYDJXk2SRDu"
   },
   "source": [
-    "## Image Augumentation"
+    "## Image Augmentation"
   ]
  },
  {
@@ -497,7 +497,7 @@
    "id": "TF9yMaRrSvmv"
   },
   "source": [
-    "## Effecient Attention"
+    "## Efficient Attention"
   ]
  },
  {
@@ -1307,7 +1307,7 @@
  "celltoolbar": "Edit Metadata",
  "colab": {
   "collapsed_sections": [],
-   "name": "Effecient Attention | Cats & Dogs",
+   "name": "Efficient Attention | Cats & Dogs",
   "provenance": [],
   "toc_visible": true
  },
--- a/images/vivit.png
+++ b/images/vivit.png
--- a/setup.py
+++ b/setup.py
@@ -3,9 +3,10 @@ from setuptools import setup, find_packages
 setup(
  name = 'vit-pytorch',
  packages = find_packages(exclude=['examples']),
-  version = '0.34.1',
+  version = '1.0.2',
  license='MIT',
  description = 'Vision Transformer (ViT) - Pytorch',
+  long_description_content_type = 'text/markdown',
  author = 'Phil Wang',
  author_email = 'lucidrains@gmail.com',
  url = 'https://github.com/lucidrains/vit-pytorch',
@@ -15,7 +16,7 @@ setup(
    'image recognition'
  ],
  install_requires=[
-    'einops>=0.4.1',
+    'einops>=0.6.0',
    'torch>=1.10',
    'torchvision'
  ],
@@ -23,7 +24,9 @@ setup(
    'pytest-runner',
  ],
  tests_require=[
-    'pytest'
+    'pytest',
+    'torch==1.12.1',
+    'torchvision==0.13.1'
  ],
  classifiers=[
    'Development Status :: 4 - Beta',
--- a/vit_pytorch/init.py
+++ b/vit_pytorch/init.py
@@ -1,3 +1,5 @@
 from vit_pytorch.vit import ViT
+from vit_pytorch.simple_vit import SimpleViT
+
 from vit_pytorch.mae import MAE
 from vit_pytorch.dino import Dino
--- a/vit_pytorch/ats_vit.py
+++ b/vit_pytorch/ats_vit.py
@@ -230,7 +230,9 @@ class ViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/cait.py
+++ b/vit_pytorch/cait.py
@@ -150,7 +150,9 @@ class CaiT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches, dim))
--- a/vit_pytorch/cct.py
+++ b/vit_pytorch/cct.py
@@ -1,9 +1,17 @@
 import torch
-import torch.nn as nn
+from torch import nn, einsum
 import torch.nn.functional as F

+from einops import rearrange, repeat
+
 # helpers

+def exists(val):
+    return val is not None
+
+def default(val, d):
+    return val if exists(val) else d
+
 def pair(t):
    return t if isinstance(t, tuple) else (t, t)

@@ -50,8 +58,9 @@ def cct_16(*args, **kwargs):
 def _cct(num_layers, num_heads, mlp_ratio, embedding_dim,
         kernel_size=3, stride=None, padding=None,
         *args, **kwargs):
-    stride = stride if stride is not None else max(1, (kernel_size // 2) - 1)
-    padding = padding if padding is not None else max(1, (kernel_size // 2))
+    stride = default(stride, max(1, (kernel_size // 2) - 1))
+    padding = default(padding, max(1, (kernel_size // 2)))
+
    return CCT(num_layers=num_layers,
               num_heads=num_heads,
               mlp_ratio=mlp_ratio,
@@ -61,13 +70,22 @@ def _cct(num_layers, num_heads, mlp_ratio, embedding_dim,
               padding=padding,
               *args, **kwargs)

+# positional
+
+def sinusoidal_embedding(n_channels, dim):
+    pe = torch.FloatTensor([[p / (10000 ** (2 * (i // 2) / dim)) for i in range(dim)]
+                            for p in range(n_channels)])
+    pe[:, 0::2] = torch.sin(pe[:, 0::2])
+    pe[:, 1::2] = torch.cos(pe[:, 1::2])
+    return rearrange(pe, '... -> 1 ...')
+
 # modules

 class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, attention_dropout=0.1, projection_dropout=0.1):
        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // self.num_heads
+        self.heads = num_heads
+        head_dim = dim // self.heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=False)
@@ -77,17 +95,20 @@ class Attention(nn.Module):

    def forward(self, x):
        B, N, C = x.shape
-        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
-        q, k, v = qkv[0], qkv[1], qkv[2]

-        attn = (q @ k.transpose(-2, -1)) * self.scale
+        qkv = self.qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        q = q * self.scale
+
+        attn = einsum('b h i d, b h j d -> b h i j', q, k)
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
+        x = einsum('b h i j, b h j d -> b h i d', attn, v)
+        x = rearrange(x, 'b h n d -> b n (h d)')
+
+        return self.proj_drop(self.proj(x))


 class TransformerEncoderLayer(nn.Module):
@@ -97,7 +118,8 @@ class TransformerEncoderLayer(nn.Module):
    """
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 attention_dropout=0.1, drop_path_rate=0.1):
-        super(TransformerEncoderLayer, self).__init__()
+        super().__init__()
+
        self.pre_norm = nn.LayerNorm(d_model)
        self.self_attn = Attention(dim=d_model, num_heads=nhead,
                                   attention_dropout=attention_dropout, projection_dropout=dropout)
@@ -108,50 +130,34 @@ class TransformerEncoderLayer(nn.Module):
        self.linear2  = nn.Linear(dim_feedforward, d_model)
        self.dropout2 = nn.Dropout(dropout)

-        self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()
+        self.drop_path = DropPath(drop_path_rate)

        self.activation = F.gelu

-    def forward(self, src: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+    def forward(self, src, *args, **kwargs):
        src = src + self.drop_path(self.self_attn(self.pre_norm(src)))
        src = self.norm1(src)
        src2 = self.linear2(self.dropout1(self.activation(self.linear1(src))))
        src = src + self.drop_path(self.dropout2(src2))
        return src

-
-def drop_path(x, drop_prob: float = 0., training: bool = False):
-    """
-    Obtained from: github.com:rwightman/pytorch-image-models
-    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
-    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
-    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
-    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
-    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
-    'survival rate' as the argument.
-    """
-    if drop_prob == 0. or not training:
-        return x
-    keep_prob = 1 - drop_prob
-    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
-    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
-    random_tensor.floor_()  # binarize
-    output = x.div(keep_prob) * random_tensor
-    return output
-
-
 class DropPath(nn.Module):
-    """
-    Obtained from: github.com:rwightman/pytorch-image-models
-    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
-    """
    def __init__(self, drop_prob=None):
-        super(DropPath, self).__init__()
-        self.drop_prob = drop_prob
+        super().__init__()
+        self.drop_prob = float(drop_prob)

    def forward(self, x):
-        return drop_path(x, self.drop_prob, self.training)
+        batch, drop_prob, device, dtype = x.shape[0], self.drop_prob, x.device, x.dtype

+        if drop_prob <= 0. or not self.training:
+            return x
+
+        keep_prob = 1 - self.drop_prob
+        shape = (batch, *((1,) * (x.ndim - 1)))
+
+        keep_mask = torch.zeros(shape, device = device).float().uniform_(0, 1) < keep_prob
+        output = x.div(keep_prob) * keep_mask.float()
+        return output

 class Tokenizer(nn.Module):
    def __init__(self,
@@ -164,34 +170,35 @@ class Tokenizer(nn.Module):
                 activation=None,
                 max_pool=True,
                 conv_bias=False):
-        super(Tokenizer, self).__init__()
+        super().__init__()

        n_filter_list = [n_input_channels] + \
                        [in_planes for _ in range(n_conv_layers - 1)] + \
                        [n_output_channels]

+        n_filter_list_pairs = zip(n_filter_list[:-1], n_filter_list[1:])
+
        self.conv_layers = nn.Sequential(
            *[nn.Sequential(
-                nn.Conv2d(n_filter_list[i], n_filter_list[i + 1],
+                nn.Conv2d(chan_in, chan_out,
                          kernel_size=(kernel_size, kernel_size),
                          stride=(stride, stride),
                          padding=(padding, padding), bias=conv_bias),
-                nn.Identity() if activation is None else activation(),
+                nn.Identity() if not exists(activation) else activation(),
                nn.MaxPool2d(kernel_size=pooling_kernel_size,
                             stride=pooling_stride,
                             padding=pooling_padding) if max_pool else nn.Identity()
            )
-                for i in range(n_conv_layers)
+                for chan_in, chan_out in n_filter_list_pairs
            ])

-        self.flattener = nn.Flatten(2, 3)
        self.apply(self.init_weight)

    def sequence_length(self, n_channels=3, height=224, width=224):
        return self.forward(torch.zeros((1, n_channels, height, width))).shape[1]

    def forward(self, x):
-        return self.flattener(self.conv_layers(x)).transpose(-2, -1)
+        return rearrange(self.conv_layers(x), 'b c h w -> b (h w) c')

    @staticmethod
    def init_weight(m):
@@ -214,106 +221,104 @@ class TransformerClassifier(nn.Module):
                 sequence_length=None,
                 *args, **kwargs):
        super().__init__()
-        positional_embedding = positional_embedding if \
-            positional_embedding in ['sine', 'learnable', 'none'] else 'sine'
+        assert positional_embedding in {'sine', 'learnable', 'none'}
+
        dim_feedforward = int(embedding_dim * mlp_ratio)
        self.embedding_dim = embedding_dim
        self.sequence_length = sequence_length
        self.seq_pool = seq_pool

-        assert sequence_length is not None or positional_embedding == 'none', \
+        assert exists(sequence_length) or positional_embedding == 'none', \
            f"Positional embedding is set to {positional_embedding} and" \
            f" the sequence length was not specified."

        if not seq_pool:
            sequence_length += 1
-            self.class_emb = nn.Parameter(torch.zeros(1, 1, self.embedding_dim),
-                                          requires_grad=True)
+            self.class_emb = nn.Parameter(torch.zeros(1, 1, self.embedding_dim), requires_grad=True)
        else:
            self.attention_pool = nn.Linear(self.embedding_dim, 1)

-        if positional_embedding != 'none':
-            if positional_embedding == 'learnable':
-                self.positional_emb = nn.Parameter(torch.zeros(1, sequence_length, embedding_dim),
-                                                   requires_grad=True)
-                nn.init.trunc_normal_(self.positional_emb, std=0.2)
-            else:
-                self.positional_emb = nn.Parameter(self.sinusoidal_embedding(sequence_length, embedding_dim),
-                                                   requires_grad=False)
-        else:
+        if positional_embedding == 'none':
            self.positional_emb = None
+        elif positional_embedding == 'learnable':
+            self.positional_emb = nn.Parameter(torch.zeros(1, sequence_length, embedding_dim),
+                                               requires_grad=True)
+            nn.init.trunc_normal_(self.positional_emb, std=0.2)
+        else:
+            self.positional_emb = nn.Parameter(sinusoidal_embedding(sequence_length, embedding_dim),
+                                               requires_grad=False)

        self.dropout = nn.Dropout(p=dropout_rate)
+
        dpr = [x.item() for x in torch.linspace(0, stochastic_depth_rate, num_layers)]
+
        self.blocks = nn.ModuleList([
            TransformerEncoderLayer(d_model=embedding_dim, nhead=num_heads,
                                    dim_feedforward=dim_feedforward, dropout=dropout_rate,
-                                    attention_dropout=attention_dropout, drop_path_rate=dpr[i])
-            for i in range(num_layers)])
+                                    attention_dropout=attention_dropout, drop_path_rate=layer_dpr)
+            for layer_dpr in dpr])
+
        self.norm = nn.LayerNorm(embedding_dim)

        self.fc = nn.Linear(embedding_dim, num_classes)
        self.apply(self.init_weight)

    def forward(self, x):
-        if self.positional_emb is None and x.size(1) < self.sequence_length:
+        b = x.shape[0]
+
+        if not exists(self.positional_emb) and x.size(1) < self.sequence_length:
            x = F.pad(x, (0, 0, 0, self.n_channels - x.size(1)), mode='constant', value=0)

        if not self.seq_pool:
-            cls_token = self.class_emb.expand(x.shape[0], -1, -1)
+            cls_token = repeat(self.class_emb, '1 1 d -> b 1 d', b = b)
            x = torch.cat((cls_token, x), dim=1)

-        if self.positional_emb is not None:
+        if exists(self.positional_emb):
            x += self.positional_emb

        x = self.dropout(x)

        for blk in self.blocks:
            x = blk(x)
+
        x = self.norm(x)

        if self.seq_pool:
-            x = torch.matmul(F.softmax(self.attention_pool(x), dim=1).transpose(-1, -2), x).squeeze(-2)
+            attn_weights = rearrange(self.attention_pool(x), 'b n 1 -> b n')
+            x = einsum('b n, b n d -> b d', attn_weights.softmax(dim = 1), x)
        else:
            x = x[:, 0]

-        x = self.fc(x)
-        return x
+        return self.fc(x)

    @staticmethod
    def init_weight(m):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
+            if isinstance(m, nn.Linear) and exists(m.bias):
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

-    @staticmethod
-    def sinusoidal_embedding(n_channels, dim):
-        pe = torch.FloatTensor([[p / (10000 ** (2 * (i // 2) / dim)) for i in range(dim)]
-                                for p in range(n_channels)])
-        pe[:, 0::2] = torch.sin(pe[:, 0::2])
-        pe[:, 1::2] = torch.cos(pe[:, 1::2])
-        return pe.unsqueeze(0)
-
-
 # CCT Main model
+
 class CCT(nn.Module):
-    def __init__(self,
-                 img_size=224,
-                 embedding_dim=768,
-                 n_input_channels=3,
-                 n_conv_layers=1,
-                 kernel_size=7,
-                 stride=2,
-                 padding=3,
-                 pooling_kernel_size=3,
-                 pooling_stride=2,
-                 pooling_padding=1,
-                 *args, **kwargs):
-        super(CCT, self).__init__()
+    def __init__(
+        self,
+        img_size=224,
+        embedding_dim=768,
+        n_input_channels=3,
+        n_conv_layers=1,
+        kernel_size=7,
+        stride=2,
+        padding=3,
+        pooling_kernel_size=3,
+        pooling_stride=2,
+        pooling_padding=1,
+        *args, **kwargs
+    ):
+        super().__init__()
        img_height, img_width = pair(img_size)

        self.tokenizer = Tokenizer(n_input_channels=n_input_channels,
--- a/vit_pytorch/cct_3d.py
+++ b/vit_pytorch/cct_3d.py
@@ -0,0 +1,376 @@
+import torch
+from torch import nn, einsum
+import torch.nn.functional as F
+
+from einops import rearrange, repeat
+
+# helpers
+
+def exists(val):
+    return val is not None
+
+def default(val, d):
+    return val if exists(val) else d
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+# CCT Models
+
+__all__ = ['cct_2', 'cct_4', 'cct_6', 'cct_7', 'cct_8', 'cct_14', 'cct_16']
+
+
+def cct_2(*args, **kwargs):
+    return _cct(num_layers=2, num_heads=2, mlp_ratio=1, embedding_dim=128,
+                *args, **kwargs)
+
+
+def cct_4(*args, **kwargs):
+    return _cct(num_layers=4, num_heads=2, mlp_ratio=1, embedding_dim=128,
+                *args, **kwargs)
+
+
+def cct_6(*args, **kwargs):
+    return _cct(num_layers=6, num_heads=4, mlp_ratio=2, embedding_dim=256,
+                *args, **kwargs)
+
+
+def cct_7(*args, **kwargs):
+    return _cct(num_layers=7, num_heads=4, mlp_ratio=2, embedding_dim=256,
+                *args, **kwargs)
+
+
+def cct_8(*args, **kwargs):
+    return _cct(num_layers=8, num_heads=4, mlp_ratio=2, embedding_dim=256,
+                *args, **kwargs)
+
+
+def cct_14(*args, **kwargs):
+    return _cct(num_layers=14, num_heads=6, mlp_ratio=3, embedding_dim=384,
+                *args, **kwargs)
+
+
+def cct_16(*args, **kwargs):
+    return _cct(num_layers=16, num_heads=6, mlp_ratio=3, embedding_dim=384,
+                *args, **kwargs)
+
+
+def _cct(num_layers, num_heads, mlp_ratio, embedding_dim,
+         kernel_size=3, stride=None, padding=None,
+         *args, **kwargs):
+    stride = default(stride, max(1, (kernel_size // 2) - 1))
+    padding = default(padding, max(1, (kernel_size // 2)))
+
+    return CCT(num_layers=num_layers,
+               num_heads=num_heads,
+               mlp_ratio=mlp_ratio,
+               embedding_dim=embedding_dim,
+               kernel_size=kernel_size,
+               stride=stride,
+               padding=padding,
+               *args, **kwargs)
+
+# positional
+
+def sinusoidal_embedding(n_channels, dim):
+    pe = torch.FloatTensor([[p / (10000 ** (2 * (i // 2) / dim)) for i in range(dim)]
+                            for p in range(n_channels)])
+    pe[:, 0::2] = torch.sin(pe[:, 0::2])
+    pe[:, 1::2] = torch.cos(pe[:, 1::2])
+    return rearrange(pe, '... -> 1 ...')
+
+# modules
+
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, attention_dropout=0.1, projection_dropout=0.1):
+        super().__init__()
+        self.heads = num_heads
+        head_dim = dim // self.heads
+        self.scale = head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=False)
+        self.attn_drop = nn.Dropout(attention_dropout)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(projection_dropout)
+
+    def forward(self, x):
+        B, N, C = x.shape
+
+        qkv = self.qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        q = q * self.scale
+
+        attn = einsum('b h i d, b h j d -> b h i j', q, k)
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = einsum('b h i j, b h j d -> b h i d', attn, v)
+        x = rearrange(x, 'b h n d -> b n (h d)')
+
+        return self.proj_drop(self.proj(x))
+
+
+class TransformerEncoderLayer(nn.Module):
+    """
+    Inspired by torch.nn.TransformerEncoderLayer and
+    rwightman's timm package.
+    """
+    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
+                 attention_dropout=0.1, drop_path_rate=0.1):
+        super().__init__()
+
+        self.pre_norm = nn.LayerNorm(d_model)
+        self.self_attn = Attention(dim=d_model, num_heads=nhead,
+                                   attention_dropout=attention_dropout, projection_dropout=dropout)
+
+        self.linear1  = nn.Linear(d_model, dim_feedforward)
+        self.dropout1 = nn.Dropout(dropout)
+        self.norm1    = nn.LayerNorm(d_model)
+        self.linear2  = nn.Linear(dim_feedforward, d_model)
+        self.dropout2 = nn.Dropout(dropout)
+
+        self.drop_path = DropPath(drop_path_rate)
+
+        self.activation = F.gelu
+
+    def forward(self, src, *args, **kwargs):
+        src = src + self.drop_path(self.self_attn(self.pre_norm(src)))
+        src = self.norm1(src)
+        src2 = self.linear2(self.dropout1(self.activation(self.linear1(src))))
+        src = src + self.drop_path(self.dropout2(src2))
+        return src
+
+class DropPath(nn.Module):
+    def __init__(self, drop_prob=None):
+        super().__init__()
+        self.drop_prob = float(drop_prob)
+
+    def forward(self, x):
+        batch, drop_prob, device, dtype = x.shape[0], self.drop_prob, x.device, x.dtype
+
+        if drop_prob <= 0. or not self.training:
+            return x
+
+        keep_prob = 1 - self.drop_prob
+        shape = (batch, *((1,) * (x.ndim - 1)))
+
+        keep_mask = torch.zeros(shape, device = device).float().uniform_(0, 1) < keep_prob
+        output = x.div(keep_prob) * keep_mask.float()
+        return output
+
+class Tokenizer(nn.Module):
+    def __init__(
+        self,
+        frame_kernel_size,
+        kernel_size,
+        stride,
+        padding,
+        frame_stride=1,
+        frame_pooling_stride=1,
+        frame_pooling_kernel_size=1,
+        pooling_kernel_size=3,
+        pooling_stride=2,
+        pooling_padding=1,
+        n_conv_layers=1,
+        n_input_channels=3,
+        n_output_channels=64,
+        in_planes=64,
+        activation=None,
+        max_pool=True,
+        conv_bias=False
+    ):
+        super().__init__()
+
+        n_filter_list = [n_input_channels] + \
+                        [in_planes for _ in range(n_conv_layers - 1)] + \
+                        [n_output_channels]
+
+        n_filter_list_pairs = zip(n_filter_list[:-1], n_filter_list[1:])
+
+        self.conv_layers = nn.Sequential(
+            *[nn.Sequential(
+                nn.Conv3d(chan_in, chan_out,
+                          kernel_size=(frame_kernel_size, kernel_size, kernel_size),
+                          stride=(frame_stride, stride, stride),
+                          padding=(frame_kernel_size // 2, padding, padding), bias=conv_bias),
+                nn.Identity() if not exists(activation) else activation(),
+                nn.MaxPool3d(kernel_size=(frame_pooling_kernel_size, pooling_kernel_size, pooling_kernel_size),
+                             stride=(frame_pooling_stride, pooling_stride, pooling_stride),
+                             padding=(frame_pooling_kernel_size // 2, pooling_padding, pooling_padding)) if max_pool else nn.Identity()
+            )
+                for chan_in, chan_out in n_filter_list_pairs
+            ])
+
+        self.apply(self.init_weight)
+
+    def sequence_length(self, n_channels=3, frames=8, height=224, width=224):
+        return self.forward(torch.zeros((1, n_channels, frames, height, width))).shape[1]
+
+    def forward(self, x):
+        x = self.conv_layers(x)
+        return rearrange(x, 'b c f h w -> b (f h w) c')
+
+    @staticmethod
+    def init_weight(m):
+        if isinstance(m, nn.Conv3d):
+            nn.init.kaiming_normal_(m.weight)
+
+
+class TransformerClassifier(nn.Module):
+    def __init__(
+        self,
+        seq_pool=True,
+        embedding_dim=768,
+        num_layers=12,
+        num_heads=12,
+        mlp_ratio=4.0,
+        num_classes=1000,
+        dropout_rate=0.1,
+        attention_dropout=0.1,
+        stochastic_depth_rate=0.1,
+        positional_embedding='sine',
+        sequence_length=None,
+        *args, **kwargs
+    ):
+        super().__init__()
+        assert positional_embedding in {'sine', 'learnable', 'none'}
+
+        dim_feedforward = int(embedding_dim * mlp_ratio)
+        self.embedding_dim = embedding_dim
+        self.sequence_length = sequence_length
+        self.seq_pool = seq_pool
+
+        assert exists(sequence_length) or positional_embedding == 'none', \
+            f"Positional embedding is set to {positional_embedding} and" \
+            f" the sequence length was not specified."
+
+        if not seq_pool:
+            sequence_length += 1
+            self.class_emb = nn.Parameter(torch.zeros(1, 1, self.embedding_dim))
+        else:
+            self.attention_pool = nn.Linear(self.embedding_dim, 1)
+
+        if positional_embedding == 'none':
+            self.positional_emb = None
+        elif positional_embedding == 'learnable':
+            self.positional_emb = nn.Parameter(torch.zeros(1, sequence_length, embedding_dim))
+            nn.init.trunc_normal_(self.positional_emb, std = 0.2)
+        else:
+            self.register_buffer('positional_emb', sinusoidal_embedding(sequence_length, embedding_dim))
+
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+        dpr = [x.item() for x in torch.linspace(0, stochastic_depth_rate, num_layers)]
+
+        self.blocks = nn.ModuleList([
+            TransformerEncoderLayer(d_model=embedding_dim, nhead=num_heads,
+                                    dim_feedforward=dim_feedforward, dropout=dropout_rate,
+                                    attention_dropout=attention_dropout, drop_path_rate=layer_dpr)
+            for layer_dpr in dpr])
+
+        self.norm = nn.LayerNorm(embedding_dim)
+
+        self.fc = nn.Linear(embedding_dim, num_classes)
+        self.apply(self.init_weight)
+
+    @staticmethod
+    def init_weight(m):
+        if isinstance(m, nn.Linear):
+            nn.init.trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and exists(m.bias):
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(self, x):
+        b = x.shape[0]
+
+        if not exists(self.positional_emb) and x.size(1) < self.sequence_length:
+            x = F.pad(x, (0, 0, 0, self.n_channels - x.size(1)), mode='constant', value=0)
+
+        if not self.seq_pool:
+            cls_token = repeat(self.class_emb, '1 1 d -> b 1 d', b = b)
+            x = torch.cat((cls_token, x), dim=1)
+
+        if exists(self.positional_emb):
+            x += self.positional_emb
+
+        x = self.dropout(x)
+
+        for blk in self.blocks:
+            x = blk(x)
+
+        x = self.norm(x)
+
+        if self.seq_pool:
+            attn_weights = rearrange(self.attention_pool(x), 'b n 1 -> b n')
+            x = einsum('b n, b n d -> b d', attn_weights.softmax(dim = 1), x)
+        else:
+            x = x[:, 0]
+
+        return self.fc(x)
+
+# CCT Main model
+
+class CCT(nn.Module):
+    def __init__(
+        self,
+        img_size=224,
+        num_frames=8,
+        embedding_dim=768,
+        n_input_channels=3,
+        n_conv_layers=1,
+        frame_stride=1,
+        frame_kernel_size=3,
+        frame_pooling_kernel_size=1,
+        frame_pooling_stride=1,
+        kernel_size=7,
+        stride=2,
+        padding=3,
+        pooling_kernel_size=3,
+        pooling_stride=2,
+        pooling_padding=1,
+        *args, **kwargs
+    ):
+        super().__init__()
+        img_height, img_width = pair(img_size)
+
+        self.tokenizer = Tokenizer(
+            n_input_channels=n_input_channels,
+            n_output_channels=embedding_dim,
+            frame_stride=frame_stride,
+            frame_kernel_size=frame_kernel_size,
+            frame_pooling_stride=frame_pooling_stride,
+            frame_pooling_kernel_size=frame_pooling_kernel_size,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            pooling_kernel_size=pooling_kernel_size,
+            pooling_stride=pooling_stride,
+            pooling_padding=pooling_padding,
+            max_pool=True,
+            activation=nn.ReLU,
+            n_conv_layers=n_conv_layers,
+            conv_bias=False
+        )
+
+        self.classifier = TransformerClassifier(
+            sequence_length=self.tokenizer.sequence_length(
+                n_channels=n_input_channels,
+                frames=num_frames,
+                height=img_height,
+                width=img_width
+            ),
+            embedding_dim=embedding_dim,
+            seq_pool=True,
+            dropout_rate=0.,
+            attention_dropout=0.1,
+            stochastic_depth=0.1,
+            *args, **kwargs
+        )
+
+    def forward(self, x):
+        x = self.tokenizer(x)
+        return self.classifier(x)
--- a/vit_pytorch/cross_vit.py
+++ b/vit_pytorch/cross_vit.py
@@ -186,7 +186,9 @@ class ImageEmbedder(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/deepvit.py
+++ b/vit_pytorch/deepvit.py
@@ -105,7 +105,9 @@ class DeepViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/efficient.py
+++ b/vit_pytorch/efficient.py
@@ -17,7 +17,9 @@ class ViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/extractor.py
+++ b/vit_pytorch/extractor.py
@@ -4,14 +4,27 @@ from torch import nn
 def exists(val):
    return val is not None

+def identity(t):
+    return t
+
+def clone_and_detach(t):
+    return t.clone().detach()
+
+def apply_tuple_or_single(fn, val):
+    if isinstance(val, tuple):
+        return tuple(map(fn, val))
+    return fn(val)
+
 class Extractor(nn.Module):
    def __init__(
        self,
        vit,
        device = None,
+        layer = None,
        layer_name = 'transformer',
        layer_save_input = False,
-        return_embeddings_only = False
+        return_embeddings_only = False,
+        detach = True
    ):
        super().__init__()
        self.vit = vit
@@ -23,17 +36,24 @@ class Extractor(nn.Module):
        self.ejected = False
        self.device = device

+        self.layer = layer
        self.layer_name = layer_name
        self.layer_save_input = layer_save_input # whether to save input or output of layer
        self.return_embeddings_only = return_embeddings_only

+        self.detach_fn = clone_and_detach if detach else identity
+
    def _hook(self, _, inputs, output):
-        tensor_to_save = inputs if self.layer_save_input else output
-        self.latents = tensor_to_save.clone().detach()
+        layer_output = inputs if self.layer_save_input else output
+        self.latents = apply_tuple_or_single(self.detach_fn, layer_output)

    def _register_hook(self):
-        assert hasattr(self.vit, self.layer_name), 'layer whose output to take as embedding not found in vision transformer'
-        layer = getattr(self.vit, self.layer_name)
+        if not exists(self.layer):
+            assert hasattr(self.vit, self.layer_name), 'layer whose output to take as embedding not found in vision transformer'
+            layer = getattr(self.vit, self.layer_name)
+        else:
+            layer = self.layer
+
        handle = layer.register_forward_hook(self._hook)
        self.hooks.append(handle)
        self.hook_registered = True
@@ -62,7 +82,7 @@ class Extractor(nn.Module):
        pred = self.vit(img)

        target_device = self.device if exists(self.device) else img.device
-        latents = self.latents.to(target_device)
+        latents = apply_tuple_or_single(lambda t: t.to(target_device), self.latents)

        if return_embeddings_only or self.return_embeddings_only:
            return latents
--- a/vit_pytorch/learnable_memory_vit.py
+++ b/vit_pytorch/learnable_memory_vit.py
@@ -118,7 +118,9 @@ class ViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/local_vit.py
+++ b/vit_pytorch/local_vit.py
@@ -126,7 +126,9 @@ class LocalViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/mae.py
+++ b/vit_pytorch/mae.py
@@ -24,11 +24,14 @@ class MAE(nn.Module):

        self.encoder = encoder
        num_patches, encoder_dim = encoder.pos_embedding.shape[-2:]
-        self.to_patch, self.patch_to_emb = encoder.to_patch_embedding[:2]
-        pixel_values_per_patch = self.patch_to_emb.weight.shape[-1]
+
+        self.to_patch = encoder.to_patch_embedding[0]
+        self.patch_to_emb = nn.Sequential(*encoder.to_patch_embedding[1:])
+
+        pixel_values_per_patch = encoder.to_patch_embedding[2].weight.shape[-1]

        # decoder parameters
-
+        self.decoder_dim = decoder_dim
        self.enc_to_dec = nn.Linear(encoder_dim, decoder_dim) if encoder_dim != decoder_dim else nn.Identity()
        self.mask_token = nn.Parameter(torch.randn(decoder_dim))
        self.decoder = Transformer(dim = decoder_dim, depth = decoder_depth, heads = decoder_heads, dim_head = decoder_dim_head, mlp_dim = decoder_dim * 4)
@@ -73,7 +76,7 @@ class MAE(nn.Module):

        # reapply decoder position embedding to unmasked tokens

-        decoder_tokens = decoder_tokens + self.decoder_pos_emb(unmasked_indices)
+        unmasked_decoder_tokens = decoder_tokens + self.decoder_pos_emb(unmasked_indices)

        # repeat mask tokens for number of masked, and add the positions using the masked indices derived above

@@ -81,13 +84,15 @@ class MAE(nn.Module):
        mask_tokens = mask_tokens + self.decoder_pos_emb(masked_indices)

        # concat the masked tokens to the decoder tokens and attend with decoder
-
-        decoder_tokens = torch.cat((mask_tokens, decoder_tokens), dim = 1)
+        
+        decoder_tokens = torch.zeros(batch, num_patches, self.decoder_dim, device=device)
+        decoder_tokens[batch_range, unmasked_indices] = unmasked_decoder_tokens
+        decoder_tokens[batch_range, masked_indices] = mask_tokens
        decoded_tokens = self.decoder(decoder_tokens)

        # splice out the mask tokens and project to pixel values

-        mask_tokens = decoded_tokens[:, :num_masked]
+        mask_tokens = decoded_tokens[batch_range, masked_indices]
        pred_pixel_values = self.to_pixels(mask_tokens)

        # calculate reconstruction loss
--- a/vit_pytorch/max_vit.py
+++ b/vit_pytorch/max_vit.py
@@ -100,12 +100,14 @@ def MBConv(
    stride = 2 if downsample else 1

    net = nn.Sequential(
-        nn.Conv2d(dim_in, dim_out, 1),
-        nn.BatchNorm2d(dim_out),
-        nn.SiLU(),
-        nn.Conv2d(dim_out, dim_out, 3, stride = stride, padding = 1, groups = dim_out),
-        SqueezeExcitation(dim_out, shrinkage_rate = shrinkage_rate),
-        nn.Conv2d(dim_out, dim_out, 1),
+        nn.Conv2d(dim_in, hidden_dim, 1),
+        nn.BatchNorm2d(hidden_dim),
+        nn.GELU(),
+        nn.Conv2d(hidden_dim, hidden_dim, 3, stride = stride, padding = 1, groups = hidden_dim),
+        nn.BatchNorm2d(hidden_dim),
+        nn.GELU(),
+        SqueezeExcitation(hidden_dim, shrinkage_rate = shrinkage_rate),
+        nn.Conv2d(hidden_dim, dim_out, 1),
        nn.BatchNorm2d(dim_out)
    )

--- a/vit_pytorch/mobile_vit.py
+++ b/vit_pytorch/mobile_vit.py
@@ -13,9 +13,9 @@ def conv_1x1_bn(inp, oup):
        nn.SiLU()
    )

-def conv_nxn_bn(inp, oup, kernal_size=3, stride=1):
+def conv_nxn_bn(inp, oup, kernel_size=3, stride=1):
    return nn.Sequential(
-        nn.Conv2d(inp, oup, kernal_size, stride, 1, bias=False),
+        nn.Conv2d(inp, oup, kernel_size, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        nn.SiLU()
    )
--- a/vit_pytorch/nest.py
+++ b/vit_pytorch/nest.py
@@ -131,7 +131,7 @@ class NesT(nn.Module):
        fmap_size = image_size // patch_size
        blocks = 2 ** (num_hierarchies - 1)

-        seq_len = (fmap_size // blocks) ** 2   # sequence length is held constant across heirarchy
+        seq_len = (fmap_size // blocks) ** 2   # sequence length is held constant across hierarchy
        hierarchies = list(reversed(range(num_hierarchies)))
        mults = [2 ** i for i in reversed(hierarchies)]

@@ -144,7 +144,9 @@ class NesT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (p1 p2 c) h w', p1 = patch_size, p2 = patch_size),
+            LayerNorm(patch_dim),
            nn.Conv2d(patch_dim, layer_dims[0], 1),
+            LayerNorm(layer_dims[0])
        )

        block_repeats = cast_tuple(block_repeats, num_hierarchies)
--- a/vit_pytorch/simmim.py
+++ b/vit_pytorch/simmim.py
@@ -18,8 +18,11 @@ class SimMIM(nn.Module):

        self.encoder = encoder
        num_patches, encoder_dim = encoder.pos_embedding.shape[-2:]
-        self.to_patch, self.patch_to_emb = encoder.to_patch_embedding[:2]
-        pixel_values_per_patch = self.patch_to_emb.weight.shape[-1]
+
+        self.to_patch = encoder.to_patch_embedding[0]
+        self.patch_to_emb = nn.Sequential(*encoder.to_patch_embedding[1:])
+
+        pixel_values_per_patch = encoder.to_patch_embedding[2].weight.shape[-1]

        # simple linear head

--- a/vit_pytorch/simple_vit.py
+++ b/vit_pytorch/simple_vit.py
@@ -0,0 +1,118 @@
+import torch
+from torch import nn
+
+from einops import rearrange
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+def posemb_sincos_2d(patches, temperature = 10000, dtype = torch.float32):
+    _, h, w, dim, device, dtype = *patches.shape, patches.device, patches.dtype
+
+    y, x = torch.meshgrid(torch.arange(h, device = device), torch.arange(w, device = device), indexing = 'ij')
+    assert (dim % 4) == 0, 'feature dimension must be multiple of 4 for sincos emb'
+    omega = torch.arange(dim // 4, device = device) / (dim // 4 - 1)
+    omega = 1. / (temperature ** omega)
+
+    y = y.flatten()[:, None] * omega[None, :]
+    x = x.flatten()[:, None] * omega[None, :] 
+    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos()), dim = 1)
+    return pe.type(dtype)
+
+# classes
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, dim),
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.norm = nn.LayerNorm(dim)
+
+        self.attend = nn.Softmax(dim = -1)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim, bias = False)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                Attention(dim, heads = heads, dim_head = dim_head),
+                FeedForward(dim, mlp_dim)
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class SimpleViT(nn.Module):
+    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+
+        num_patches = (image_height // patch_height) * (image_width // patch_width)
+        patch_dim = channels * patch_height * patch_width
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (h p1) (w p2) -> b h w (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
+        )
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim)
+
+        self.to_latent = nn.Identity()
+        self.linear_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, img):
+        *_, h, w, dtype = *img.shape, img.dtype
+
+        x = self.to_patch_embedding(img)
+        pe = posemb_sincos_2d(x)
+        x = rearrange(x, 'b ... d -> b (...) d') + pe
+
+        x = self.transformer(x)
+        x = x.mean(dim = 1)
+
+        x = self.to_latent(x)
+        return self.linear_head(x)
--- a/vit_pytorch/simple_vit_1d.py
+++ b/vit_pytorch/simple_vit_1d.py
@@ -0,0 +1,127 @@
+import torch
+from torch import nn
+
+from einops import rearrange
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def posemb_sincos_1d(patches, temperature = 10000, dtype = torch.float32):
+    _, n, dim, device, dtype = *patches.shape, patches.device, patches.dtype
+
+    n = torch.arange(n, device = device)
+    assert (dim % 2) == 0, 'feature dimension must be multiple of 2 for sincos emb'
+    omega = torch.arange(dim // 2, device = device) / (dim // 2 - 1)
+    omega = 1. / (temperature ** omega)
+
+    n = n.flatten()[:, None] * omega[None, :]
+    pe = torch.cat((n.sin(), n.cos()), dim = 1)
+    return pe.type(dtype)
+
+# classes
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, dim),
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.norm = nn.LayerNorm(dim)
+
+        self.attend = nn.Softmax(dim = -1)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim, bias = False)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                Attention(dim, heads = heads, dim_head = dim_head),
+                FeedForward(dim, mlp_dim)
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class SimpleViT(nn.Module):
+    def __init__(self, *, seq_len, patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64):
+        super().__init__()
+
+        assert seq_len % patch_size == 0
+
+        num_patches = seq_len // patch_size
+        patch_dim = channels * patch_size
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (n p) -> b n (p c)', p = patch_size),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
+        )
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim)
+
+        self.to_latent = nn.Identity()
+        self.linear_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, series):
+        *_, n, dtype = *series.shape, series.dtype
+
+        x = self.to_patch_embedding(series)
+        pe = posemb_sincos_1d(x)
+        x = rearrange(x, 'b ... d -> b (...) d') + pe
+
+        x = self.transformer(x)
+        x = x.mean(dim = 1)
+
+        x = self.to_latent(x)
+        return self.linear_head(x)
+
+if __name__ == '__main__':
+
+    v = SimpleViT(
+        seq_len = 256,
+        patch_size = 16,
+        num_classes = 1000,
+        dim = 1024,
+        depth = 6,
+        heads = 8,
+        mlp_dim = 2048
+    )
+
+    time_series = torch.randn(4, 3, 256)
+    logits = v(time_series) # (4, 1000)
--- a/vit_pytorch/simple_vit_3d.py
+++ b/vit_pytorch/simple_vit_3d.py
@@ -0,0 +1,130 @@
+import torch
+import torch.nn.functional as F
+from torch import nn
+
+from einops import rearrange
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+def posemb_sincos_3d(patches, temperature = 10000, dtype = torch.float32):
+    _, f, h, w, dim, device, dtype = *patches.shape, patches.device, patches.dtype
+
+    z, y, x = torch.meshgrid(
+        torch.arange(f, device = device),
+        torch.arange(h, device = device),
+        torch.arange(w, device = device),
+    indexing = 'ij')
+
+    fourier_dim = dim // 6
+
+    omega = torch.arange(fourier_dim, device = device) / (fourier_dim - 1)
+    omega = 1. / (temperature ** omega)
+
+    z = z.flatten()[:, None] * omega[None, :]
+    y = y.flatten()[:, None] * omega[None, :]
+    x = x.flatten()[:, None] * omega[None, :] 
+
+    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos(), z.sin(), z.cos()), dim = 1)
+
+    pe = F.pad(pe, (0, dim - (fourier_dim * 6))) # pad if feature dimension not cleanly divisible by 6
+    return pe.type(dtype)
+
+# classes
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, dim),
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.norm = nn.LayerNorm(dim)
+
+        self.attend = nn.Softmax(dim = -1)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim, bias = False)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                Attention(dim, heads = heads, dim_head = dim_head),
+                FeedForward(dim, mlp_dim)
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class SimpleViT(nn.Module):
+    def __init__(self, *, image_size, image_patch_size, frames, frame_patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(image_patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+        assert frames % frame_patch_size == 0, 'Frames must be divisible by the frame patch size'
+
+        num_patches = (image_height // patch_height) * (image_width // patch_width) * (frames // frame_patch_size)
+        patch_dim = channels * patch_height * patch_width * frame_patch_size
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (f pf) (h p1) (w p2) -> b f h w (p1 p2 pf c)', p1 = patch_height, p2 = patch_width, pf = frame_patch_size),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
+        )
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim)
+
+        self.to_latent = nn.Identity()
+        self.linear_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, video):
+        *_, h, w, dtype = *video.shape, video.dtype
+
+        x = self.to_patch_embedding(video)
+        pe = posemb_sincos_3d(x)
+        x = rearrange(x, 'b ... d -> b (...) d') + pe
+
+        x = self.transformer(x)
+        x = x.mean(dim = 1)
+
+        x = self.to_latent(x)
+        return self.linear_head(x)
--- a/vit_pytorch/simple_vit_with_patch_dropout.py
+++ b/vit_pytorch/simple_vit_with_patch_dropout.py
@@ -0,0 +1,143 @@
+import torch
+from torch import nn
+
+from einops import rearrange
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+def posemb_sincos_2d(patches, temperature = 10000, dtype = torch.float32):
+    _, h, w, dim, device, dtype = *patches.shape, patches.device, patches.dtype
+
+    y, x = torch.meshgrid(torch.arange(h, device = device), torch.arange(w, device = device), indexing = 'ij')
+    assert (dim % 4) == 0, 'feature dimension must be multiple of 4 for sincos emb'
+    omega = torch.arange(dim // 4, device = device) / (dim // 4 - 1)
+    omega = 1. / (temperature ** omega)
+
+    y = y.flatten()[:, None] * omega[None, :]
+    x = x.flatten()[:, None] * omega[None, :] 
+    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos()), dim = 1)
+    return pe.type(dtype)
+
+# patch dropout
+
+class PatchDropout(nn.Module):
+    def __init__(self, prob):
+        super().__init__()
+        assert 0 <= prob < 1.
+        self.prob = prob
+
+    def forward(self, x):
+        if not self.training or self.prob == 0.:
+            return x
+
+        b, n, _, device = *x.shape, x.device
+
+        batch_indices = torch.arange(b, device = device)
+        batch_indices = rearrange(batch_indices, '... -> ... 1')
+        num_patches_keep = max(1, int(n * (1 - self.prob)))
+        patch_indices_keep = torch.randn(b, n, device = device).topk(num_patches_keep, dim = -1).indices
+
+        return x[batch_indices, patch_indices_keep]
+
+# classes
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, dim),
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.norm = nn.LayerNorm(dim)
+
+        self.attend = nn.Softmax(dim = -1)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim, bias = False)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                Attention(dim, heads = heads, dim_head = dim_head),
+                FeedForward(dim, mlp_dim)
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class SimpleViT(nn.Module):
+    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, patch_dropout = 0.5):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+
+        num_patches = (image_height // patch_height) * (image_width // patch_width)
+        patch_dim = channels * patch_height * patch_width
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (h p1) (w p2) -> b h w (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
+        )
+
+        self.patch_dropout = PatchDropout(patch_dropout)
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim)
+
+        self.to_latent = nn.Identity()
+        self.linear_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, img):
+        *_, h, w, dtype = *img.shape, img.dtype
+
+        x = self.to_patch_embedding(img)
+        pe = posemb_sincos_2d(x)
+        x = rearrange(x, 'b ... d -> b (...) d') + pe
+
+        x = self.patch_dropout(x)
+
+        x = self.transformer(x)
+        x = x.mean(dim = 1)
+
+        x = self.to_latent(x)
+        return self.linear_head(x)
--- a/vit_pytorch/twins_svt.py
+++ b/vit_pytorch/twins_svt.py
@@ -71,7 +71,12 @@ class PatchEmbedding(nn.Module):
        self.dim = dim
        self.dim_out = dim_out
        self.patch_size = patch_size
-        self.proj = nn.Conv2d(patch_size ** 2 * dim, dim_out, 1)
+
+        self.proj = nn.Sequential(
+            LayerNorm(patch_size ** 2 * dim),
+            nn.Conv2d(patch_size ** 2 * dim, dim_out, 1),
+            LayerNorm(dim_out)
+        )

    def forward(self, fmap):
        p = self.patch_size
--- a/vit_pytorch/vit.py
+++ b/vit_pytorch/vit.py
@@ -93,7 +93,9 @@ class ViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
@@ -114,7 +116,7 @@ class ViT(nn.Module):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape

-        cls_tokens = repeat(self.cls_token, '1 n d -> b n d', b = b)
+        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)
--- a/vit_pytorch/vit_1d.py
+++ b/vit_pytorch/vit_1d.py
@@ -0,0 +1,135 @@
+import torch
+from torch import nn
+
+from einops import rearrange, repeat, pack, unpack
+from einops.layers.torch import Rearrange
+
+# classes
+
+class PreNorm(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs)
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim, dropout = 0.):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+
+        self.attend = nn.Softmax(dim = -1)
+        self.dropout = nn.Dropout(dropout)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+
+    def forward(self, x):
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
+                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class ViT(nn.Module):
+    def __init__(self, *, seq_len, patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
+        super().__init__()
+        assert (seq_len % patch_size) == 0
+
+        num_patches = seq_len // patch_size
+        patch_dim = channels * patch_size
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (n p) -> b n (p c)', p = patch_size),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
+        )
+
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
+        self.cls_token = nn.Parameter(torch.randn(dim))
+        self.dropout = nn.Dropout(emb_dropout)
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
+
+        self.mlp_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, series):
+        x = self.to_patch_embedding(series)
+        b, n, _ = x.shape
+
+        cls_tokens = repeat(self.cls_token, 'd -> b d', b = b)
+
+        x, ps = pack([cls_tokens, x], 'b * d')
+
+        x += self.pos_embedding[:, :(n + 1)]
+        x = self.dropout(x)
+
+        x = self.transformer(x)
+
+        cls_tokens, _ = unpack(x, ps, 'b * d')
+
+        return self.mlp_head(cls_tokens)
+
+if __name__ == '__main__':
+
+    v = ViT(
+        seq_len = 256,
+        patch_size = 16,
+        num_classes = 1000,
+        dim = 1024,
+        depth = 6,
+        heads = 8,
+        mlp_dim = 2048,
+        dropout = 0.1,
+        emb_dropout = 0.1
+    )
+
+    time_series = torch.randn(4, 3, 256)
+    logits = v(time_series) # (4, 1000)
--- a/vit_pytorch/vit_3d.py
+++ b/vit_pytorch/vit_3d.py
@@ -0,0 +1,131 @@
+import torch
+from torch import nn
+
+from einops import rearrange, repeat
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+# classes
+
+class PreNorm(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs)
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim, dropout = 0.):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+
+        self.attend = nn.Softmax(dim = -1)
+        self.dropout = nn.Dropout(dropout)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+
+    def forward(self, x):
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
+                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class ViT(nn.Module):
+    def __init__(self, *, image_size, image_patch_size, frames, frame_patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(image_patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+        assert frames % frame_patch_size == 0, 'Frames must be divisible by frame patch size'
+
+        num_patches = (image_height // patch_height) * (image_width // patch_width) * (frames // frame_patch_size)
+        patch_dim = channels * patch_height * patch_width * frame_patch_size
+
+        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (f pf) (h p1) (w p2) -> b (f h w) (p1 p2 pf c)', p1 = patch_height, p2 = patch_width, pf = frame_patch_size),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim),
+        )
+
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
+        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
+        self.dropout = nn.Dropout(emb_dropout)
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
+
+        self.pool = pool
+        self.to_latent = nn.Identity()
+
+        self.mlp_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, video):
+        x = self.to_patch_embedding(video)
+        b, n, _ = x.shape
+
+        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
+        x = torch.cat((cls_tokens, x), dim=1)
+        x += self.pos_embedding[:, :(n + 1)]
+        x = self.dropout(x)
+
+        x = self.transformer(x)
+
+        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
+
+        x = self.to_latent(x)
+        return self.mlp_head(x)
--- a/vit_pytorch/vit_with_patch_dropout.py
+++ b/vit_pytorch/vit_with_patch_dropout.py
@@ -0,0 +1,152 @@
+import torch
+from torch import nn
+
+from einops import rearrange, repeat
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+# classes
+
+class PatchDropout(nn.Module):
+    def __init__(self, prob):
+        super().__init__()
+        assert 0 <= prob < 1.
+        self.prob = prob
+
+    def forward(self, x):
+        if not self.training or self.prob == 0.:
+            return x
+
+        b, n, _, device = *x.shape, x.device
+
+        batch_indices = torch.arange(b, device = device)
+        batch_indices = rearrange(batch_indices, '... -> ... 1')
+        num_patches_keep = max(1, int(n * (1 - self.prob)))
+        patch_indices_keep = torch.randn(b, n, device = device).topk(num_patches_keep, dim = -1).indices
+
+        return x[batch_indices, patch_indices_keep]
+
+class PreNorm(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs)
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim, dropout = 0.):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+
+        self.attend = nn.Softmax(dim = -1)
+        self.dropout = nn.Dropout(dropout)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+
+    def forward(self, x):
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
+                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class ViT(nn.Module):
+    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., patch_dropout = 0.25):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+
+        num_patches = (image_height // patch_height) * (image_width // patch_width)
+        patch_dim = channels * patch_height * patch_width
+        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.Linear(patch_dim, dim),
+        )
+
+        self.pos_embedding = nn.Parameter(torch.randn(num_patches, dim))
+        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
+
+        self.patch_dropout = PatchDropout(patch_dropout)
+        self.dropout = nn.Dropout(emb_dropout)
+
+        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
+
+        self.pool = pool
+        self.to_latent = nn.Identity()
+
+        self.mlp_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, img):
+        x = self.to_patch_embedding(img)
+        b, n, _ = x.shape
+
+        x += self.pos_embedding
+
+        x = self.patch_dropout(x)
+
+        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
+
+        x = torch.cat((cls_tokens, x), dim=1)
+        x = self.dropout(x)
+
+        x = self.transformer(x)
+
+        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
+
+        x = self.to_latent(x)
+        return self.mlp_head(x)
--- a/vit_pytorch/vit_with_patch_merger.py
+++ b/vit_pytorch/vit_with_patch_merger.py
@@ -121,7 +121,9 @@ class ViT(nn.Module):

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
+            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
--- a/vit_pytorch/vivit.py
+++ b/vit_pytorch/vivit.py
@@ -0,0 +1,185 @@
+import torch
+from torch import nn
+
+from einops import rearrange, repeat, reduce
+from einops.layers.torch import Rearrange
+
+# helpers
+
+def exists(val):
+    return val is not None
+
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+
+# classes
+
+class PreNorm(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs)
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim, dropout = 0.):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x):
+        return self.net(x)
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+
+        self.attend = nn.Softmax(dim = -1)
+        self.dropout = nn.Dropout(dropout)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+
+    def forward(self, x):
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
+                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
+
+class ViT(nn.Module):
+    def __init__(
+        self,
+        *,
+        image_size,
+        image_patch_size,
+        frames,
+        frame_patch_size,
+        num_classes,
+        dim,
+        spatial_depth,
+        temporal_depth,
+        heads,
+        mlp_dim,
+        pool = 'cls',
+        channels = 3,
+        dim_head = 64,
+        dropout = 0.,
+        emb_dropout = 0.
+    ):
+        super().__init__()
+        image_height, image_width = pair(image_size)
+        patch_height, patch_width = pair(image_patch_size)
+
+        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
+        assert frames % frame_patch_size == 0, 'Frames must be divisible by frame patch size'
+
+        num_image_patches = (image_height // patch_height) * (image_width // patch_width)
+        num_frame_patches = (frames // frame_patch_size)
+
+        patch_dim = channels * patch_height * patch_width * frame_patch_size
+
+        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
+
+        self.global_average_pool = pool == 'mean'
+
+        self.to_patch_embedding = nn.Sequential(
+            Rearrange('b c (f pf) (h p1) (w p2) -> b f (h w) (p1 p2 pf c)', p1 = patch_height, p2 = patch_width, pf = frame_patch_size),
+            nn.LayerNorm(patch_dim),
+            nn.Linear(patch_dim, dim),
+            nn.LayerNorm(dim)
+        )
+
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_frame_patches, num_image_patches, dim))
+        self.dropout = nn.Dropout(emb_dropout)
+
+        self.spatial_cls_token = nn.Parameter(torch.randn(1, 1, dim)) if not self.global_average_pool else None
+        self.temporal_cls_token = nn.Parameter(torch.randn(1, 1, dim)) if not self.global_average_pool else None
+
+        self.spatial_transformer = Transformer(dim, spatial_depth, heads, dim_head, mlp_dim, dropout)
+        self.temporal_transformer = Transformer(dim, temporal_depth, heads, dim_head, mlp_dim, dropout)
+
+        self.pool = pool
+        self.to_latent = nn.Identity()
+
+        self.mlp_head = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, num_classes)
+        )
+
+    def forward(self, video):
+        x = self.to_patch_embedding(video)
+        b, f, n, _ = x.shape
+
+        x = x + self.pos_embedding
+
+        if exists(self.spatial_cls_token):
+            spatial_cls_tokens = repeat(self.spatial_cls_token, '1 1 d -> b f 1 d', b = b, f = f)
+            x = torch.cat((spatial_cls_tokens, x), dim = 2)
+
+        x = self.dropout(x)
+
+        x = rearrange(x, 'b f n d -> (b f) n d')
+
+        # attend across space
+
+        x = self.spatial_transformer(x)
+
+        x = rearrange(x, '(b f) n d -> b f n d', b = b)
+
+        # excise out the spatial cls tokens or average pool for temporal attention
+
+        x = x[:, :, 0] if not self.global_average_pool else reduce(x, 'b f n d -> b f d', 'mean')
+
+        # append temporal CLS tokens
+
+        if exists(self.temporal_cls_token):
+            temporal_cls_tokens = repeat(self.temporal_cls_token, '1 1 d-> b 1 d', b = b)
+
+            x = torch.cat((temporal_cls_tokens, x), dim = 1)
+
+        # attend across time
+
+        x = self.temporal_transformer(x)
+
+        # excise out temporal cls token or average pool
+
+        x = x[:, 0] if not self.global_average_pool else reduce(x, 'b f d -> b d', 'mean')
+
+        x = self.to_latent(x)
+        return self.mlp_head(x)
Author	SHA1	Message	Date
Phil Wang	5699ed7d13	double down on dual patch norm, fix MAE and Simmim to be compatible with dual patchnorm	2023-02-10 10:39:50 -08:00
Phil Wang	46dcaf23d8	seeing a signal with dual patchnorm in another repository, fully incorporate	2023-02-06 09:45:12 -08:00
Phil Wang	bdaf2d1491	adopt dual patchnorm paper for as many vit as applicable, release 1.0.0	2023-02-03 08:11:29 -08:00
Phil Wang	500e23105a	need simple vit with patch dropout for another project	2022-12-05 10:47:36 -08:00
Phil Wang	89e1996c8b	add vit with patch dropout, fully embrace structured dropout as multiple papers are now corroborating each other	2022-12-02 11:28:11 -08:00
Phil Wang	2f87c0cf8f	offer 1d versions, in light of https://arxiv.org/abs/2211.14730	2022-12-01 10:31:05 -08:00
Phil Wang	59c8948c6a	try to fix tests	2022-10-29 11:44:17 -07:00
Phil Wang	cb6d749821	add a 3d version of cct, addressing https://github.com/lucidrains/vit-pytorch/issues/238 0.38.1	2022-10-29 11:35:06 -07:00
Phil Wang	6ec8fdaa6d	make sure global average pool can be used for vivit in place of cls token	2022-10-24 19:59:48 -07:00
Phil Wang	13fabf901e	add vivit	2022-10-24 09:34:04 -07:00
Ryan Russell	c0eb4c0150	Improving Readability (#220 ) Signed-off-by: Ryan Russell <git@ryanrussell.org> Signed-off-by: Ryan Russell <git@ryanrussell.org>	2022-10-17 10:42:45 -07:00
Phil Wang	5f1a6a05e9	release updated mae where one can more easily visualize reconstructions, thanks to @Vishu26	2022-10-17 10:41:46 -07:00
Srikumar Sastry	9a95e7904e	Update mae.py (#242 ) update mae so decoded tokens can be easily reshaped back to visualize the reconstruction	2022-10-17 10:41:10 -07:00
Phil Wang	b4853d39c2	add the 3d simple vit	2022-10-16 20:45:30 -07:00
Phil Wang	29fbf0aff4	begin extending some of the architectures over to 3d, starting with basic ViT	2022-10-16 15:31:59 -07:00
Phil Wang	4b8f5bc900	add link to Flax translation by @conceptofmind	2022-07-27 08:58:18 -07:00
Phil Wang	f86e052c05	offer way for extractor to return latents without detaching them	2022-07-16 16:22:40 -07:00
Phil Wang	2fa2b62def	slightly more clear of einops rearrange for cls token, for https://github.com/lucidrains/vit-pytorch/issues/224	2022-06-30 08:11:17 -07:00
Phil Wang	9f87d1c43b	follow @arquolo feedback and advice for MaxViT	2022-06-29 08:53:09 -07:00
Phil Wang	2c6dd7010a	fix hidden dimension in MaxViT thanks to @arquolo	2022-06-24 23:28:35 -07:00
Phil Wang	6460119f65	be able to accept a reference to a layer within the model for forward hooking and extracting the embedding output, for regionvit to work with extractor	2022-06-19 08:22:18 -07:00
Phil Wang	4e62e5f05e	make extractor flexible for layers that output multiple tensors, show CrossViT example	2022-06-19 08:11:41 -07:00
Phil Wang	b3e90a2652	add simple vit, from https://arxiv.org/abs/2205.01580	2022-05-03 20:24:14 -07:00