patch

Fix ViViT Transformer not passing use_flash_attn to Attention and duplicate mask reshape (#360 )
Two related bugs in vivit.py: 1. Transformer.__init__ accepted use_flash_attn but never forwarded it to the Attention modules it creates. Since Attention defaults to use_flash_attn=True, setting use_flash_attn=False on ViViT had no effect on the factorized_encoder variant's spatial and temporal transformers. 2. Attention.forward reshaped the mask from 2D to 4D before the flash/non-flash branch (line 82), then attempted to reshape it again inside the non-flash branch (line 92). When the non-flash code path is actually reached with a mask, einops raises an error because the mask is already 4D. These bugs masked each other: bug #1 prevented bug #2 from triggering because the non-flash path was never taken even when requested. Fix: pass use_flash_attn through to Attention in Transformer.__init__, and remove the redundant second mask rearrange in the non-flash branch.
2026-05-14 12:18:06 +00:00 · 2026-02-11 11:49:51 -08:00 · 2026-02-11 11:49:31 -08:00 · 2026-02-04 13:29:40 -08:00
3 changed files with 3 additions and 4 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "vit-pytorch"
-version = "1.17.7"
+version = "1.17.8"
 description = "Vision Transformer (ViT) - Pytorch"
 readme = { file = "README.md", content-type = "text/markdown" }
 license = { file = "LICENSE" }
--- a/vit_pytorch/vit_with_keel_post_ln.py
+++ b/vit_pytorch/vit_with_keel_post_ln.py
@@ -189,7 +189,7 @@ class ViT(Module):

        x = self.transformer(x)

-        if self.mlp_head is None:
+        if not exists(self.mlp_head):
            return x

        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
--- a/vit_pytorch/vivit.py
+++ b/vit_pytorch/vivit.py
@@ -89,7 +89,6 @@ class Attention(Module):
            dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

            if exists(mask):
-                mask = rearrange(mask, 'b j -> b 1 1 j')
                dots = dots.masked_fill(~mask, -torch.finfo(dots.dtype).max)

            attn = self.attend(dots)
@@ -109,7 +108,7 @@ class Transformer(Module):
        self.layers = ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
-                Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
+                Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout, use_flash_attn = use_flash_attn),
                FeedForward(dim, mlp_dim, dropout = dropout)
            ]))