intent to build (#210)

complete SepViT, from bytedance AI labs
2025-12-30 08:02:29 +00:00 · 2022-03-31 14:30:23 -07:00
parent 8c54e01492
commit d65a742efe
4 changed files with 337 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -19,6 +19,7 @@
 - [CrossFormer](#crossformer)
 - [RegionViT](#regionvit)
 - [ScalableViT](#scalablevit)
+- [SepViT](#sepvit)
 - [NesT](#nest)
 - [MobileViT](#mobilevit)
 - [Masked Autoencoder](#masked-autoencoder)
@@ -559,13 +560,42 @@ model = ScalableViT(
    reduction_factor = (8, 4, 2, 1),        # downsampling of the key / values in SSA. in the paper, this was represented as (reduction_factor ** -2)
    window_size = (64, 32, None, None),     # window size of the IWSA at each stage. None means no windowing needed
    dropout = 0.1,                          # attention and feedforward dropout
-).cuda()
+)

-img = torch.randn(1, 3, 256, 256).cuda()
+img = torch.randn(1, 3, 256, 256)

 preds = model(img) # (1, 1000)
 ```

+## SepViT
+
+<img src="./images/sep-vit.png" width="400px"></img>
+
+Another <a href="https://arxiv.org/abs/2203.15380">Bytedance AI paper</a>, it proposes a depthwise-pointwise self-attention layer that seems largely inspired by mobilenet's depthwise-separable convolution. The most interesting aspect is the reuse of the feature map from the depthwise self-attention stage as the values for the pointwise self-attention, as shown in the diagram above.
+
+I have decided to include only the version of `SepViT` with this specific self-attention layer, as the grouped attention layers are not remarkable nor novel, and the authors were not clear on how they treated the window tokens for the group self-attention layer. Besides, it seems like with `DSSA` layer alone, they were able to beat Swin.
+
+ex. SepViT-Lite
+
+```python
+import torch
+from vit_pytorch.sep_vit import SepViT
+
+v = SepViT(
+    num_classes = 1000,
+    dim = 32,               # dimensions of first stage, which doubles every stage (32, 64, 128, 256) for SepViT-Lite
+    dim_head = 32,          # attention head dimension
+    heads = (1, 2, 4, 8),   # number of heads per stage
+    depth = (1, 2, 6, 2),   # number of transformer blocks per stage
+    window_size = 7,        # window size of DSS Attention block
+    dropout = 0.1           # dropout
+)
+
+img = torch.randn(1, 3, 224, 224)
+
+preds = v(img) # (1, 1000)
+```
+
 ## NesT

 <img src="./images/nest.png" width="400px"></img>
@@ -1506,6 +1536,14 @@ Coming from computer vision and new to transformers? Here are some resources tha
 }
 ```

+```bibtex
+@inproceedings{Li2022SepViTSV,
+    title   = {SepViT: Separable Vision Transformer},
+    author  = {Wei Li and Xing Wang and Xin Xia and Jie Wu and Xuefeng Xiao and Minghang Zheng and Shiping Wen},
+    year    = {2022}
+}
+```
+
 ```bibtex
@misc{vaswani2017attention,
    title   = {Attention Is All You Need},