intent to build (#210)

complete SepViT, from bytedance AI labs
This commit is contained in:
Phil Wang
2022-03-31 14:30:23 -07:00
committed by GitHub
parent 8c54e01492
commit d65a742efe
4 changed files with 337 additions and 3 deletions

View File

@@ -19,6 +19,7 @@
- [CrossFormer](#crossformer)
- [RegionViT](#regionvit)
- [ScalableViT](#scalablevit)
- [SepViT](#sepvit)
- [NesT](#nest)
- [MobileViT](#mobilevit)
- [Masked Autoencoder](#masked-autoencoder)
@@ -559,13 +560,42 @@ model = ScalableViT(
reduction_factor = (8, 4, 2, 1), # downsampling of the key / values in SSA. in the paper, this was represented as (reduction_factor ** -2)
window_size = (64, 32, None, None), # window size of the IWSA at each stage. None means no windowing needed
dropout = 0.1, # attention and feedforward dropout
).cuda()
)
img = torch.randn(1, 3, 256, 256).cuda()
img = torch.randn(1, 3, 256, 256)
preds = model(img) # (1, 1000)
```
## SepViT
<img src="./images/sep-vit.png" width="400px"></img>
Another <a href="https://arxiv.org/abs/2203.15380">Bytedance AI paper</a>, it proposes a depthwise-pointwise self-attention layer that seems largely inspired by mobilenet's depthwise-separable convolution. The most interesting aspect is the reuse of the feature map from the depthwise self-attention stage as the values for the pointwise self-attention, as shown in the diagram above.
I have decided to include only the version of `SepViT` with this specific self-attention layer, as the grouped attention layers are not remarkable nor novel, and the authors were not clear on how they treated the window tokens for the group self-attention layer. Besides, it seems like with `DSSA` layer alone, they were able to beat Swin.
ex. SepViT-Lite
```python
import torch
from vit_pytorch.sep_vit import SepViT
v = SepViT(
num_classes = 1000,
dim = 32, # dimensions of first stage, which doubles every stage (32, 64, 128, 256) for SepViT-Lite
dim_head = 32, # attention head dimension
heads = (1, 2, 4, 8), # number of heads per stage
depth = (1, 2, 6, 2), # number of transformer blocks per stage
window_size = 7, # window size of DSS Attention block
dropout = 0.1 # dropout
)
img = torch.randn(1, 3, 224, 224)
preds = v(img) # (1, 1000)
```
## NesT
<img src="./images/nest.png" width="400px"></img>
@@ -1506,6 +1536,14 @@ Coming from computer vision and new to transformers? Here are some resources tha
}
```
```bibtex
@inproceedings{Li2022SepViTSV,
title = {SepViT: Separable Vision Transformer},
author = {Wei Li and Xing Wang and Xin Xia and Jie Wu and Xuefeng Xiao and Minghang Zheng and Shiping Wen},
year = {2022}
}
```
```bibtex
@misc{vaswani2017attention,
title = {Attention Is All You Need},