diff --git a/README.md b/README.md
index 367184d..e21dc0c 100644
--- a/README.md
+++ b/README.md
@@ -493,6 +493,14 @@ img = torch.randn(1, 3, 224, 224)
pred = model(img) # (1, 1000)
```
+## CrossFormer (wip)
+
+
+
+This paper beats PVT and Swin using alternating local and global attention. The global attention is done across the windowing dimension for reduced complexity, much like the scheme used for axial attention.
+
+They also have cross-scale embedding layer, which they shown to be a generic layer that can improve all vision transformers. Dynamic relative positional bias was also formulated to allow the net to generalize to images of greater resolution.
+
## NesT
@@ -1045,6 +1053,17 @@ Coming from computer vision and new to transformers? Here are some resources tha
}
```
+```bibtex
+@misc{wang2021crossformer,
+ title = {CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention},
+ author = {Wenxiao Wang and Lu Yao and Long Chen and Binbin Lin and Deng Cai and Xiaofei He and Wei Liu},
+ year = {2021},
+ eprint = {2108.00154},
+ archivePrefix = {arXiv},
+ primaryClass = {cs.CV}
+}
+```
+
```bibtex
@misc{caron2021emerging,
title = {Emerging Properties in Self-Supervised Vision Transformers},
diff --git a/images/crossformer.png b/images/crossformer.png
new file mode 100644
index 0000000..b0d2120
Binary files /dev/null and b/images/crossformer.png differ