## Vision Transformer - Pytorch Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch. There's really not much to code here, but may as well lay it out all for everyone so we expedite the attention revolution. ## Install ```bash $ pip install vit-pytorch ``` ## Usage ```python import torch from vit_pytorch import ViT v = ViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 8, mlp_dim = 2048 ) img = torch.randn(1, 3, 256, 256) preds = v(img) # (1, 1000) ``` ## Citations ```bibtex @inproceedings{ anonymous2021an, title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale}, author={Anonymous}, booktitle={Submitted to International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=YicbFdNTTy}, note={under review} } ```