备注
您正在阅读 MMClassification 0.x 版本的文档。MMClassification 0.x 会在 2022 年末被切换为次要分支。建议您升级到 MMClassification 1.0 版本,体验更多新特性和新功能。请查阅 MMClassification 1.0 的安装教程、迁移教程以及更新日志。
mmcls.models.VisionTransformer¶
- class mmcls.models.VisionTransformer(arch='base', img_size=224, patch_size=16, in_channels=3, out_indices=- 1, drop_rate=0.0, drop_path_rate=0.0, qkv_bias=True, norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, with_cls_token=True, output_cls_token=True, interpolate_mode='bicubic', patch_cfg={}, layer_cfgs={}, init_cfg=None)[源代码]¶
Vision Transformer.
A PyTorch implement of : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 参数
Vision Transformer architecture. If use string, choose from ‘small’, ‘base’, ‘large’, ‘deit-tiny’, ‘deit-small’ and ‘deit-base’. If use dict, it should have below keys:
embed_dims (int): The dimensions of embedding.
num_layers (int): The number of transformer encoder layers.
num_heads (int): The number of heads in attention modules.
feedforward_channels (int): The hidden dimensions in feedforward modules.
Defaults to ‘base’.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True.
output_cls_token (bool) – Whether output the cls_token. If set True,
with_cls_token
must be True. Defaults to True.interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.