ViTEVA02¶

class mmpretrain.models.backbones.ViTEVA02(arch='tiny', sub_ln=False, drop_rate=0.0, attn_drop_rate=0.0, proj_drop_rate=0.0, drop_path_rate=0.0, qkv_bias=True, norm_cfg={'type': 'LN'}, with_cls_token=True, layer_cfgs={}, **kwargs)[source]¶

EVA02 Vision Transformer.

A PyTorch implement of : EVA-02: A Visual Representation for Neon Genesis

Parameters:

arch (str | dict) –
Vision Transformer architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’, ‘large’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- num_layers (int): The number of transformer encoder layers.
- num_heads (int): The number of heads in attention modules.
- mlp_ratio (float): The ratio of the mlp module.
Defaults to ‘tiny’.
sub_ln (bool) – Whether to add the sub layer normalization in swiglu. Defaults to False.
drop_rate (float) – Probability of an element to be zeroed in the mlp module. Defaults to 0.
attn_drop_rate (float) – Probability of an element to be zeroed after the softmax in the attention. Defaults to 0.
proj_drop_rate (float) – Probability of an element to be zeroed after projection in the attention. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
**kwargs (dict, optional) – Other args for Vision Transformer.