MultiheadAttention¶
- class mmpretrain.models.utils.MultiheadAttention(embed_dims, num_heads, input_dims=None, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.0, 'type': 'Dropout'}, qkv_bias=True, qk_scale=None, proj_bias=True, v_shortcut=False, use_layer_scale=False, layer_scale_init_value=0.0, init_cfg=None)[source]¶
Multi-head Attention Module.
This module implements multi-head attention that supports different input dims and embed dims. And it also supports a shortcut from
value
, which is useful if input dims is not the same with embed dims.- Parameters:
embed_dims (int) – The embedding dimension.
num_heads (int) – Parallel attention heads.
input_dims (int, optional) – The input dimension, and if None, use
embed_dims
. Defaults to None.attn_drop (float) – Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0.
proj_drop (float) – Dropout rate of the dropout layer after the output projection. Defaults to 0.
dropout_layer (dict) – The dropout config before adding the shortcut. Defaults to
dict(type='Dropout', drop_prob=0.)
.qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of
head_dim ** -0.5
if set. Defaults to None.proj_bias (bool) – Defaults to True.
v_shortcut (bool) – Add a shortcut from value to output. It’s usually used if
input_dims
is different fromembed_dims
. Defaults to False.use_layer_scale (bool) – Whether to use layer scale. Defaults to False.
layer_scale_init_value (float or torch.Tensor) – Init value of layer scale. Defaults to 0.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.