MultiheadAttention¶

class mmpretrain.models.utils.MultiheadAttention(embed_dims, num_heads, input_dims=None, attn_drop=0.0, proj_drop=0.0, dropout_layer={'drop_prob': 0.0, 'type': 'Dropout'}, qkv_bias=True, qk_scale=None, proj_bias=True, v_shortcut=False, use_layer_scale=False, layer_scale_init_value=0.0, init_cfg=None)[source]¶

Multi-head Attention Module.

This module implements multi-head attention that supports different input dims and embed dims. And it also supports a shortcut from value, which is useful if input dims is not the same with embed dims.

Parameters:

embed_dims (int) – The embedding dimension.
num_heads (int) – Parallel attention heads.
input_dims (int, optional) – The input dimension, and if None, use embed_dims. Defaults to None.
attn_drop (float) – Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0.
proj_drop (float) – Dropout rate of the dropout layer after the output projection. Defaults to 0.
dropout_layer (dict) – The dropout config before adding the shortcut. Defaults to dict(type='Dropout', drop_prob=0.).
qkv_bias (bool) – If True, add a learnable bias to q, k, v. Defaults to True.
qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 if set. Defaults to None.
proj_bias (bool) – Defaults to True.
v_shortcut (bool) – Add a shortcut from value to output. It’s usually used if input_dims is different from embed_dims. Defaults to False.
use_layer_scale (bool) – Whether to use layer scale. Defaults to False.
layer_scale_init_value (float or torch.Tensor) – Init value of layer scale. Defaults to 0.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.