MixMIMPretrainTransformer¶

class mmpretrain.models.selfsup.MixMIMPretrainTransformer(arch='base', mlp_ratio=4, img_size=224, patch_size=4, in_channels=3, window_size=[14, 14, 14, 7], qkv_bias=True, patch_cfg={}, norm_cfg={'type': 'LN'}, drop_rate=0.0, drop_path_rate=0.0, attn_drop_rate=0.0, use_checkpoint=False, mask_ratio=0.5, range_mask_ratio=0.0, init_cfg=None)[source]¶

MixMIM backbone for MixMIM pre-training.

A PyTorch implement of : ` MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning <https://arxiv.org/abs/2205.13137>`_

Parameters:

arch (str | dict) –
MixMIM architecture. If use string, choose from ‘base’,’large’ and ‘huge’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- depths (int): The number of transformer encoder layers.
- num_heads (int): The number of heads in attention modules.
Defaults to ‘base’.
mlp_ratio (int) – The mlp ratio in FFN. Defaults to 4.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to mlp_ratio the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
window_size (list) – The height and width of the window.
qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to an empty dict.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.
attn_drop_rate (float) – Attention drop rate. Defaults to 0.
use_checkpoint (bool) – Whether use the checkpoint to reduce GPU memory cost. Defaults to False.
mask_ratio (bool) – The base ratio of total number of patches to be masked. Defaults to 0.5.
range_mask_ratio (float) – The range of mask ratio. Defaults to 0.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x, mask=True)[source]¶

Generate features for masked images.

This function generates mask and masks some patches randomly and get the hidden features for visible patches.

Parameters:

x (torch.Tensor) – Input images, which is of shape B x C x H x W.
mask (bool, optional) – To indicate whether the forward containing mask or not.

Returns:

x (torch.Tensor): hidden features, which is of shape B x L x C.
mask_s4 (torch.Tensor): the mask tensor for the last layer.

Return type:

Tuple[torch.Tensor, torch.Tensor]

init_weights()[source]¶: Initialize position embedding, patch embedding.

random_masking(x, mask_ratio=0.5)[source]¶

Generate the mask for MixMIM Pretraining.

Parameters:

x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.5.

Returns:

mask_s1 (torch.Tensor): mask with stride of self.encoder_stride // 8.
mask_s2 (torch.Tensor): mask with stride of self.encoder_stride // 4.
mask_s3 (torch.Tensor): mask with stride of self.encoder_stride // 2.
mask (torch.Tensor): mask with stride of self.encoder_stride.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]