Note

You are reading the documentation for MMClassification 0.x, which will soon be deprecated at the end of 2022. We recommend you upgrade to MMClassification 1.0 to enjoy fruitful new features and better performance brought by OpenMMLab 2.0. Check the installation tutorial, migration tutorial and changelog for more details.

Getting Started¶

This page provides basic tutorials about the usage of MMClassification.

Prepare datasets¶

It is recommended to symlink the dataset root to $MMCLASSIFICATION/data. If your folder structure is different, you may need to change the corresponding paths in config files.

mmclassification
├── mmcls
├── tools
├── configs
├── docs
├── data
│   ├── imagenet
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── cifar
│   │   ├── cifar-10-batches-py
│   ├── mnist
│   │   ├── train-images-idx3-ubyte
│   │   ├── train-labels-idx1-ubyte
│   │   ├── t10k-images-idx3-ubyte
│   │   ├── t10k-labels-idx1-ubyte

For ImageNet, it has multiple versions, but the most commonly used one is ILSVRC 2012. It can be accessed with the following steps.

Register an account and login to the download page.
Find download links for ILSVRC2012 and download the following two files
- ILSVRC2012_img_train.tar (~138GB)
- ILSVRC2012_img_val.tar (~6.3GB)
Untar the downloaded files
Download meta data using this script

For MNIST, CIFAR10 and CIFAR100, the datasets will be downloaded and unzipped automatically if they are not found.

For using custom datasets, please refer to Tutorial 3: Customize Dataset.

Inference with pretrained models¶

We provide scripts to inference a single image, inference a dataset and test a dataset (e.g., ImageNet).

Inference a single image¶

python demo/image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE}

# Example
python demo/image_demo.py demo/demo.JPEG configs/resnet/resnet50_8xb32_in1k.py \
  https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth

Inference and test a dataset¶

single GPU
CPU
single node multiple GPU
multiple node

You can use the following commands to infer a dataset.

# single-gpu
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# CPU: disable GPUs and run single-gpu testing script
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# multi-gpu
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# multi-node in slurm environment
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}] --launcher slurm

Optional arguments:

RESULT_FILE: Filename of the output results. If not specified, the results will not be saved to a file. Support formats include json, yaml and pickle.
METRICS：Items to be evaluated on the results, like accuracy, precision, recall, etc.

Examples:

Infer ResNet-50 on CIFAR10 validation set to get predicted labels and their corresponding predicted scores.

python tools/test.py configs/resnet/resnet50_8xb16_cifar10.py \
  https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
  --out result.pkl

Train a model¶

MMClassification implements distributed training and non-distributed training, which uses MMDistributedDataParallel and MMDataParallel respectively.

All outputs (log files and checkpoints) will be saved to the working directory, which is specified by work_dir in the config file.

By default we evaluate the model on the validation set after each epoch, you can change the evaluation interval by adding the interval argument in the training config.

evaluation = dict(interval=12)  # Evaluate the model per 12 epochs.

Train with a single GPU¶

python tools/train.py ${CONFIG_FILE} [optional arguments]

If you want to specify the working directory in the command, you can add an argument --work_dir ${YOUR_WORK_DIR}.

Train with CPU¶

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run the script above.

Warning

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

Train with multiple GPUs in single machine¶

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

Optional arguments are:

--no-validate (not suggested): By default, the codebase will perform evaluation at every k (default value is 1) epochs during the training. To disable this behavior, use --no-validate.
--work-dir ${WORK_DIR}: Override the working directory specified in the config file.
--resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file.

Difference between resume-from and load-from: resume-from loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally. load-from only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.

Train with multiple machines¶

If you launch with multiple machines simply connected with ethernet, you can simply run following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

Usually it is slow if you do not have high speed networking like InfiniBand.

If you run MMClassification on a cluster managed with slurm, you can use the script slurm_train.sh. (This script also supports single machine training.)

[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

You can check slurm_train.sh for full arguments and environment variables.

If you have just multiple machines connected with ethernet, you can refer to PyTorch launch utility. Usually it is slow if you do not have high speed networking like InfiniBand.

Launch multiple jobs on a single machine¶

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict.

If you use dist_train.sh to launch training jobs, you can set the port in commands.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4

If you use launch training jobs with Slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports.

In config1.py,

dist_params = dict(backend='nccl', port=29500)

In config2.py,

dist_params = dict(backend='nccl', port=29501)

Then you can launch two jobs with config1.py ang config2.py.

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

Train with IPU¶

The process of training on the IPU is consistent with single GPU training. We just need to have IPU machine and environment and add an extra argument --ipu-replicas ${IPU_NUM}

Useful tools¶

We provide lots of useful tools under tools/ directory.

Get the FLOPs and params (experimental)¶

We provide a script adapted from flops-counter.pytorch to compute the FLOPs and params of a given model.

python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]

You will get the result like this.

==============================
Input shape: (3, 224, 224)
Flops: 4.12 GFLOPs
Params: 25.56 M
==============================

Warning

This tool is still experimental and we do not guarantee that the number is correct. You may well use the result for simple comparisons, but double check it before you adopt it in technical reports or papers.

FLOPs are related to the input shape while parameters are not. The default input shape is (1, 3, 224, 224).
Some operators are not counted into FLOPs like GN and custom operators. Refer to mmcv.cnn.get_model_complexity_info() for details.

Publish a model¶

Before you publish a model, you may want to

Convert model weights to CPU tensors.
Delete the optimizer states.
Compute the hash of the checkpoint file and append the hash id to the filename.

python tools/convert_models/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}

E.g.,

python tools/convert_models/publish_model.py work_dirs/resnet50/latest.pth imagenet_resnet50.pth

The final output filename will be imagenet_resnet50_{date}-{hash id}.pth.

Tutorials¶

Currently, we provide five tutorials for users.