Swin Transformer for Object Detection - Inference with COCO dataset 2. Cascade Mask RCNN (ResNet50)
on Projects
1. Reference
2. Checkpoints
Since there are no checkpoint files, we need to train the model for the COCO dataset.
2.1. Edit the code
configs/swin/cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py
_base_ = './cascade_mask_rcnn_swin-t-p4-w7_fpn_fp16_ms-crop-3x_coco.py'
pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth' # noqa
model = dict(
backbone=dict(
depths=[2, 2, 18, 2],
init_cfg=dict(type='Pretrained', checkpoint=pretrained)))
configs/swin/cascade_mask_rcnn_swin-t-p4-w7_fpn_fp16_ms-crop-3x_coco.py
_base_ = './cascade_mask_rcnn_swin-t-p4-w7_fpn_ms-crop-3x_coco.py'
# you need to set mode='dynamic' if you are using pytorch<=1.5.0
fp16 = dict(loss_scale=dict(init_scale=512))
configs/swin/cascade_mask_rcnn_swin-t-p4-w7_fpn_ms-crop-3x_coco.py
_base_ = [
'../_base_/models/cascade_mask_rcnn_r50_fpn.py',
'../_base_/datasets/coco_instance.py',
'../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py'
]
pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa
model = dict(
type='CascadeRCNN',
backbone=dict(
_delete_=True,
type='SwinTransformer',
embed_dims=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
mlp_ratio=4,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.2,
patch_norm=True,
out_indices=(0, 1, 2, 3),
with_cp=False,
convert_weights=True,
init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
neck=dict(in_channels=[96, 192, 384, 768]))
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
# augmentation strategy originates from DETR / Sparse RCNN
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='AutoAugment',
policies=[[
dict(
type='Resize',
img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
(608, 1333), (640, 1333), (672, 1333), (704, 1333),
(736, 1333), (768, 1333), (800, 1333)],
multiscale_mode='value',
keep_ratio=True)
],
[
dict(
type='Resize',
img_scale=[(400, 1333), (500, 1333), (600, 1333)],
multiscale_mode='value',
keep_ratio=True),
dict(
type='RandomCrop',
crop_type='absolute_range',
crop_size=(384, 600),
allow_negative_crop=True),
dict(
type='Resize',
img_scale=[(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
multiscale_mode='value',
override=True,
keep_ratio=True)
]]),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
data = dict(train=dict(pipeline=train_pipeline))
optimizer = dict(
_delete_=True,
type='AdamW',
lr=0.0001,
betas=(0.9, 0.999),
weight_decay=0.05,
paramwise_cfg=dict(
custom_keys={
'absolute_pos_embed': dict(decay_mult=0.),
'relative_position_bias_table': dict(decay_mult=0.),
'norm': dict(decay_mult=0.)
}))
lr_config = dict(warmup_iters=1000, step=[27, 33])
runner = dict(max_epochs=36)
2.2. Train
- 4 GPU 3090 RTX
- 3~4 days
cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py
bash ./tools/dist_train.sh configs/swin/cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py 4
3. Test
work_dirs/cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco/latest.pth
python tools/test.py configs/swin/cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py work_dirs/cascade_mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco/latest.pth --eval bbox segm
4. Results
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.508
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.701
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.555
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.335
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.542
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.663
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.630
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.630
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.630
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.455
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.664
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.779
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.443
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.675
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.480
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.245
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.474
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.636
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.555
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.555
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.555
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.370
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.594
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.716
Reference인 51.8boxAP, 44.7MaskAP보다 결과값이 작게 나옴.
Head로 사용한 Cascade Mask-RCNN이 ResNet50으로 pretrain된거여서 그런 것 같음.
SwinTransformer로 사전학습된 Cascade Mask-RCNN을 사용한 Swin transformer를 training해보기.