Mask R-CNN 리뷰 및 코드 구현(Image Segmentation)으로 Custom train 시켜 보자

June 19, 2022 4 분 소요

Mask R-CNN

우선 객체 탐지에 있어서 우리가 해야할 일은 3가지 이다.

분류
Object detection
instance segmentation

Faster R-CNN까지는 Object detection을 위해 고안된 모델이며

Mask R-CNN은 Faster R-CNN에 mask branch를 더한

일반적으로 detection task보다는 instance segmentation task에서 주로 사용됩니다.

여기서 Instance segentation은 이미지 내에 존재하는 모든 객체를 탐지하는 동시에 각각의 경우(instance)를 정확하게 픽셀 단위로 분류하는 task이다.

혹시 Faster R-CNN을 모른다면 아래 주소로

https://panggu15.github.io/detection/%EA%B0%9D%EC%B2%B4-%ED%83%90%EC%A7%80(Faster-R-CNN)/

구조

Mask R-CNN은 기존의 Faster R-CNN을 object detection을 하도록 하고 각각의 RoI에 class를 예측하는 classification branch, bbox regression을 수행하는 bbox regression branch와 평행으로 mask segmentation을 해주는 작은 FC 레이어를 추가한 구조를 가진다.

특징

RoIAlign

기존의 Faster R-CNN에서 RoI pooling은 object detection을 위한 모델이어서 인접 픽셀들로 box를 이동시킨 후 pooling을 진행했다. pooling 연산을 거치면서 크기가 소수점이 있는 경우 반올림 연산에 의해 강제로 무시하기 때문에 인접 pixel 공간 정보를 훼손을 유발한다.

Mask R-CNN은 RoIPool 대신에 bilinear interpolation을 이용해서 위치정보를 담는 RoI Align을 사용함으로 이를 해결했다.

Mask Branch

segmentation task는 픽셀 단위로 class를 분류해야 하기 때문에 detection task보다 더 정교한 spatial layout(공간에 대한 배치 정보)를 필요로 한다. 이를 위해 Mask Branch가 쓰이는데 Mask Branch는 여러 개의 convolution를 사용하여 pixel-to-pixel 정보를 추출한다. 작은 FCN을 사용하여 각 RoI에 대해 m x m 크기의 K개 mask를 예측한다. (K는 class 수를 의미) FC layer를 사용하지 않는 이유는 공간 정보를 활용하기 위함이다.

mask branch는 각각의 RoI에 대하여 class별로 binary mask를 출력한다.

Loss

Mask R-CNN은 Faster-R-CNN처럼 2-stage 기법이다.

첫 번째 stage는 RPN에서 RoI를 생성하고, 두 번째 stage는 생성한 RoI를 이용하여 class, boxx offset, binary mask를 출력한다. mask branch는 각 RoI에 대하여 Km^2 차원의 출력값을 생성한다. K는 class의 수이고, m은 mask의 크기이다. mask branch는 각 K class에 대해 출력값을 계산하고, class branch에서 출력한 class를 지닌 mask만 Loss를 계산한다.

Mask R-CNN 구현

import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
from glob import glob
from tqdm import tqdm

import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

데이터 받아오기

box = pd.read_csv('/content/drive/MyDrive/dataset/data/train_solution_bounding_boxes (1).csv')
box.head()

             image        xmin        ymin        xmax        ymax
0   vid_4_1000.jpg  281.259045  187.035071  327.727931  223.225547
1  vid_4_10000.jpg   15.163531  187.035071  120.329957  236.430180
2  vid_4_10040.jpg  239.192475  176.764801  361.968162  236.430180
3  vid_4_10020.jpg  496.483358  172.363256  630.020260  231.539575
4  vid_4_10060.jpg   16.630970  186.546010  132.558611  238.386422

sample = cv2.imread('/content/drive/MyDrive/dataset/data/training_images/vid_4_1000.jpg')
sample = cv2.cvtColor(sample, cv2.COLOR_BGR2RGB)
point = box.iloc[0]
pt1 = (int(point['xmin']), int(point['ymax']))
pt2 = (int(point['xmax']), int(point['ymin']))
cv2.rectangle(sample, pt1, pt2, color=(255,0,0), thickness=2)
plt.imshow(sample)
plt.show()

class CarDataset(Dataset):
    def __init__(self, df, image_dir, transforms=None):
        super().__init__()
        
        self.image_ids = df["image"].unique() # all image filenames
        self.df = df
        self.image_dir = image_dir # dir to image files
        self.transforms = transforms

    def __getitem__(self, idx: int):
        image_id = self.image_ids[idx]
        records = self.df[self.df["image"] == image_id]
        image = cv2.imread(f"{self.image_dir}/{image_id}", cv2.IMREAD_COLOR)
        heights, widths = image.shape[:2]
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image /= 255.0
        image = torch.tensor(image)
        image = image.permute(2,0,1)
        
        
        boxes = records[["xmin", "ymin", "xmax", "ymax"]].values
        
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        area = torch.as_tensor(area, dtype=torch.float32)
        
        masks = []
        for box in boxes:
            mask = np.zeros([int(heights), int(widths)], np.uint8)
            masks.append(cv2.rectangle(mask, (int(box[0]), int(box[1])), (int(box[2]), int(box[3])), 1, -1))

        masks = torch.tensor(masks, dtype=torch.uint8)

        # class가 1종류이기 때문에 label은 1로만 지정
        labels = torch.ones((records.shape[0]), dtype=torch.int64)
        
        target = {}
        target["boxes"] = torch.tensor(boxes)
        target["labels"] = labels
        target['masks'] = masks
        target["image_id"] = torch.tensor([idx])
        target["area"] = area


        if self.transforms:
            sample = {"image": image, "boxes": target["boxes"], "labels": labels}
            sample = self.transforms(**sample)
            image = sample["image"]
            target["boxes"] = torch.stack(tuple(map(torch.tensor, zip(*sample["boxes"])))).permute(1, 0)

        return image, target

    def __len__(self):
        return self.image_ids.shape[0]

def collate_fn(batch):
    return tuple(zip(*batch))

dir_train = "/content/drive/MyDrive/dataset/data/training_images"
train_ds = CarDataset(box, dir_train)

train_dl = DataLoader(train_ds, batch_size=8, shuffle=False, num_workers=2, collate_fn=collate_fn)

모델 생성

기존의 Faster R-CNN을 바탕으로 만들어진다.

# https://panggu15.github.io/detection/%EA%B0%9D%EC%B2%B4-%ED%83%90%EC%A7%80(Faster-R-CNN)/
# Faster R-CNN

# num_classes = 2 # 1 class (car) + background

# Load model pretrained on COCO
# model = fasterrcnn_resnet50_fpn(pretrained=True)

# get number of input features for the classifier
# in_features = model.roi_heads.box_predictor.cls_score.in_features

# replace pre-trained head with new one
# model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

def get_instance_segmentation_model(num_classes):
    # load an instance segmentation model pre-trained on COCO
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # now get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                       hidden_layer,
                                                       num_classes)

    return model

# class 1 + background 1 = 2
num_classes = 2

model = get_instance_segmentation_model(num_classes)
model.to(device)

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(params, lr=0.0005, weight_decay=0.0005)

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /root/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth

  0%|          | 0.00/170M [00:00<?, ?B/s]

model.train()

num_epochs = 10

for epoch in range(num_epochs):
    
    for i, (images, targets) in enumerate(train_dl):
      optimizer.zero_grad()
      images = list(image.to(device) for image in images)
      targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        
      loss_dict = model(images, targets)

      losses = sum(loss for loss in loss_dict.values())

      losses.backward()
      optimizer.step()

      if (i+1) % 40 == 0:
        print(f'Epoch {epoch+1} - Total: {losses:.4f}, Regression: {loss_dict["loss_box_reg"]:.4f}, Classifier: {loss_dict["loss_classifier"]:.4f}')

    scheduler.step()

Epoch 1 - Total: 0.5321, Regression: 0.1252, Classifier: 0.0785
Epoch 2 - Total: 0.4848, Regression: 0.1046, Classifier: 0.0807
Epoch 3 - Total: 0.4614, Regression: 0.1080, Classifier: 0.0954
Epoch 4 - Total: 0.3701, Regression: 0.0717, Classifier: 0.0521
Epoch 5 - Total: 0.3382, Regression: 0.0701, Classifier: 0.0475
Epoch 6 - Total: 0.3417, Regression: 0.0743, Classifier: 0.0485
Epoch 7 - Total: 0.3255, Regression: 0.0710, Classifier: 0.0402
Epoch 8 - Total: 0.3335, Regression: 0.0711, Classifier: 0.0469
Epoch 9 - Total: 0.3306, Regression: 0.0710, Classifier: 0.0452
Epoch 10 - Total: 0.3295, Regression: 0.0720, Classifier: 0.0468

threshold = 0.8

images = cv2.imread("/content/drive/MyDrive/dataset/data/testing_images/vid_5_26640.jpg", cv2.IMREAD_COLOR)
images = cv2.cvtColor(images, cv2.COLOR_BGR2RGB).astype(np.float32)
images /= 255.0
sample = images
images = torch.tensor(images)
images = images.permute(2,0,1)
images = torch.unsqueeze(images, 0)
images = images.to(device)
model.eval()
cpu_device = torch.device("cpu")

preds = model(images)
outputs = [{k: v.to(cpu_device) for k, v in t.items()} for t in preds]
mask = outputs[0]['scores'] > threshold
boxes = outputs[0]["boxes"][mask].detach().numpy().astype(np.int32)

for box in boxes:
    cv2.rectangle(sample,
                  (box[0], box[1]),
                  (box[2], box[3]),
                  (220, 0, 0), 3)
    
plt.imshow(sample)
plt.show()

Twitter Facebook LinkedIn

Mask R-CNN 리뷰 및 코드 구현(Image Segmentation)으로 Custom train 시켜 보자

Mask R-CNN

구조

특징

RoIAlign

Mask Branch

Loss

Mask R-CNN 구현

데이터 받아오기

모델 생성

공유하기

댓글남기기

참고

DALL-E 2 사용법 (사용기), 텍스트로 이미지를 만드는 인공지능

구글 드라이브 파일 다운받는 gdown 사용법과 안될 시 해결법

스테이블 디퓨전(Stable Diffusion) 간단한 사용법과 가이드 및 원리 이해 by 코랩(colab)

머신러닝 - K-최근접 이웃(KNN classifier)을 이용한 분류