Kvasir-Instruments and Polyp Segmentation Using UNet

This paper describes the methodology used to develop, fine-tune, and analyze a UNet-based model for generating segmentation masks for the polyp and instrument segmentation tasks held at MedAI 2021. We used the same methodology on both tasks, where the evaluation on the hidden testing dataset resulted in an IOU of 0.73 and dice score of 0.7980 for the instrumentation task, and an IOU of 0.41 and dice score of0.41 for the polyp segmentation task


Introduction
Over the last few years, the use of Deep Learning (DL) for Medical Image Segmentation (MIS), has gained a lot of interest amongst the medical community. MIS is characterized as a complicated task due to factors, such as complexity of the data, the complexity of the objects of interest, and the complex validation [1].
Compared to classical machine learning and computer vision techniques, deep learning offers higher segmentation accuracy and speed when it comes to MIS [2]. In particular, fully connected networks, generative adversarial networks (GAN), and U-Net have emerged as the most commonly used models for the MIS task. This paper summarizes our work to develop deep learning models for the instrument and polyp segmentation tasks as part of the MedAI 2021 [2].
We used the U-Net model for both segmentation tasks (polyp and instrument segmentation task) and the details of the materials and method is presented in the following.

Materials and methods
The methodology used for training and testing of the U-Net model has been divided into three parts: Data pre-processing The datasets used in this paper are obtained from Simula open datasets (Kvasir-Instrument Dataset [3] and Polyp dataset [4]). Both datasets were divided into two parts: Development set and testing set with the ratio of 80:20. Before the training, the image data were at first resized into 256*256 and then normalized between 0 and 1.

UNet Model
The U-Net (as shown in Figure 1) architecture based on a fully convolution neural network. In this work its architecture was modified and extended to work with fewer training images and produces the output of the same size as input.

Initialization of model parameters and weights
The weights of the U-Net models were randomly initiated. During training, the hyperparameters were setup using methods as described in the following.
1. Learning rate: The learning rate determines how fast or slow the model will learn the task. In this work, the learning rate is regulated by the (TensorFlow's) 'reduceLROnPlat' function. In this function we define following input parameters: 'monitor ': it continuously monitors the validation loss. At first, we define the lowest learning rate, which is 106 in our study for U-net. If the Dice similarity coefficient loss rate was not changed for 3 continuous epochs (for 5 decimal points), then learning rate was reduced by the factor 0.05.

Number of Epochs:
The number Epoch determines how many times the data will be set for the model. In this work, it was determined by (Tensor-Flow's) 'EarlyStopping' function. If the validation loss is not changing itself for continuous 15 epochs, then the model stopped itself. For UNet the number of epochs is 1000.

Batch Size:
Batch size indicates how the data will feed to the model during training. The batch size for both works were set by using training the models iteratively on different batch sizes. Due to the memory limitations the UNet model cannot be trained for more than 12 batch sizes. So based on performances we chose 8 as the best batch size.

Activation Function:
In both tasks Relu activation function was used in the hidden layers.

5.
Optimizer: In order to minimize the loss, we used Adam optimizer for both tasks.
6. Loss Function: The loss function used in this work is the negative Dice coefficient (DSC). Figure 2, shows two plots between DSC loss vs number of epochs: (a), is for instruments task, and (b), is for Polyp task. It is evident from both plots that the model is neither over-fitting nor over-fitting. For the instrumentation task, the training was stopped by the learning rate scheduler when the loss was same for 15 epochs, the DSC for training and validation is 0.8. And for the polyp task the training by the training was stopped by the learning rate scheduler when the loss was same for 15 epochs, the DSC for training and validation is 0.65.  Table 1 shows the models' performance on Accuracy, Jaccard, Dice Coeff., Recall, F1 score, Precision for the testing set of Kvasir-Instruments and Polyp data. It is evident that model has performed better on instrument dataset, the average dice score on instrument dataset is 0.80 whereas on Ployp dataset it is 0.41.

Discussion
In this work, we established a methodology using UNet for generating the segmentation masks for the instrumentation and polyp datasets. The fair advantage of this methodology is that it automatically fine-tunes the learning on a defined range. Hence the model does not overfit on testing validation dataset, as shown in Figure 2.
If we closely look at Table 1 and Figure 2, for instruments dataset: the model performed well (dice score is around 0.8, as shown in Figure 2(a)) on training, validation and for testing dataset (dice score is 0.7980, as shown in Table 2). While in case of the polyp dataset( as shown in Figure 2 (b)), the model also doesn't look overfit (DSC is around 0.6 for both sets ) but by analyzing Table 1, it is evident that the model was failed to perform segmentation task on testing dataset the DSC score was only 0.41. By analyzing the model on both datasets, we conclude that the model performed well on the instrument segmentation task. The reason behind this is quite apparent: the features (color, pixels intensity, etc.) of instruments are different than the skin, so it was an easier task. While in the case of the polyp segmentation task, the region of interest (ROI) was on the skin, making it quite difficult for the UNet model to distinguish between ROI and skin (due to similar pixels features). A solution to this problem could be using other models like deeplabv3 or Pix2Pix-GAN for segmentation as future work.

Conflict of interest
Authors state no conflict of interest.