Precision in Building Segmentation Competition Submission - Team UiAI

The MapAI competition "Precision in Building Segmentation" tasked researchers with developing neural networks to segment buildings in aerial images, using RGB and LiDAR image data in a labeled dataset. This paper presents the Team UiAI sub-mission for this competition. To solve the tasks, convolutional auto-encoders with pretrained encoder backbones was trained using only the competition provided dataset, and after the final withheld dataset was released, the trained model achieved 65.58% score for task 1 and 83.16% for task 2.


Introduction
This paper presents the Team UiAI submission for the MapAI competition "Precision in Building Segmentation", which is hosted by the Norwegian Artificial Intelligence Research Consortium (NORA) and other institutions [1]. The goal of the competition is to use artificial intelligence or computer vision methods to segment buildings in aerial images of land. While aerial images provide an overview of the land structure and population, deriving useful data from such images is a challenging task because of varying image quality, noise, shadows from objects, and objects blocking the view of other objects.
Several image segmentation methods exist, divided into classical computer vision and deep learning techniques. One classical computer vision technique is semantic segmentation, where each pixel is analyzed based on its surrounding area to classify objects [2]. On the other hand, deep learning enables neural networks to be trained using supervised learning to segment objects in images. For supervised learning, a labeled dataset of images is required to train and update the weights of the neural network, which is provided in this competition.

Method
The models were evaluated on a subset of the training dataset that acted as the test dataset until the actual competition test dataset was released.
On the danish dataset of the models that we tried the one that had the best result was an autoencoder with the backbone based on resnext101_32x8d feature pyramid network (pretrained model with ResNeXt101_32X8D_Weights.IMAGENET1K_V2). The first four feature maps were used. All feature maps were upscaled into the size of the biggest feature map using bilinear interpolation. Then the feature maps were concatenated on the channel axis, and the number of channels was reduced using a 2d convolutional layer with a 1x1 convolution. The result is passed through 2 residual blocks based on 2 times 3x3 2d Convolutions GELU and batch normalization. The result is finally upscaled into the size of the original image using 2 Transposed convolutions while keeping the channel amount to 256. Then the amount of channels gets reduced to 2 using a 1x1 2d convolution. The model is illustrated in Figure 1.
To improve results, the model was trained using Soft mIoU [3] as the loss function to increase the total score. Data augmentation was also used in training by randomly rotating an image and the corresponding label and mirroring the image horizontally and vertically to increase the size of the dataset.

Results
We evaluated the results using the framework provided for the MapAI competition which gave half the score based on IoU [4] and half the score based on BIoU [5]. Our model was trained without having any part of the pretrained backbone frozen and trained on the same batch 30 times

Discussion
The results show that the model performs better in task 2 than in task 1, which means that the model improves when it is provided with more data as input. For task 1, the model performs well during testing and training, but the score is drastically lowered when evaluated on the withheld competition dataset. A possible explanation could be that the model is being trained using incomplete image labels, as the competition states that there are errors in the labels with missing segments [1]. Another explanation could be that even though the scores are similar on the danish test images we used for testing, with distance data, the model learns to recognize buildings in a vastly different way that generalizes better to the Norwegian dataset. There are numerous ways to improve these results, such as gathering more labeled datasets, such as SpaceNet [6]. Another way to enhance the result could be to include our model in an ensemble of autoencoders [7]. A different model, such as Transformers [8], could also be developed, which has shown promising results in image segmentation [9] and could be used together with the existing model in an ensemble.

Conflict of interest
The authors state no conflict of interest.