LAB-Net: Lidar and aerial image-based building segmentation using U-Nets

We describe our approach in the 2022 NORA MapAI competition 1 . The objective of the competition was to construct methods that were able to detect and segment buildings from aerial imaging and laser data. There were two subtasks: (1) building segmentation from aerial imaging; (2) building segmentation from lidar data, optionally combined with aerial images. We trained multiple dynamic U-Net models with self-attention layers. For Task 1, we used a ResNet34-based encoder pre-trained on the ImageNet challenge dataset and further pre-trained the U-Net on another similar aerial image dataset. For Task 2, we adapted the dynamic U-Net to deal with multispectral data. Our ensembles placed us in second place, with the top score on Task 1. The complete source code for reproducing our results is available at https://github.com/HVL-ML/LAB-Net.


Introduction
Obtaining accurate detection and segmentation of buildings from aerial imaging and lidar data allows for valuable analyses that are useful in remote sensing, for, among other things, urban planning, constructing maps, risk assessment, property assessment, and insurance purposes [1].
Deep learning-based methods have become the go-to approach across a range of problems in remote sensing because of the increased accuracy in deriving results from aerial and satellite imaging compared to traditional methods [2,3,4]. In this work, we explored the use of deep learning for building segmentation using the data and tasks of a remote sensing challenge.

Materials and methods
The data and task were from the NORA MapAI Challenge [5]. There were two tasks: (1) using aerial imaging to segment buildings and (2) doing the same using data from lidar, optionally combined with the aerial images. The training set had 7000 500x500 image patches derived from 70 different locations in Denmark, each covered by a 5000x5000 image. The validation set had 1500 image patches from 15 locations in Denmark. The hidden test set on which the submission was scored was derived from locations in Norway, both rural and urban. The "ground truth" labels were generated using a digital surface model (DSM) for the training and validation data, while the test set labels were from a digital terrain model (DTM) [5,6]. The submissions were ranked by the average of the intersection over the union (IoU) and the boundary IoU on the test set.
We used the fastai library to construct and train our models [7]. We explored multiple strategies for data augmentation, loss function designs, optimizers, model architectures, and training strategies. In the end, we used an ensemble of four models for Task 1 and two models for Task 2.We used a single NVIDIA RTX A6000 GPU to train the models. In the following, we describe some of our major design choices. The complete setup can be found in the accompanying code repository https: //github.com/HVL-ML/LAB-Net.
Data preparation and cleaning. The images were made available by the organizers as 500x500 patches, and we stitched these back together into the original 5000x5000 images. To better deal with issues arising from buildings spanning multiple patches and sometimes being cut in unfortunate ways, we extracted new 500x500 patches from random locations and used them as additional training data. A significant number of image patches had erroneous "ground truth" labels, with empty masks where buildings could clearly be seen. To find such images, we first trained a building-detection classifier to detect whether a patch contained buildings. By sorting this model's mislabeled images by the magnitude of the loss, we efficiently surfaced several labeling mistakes. We then used an early version of our segmentation model and manually inspected cases where our model disagreed the most with the ground truth labels. In total, we identified 231 clearly mislabeled training and 31 validation patches.
Pseudo-labeling. Pseudo-labeling is a technique in which a model trained on labeled data is used to generate labels for a portion of unlabeled or mislabeled data. This data can then be used to train or fine-tune the model. Depending on the performance of the model used to pseudo-label data, this can be a useful way to use large amounts of unlabeled data and reduce the need for ground truth labels. In our case, we used pseudo-labeling to correct erroneous ground truth labels.
Pretraining. We used the Inria Aerial Image Labeling dataset [8] to pre-train our aerial models. It consists of images and corresponding building masks, with 180 training tiles of size 5000 x 5000 from various locations across the world. To conform with the MapAI data, we split each tile into 10 tiles of size 500 x 500 using a regular grid.
Data augmentation. Motivated by the observed variation in the aerial imaging data, we transformed the images batch-wise on the GPU by randomly modifying the contrast, saturation, and brightness. We also used dihedral rotations, continuous translations of ±45 • , zooming with a factor of ±0.1, and reflection padding. Additionally, we used a CutMix data augmentation strategy [9] by randomly extracting and inserting small patches.
Progressive resizing. Our pre-trained models and our combined aerial and lidar models were first trained on down-scaled images (250 × 250) before they were trained further on the full-scale image patches. This progressive resizing strategy, first introduced by fastai [7], has been shown to increase performance [10].
Model architectures. Our model architectures were variants of U-Nets with a ResNet34 encoder pre-trained on ImageNet data. We equipped the models with selfattention layers, and some of our ensemble members were trained with the Mish activation function [11], others with ReLU.
Loss functions. We set up and explored multiple loss functions tailored to the task of segmenting objects in images while paying particular attention to capturing their boundaries. In the end, our models trained with a combination of cross entropy and Tversky loss [12], FocalTversky [13], or FocalTanimoto [12], had the highest validation performance.
Optimizer and learning rate policies. We used the Ranger optimizer [14] combined with cosine annealing of the learning rate. Experiments have shown that such learning rate schedules can be useful for models with complex loss landscapes by helping models escape local minima [15].

Results
Our ensemble of four models trained on the aerial images achieved an IoU, and BIoU test set score of 0.7879 and 0.6245, respectively. Our ensemble of two models trained on combined lidar and aerial images scored 0.8711 and 0.7504. This was the best Task 1 score in the competition, and we ended up in second place overall.

Discussion
Using generic techniques from modern deep learning, we were able to obtain competitive results.
As this was a competition with a strict deadline, we could not pursue many of the ideas we had at the outset. Our ablation study to isolate the most critical design choices was relatively limited and only performed using a few epochs for each choice of settings. Given more time, we would expand upon this 2 . Moreover, we would have drawn more on our experience in medical imaging (e.g., [16,17]), an area with many problems that are analogous to those in remote sensing, including having large images with relatively small objects of interest, images that are often multispectral.