Team Fundator: Weighted UNet ensembles with enhanced datasets

The 2022 MapAI: Precision in Building Segmentation competition has invited participants to develop systems which segment buildings in orthophotos. In this paper, we present our winning submission for the competition. Submissions are ranked based on the mean of two metrics: Intersection over Union (IoU) and Boundary Intersection over Union (BIoU). The competition evaluates these metrics on two separate tasks with RGB and RGB-Z images, respectively. In our solution, we incorporate heterogeneous, weighted U-Net ensembles, multiple extensions of the training data, and area-based post-processing of predictions to achieve leading results on the the test and validation data, achieving a score of 0.7635 and 0.9266, respectively.


Introduction
Segmentation of buildings has a broad range of applications, e.g., in urban planning, disaster recovery, and cartography.
The MapAI: Precision in Building Segmentation competition [1] is comprised of two tasks. One in which buildings are segmented from RGB images, and another where a Z-dimension is added with LIDAR measurements of the pixel-wise height in meters over the local terrain. The competition is graded based on the mean of IoU and BIoU [2] on the two tasks. For further details, we refer to the competition paper [1].

Structure
We continue the paper with a four-part overview of our approach, and end with a brief discussion of the networks' results on the validation data.

Materials and approach
We present our approach in four parts: the data we used for training, our model architecture, post-processing steps, and training setup. Data preparation and augmentation Our first step for data preparation was to supersample the training and validation data by combining images into their original 10000 × 5000 and 5000 × 5000 tiles, and then splitting the data with a stride of 500/3, effectively increasing the dataset size eight-fold.
We also reclassified the dataset masks by adding two new classes: building edges (red) and background pixels nested between two separate buildings (blue). This was done in hopes of highlighting edges to improve BIoU, and to disincentivise models from merging adjacent buildings into one building. An example of these new masks is seen in Figure 1. When reclassifying, a pixel is considered nested if the combined L2 distance to its two closest building-edges is below 18 pixels. We later introduced the option to only add the edge class. We call these datasets Mapai-Reclassified and Mapai-Edge, respectively.
Lastly, we resize the images to a resolution of 768×768 and employ a standard suite of image augmentations by way of the albumentations library [3].
Our LIDAR pre-processing consists of augmentation and normalisation. To augment the LIDAR data, we use random noise and random scaling. Random scaling is done with scalars selected from the uniform distribution U(0.85, 1.15). Random noise is done by adding Gaussian noise with mean 0 and standard deviation of 0.1.
For normalisation, we first clip the LIDAR into the range [−1, 40] and then divide by 40. The negative LIDAR values exclusively constitute noise, and is often present around water and forests. We do not clip these values to zero, because 0 is semantically meaningful in the data, and represents the areas in which a Digital Surface Model (DSM) and Digital Terrain Model (DTM) are equal. Moreover, in task 2, we implement the option to add a third class to the training data to represent regions in which the LIDAR mask is 0.

Network Architecture
For both task 1 and task 2 we use UNet [4] ensembles to segment buildings. To increase the diversity of the ensembles we use two encoders, Resnest26d [5] and efficientnet-b1 [6], both pretrained on imagenet. In task 2, the additional LIDAR channel is accommodated by repeating the weights of the first colour channel.
For task 1, we initially train three models per backbone, one on the supersampled MapAI data, and two more on the supersampled, edge and reclassified datasets. For task 2 we also use a fourth dataset in which LIDAR pixels with value 0 constitute an additional class.
To increase variety further, we train three more models per task with image size 1024 × 1024. For this purpose, we used the two best performing datsets, Mapai and Mapai-Edge. In total, we train 9 models for task 1, and 11 models for task 2. We did not have time to train efficientnet-b1 models on 1024-resolution Mapai.
For the standard single-class dataset, we use the sigmoid activation followed by rounding to the nearest integer to get the binary classification mask. Otherwise, we use a softmax activation function, and map its multiclass probabilities to a single-class prediction.
After training, we collect the predictions of each individual model, and test every combination of three or more models with even weighting to find the combinations which perform best. With the two best subsets, we use an evolutionary algorithm to evolve optimised weights over 30 generations with population size 500.
Although this has a relatively small impact on the ensembles' performance, the weighting clearly prefers models with higher validation scores, while seemingly optimising the subsets for diversity as well.
Consequently, the task 1 ensemble uses 5 out of 7 Mapai and Mapai-Edge models (two 768 models are excluded). In task 2, all 1024 resolution models are used alongside Mapai-trained resnest and Mapai-Edge-trained efficientnet models, both at 768 resolution.

Post-Processing
We employ post-processing of predictions to remove noise and resize the output to 500 × 500, i.e., the size of the ground truth masks. For resizing we use bilinear interpolation without antialiasing. We found that resizing the logits from the ensemble gave the best performance; we tried resizing the binary output mask, but found that BIoU decreased by 0.02 on the validation data.
After resizing, we find the area of each individual building in the prediction. We remove buildings with an area under 190 pixels, unless more than 5% of this area is along the border of the image. We found this approach to be marginally better than morphological erosion followed by dilation. We decided not to fill in holes in the prediction, because any improvement seemed to cancel out by introducing false positives.

Training
We train the networks for 50 and 40 epochs for the 768 and 1024 models, respectively, using Jaccard loss and the RAdam optimiser [7] with an initial learning rate of 0.0002 on the PolyLR schedule. We used an RTX 3090 GPU with batchsize 12 and 8 for the Resnest26d and efficientnet-b1 models. For 1024 models we used batch size 6 and 5.

Discussion
From the results we find that the task 1 ensemble is outperformed by the task 2 models on both metrics. This observation is also present in stand alone networks, where even the lowest performing task 2 models outperform the task 1 ensemble. From the results, it is clear that a deficiency in BIoU is the main cause of this.
We think that the performance drop between the dataset splits is partly caused by the DTM orthophotos in the testset, which skew rooftops from their foundations, and thus, away from the ground truth. Nonetheless, we think further research into this discrepancy is warranted.

Conflict of interest
The authors state no conflict of interest.