BuildSeg: A General Framework for the Segmentation of Buildings

Building segmentation from aerial images and 3D laser scanning (LiDAR) is a challenging task due to the diversity of backgrounds, building textures, and image quality. While current research using different types of convolutional and transformer networks has considerably improved the performance on this task, even more accurate segmentation methods for buildings are desirable for applications such as automatic mapping. In this study, we propose a general framework termed \emph{BuildSeg} employing a generic approach that can be quickly applied to segment buildings. Different data sources were combined to increase generalization performance. The approach yields good results for different data sources as shown by experiments on high-resolution multi-spectral and LiDAR imagery of cities in Norway, Denmark and France. We applied ConvNeXt and SegFormer based models on the high resolution aerial image dataset from the MapAI-competition. The methods achieved an IOU of 0.7902 and a boundary IOU of 0.6185. We used post-processing to account for the rectangular shape of the objects. This increased the boundary IOU from 0.6185 to 0.6189.


I. INTRODUCTION
Detecting buildings from remote sensing imagery has been extensively studied [1]- [3] as it is of great importance for many fields, such as urban planning, population estimation, economic development, and topographic map production.Since the amount of data cannot be processed manually, datadriven machine learning methods are needed to reduce the manual work required to obtain reliable urban development mappings.
Segmenting buildings on a large scale is a challenging task because satellite or aerial images can be very diverse, for example, because of different styles of architecture, building materials, and topography.Quite a number of benchmarks for the segmentation of buildings have been published [4]- [12].Since the silhouettes of buildings can be very different, combining several datasets with different characteristics can lead to more generally applicable building segmentation models.
In this study, we propose a framework for building segmentaion referred to as BuildSeg.We consider the Inria Aerial Image Labeling Benchmark [4] combined with the MapAIcompetition dataset [6] to improve the segmentation performance of the latter.When designing the BuildSeg framework, our goal was to design a segmentation pipeline that is generally applicable.Therefore, several benchmarks and corresponding models are available within the framework [4]- [12].
Neural networks and in particular convolutional neural networks (CNNs) have become the go-to methods for image segmentation, see [13] for a recent review.The U-Net [14] is one of the fundamental segmentation architectures, and we have been successfully applying it to remote sensing imagery (e.g., [15], [16]).It uses an encoder/decoder structure which processes the input image at different scales and allows to detect high-frequency patterns while being computationally feasible.
The original U-Net architecture can be generalized by replacing the encoder and decoder by tailored networks.This makes the U-Net very versatile and allows to utilize state-ofthe-art encoders.In our framework, we consider two different U-Net variants, SegFormer [17] and ConvNeXt U-Net [18], where the decoder of the ConvNeXt U-Net are backwards strided convolutions [19].
The main contributions of this study can be summarized as follows: (1) we propose a general framework called BuildSeg based on [20] for segmenting buildings in aerial images of different resolutions; (2) we explore how 3D information from LiDAR affects the performance of deep CNN models; (3) we combine different datasets and apply rectangle-aware postprocessing to create rectangular boundaries that match the labels more accurately.The proposed approach achieved the IOU of 0.7902 for the segmentation of images in MapAI:

II. METHOD
We developed our framework for the MapAI challenge [6], which provides both aerial images and LiDAR data.The challenge formulates two tasks.The first is the segmentation of buildings only using the aerial imagery.In the second task, the LiDAR data must be segmented either with or without aerial images.
We additionally used the data from [4] to improve the performance.A subset of 5000 images was considered as additional training data.To align the image sizes, we cropped the input images to 500 × 500.
We tried different models such as the standard U-Net and variants of it, namely ConvNeXt and SegFormer [17].For the ConvNeXt model, ConvNeXt [18] is used as encoder and backwards strided convolution [19] as decoder.We also tried EfficientNet [21] as encoder but the results were not as good.All encoders were pre-trained on ImageNet.
Two metrics were considered to measure the performance: intersection over union (IOU) and boundary intersection over union (BIOU) [22].
LiDAR height data were added directly as an additional channel to the multi-spectral data when available.
For post-processing, we applied a sequence of morphological opening and closing operations to detect lines and then removed points not matching the hypothesis of a rectangular structure.

III. EXPERIMENTAL RESULTS
The results are summarized in Table I.The model SegFormer-B5 performed best in terms of IOU and BIOU for the images.Note that SegFormer-B5 and SegFormer-B4 have more layers than SegFormer-B0.
Figure 2 illustrates results of the SegFormer-B5 model, and it can be seen that the buildings were nicely captured.
The averaged score, computed as the mean of IOU and BIOU, was 0.7044 without post-processing and slightly increased to 0.7045 after post-processing, so the latter should be preferred if IOU and BIOU weight the same.When combined with LiDAR, the method reached an IOU of 0.8506 and a BIOU of 0.7461.

IV. CONCLUSION
For the MapAI-competition, we proposed a solution that utilizes additional building datasets and current state-of-the-art deep learning architectures.The method achieved an IOU of 0.7902 and boundary IOU of 0.6185 for the task of segmenting buildings in aerial images.Using additional information from LiDAR further improved the results, increasing the IOU to 0.8506 and the BIOU to 0.7461.

Figure 1 .
Figure 1.Examples from the MapAI-competition [6].The left image (a) shows the ground-truth data, and the right image (b) the prediction created by the ConvNeXt model.

Figure 2 .
Figure 2. Building predictions with SegFormer, overlaid are the prediction result of the model.

Table I PERFORMANCE
[14]IFFERENT MODELS ON THE MAPAI-COMPETITION IMAGE TEST SET (WITHOUT POST-PROCESSING).AS BASELINE WE SHOWA STANDARD U-NET[14].