Semantic Segmentation of Geospatial Imagery with Deep Learning Part 4 – U-Net

Introduction

In the previous post, we have successfully created image crops that we are going to use for training the U-Net model. In this post, we are going to talk about how the U-Net architecture looks like and how it can be implemented using PyTorch. Finally, we are discussing how the training data can be accessed to be usable by the U-Net architecture.

U-Net Architecture

The U-Net was developed by Ronneberger et al. in the year 2015. The name stems from the fact that its architecture somewhat resembles the letter “U”. On a very abstract level, the architecture is comprised of two “paths”. The first path, the contracting path, compresses the input images into gradually higher representations by reducing the image dimensions and at the same time increasing the number of channels. For example, in the original U-Net architecture, an image with the spatial extent of 572pixels x 572pixels and one channel (i.e. a grayscale image), is ultimately being compressed into an image of 30pixels x 30pixels and 1024 channels. The original spatial information hence needs to be encoded into these channels to be preserved. The second path, the expanding path, takes this compressed information and gradually transforms it back until its spatial extent matches the desired output extent of the resulting segmentation. From the previous post, we have learned that the output, i.e., the segmentation image, is smaller than the input image to be analyzed. For each level of compression, there is a corresponding level of expansion whose feature maps have the same spatial extents. Hence, these can be concatenated in the expanding path to combine both, information with a lower and information with a higher abstraction level. The concrete U-Net architecture we are going to implement somewhat deviates from the original one. Most importantly, the input and output dimensions are smaller, there is no dropout and cropping to the desired output extent is only carried out once as part of one of the final layers instead of multiple times at intermediate steps.

U-Net Implementation

The U-Net architecture can be implemented using three different types of Blocks, which we call a ContractionBlock, an ExpansionBlock and the BottleneckBlock. As one can imagine, the ContractionBlock and the ExpansionBlock are part of the contracting and expanding paths, respectively. The BottleneckBlock represents the innermost block, which performs the highest compression before it is fed into the first ExpansionBlock. The first thing we need to do is to import the relevant PyTorch libraries:

from torch.nn import ConvTranspose2d
from torch.nn import Conv2d
from torch.nn import MaxPool2d
from torch.nn import Module
from torch.nn import ReLU
from torchvision.transforms import CenterCrop
from torch.nn import functional as F
import torch

Now, we can define the ContractionBlock module:

class ContractionBlock(Module):
    def __init__(self, inChannels, outChannels):
        super().__init__()
        self.conv1 = Conv2d(inChannels, outChannels, 3)
        self.relu1 = ReLU()
        self.conv2 = Conv2d(outChannels, outChannels, 3)
        self.relu2 = ReLU()
        self.pool = MaxPool2d(2)
        
        
    def forward(self, x):
        x_padded = F.pad(x, (1, 1, 1, 1))
        conv1 = self.conv1(x_padded)
        conv1_padded = F.pad(conv1, (1, 1, 1, 1))
        conv1_relu = self.relu1(conv1_padded)
        conv2 = self.conv2(conv1_relu)
        conv2_relu = self.relu2(conv2)
        pooled = self.pool(conv2_relu)
        
        return pooled

In PyTorch, it is common practice to structure architectures in modules, i.e., logical blocks that typically comprise a fixed set of layers. In the case of the ContractionBlock, we are defining two convolutional layers accompanied by ReLU activation layers and a final max pooling layer (lines 4-8). The convolutional layers allow to analyze the input spatially, the ReLU layers introduce non-linearity into the network and the max pooling layer essentially halves the spatial dimensions of the input by taking the maximum activation of each four neighboring cells in the image grid. For more information about these layers, you might want to have a look at the official documentation of convolutional layers, ReLU activations and max pooling layers. The only parameters we need to define are the numbers of channels this module will receive and how many channels it should return. This directly determines the number of kernels to be used by the convolutional layers.

Once the relevant layers are defined, they are being applied in the forward method by incrementally applying them to the output of the respective former layer (lines 12-18). Note that we are also introducing padding layers before applying convolutions. The reason for this lies in the property of convolutional layers to reduce the spatial extent of a feature map by two in each dimension. This requires adding intermediate cropping operations so that max pooling layers can consistently work on dimensions that can be divided by two. To give an example for this issue: without padding the original input of 320pixels x 320pixels would be 316pixels x 316pixels after applying two convolutional layers. Dividing this number by two yields 158pixels x 158pixels and, after applying another ContractionBlock, 77pixels x 77pixels, which are dimensions that cannot be divided by two anymore. This problem can be solved by applying intermediate cropping operations. Using padding, however, allows these dimensions to become 160pixels x 160pixels and 80pixels x 80pixels, respectively. Hence, the a division by two is consistently possible and yields a more elegant way to assemble the network architecture.

The ExpansionBlock somewhat performs the complementary operations to the ContractionBlock:

class ExpansionBlock(Module):
    def __init__(self, inChannels, outChannels):
        super().__init__()
        self.convT = ConvTranspose2d(inChannels, outChannels, 2, 2)
        self.relu1 = ReLU()
        self.conv2 = Conv2d(outChannels, outChannels, 3)
        self.relu2 = ReLU()
        
        
    def forward(self, x):
        convT = self.convT(x)
        conv1_padded = F.pad(convT, (1, 1, 1, 1))
        conv1_relu = self.relu1(conv1_padded)
        conv2 = self.conv2(conv1_relu)
        conv2_relu = self.relu2(conv2)
        
        return conv2_relu

This block first consists of a transpose convolutional layer, i.e., a convolutional layer that doubles the extent of the input feature map, and a normal convolutional layer (lines 4 & 6). Both are accompanied by ReLU layers (lines 5 & 7). In the forward method, these layers are applied to the output of the respective former layer. Note that again a padding layer is necessary to prevent the creation of unfavorable dimensions.

Finally, the bottleneck module looks as follows:

class BottleneckBlock(Module):
    def __init__(self, inChannels, outChannels):
        super().__init__()
        self.conv1 = Conv2d(inChannels, outChannels, 3)
        self.relu1 = ReLU()
        self.conv2 = Conv2d(outChannels, outChannels, 3)
        self.relu2 = ReLU()
        
        
    def forward(self, x):
        x_padded = F.pad(x, (1, 1, 1, 1))
        conv1 = self.conv1(x_padded)
        conv1_padded = F.pad(conv1, (1, 1, 1, 1))
        conv1_relu = self.relu1(conv1_padded)
        conv2 = self.conv2(conv1_relu)
        conv2_relu = self.relu2(conv2)
        
        return conv2_relu

Essentially, the BottleneckBlock is the same as the ContractionBlock. The only difference being that the max pooling layer is discarded as the feature map is not required to be downsampled anymore. Now that we are having all our main blocks defined, we can assemble our U-Net:

class UNet(Module):
    def __init__(self, out_size=(200,  200)):
        super().__init__()
        
        self.out_size = out_size
        
        self.contraction_block1 = ContractionBlock(3, 16) # 160
        self.contraction_block2 = ContractionBlock(16, 32) # 80
        self.contraction_block3 = ContractionBlock(32, 64) # 40
        self.contraction_block4 = ContractionBlock(64, 128) # 20
        
        self.bottleneck_block = BottleneckBlock(128, 128)

        self.expansion_block4 = ExpansionBlock(128, 32)
        self.expansion_block3 = ExpansionBlock(96, 16)
        self.expansion_block2 = ExpansionBlock(48, 8)
        self.expansion_block1 = ExpansionBlock(24, 4)
        
        self.head = Conv2d(4, 1, 1)
        
        
    def forward(self, x):
        x_d_1 = self.contraction_block1(x)
        x_d_2 = self.contraction_block2(x_d_1)
        x_d_3 = self.contraction_block3(x_d_2)
        x_d_4 = self.contraction_block4(x_d_3)
        
        x_bl = self.bottleneck_block(x_d_4)
        
        x_u_4 = self.expansion_block4(x_bl)
        concat_4 = torch.cat([x_d_3, x_u_4], dim=1)

        x_u_3 = self.expansion_block3(concat_4)
        concat_3 = torch.cat([x_d_2, x_u_3], dim=1)
        
        x_u_2 = self.expansion_block2(concat_3)
        concat_2 = torch.cat([x_d_1, x_u_2], dim=1)

        x_u_1 = self.expansion_block1(concat_2)
        
        head = self.head(x_u_1)
        out = CenterCrop((self.out_size[0], self.out_size[1]))(head)
        
        return out

As we can see, the UNet module takes into account the building blocks we have defined before. We first create the respective blocks of the contracting path (lines 7-10), then the bottleneck (line 12) and finally the blocks of the expanding path (lines 14-17). Note that the number of channels is determined by the number of channels from the previous ExpansionBlock added to the number of channels of the corresponding ContractionBlock. For example, expansion_block3 (line 15) will work on 96 channels, 32 channels from expansion_block4 and another 64 from contraction_block3 (line 9). We also define the size the output should have (line 5) and the “head”, which ensures that the output only has a single channel (line 19). In the forward method, we are first gradually applying the contraction blocks on the input feature maps (lines 23-26) and finally pass the result into the bottleneck block (line 28). The output of the bottleneck is provided to the first expansion block expansion_block4 (line 30). Afterwards, the output of expansion_block4 and the output of the corresponding block of the expanding path are concatenated (line 31). The concatenated feature map is then passed to expansion_block3 (line 33). The process is being repeated two more times (lines 34-39) until the final feature map is provided to the head (line 41) and is being cropped (line 42).

The Dataset Class

Now that we are having a functioning U-Net implementation, we also need to discuss how the U-Net can be provided with the data it requires for training. For this purpose, PyTorch provides a Dataset class from which a concrete class can be derived that defines how the required data is being accessed:

import torch
from torch.utils.data import Dataset
from osgeo import gdal
import numpy as np

class SegmentationDataset(Dataset):
    def __init__(self, swissimage_paths, mask_paths):
        assert(len(swissimage_paths) == len(mask_paths))

        self.swissimage_paths = swissimage_paths
        self.mask_paths = mask_paths        
        
        
    def __len__(self):
        return len(self.swissimage_paths)
        
        
    def __getitem__(self, idx):
        swissimage_dataset = gdal.Open(self.swissimage_paths[idx])
        swissimage_array = swissimage_dataset.ReadAsArray() / 255
    
        mask_dataset = gdal.Open(self.mask_paths[idx])
        mask_array = np.expand_dims(mask_dataset.ReadAsArray(), 0) / 255
        
        return (swissimage_array, mask_array)

The class contains three methods, the __init__ method holds the paths of the data to be loaded (lines 8-11). While base_path is a single path to the base directory where the cropped images are located, swissimage_paths and mask_paths each are lists of relative paths to the single cropped images. Note that we use an assert statement to make sure that the we have the same number of images for SWISSIMAGE and the lake masks (line 8). The __len__ method (line 14) returns the number of available image crops for training. Finally, the __getitem__ method is responsible for returning a single tuple consisting of a single SWISSIMAGE image and a corresponding lake mask. Both images are opened via GDAL and then normalized to the range [0, 1] (lines 19-23). The concrete images are being referenced using the idx parameter provided in the method. Another phony dimension is added to the lake mask array to mimic a single channel to conform to the interface required by PyTorch (line 23).

Conclusion

In this blog post, we have discussed the implementation of the U-Net architecture as well as the Dataset class we need to access the data during training. The U-Net architecture deviates from the original one in some aspects, which might make it slightly more elegant to use, but in practice should not yield substantial differences regarding its performance. In the next blog post, we are going to have a look at how the model can be trained.