Comparing CNNs VS Vision Transformers for Medical Image Segmentation

Both models learn to recognize the liver through supervised learning:

After seeing thousands of CT scans with manually labeled liver masks, the models learn:

Convolution kernels (3x3x3 filters) slide across the image detecting patterns:

Each convolution only sees a small local window

SwinUNETR (Vision Transformer) - How It Works

SwinTransformer does perform better but that comes with a price- More memory usage, more training time.

The model essentially asks at every voxel:

All these factors combine into a probability score for each voxel being liver or not.