FundusSSM: A Hybrid CNN–State Space Model with Geometry-Aware Ring-Scan Tokenization for Retinal Disease Classification

Automated retinal disease classification from colour fundus photographs is a critical screening tool for early diagnosis of sight-threatening conditions, especially in regions with limited access to ophthalmologists. Convolutional neural networks (CNNs) and vision transformers have achieved strong performance in this task; however, both families treat the fundus image as a generic two-dimensional grid and ignore the well-known circular geometry of fundus photography and the concentric anatomical organisation of the retina. In this paper, we propose FundusSSM, a hybrid architecture that combines a pretrained ConvNeXt-Tiny feature extractor with a geometry-aware Ring-Scan State Space Model. The Ring-Scan tokenizer partitions the CNN feature map into  equal-area concentric rings that align with the optic disc, the macula, and the peripheral retina; each ring is then processed by a bidirectional Mamba block, and information is exchanged across rings every two layers through a lightweight cross-ring attention module. We evaluate FundusSSM on a 4,217-image, four-class fundus dataset (cataract, diabetic retinopathy, glaucoma and normal) under stratified five-fold cross-validation. FundusSSM achieves the highest mean F1-score among the evaluated models (95.78%), with a low cross-fold standard deviation of 0.59% that is smaller than those of the closest baselines (ConvNeXt-Tiny and Swin-Tiny), and it outperforms ConvNeXt-Tiny, Swin-Tiny, EfficientNet-B4 and ResNet-50 in mean F1. An ablation study confirms that the proposed Ring-Scan ordering reduces the cross-fold variance by approximately 46% relative to a raster-scan ablation that uses the same architecture but a standard row-major token order. We further introduce a ring-level explainability analysis that produces per-ring feature-contribution scores aligned with clinical anatomical zones, and we observe that the model concentrates most on central optic-disc tokens for glaucoma while activating all rings nearly uniformly for cataract — patterns that agree with how clinicians read the same images. We believe that the approach followed in this research and the achieved findings could be useful to other researchers who are interested in geometry-aware deep-learning models for fundus screening tasks.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply