Abstract :
Automated retinal disease classification from colour fundus photographs is a critical screening tool for early diagnosis of sight-threatening conditions, especially in regions with limited access to ophthalmologists. Convolutional neural networks (CNNs) and vision transformers have achieved strong performance in this task; however, both families treat the fundus image as a generic two-dimensional grid and ignore the well-known circular geometry of fundus photography and the concentric anatomical organisation of the retina. In this paper, we propose FundusSSM, a hybrid architecture that combines a pretrained ConvNeXt-Tiny feature extractor with a geometry-aware Ring-Scan State Space Model. The Ring-Scan tokenizer partitions the CNN feature map into equal-area concentric rings that align with the optic disc, the macula, and the peripheral retina; each ring is then processed by a bidirectional Mamba block, and information is exchanged across rings every two layers through a lightweight cross-ring attention module. We evaluate FundusSSM on a 4,217-image, four-class fundus dataset (cataract, diabetic retinopathy, glaucoma and normal) under stratified five-fold cross-validation. FundusSSM achieves the highest mean F1-score among the evaluated models (95.78%), with a low cross-fold standard deviation of 0.59% that is smaller than those of the closest baselines (ConvNeXt-Tiny and Swin-Tiny), and it outperforms ConvNeXt-Tiny, Swin-Tiny, EfficientNet-B4 and ResNet-50 in mean F1. An ablation study confirms that the proposed Ring-Scan ordering reduces the cross-fold variance by approximately 46% relative to a raster-scan ablation that uses the same architecture but a standard row-major token order. We further introduce a ring-level explainability analysis that produces per-ring feature-contribution scores aligned with clinical anatomical zones, and we observe that the model concentrates most on central optic-disc tokens for glaucoma while activating all rings nearly uniformly for cataract — patterns that agree with how clinicians read the same images. We believe that the approach followed in this research and the achieved findings could be useful to other researchers who are interested in geometry-aware deep-learning models for fundus screening tasks.
Keywords :
Explainable AI, fundus photography, geometry-aware tokenization, Mamba, retinal disease classification, state space modelsReferences :
- Applications of deep learning in fundus images: A review. (2021). arXiv preprint arXiv:2101.09864.
- Bourne, R. R. A., et al. (2021). Trends in prevalence of blindness and distance and near vision impairment over 30 years: An analysis for the Global Burden of Disease Study. The Lancet Global Health, 9(2), e130–e143.
- Convolutional neural network model for diabetic retinopathy feature extraction and classification. (2023). arXiv preprint arXiv:2310.10806.
- Discriminative kernel convolution network for multi-label ophthalmic disease detection. (2022). arXiv preprint arXiv:2207.07918.
- Dosovitskiy, A., et al. (2021). An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of ICLR.
- Dual-branch network for diabetic retinopathy detection and stage grading. (2023). arXiv preprint arXiv:2308.09945.
- Fu, H., et al. (2018). Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. arXiv preprint arXiv:1801.00926.
- GBD 2019 Blindness and Vision Impairment Collaborators. (2021). Causes of blindness and vision impairment in 2020 and trends over 30 years. The Lancet Global Health, 9(2), e144–e160.
- Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Gulshan, V., et al. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22), 2402–2410.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
- Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708).
- Hybrid CNN-transformer ensemble for retinal fundus multi-disease classification. (2025). arXiv preprint arXiv:2503.21465.
- Liu, J., et al. (2024). Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining. arXiv preprint arXiv:2402.03302.
- Liu, Y., et al. (2024). VMamba: Visual state space model. arXiv preprint arXiv:2401.10166.
- Liu, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
- Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11976–11986).
- Ma, J., Li, F., & Wang, B. (2024). U-Mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722.
- Machine learning for cataract classification and grading: A survey. (2020). arXiv preprint arXiv:2012.04830.
- Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? In Proceedings of NeurIPS.
- Rahman, M., et al. (2024). Mamba in vision: A comprehensive survey of techniques and applications. arXiv preprint arXiv:2410.03105.
- Ruan, J., & Xiang, S. (2024). VM-UNet: Vision Mamba UNet for medical image segmentation. arXiv preprint arXiv:2402.02491.
- RET-CLIP: A retinal image foundation model pre-trained with clinical diagnostic reports. (2024). arXiv preprint arXiv:2405.14137.
- Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618–626).
- Serp-Mamba: Advancing high-resolution retinal vessel segmentation with selective state-space model. (2024). arXiv preprint arXiv:2409.04356.
- Tan, M., & Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of ICML.
- Ting, D. S. W., et al. (2017). Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA, 318(22), 2211–2223.
- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers and distillation through attention. In Proceedings of ICML.
- World Health Organization. (2019). World report on vision. Geneva, Switzerland: WHO.
- Xu, R., et al. (2024). A survey on visual Mamba. arXiv preprint arXiv:2404.18861.
- Yang, S., et al. (2024). MambaMIL: Enhancing long sequence modeling with sequence reordering in computational pathology. In Proceedings of MICCAI.
- Yue, Y., & Li, Z. (2024). MedMamba: Vision Mamba for medical image classification. arXiv preprint arXiv:2403.03849.
- Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6023–6032).
- Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. In Proceedings of ICLR.
- Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision Mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of ICML.
- Explainable deep learning for cataract detection from dual-eye fundus images with knowledge distillation. (2025). arXiv preprint arXiv:2509.22696.
- Explainable fundus disease classification with dilated ResNet on ODIR-8. (2024). arXiv preprint arXiv:2407.05440.
- Glaucoma classification with dual-attention DenseNet-121. (2024). arXiv preprint arXiv:2406.15113.
- Glaucoma diagnosis via focal notching from fundus images. (2021). arXiv preprint arXiv:2112.05748.

