Histopathological images provide the medical evidences to help the disease diagnosis. However, manually reviewing these images by pathologists is very time consuming. Moreover, the variations of pathological images with respect to different organs, cell sizes and magnification factors lead to the difficulty of developing a general method to solve the histopathological image classification
problems. To address these issues, we propose a novel cross-scale fusion (CSF) transformer which consists of the multiple field-of-view patch embedding module, the transformer encoders and the cross-fusion modules. Based on the proposed modules, the CSF transformer can effectively integrate patch embeddings of different field-of-views to learn cross-scale contextual correlations, which represent tissues and cells of different sizes and magnification factors, with less memory usage and computation compared with the state-of-the-art transformers. To verify the generalization ability of the CSF transformer, experiments are performed on four public datasets of different organs and magnification factors. The CSF transformer outperforms the state-of-the-art task specific methods, convolutional neural network-based methods and transformer-based methods. The source code will be available in our GitHub https://github.com/nchucvml/CSFT after acceptance.
Acknowledgements
This work was supported in part by the National Science and Technology Council, Taiwan under Grant NSTC 111-2634-F-006-012.
We thank to National Center for High-performance Computing (NCHC) for providing computational and storage resources.