TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

1Stony Brook University 2MICS, CentraleSupélec, Université Paris-Saclay 3UNC Charlotte 4Argonne National Laboratory 5University of Chicago 6Archimedes/Athena RC 7Independent Researcher
* Co-first and Co-second Authors
Teaser
Teaser

TICON: An Omni Tile Contextualizer that can contextualize embeddings from any tile encoder. (—) represent input projectors for tile encoders used in pretraining. (- -) represent input projectors used in adapting TICON to new tile encoders.

Abstract

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ``any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ``any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

Pretraining: Omni-Feature Masked Modeling

Method

Overview of the pretraining framework. (Left) Grid sampling and tile embedding extraction using a set of tile encoders (φ1, φ2, .., φT). (Right) An input tile encoder (φi) is sampled randomly at each iteration, and its embeddings are masked. The remaining visible embeddings are passed through a φi-specific input projector (ρi) and then a shared encoder. A shared decoder, paired with output projectors specific to each tile encoder, then reconstructs the masked embeddings corresponding to all tile encoders (φ1, .., φT).

Inference Modes

Method

Overview of TICON's inference modes. (Left) Standard preprocessing pipeline: tiling the WSI followed by embedding extraction. (Middle) Contextualized Inference: The default mode where the entire sequence of WSI tile embeddings is passed through the TICON Encoder. This allows the model to contextualize each tile with information from the full slide-level neighborhood. (Right) Isolated Inference: An alternative inference mode where a single tile embedding is passed through TICON independently. In this setting, the Transformer effectively functions as a deep MLP (sequence length of 1). Although not the primary design intent, we empirically discovered that TICON exhibits an emergent property in this mode, enhancing individual tile representations even when slide-level context is unavailable (e.g., in the THUNDER benchmark).

Tile-level tasks with Slide Context

Table

Tile-level tasks without Slide Context

Table

TICON as Slide-level Foundation Model

Table

Adapting to unseen Tile encoders

Table

Visualization

Table

Visualization of tile classification results on a WSI from the CATCH dataset. The left panel (baseline) shows classification using non-contextual tile embeddings, whereas the right panel (TICON) displays classification with contextualized embeddings. TICON produces less noisy predictions and corrects many local misclassifications (green boxes). However, we also observe shared failure modes (orange box) where both methods misclassify a region. This suggests limitations in the underlying tile encoder's features. Indeed, the latter can miss some necessary information that even contextualization cannot retrieve.