Motivation
Accepted at ICML 2026
Supervised Classification Heads as Semantic Prototypes
Unlocking Vision-Language Alignment via Weight Recycling
We recycle supervised classifier weights as semantic prototypes, showing that they can connect frozen image and text encoders with little or even no paired multimodal data.
Main idea
Recycle supervised classifier weights as semantic prototypes to align vision and language with little or no paired image-text data.
Weight recycling
Classifier rows become alignment data for lightweight post-hoc bridges.
Two settings
Use weights alone for zero-shot alignment, or combine them with scarce image-text pairs.
Motivation
Can supervised image models already contain reusable semantic anchors for language alignment?
Why this problem matters
Vision-language models need a shared image-text space, but large-scale contrastive pretraining depends on massive paired datasets. Post-hoc alignment freezes existing image and text encoders, yet still often needs many image-caption pairs.
Overlooked supervision
A supervised classifier head is usually discarded after pretraining, even though each row was learned to recognize a named visual class.
We show that the rows of the discarded classifier head can be reused as class-level prototypes instead of treated as task-specific leftovers.
Semantic prototypes
Studying the rows of the classifier head as reusable concepts
Supervised pretraining
The classifier head is learned with the encoder, then usually thrown away
A supervised image model learns an encoder fI and a linear classifier head:
W fI(x) + b
where W ∈ ℝC×d, b ∈ ℝC, and
W =
Instead of treating the trained head as disposable, we reuse each row wi as the semantic prototype for class i. The evidence below asks whether those rows preserve class-level meaning strongly enough to align with language.
Alignment workflow
From post-hoc alignment to weight recycling
Post-hoc alignment
Learn lightweight mappings g and ḡ between fixed, independently pretrained image and text encoders, so representations with the same semantic content are mapped into a shared space.
This is the base alignment workflow used in the paper: the encoders remain frozen, and only the small bridging functions are trained.
Weight recycling
Classifier rows become extra alignment data once each wi is paired with the text embedding of its class name.
That turns the discarded classification head into reusable supervision, either by itself or as augmentation for scarce image-text pairs.
Two regimes
1. Weights-only alignment. 2. Weights augment scarce pairs.
Key idea
Reusable semantic anchors.
Classifier heads as semantic prototypes
The weight vector associated with each supervised class can act as a meaningful prototype of that concept, rather than just a disposable decision boundary.
Alignment without image-text pairs
By pairing recycled classifier weights with text embeddings of their class names, we can learn a post-hoc bridge between independently pretrained image and text encoders even when paired multimodal data is absent.
Low-data augmentation from recycled weights
When real image-text pairs do exist, the recycled weights provide a compatible and effective augmentation source, with the largest gains appearing in low-data regimes.
Two settings
We evaluate weights alone first, then weights as augmentation
Setting I: No image-text pairs
Train g and ḡ using classifier weights and class names only. This setting tests the strongest claim of the work: a frozen vision encoder can acquire non-trivial vision-language capabilities without any image-text pairs.
The key observations are that weight-only alignment can be competitive with CLIP on some benchmarks, and that ImageNet-21K weights provide broader semantic coverage than ImageNet-1K alone.
Setting II: Data augmentation
Once some paired data exists, recycled weights remain useful because they are compatible with image representations and help most in the low-pair regime.
Here, weight recycling acts as complementary supervision rather than a replacement for real image-text pairs: the gains are strongest where paired data is hardest to collect.
Retrieval with scarce paired data
Flickr30K retrieval; the x-axis is the number of image-caption pairs used for alignment.
i2t P@1
i2t mAP
t2i P@1
Further analyses
Why the recycled weights work, and where the limits are
Image and weight representations are different
Further analyses show a clear modality gap: classifier weights and image embeddings occupy different regions of the frozen vision feature space, even though they remain compatible enough for augmentation.
Source dataset quality matters
ImageNet-21K provides broader semantic coverage than narrower supervised heads, which is why its recycled classifier rows transfer more strongly across downstream tasks.
Weights beat images on equal budget
When the number of alignment pairs is matched, recycled classifier weights provide stronger text-aligned supervision than image representations.
Limitations and future work
Weights and image embeddings occupy different regions, so better methods are still needed to combine them more effectively.
Combining heads from multiple supervised datasets is promising, but only when they share the same frozen feature space; extending this across different backbones remains open.
What the results establish
The classifier head carries reusable semantic structure across the motivation, weight-only setting, and low-data augmentation setting.
The broader message is not only that recycling works, but that it does so in a resource-efficient way that complements standard multimodal supervision.
Conclusion
Reusable semantic prototypes are the through-line of the paper
Without image-text data
Supervised classifier heads can endow a vision encoder with text-aligned capabilities even when no paired image-text data is available.
With scarce image-text data
The same recycled weights remain useful as augmentation, with the strongest benefits appearing in the low-data regime.
Citation
Cite the paper
If you use this repository, please cite the ICML 2026 paper:
@inproceedings{mendez2026supervised,
title = {Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling},
author = {Mendez, David and Confalonieri, Roberto and Diaz-Rodriguez, Natalia},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026},
archivePrefix = {arXiv},
eprint = {2605.22484},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.22484}
}