Accepted at ICML 2026

Supervised Classification Heads as Semantic Prototypes

Unlocking Vision-Language Alignment via Weight Recycling

We recycle supervised classifier weights as semantic prototypes, showing that they can connect frozen image and text encoders with little or even no paired multimodal data.

David Mendez* Roberto Confalonieri Natalia Diaz-Rodriguez

University of Granada · Department of Computer Science and Artificial Intelligence, DaSCI Institute, Granada, Spain

University of Padova · Department of Mathematics "Tullio Levi-Civita", Padova, Italy

* Corresponding author: davidmendez@ugr.es

Main idea

Recycle supervised classifier weights as semantic prototypes to align vision and language with little or no paired image-text data.

1

Motivation

Supervised image models may already hide reusable semantic anchors.

2

Weight recycling

Classifier rows become alignment data for lightweight post-hoc bridges.

3

Two settings

Use weights alone for zero-shot alignment, or combine them with scarce image-text pairs.

Motivation

Can supervised image models already contain reusable semantic anchors for language alignment?

Why this problem matters

Vision-language models need a shared image-text space, but large-scale contrastive pretraining depends on massive paired datasets. Post-hoc alignment freezes existing image and text encoders, yet still often needs many image-caption pairs.

Overlooked supervision

A supervised classifier head is usually discarded after pretraining, even though each row was learned to recognize a named visual class.

We show that the rows of the discarded classifier head can be reused as class-level prototypes instead of treated as task-specific leftovers.

Semantic prototypes

Studying the rows of the classifier head as reusable concepts

Supervised pretraining workflow An input image x passes through a trainable image encoder, then a trainable linear classifier W times f I of x plus b, and then a supervised loss. The trained W and b are commonly discarded after pretraining. Supervised Pretraining x Image Encoder fI(.) fI(x) W . fI(x) + b ŷ L (x, y) ~ D supervised pretraining dataset, e.g. ImageNet Trainable parameters W b Discarded after pretraining Supervised pretraining workflow mobile A mobile version of supervised pretraining: x flows through an image encoder, then a linear classifier head, then a supervised loss. The trained W and b are usually discarded. Supervised Pretraining x Image encoder fI(.) fI(x) W . fI(x) + b ŷ L (x, y) ~ D Pretraining Trainable parameters W b Discarded after pretraining

Supervised pretraining

The classifier head is learned with the encoder, then usually thrown away

A supervised image model learns an encoder fI and a linear classifier head:

W fI(x) + b
where W ∈ ℝC×d, b ∈ ℝC, and W =

Instead of treating the trained head as disposable, we reuse each row wi as the semantic prototype for class i. The evidence below asks whether those rows preserve class-level meaning strongly enough to align with language.

Semantic neighborhood alignment chart Line chart comparing average image representations and classifier-head vectors across k equals 3, 5, and 10 on the mNN metric. Semantic neighborhood alignment 0.16 0.20 0.24 0.28 0.32 1 3 5 10 50 # images in averaged representation mNN k = 3 k = 5 k = 10 avg. img. repr. wi
Classifier-head rows preserve semantic neighborhoods with respect to text better than equally budgeted averaged image representations.

Alignment workflow

From post-hoc alignment to weight recycling

Post-hoc alignment

Learn lightweight mappings g and between fixed, independently pretrained image and text encoders, so representations with the same semantic content are mapped into a shared space.

This is the base alignment workflow used in the paper: the encoders remain frozen, and only the small bridging functions are trained.

Weight recycling

Classifier rows become extra alignment data once each wi is paired with the text embedding of its class name.

That turns the discarded classification head into reusable supervision, either by itself or as augmentation for scarce image-text pairs.

Post-hoc alignment diagram A dog image and a short text description are encoded by frozen image and text encoders, then projected through trainable functions into aligned representations. Image Image encoder fI(·) g A dog in a field ... Text Text encoder fT(·) g Aligned representations Frozen parameters Trainable parameters
Post-hoc alignment learns lightweight bridges between frozen image and text encoders before weight recycling broadens the available supervision.
Post-hoc alignment diagram mobile A dog image and a short text description flow downward through frozen image and text encoders, then through trainable g and g-bar modules into aligned representations. Post-hoc alignment Image A dog in a field ... Text Image encoder fI(.) Text encoder fT(.) g g Aligned representations Frozen Trainable
Post-hoc alignment sends image and text inputs downward through frozen encoders and trainable bridges into the shared representation space.
Weight recycling diagram Image-text representation pairs and classifier-head weights are both paired with text embeddings and used to learn lightweight alignment functions. Image-text pairs Classification head f_I( ) tabby kitten f_I( ) pink primrose f_I( ) runway planes ... ... w1 tench w2 tree frog w3 ice cream ... ... g Loss g

Two regimes

1. Weights-only alignment. 2. Weights augment scarce pairs.

Key idea

Reusable semantic anchors.

Weight vectors and real image-text pairs can both act as alignment data for lightweight post-hoc bridges.
Weight recycling diagram mobile Mobile diagram showing image-text pairs and recycled weight-text pairs as aligned representation and text-embedding columns, then mapped by g and g-bar before applying an alignment loss. Weight recycling Parallel representation and text-embedding pairs. representation side text-embedding side Image-text pairs fI(cat) fT("tabby") fI(flower) fT("flower") Recycled weight-text pairs w1 fT("tench") w2 fT("frog") g g Loss Weights add more paired anchors for the same post-hoc bridges.
Recycled classifier rows and real image-text pairs both become alignment data for lightweight post-hoc bridges.

Classifier heads as semantic prototypes

The weight vector associated with each supervised class can act as a meaningful prototype of that concept, rather than just a disposable decision boundary.

Alignment without image-text pairs

By pairing recycled classifier weights with text embeddings of their class names, we can learn a post-hoc bridge between independently pretrained image and text encoders even when paired multimodal data is absent.

Low-data augmentation from recycled weights

When real image-text pairs do exist, the recycled weights provide a compatible and effective augmentation source, with the largest gains appearing in low-data regimes.

Two settings

We evaluate weights alone first, then weights as augmentation

Setting I: No image-text pairs

Train g and using classifier weights and class names only. This setting tests the strongest claim of the work: a frozen vision encoder can acquire non-trivial vision-language capabilities without any image-text pairs.

The key observations are that weight-only alignment can be competitive with CLIP on some benchmarks, and that ImageNet-21K weights provide broader semantic coverage than ImageNet-1K alone.

Zero-shot classification chart Stacked bar chart comparing ImageNet-1K based alignment, added ImageNet-21K weight gains, and CLIP reference scores across nine datasets. Zero-shot classification 0 25 50 75 100 Accuracy (%) RESISC EUSAT FLOW OXPET FOOD CFR10 CFR100 DTD PLACES IN1K + IN21K CLIP Zero-shot classification mobile chart Horizontal mobile summary of zero-shot transfer across nine datasets, showing IN1K base performance, additional IN21K gain, and vertical dashed CLIP reference markers. Zero-shot transfer IN1K + IN21K CLIP 0 50 100 RESISC EUSAT FLOW OXPET FOOD CFR10 CFR100 DTD PLACES Accuracy (%)
Setting I: with no image-text pairs, weight-only alignment already supports non-trivial zero-shot transfer.

Setting II: Data augmentation

Once some paired data exists, recycled weights remain useful because they are compatible with image representations and help most in the low-pair regime.

Here, weight recycling acts as complementary supervision rather than a replacement for real image-text pairs: the gains are strongest where paired data is hardest to collect.

Retrieval with scarce paired data

Flickr30K retrieval; the x-axis is the number of image-caption pairs used for alignment.

i2t P@1

Image-to-text precision at one Retrieval chart for image-to-text precision at one across increasing alignment set sizes. 0 0.25 0.50 0 1k 5k 10k 30k

i2t mAP

Image-to-text mean average precision Retrieval chart for image-to-text mean average precision across increasing alignment set sizes. 0 0.25 0.50 0 1k 5k 10k 30k

t2i P@1

Text-to-image precision at one Retrieval chart for text-to-image precision at one across increasing alignment set sizes. 0 0.25 0.50 0 1k 5k 10k 30k
Setting II: classifier weights are most useful as augmentation exactly in the low-pair regime, where paired data is hardest to collect.

Further analyses

Why the recycled weights work, and where the limits are

1

Image and weight representations are different

Further analyses show a clear modality gap: classifier weights and image embeddings occupy different regions of the frozen vision feature space, even though they remain compatible enough for augmentation.

2

Source dataset quality matters

ImageNet-21K provides broader semantic coverage than narrower supervised heads, which is why its recycled classifier rows transfer more strongly across downstream tasks.

3

Weights beat images on equal budget

When the number of alignment pairs is matched, recycled classifier weights provide stronger text-aligned supervision than image representations.

Limitations and future work

Weights and image embeddings occupy different regions, so better methods are still needed to combine them more effectively.

Combining heads from multiple supervised datasets is promising, but only when they share the same frozen feature space; extending this across different backbones remains open.

What the results establish

The classifier head carries reusable semantic structure across the motivation, weight-only setting, and low-data augmentation setting.

The broader message is not only that recycling works, but that it does so in a resource-efficient way that complements standard multimodal supervision.

Conclusion

Reusable semantic prototypes are the through-line of the paper

Without image-text data

Supervised classifier heads can endow a vision encoder with text-aligned capabilities even when no paired image-text data is available.

With scarce image-text data

The same recycled weights remain useful as augmentation, with the strongest benefits appearing in the low-data regime.

Citation

Cite the paper

If you use this repository, please cite the ICML 2026 paper:

BibTeX
@inproceedings{mendez2026supervised,
  title = {Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling},
  author = {Mendez, David and Confalonieri, Roberto and Diaz-Rodriguez, Natalia},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year = {2026},
  archivePrefix = {arXiv},
  eprint = {2605.22484},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2605.22484}
}