Accepted at ICML 2026

Supervised Classification Heads as Semantic Prototypes

Unlocking Vision-Language Alignment via Weight Recycling

We recycle supervised classifier weights as semantic prototypes, showing that they can connect frozen image and text encoders with little or even no paired multimodal data.

David Mendez* Roberto Confalonieri Natalia Diaz-Rodriguez

University of Granada · Department of Computer Science and Artificial Intelligence, DaSCI Institute, Granada, Spain

University of Padova · Department of Mathematics "Tullio Levi-Civita", Padova, Italy

* Corresponding author: davidmendez@ugr.es

View code on GitHub arXiv Download poster Cite

Main idea

Recycle supervised classifier weights as semantic prototypes to align vision and language with little or no paired image-text data.

Motivation

Supervised image models may already hide reusable semantic anchors.

Weight recycling

Classifier rows become alignment data for lightweight post-hoc bridges.

Two settings

Use weights alone for zero-shot alignment, or combine them with scarce image-text pairs.

Motivation

Can supervised image models already contain reusable semantic anchors for language alignment?

Why this problem matters

Vision-language models need a shared image-text space, but large-scale contrastive pretraining depends on massive paired datasets. Post-hoc alignment freezes existing image and text encoders, yet still often needs many image-caption pairs.

Overlooked supervision

A supervised classifier head is usually discarded after pretraining, even though each row was learned to recognize a named visual class.

We show that the rows of the discarded classifier head can be reused as class-level prototypes instead of treated as task-specific leftovers.

Semantic prototypes

Studying the rows of the classifier head as reusable concepts

Supervised pretraining

The classifier head is learned with the encoder, then usually thrown away

A supervised image model learns an encoder f_I and a linear classifier head:

W f_I(x) + b

where W ∈ ℝ^C×d, b ∈ ℝ^C, and W =

Instead of treating the trained head as disposable, we reuse each row w_i as the semantic prototype for class i. The evidence below asks whether those rows preserve class-level meaning strongly enough to align with language.

Classifier-head rows preserve semantic neighborhoods with respect to text better than equally budgeted averaged image representations.

Alignment workflow

From post-hoc alignment to weight recycling

Post-hoc alignment

Learn lightweight mappings g and ḡ between fixed, independently pretrained image and text encoders, so representations with the same semantic content are mapped into a shared space.

This is the base alignment workflow used in the paper: the encoders remain frozen, and only the small bridging functions are trained.

Weight recycling

Classifier rows become extra alignment data once each w_i is paired with the text embedding of its class name.

That turns the discarded classification head into reusable supervision, either by itself or as augmentation for scarce image-text pairs.

Post-hoc alignment learns lightweight bridges between frozen image and text encoders before weight recycling broadens the available supervision.

Post-hoc alignment sends image and text inputs downward through frozen encoders and trainable bridges into the shared representation space.

Two regimes

1. Weights-only alignment. 2. Weights augment scarce pairs.

Key idea

Reusable semantic anchors.

Weight vectors and real image-text pairs can both act as alignment data for lightweight post-hoc bridges.

Recycled classifier rows and real image-text pairs both become alignment data for lightweight post-hoc bridges.

Classifier heads as semantic prototypes

The weight vector associated with each supervised class can act as a meaningful prototype of that concept, rather than just a disposable decision boundary.

Alignment without image-text pairs

By pairing recycled classifier weights with text embeddings of their class names, we can learn a post-hoc bridge between independently pretrained image and text encoders even when paired multimodal data is absent.

Low-data augmentation from recycled weights

When real image-text pairs do exist, the recycled weights provide a compatible and effective augmentation source, with the largest gains appearing in low-data regimes.

Two settings

We evaluate weights alone first, then weights as augmentation

Setting I: No image-text pairs

Train g and ḡ using classifier weights and class names only. This setting tests the strongest claim of the work: a frozen vision encoder can acquire non-trivial vision-language capabilities without any image-text pairs.

The key observations are that weight-only alignment can be competitive with CLIP on some benchmarks, and that ImageNet-21K weights provide broader semantic coverage than ImageNet-1K alone.

Setting I: with no image-text pairs, weight-only alignment already supports non-trivial zero-shot transfer.

Setting II: Data augmentation

Once some paired data exists, recycled weights remain useful because they are compatible with image representations and help most in the low-pair regime.

Here, weight recycling acts as complementary supervision rather than a replacement for real image-text pairs: the gains are strongest where paired data is hardest to collect.

Retrieval with scarce paired data

Flickr30K retrieval; the x-axis is the number of image-caption pairs used for alignment.

i2t P@1

i2t mAP

t2i P@1

Setting II: classifier weights are most useful as augmentation exactly in the low-pair regime, where paired data is hardest to collect.

Further analyses

Why the recycled weights work, and where the limits are

Image and weight representations are different

Further analyses show a clear modality gap: classifier weights and image embeddings occupy different regions of the frozen vision feature space, even though they remain compatible enough for augmentation.

Source dataset quality matters

ImageNet-21K provides broader semantic coverage than narrower supervised heads, which is why its recycled classifier rows transfer more strongly across downstream tasks.

Weights beat images on equal budget

When the number of alignment pairs is matched, recycled classifier weights provide stronger text-aligned supervision than image representations.

Limitations and future work

Weights and image embeddings occupy different regions, so better methods are still needed to combine them more effectively.

Combining heads from multiple supervised datasets is promising, but only when they share the same frozen feature space; extending this across different backbones remains open.

What the results establish

The classifier head carries reusable semantic structure across the motivation, weight-only setting, and low-data augmentation setting.

The broader message is not only that recycling works, but that it does so in a resource-efficient way that complements standard multimodal supervision.

Conclusion

Reusable semantic prototypes are the through-line of the paper

Without image-text data

Supervised classifier heads can endow a vision encoder with text-aligned capabilities even when no paired image-text data is available.

With scarce image-text data

The same recycled weights remain useful as augmentation, with the strongest benefits appearing in the low-data regime.

Citation

Cite the paper

If you use this repository, please cite the ICML 2026 paper:

@inproceedings{mendez2026supervised,
  title = {Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling},
  author = {Mendez, David and Confalonieri, Roberto and Diaz-Rodriguez, Natalia},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year = {2026},
  archivePrefix = {arXiv},
  eprint = {2605.22484},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2605.22484}
}