EffL LAB. Regular Seminar

Linearly Mapping from Image to Text Space (ICLR’23)

LiMBeR 01

Problem of Language Model

LiMBeR 02

Emily M. Bender and Alexander Koller., “Climbing towards NLU: on meaning form and understanding in the age of data”, ACL 2020

A System exposed only to form in its training cannot in principle learn meaning

##Form & Meaning in Language** LiMBeR 03 Form

Anything we can find in a language (e.g., symbols, mouth movements)

Meaning

Relationship between form and non-linguistic parts
Including Communicative intent

Is form alone meaningful?

Octopus Thought exp.

LiMBeR 04 A highly intelligent octopus that knows nothing about Human language

Excellent at spotting statistical patterns
Observed the use of certain words in similar forms
Maybe noticed a common lexical pattern

LiMBeR 05 LiMBeR 06 starts impersonating B and replying to A LiMBeR 07 **The octopus doesn’t know the referents of the words no idea what bears or sticks are

=> Octopus = LM

Octopus Thought Experiment - Conclusion

LiMBeR 08

LMs do not tend to learn conceptual representations (meanings) of language.
Humans acquire language not only through the form (representation)

but also through the interaction of various factors in physical world.

*How well can a text-only language model learn aspects of the physical world?

Previous Works

LiMBeR 09

Show success in mapping images to language model soft prompts as a method for multimodal pre-training (e.g., MAGMA, Frozen)
- Constantin Eichenberg et al., “MAGMA–Multi modal Augmentation of Generative Models through Adapter-based Finetuning”, EMNLP 2022
- Maria Tsimpoukelli et al., “Multimodal Few-Shot Learning with Frozen Language Models”, NeurIPS 2021
However, no attempts to restrict the mechanism behind this mapping and understand how it works.

Language & Image representation

LiMBeR 10

Hypothesis.

Conceptual representations (between language and image embeddings) can be approximately mapped to one through a linear transformation

Why train on linear transformation?
because of the simplicity !

Method

LiMBeR 11 LiMBeR (Linearly Mapping Between Representation spaces)

Train linear projections from image representations into the text space of a language model to produce image-to-text tasks

= transform an image representation into “soft prompts”

(do not correspond to discrete language tokens)

LiMBeR 12 LiMBeR 13 LiMBeR 14 LiMBeR 15 LiMBeR 16 LiMBeR 17 LiMBeR 18 LiMBeR 19

Experiments : Captioning

LiMBeR 20 LiMBeR 21

Experiments : VQA (Visual Question Answering)

LiMBeR 22 LiMBeR 23

Experiments : Visual Concepts

LiMBeR 24 Why BEIT prompts perform so poorly for VQA despite performing decently for captioning? LiMBeR 25

Hypothesis. BEIT does not encode visual info. that corresponds to lexical categories
Metrics
Wu-Palmer similarity
Calculate the distance between the GT and the generated word in the WordNet taxonomy
Measure how close a word was to the correct answer

Conclusion

LiMBeR 26

Show the linguistic supervision of the vision model pretraining objective correlates with the degree of similarity
- Verified a hypothesis : training only a linear layer is enough for mapping visual pre-trained knowledge to text space.
- And it can enable downstream tasks (such as few/zero-shot VQA, image captioning) utilizing stored knowledge from both worlds
Future work (or Question)
Could it be improved by considering different model sizes ?

(e.g. larger or smaller CLIP models or supervised resnets or BEITs)

whether the probing results get better or worse with image encoder size

EffL LAB. Regular Seminar
Problem of Language Model
Octopus Thought exp.
Octopus Thought Experiment - Conclusion
Previous Works
Language & Image representation
Method
Experiments : Captioning
Experiments : VQA (Visual Question Answering)
Experiments : Visual Concepts
Conclusion