There’s this Blog that uses CLIP as the core example to demonstrate how the thing works out.

A common technique used here is Contrastive learning. There’re other ways too, but I have not explored them yet.