Loss Functions

Sigmoid: Turns a continuous value into probability. Takes single value as input
Softmax: effectively tries to push max value far away from all other values. Takes array of values as input
Entropy: measure of uncertainty. H(p) = p * log(p). This is like variance
Cross Entropy: measures difference between two probability distribution. Also known as Negative Log Likelihood (proxy for maximizing the log likelihood problem). Binary CE is a special case for Cross Entropy loss. CE is like co-variance
KL Divergence: measures the difference between two probability distributions P & Q with respect to Q. CE(p,q) = E(p) + KLD(p||q). KLD is like co-variance wrt to distribution Q. It's kind of normalized cross-entropy wrt to Q. Refer YT video

<aside> 💡

If you have two probability distributions: P & Q, and if one of the distributions is ground truth then use cross-entropy. But if both P & Q are just some random distributions (i.e. not ground truth) then used KL Divergence.

</aside>

What is perplexity? https://www.perplexity.ai/search/difference-between-fp8-e4m3-an-v26Z5o.8Qd.xS06EfO2Bzw#1
How does perplexity relate to cross-entropy loss? https://www.perplexity.ai/search/difference-between-fp8-e4m3-an-v26Z5o.8Qd.xS06EfO2Bzw#2

Representation / Similarity Learning

1. Supervised

Siamese Network with BCE. Basically, a classification problem. The Network will try to push similar samples nearby
Siamese Network with ArcFace Loss: Same a BCE but add an extra Angular Margin Loss to to BCE. This concisely tries to push samples from different class away based on margin value. Read this blog for better understanding

2. Unsupervised Learning (better representations)

Motivation: For any down-stream task, we will be better off if we can learn better representations of the sample. With better representation, the MLP head can easily classify samples (above siamese n/w with BCE).

Untitled

Empirically, the project layer from embedding (2048-D) to 128-D helps a lot while training. During inference, we take the embeddings (2048-D) and not the output after projection. Refer diagram above.
Another important thing to note is, that you need to train for lot more epochs (~600) and with much bigger batch size (~8K). Contrastive Learning is hard to optimize we don’t have direct labels, and the problem is much harder.
Triplet Loss: only one positive and one Negative sample. Hinge loss like arrangement, min(max(0, d(A, P) - d(A, N) + alpha)). Just minimizing (d(A, P) - d(A, N)) doesn’t work in practice because it is a unbounded loss. They value of d(A,N) can go up to infinity i.e. the model would put very sample extremely far from all other sample. But this behavior is not desired. So, we make it bounded by making it bounded (using alpha and max).

Untitled

The blue line, the loss is positive until 1, after that point, its zero i.e. no points for pushing samples every further. The point 1 is controlled using alpha.