- Sigmoid: Turns a continuous value into probability. Takes single value as input
- Softmax: effectively tries to push max value far away from all other values. Takes array of values as input
- Entropy: measure of uncertainty. H(p) = p * log(p). This is like variance
- Cross Entropy: measures difference between two probability distribution. Also known as Negative Log Likelihood (proxy for maximizing the log likelihood problem). Binary CE is a special case for Cross Entropy loss. CE is like co-variance
- KL Divergence: measures the difference between two probability distributions P & Q with respect to Q.
CE(p,q) = E(p) + KLD(p||q)
. KLD is like co-variance wrt to distribution Q. It's kind of normalized cross-entropy wrt to Q. Refer YT video
<aside>
💡
If you have two probability distributions: P & Q, and if one of the distributions is ground truth then use cross-entropy. But if both P & Q are just some random distributions (i.e. not ground truth) then used KL Divergence.
</aside>
Representation / Similarity Learning
1. Supervised
- Siamese Network with BCE. Basically, a classification problem. The Network will try to push similar samples nearby
- Siamese Network with ArcFace Loss: Same a BCE but add an extra Angular Margin Loss to to BCE. This concisely tries to push samples from different class away based on margin value. Read this blog for better understanding
2. Unsupervised Learning (better representations)
Motivation: For any down-stream task, we will be better off if we can learn better representations of the sample. With better representation, the MLP head can easily classify samples (above siamese n/w with BCE).
- Empirically, the project layer from embedding (2048-D) to 128-D helps a lot while training. During inference, we take the embeddings (2048-D) and not the output after projection. Refer diagram above.
- Another important thing to note is, that you need to train for lot more epochs (~600) and with much bigger batch size (~8K). Contrastive Learning is hard to optimize we don’t have direct labels, and the problem is much harder.
- Triplet Loss: only one positive and one Negative sample. Hinge loss like arrangement, min(max(0, d(A, P) - d(A, N) + alpha)). Just minimizing (d(A, P) - d(A, N)) doesn’t work in practice because it is a unbounded loss. They value of d(A,N) can go up to infinity i.e. the model would put very sample extremely far from all other sample. But this behavior is not desired. So, we make it bounded by making it bounded (using alpha and max).
The blue line, the loss is positive until 1, after that point, its zero i.e. no points for pushing samples every further. The point 1 is controlled using alpha.