LLMs - New Research

Transformer Visualizer

https://poloclub.github.io/transformer-explainer/ Great visualization to better understand transformer architecture.

Mixed Precision Training (AMP)

https://claude.ai/chat/11f9de54-f9b5-4ff1-9539-d7e042703aa9 discussion on how memory reductions materialize in practice.

Methodology to train DNN using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. Since half-precision has a narrower range than single-precision, they propose three techniques for preventing the loss of critical information:

Maintaining a single-precision copy of weights that accumulates the gradient after each optimizer step (this copy is rounder to half-precision for the forward and backward pass).
- Why? When updates (weight gradient * learning_rate) are very small or large, making it difficult to represent in FP16, they become zeros.
Loss-scaling to preserve gradient values with small magnitude
- In practice, gradient values in practice are generally pretty small in magnitude (< $2^{-14}$), below the minimum representable range and becomes zeros.
- The scaling factor would depend on how small the gradient updates are. Paper tried different values between 8 to 1024.
Using half-precision arithmetic that accumulates into single-precision outputs, which are converted to half-precision before storing to memory.
- By and large neural network arithmetic fall into three categories: Vector dot-products, reductions and point-wise operations.
- Dot-Product accumulates partial products in FP32, then convert to FP16 before writing to memory.
- Large reductions like norms, softmax, etc. should be carried out in FP32.

Nsight Compute and Systems

Basically, Nsight System is a high-level profiler meant to perform system level profiling. On the other hand, Nsight compute is a low-level profiler, meant for profiling CUDA kernels.

nsys - very similar to chrome trace. You have various streams like CPU, GPU, MPI, and some other libraries. For each stream you can see the operation that is performed, and the hovering over will show more details. You get details about sync and async operations between different streams on a system level.
ncu - give a detailed report for each kernel in the code. Each report has sections like Speed of light (SOL), Compute workload analysis, memory workload analysis, source code view, roofline plot, etc.