Lion EvoLved Sign Momentum: The New Optimizer Discovered by Google Brain

📌 According to the authors of the paper, a suitable learning rate for Lion is typically 3-10 times lower than that used with Adam(w). Since the effective weight decay is lr * λ, the value of the decoupled weight decay λ used for Lion is 3-10 times larger than that used with Adam(w) to maintain similar strength.

📌 The initial value, peak value and final value in learning rate scheduling must change simultaneously with the same relationship to Adam(w).

📌 Learning rate scheduling: The authors use the same learning rate schedule for Lion as for Adam(w) in the paper. However, they observed a higher gain when using a cosine decay schedule to train ViT, compared to a square -root decay schedule .

📌 β1 and β2: The default values ​​for β1 and β2 in Adam(w) are set to 0.9 and 0.999, respectively, with an ε of 1e−8, while in Lion, the default values ​​for β1 and β2 are discovered through the program discovery process and are set to 0.9 and 0.99, respectively.

📌 Just as humans reduce β2 to 0.99 or lower and increase ε to 1e-6 in Adam(w) to improve stability, using β1=0.95, β2=0.98 in Lion may also be helpful to mitigate instability during training, the authors suggest. This was corroborated by a researcher.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Translate »