Glossary of Training Parameters

Parameter	Description	Effect on VRAM	Effect on Training Time	Effect on Loss
Auto Find Batch Size	Adjusts batch size dynamically to avoid Out-Of-Memory errors.	Reduces VRAM usage when triggered.	May increase training time due to smaller batch sizes.	Minimal indirect effect.
Distributed Backend	Specifies the training strategy, such as DDP (Distributed Data Parallel).	Shares VRAM usage across GPUs.	Speeds up training on multiple GPUs.	No direct effect.
Mixed Precision	Uses lower precision (e.g., FP16) to reduce memory usage and improve speed.	Significantly reduces VRAM usage.	Speeds up training.	Minor potential numerical stability issues.
Padding Side	Defines where padding is applied in input sequences (left or right).	Minimal impact.	Minimal impact.	Minimal impact unless it affects data alignment.
Batch Size	Defines the number of samples processed per step.	Larger values increase VRAM usage.	Larger values reduce training time per epoch.	Smaller values can destabilize optimization.
Gradient Accumulation	Accumulates gradients over multiple steps to simulate larger batch sizes.	Reduces VRAM usage.	Increases training time.	Stabilizes loss with small batch sizes.
Chat Template	Defines the structure of chat-based training inputs.	Minimal impact.	Minimal impact.	Can indirectly affect loss if templates interfere with learning.
Evaluation Strategy	Determines when evaluations occur (e.g., per epoch).	Minimal impact.	Increases time with frequent evaluations.	No direct effect.
Optimizer	Algorithm for updating weights (e.g., AdamW).	Minimal impact.	Depends on efficiency of the optimizer.	Significant effect on convergence.
Quantization	Reduces precision (e.g., INT4) to save memory.	Significantly reduces VRAM usage.	Slightly reduces training time.	May cause slight accuracy loss.
Use Flash Attention	Memory-efficient attention mechanism.	Reduces VRAM usage.	Speeds up training slightly.	No direct effect.
Block Size	Maximum sequence length for training.	Larger values increase VRAM usage.	Increases computation time.	Improves context understanding.
Epochs	Number of passes through the dataset.	No impact.	Linearly increases with more epochs.	More epochs improve loss but risk overfitting.
Learning Rate	Step size for parameter updates.	No impact.	No impact.	Critical for convergence stability.
LoRA Parameters	Controls low-rank adaptation for fine-tuning.	Reduces VRAM usage for fine-tuning.	Speeds up fine-tuning.	Helps maintain performance on small datasets.
Scheduler	Adjusts the learning rate over time (e.g., linear).	No impact.	No impact.	Affects convergence behavior.
Max Grad Norm	Clips gradients to avoid exploding gradients.	No impact.	Minimal impact.	Improves stability in training.
Model Max Length	Defines the maximum input length.	Larger lengths increase VRAM usage.	Increases computation time.	Improves context understanding.
Warmup Proportion	Fraction of steps for learning rate warmup.	No impact.	Minimal impact.	Stabilizes training at the start.
Seed	Sets the random seed for reproducibility.	No impact.	No impact.	Ensures consistent results.
Weight Decay	Regularization method to prevent overfitting.	Minimal impact.	Minimal impact.	Reduces overfitting.
Target Modules	Specifies which model parts to fine-tune.	Limits memory usage by fine-tuning fewer layers.	Speeds up fine-tuning.	Limits flexibility in fine-tuning.

home

End of output.