Glossary of Training Parameters



Parameter Description Effect on VRAM Effect on Training Time Effect on Loss
Auto Find Batch Size Adjusts batch size dynamically to avoid Out-Of-Memory errors. Reduces VRAM usage when triggered. May increase training time due to smaller batch sizes. Minimal indirect effect.
Distributed Backend Specifies the training strategy, such as DDP (Distributed Data Parallel). Shares VRAM usage across GPUs. Speeds up training on multiple GPUs. No direct effect.
Mixed Precision Uses lower precision (e.g., FP16) to reduce memory usage and improve speed. Significantly reduces VRAM usage. Speeds up training. Minor potential numerical stability issues.
Padding Side Defines where padding is applied in input sequences (left or right). Minimal impact. Minimal impact. Minimal impact unless it affects data alignment.
Batch Size Defines the number of samples processed per step. Larger values increase VRAM usage. Larger values reduce training time per epoch. Smaller values can destabilize optimization.
Gradient Accumulation Accumulates gradients over multiple steps to simulate larger batch sizes. Reduces VRAM usage. Increases training time. Stabilizes loss with small batch sizes.
Chat Template Defines the structure of chat-based training inputs. Minimal impact. Minimal impact. Can indirectly affect loss if templates interfere with learning.
Evaluation Strategy Determines when evaluations occur (e.g., per epoch). Minimal impact. Increases time with frequent evaluations. No direct effect.
Optimizer Algorithm for updating weights (e.g., AdamW). Minimal impact. Depends on efficiency of the optimizer. Significant effect on convergence.
Quantization Reduces precision (e.g., INT4) to save memory. Significantly reduces VRAM usage. Slightly reduces training time. May cause slight accuracy loss.
Use Flash Attention Memory-efficient attention mechanism. Reduces VRAM usage. Speeds up training slightly. No direct effect.
Block Size Maximum sequence length for training. Larger values increase VRAM usage. Increases computation time. Improves context understanding.
Epochs Number of passes through the dataset. No impact. Linearly increases with more epochs. More epochs improve loss but risk overfitting.
Learning Rate Step size for parameter updates. No impact. No impact. Critical for convergence stability.
LoRA Parameters Controls low-rank adaptation for fine-tuning. Reduces VRAM usage for fine-tuning. Speeds up fine-tuning. Helps maintain performance on small datasets.
Scheduler Adjusts the learning rate over time (e.g., linear). No impact. No impact. Affects convergence behavior.
Max Grad Norm Clips gradients to avoid exploding gradients. No impact. Minimal impact. Improves stability in training.
Model Max Length Defines the maximum input length. Larger lengths increase VRAM usage. Increases computation time. Improves context understanding.
Warmup Proportion Fraction of steps for learning rate warmup. No impact. Minimal impact. Stabilizes training at the start.
Seed Sets the random seed for reproducibility. No impact. No impact. Ensures consistent results.
Weight Decay Regularization method to prevent overfitting. Minimal impact. Minimal impact. Reduces overfitting.
Target Modules Specifies which model parts to fine-tune. Limits memory usage by fine-tuning fewer layers. Speeds up fine-tuning. Limits flexibility in fine-tuning.

home

End of output.