Parameter | Description | Effect on VRAM | Effect on Training Time | Effect on Loss |
---|---|---|---|---|
Auto Find Batch Size | Adjusts batch size dynamically to avoid Out-Of-Memory errors. | Reduces VRAM usage when triggered. | May increase training time due to smaller batch sizes. | Minimal indirect effect. |
Distributed Backend | Specifies the training strategy, such as DDP (Distributed Data Parallel). | Shares VRAM usage across GPUs. | Speeds up training on multiple GPUs. | No direct effect. |
Mixed Precision | Uses lower precision (e.g., FP16) to reduce memory usage and improve speed. | Significantly reduces VRAM usage. | Speeds up training. | Minor potential numerical stability issues. |
Padding Side | Defines where padding is applied in input sequences (left or right). | Minimal impact. | Minimal impact. | Minimal impact unless it affects data alignment. |
Batch Size | Defines the number of samples processed per step. | Larger values increase VRAM usage. | Larger values reduce training time per epoch. | Smaller values can destabilize optimization. |
Gradient Accumulation | Accumulates gradients over multiple steps to simulate larger batch sizes. | Reduces VRAM usage. | Increases training time. | Stabilizes loss with small batch sizes. |
Chat Template | Defines the structure of chat-based training inputs. | Minimal impact. | Minimal impact. | Can indirectly affect loss if templates interfere with learning. |
Evaluation Strategy | Determines when evaluations occur (e.g., per epoch). | Minimal impact. | Increases time with frequent evaluations. | No direct effect. |
Optimizer | Algorithm for updating weights (e.g., AdamW). | Minimal impact. | Depends on efficiency of the optimizer. | Significant effect on convergence. |
Quantization | Reduces precision (e.g., INT4) to save memory. | Significantly reduces VRAM usage. | Slightly reduces training time. | May cause slight accuracy loss. |
Use Flash Attention | Memory-efficient attention mechanism. | Reduces VRAM usage. | Speeds up training slightly. | No direct effect. |
Block Size | Maximum sequence length for training. | Larger values increase VRAM usage. | Increases computation time. | Improves context understanding. |
Epochs | Number of passes through the dataset. | No impact. | Linearly increases with more epochs. | More epochs improve loss but risk overfitting. |
Learning Rate | Step size for parameter updates. | No impact. | No impact. | Critical for convergence stability. |
LoRA Parameters | Controls low-rank adaptation for fine-tuning. | Reduces VRAM usage for fine-tuning. | Speeds up fine-tuning. | Helps maintain performance on small datasets. |
Scheduler | Adjusts the learning rate over time (e.g., linear). | No impact. | No impact. | Affects convergence behavior. |
Max Grad Norm | Clips gradients to avoid exploding gradients. | No impact. | Minimal impact. | Improves stability in training. |
Model Max Length | Defines the maximum input length. | Larger lengths increase VRAM usage. | Increases computation time. | Improves context understanding. |
Warmup Proportion | Fraction of steps for learning rate warmup. | No impact. | Minimal impact. | Stabilizes training at the start. |
Seed | Sets the random seed for reproducibility. | No impact. | No impact. | Ensures consistent results. |
Weight Decay | Regularization method to prevent overfitting. | Minimal impact. | Minimal impact. | Reduces overfitting. |
Target Modules | Specifies which model parts to fine-tune. | Limits memory usage by fine-tuning fewer layers. | Speeds up fine-tuning. | Limits flexibility in fine-tuning. |
End of output.