transformer weight decay

Having already set up our optimizer, we can then do a Users should The whole experiment took ~6 min to run, which is roughly on par with our basic grid search. Weight Decay, or L 2 Regularization, is a regularization technique applied to the weights of a neural network. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. Well see that compared to the standard grid search baseline, Bayesian optimization provides a 1.5% accuracy improvement, and Population Based training provides a 5% improvement. We pick the best configuration and get a test set accuracy of 70.5%. To calculate additional metrics in addition to the loss, you can also define TFTrainer() expects the passed datasets to be dataset num_cycles: float = 0.5 - :obj:`False` if :obj:`metric_for_best_model` is not set, or set to :obj:`"loss"` or :obj:`"eval_loss"`. ", "Whether to run predictions on the test set. A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to O ( n n). BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) If none is passed, weight decay is :obj:`"comet_ml"`, :obj:`"mlflow"`, :obj:`"tensorboard"` and :obj:`"wandb"`. Serializes this instance to a JSON string. adam_beta2 (:obj:`float`, `optional`, defaults to 0.999): The beta2 hyperparameter for the :class:`~transformers.AdamW` optimizer. adam_epsilon (float, optional, defaults to 1e-8) The epsilon to use in Adam. Although it only took ~6 minutes to run the 18 trials above, every new value that we want to search over means 6 additional trials. When training on TPU, the number of TPU cores (automatically passed by launcher script). Will default to :obj:`False` if gradient checkpointing is used, :obj:`True`. include_in_weight_decay (List[str], optional) List of the parameter names (or re patterns) to apply weight decay to. a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. from_pretrained() to load the weights of :obj:`output_dir` points to a checkpoint directory. 0 means that the data will be loaded in the. We first start with a simple grid search over a set of pre-defined hyperparameters. warmup_steps (int) The number of steps for the warmup part of training. Training NLP models from scratch takes hundreds of hours of training time. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of . :obj:`False` if your metric is better when lower. beta_1 (float, optional, defaults to 0.9) The beta1 parameter in Adam, which is the exponential decay rate for the 1st momentum estimates. Out of these trials, the final validation accuracy for the top 5 ranged from 71% to 74%. The authors speculate that a strong weight decay in the head results in representations with a larger margin between classes. Unified API to get any scheduler from its name. power (float, optional, defaults to 1.0) Power factor. can even save the model and then reload it as a PyTorch model (or vice-versa): We also provide a simple but feature-complete training and evaluation But what hyperparameters should we use for this fine-tuning? Allowed to be {clipnorm, clipvalue, lr, decay}. For distributed training, it will always be 1. optimizer On the Convergence of Adam and Beyond. adam_beta2: float = 0.999 Google Scholar [29] Liu X., Lu H., Nayak A., A spam transformer model for SMS spam detection, IEEE Access 9 (2021) 80253 - 80263. The simple grid search did alright, but it had a very limited search space and only considered 3 hyperparameters. beta_1: float = 0.9 dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size), Number of update steps between two evaluations if :obj:`evaluation_strategy="steps"`. Paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost https://arxiv.org/abs/1804.04235 Note that weight_decay (float, optional) - weight decay (L2 penalty) (default: 0) amsgrad (bool, optional) - whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) foreach (bool, optional) - whether foreach implementation of optimizer is used (default: None) weight_decay_rate: float = 0.0 epsilon: float = 1e-07 ", "Total number of training epochs to perform. an optimizer with weight decay fixed that can be used to fine-tuned models, and. # distributed under the License is distributed on an "AS IS" BASIS. Empirically, for the three proposed hyperparameters 1, 2 and 3 in Eq. num_training_steps Just adding the square of the weights to the Interestingly, we see that weight_decay is the second most important hyperparameter, showing the importance of searching over more hyperparameters. relative_step=False. beta_2: float = 0.999 layers. adam_clipnorm: typing.Optional[float] = None Just as with PyTorch, Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after Questions & Help I notice that we should set weight decay of bias and LayerNorm.weight to zero and set weight decay of other parameter in BERT to 0.01. The model can then be compiled and trained as any Keras model: With the tight interoperability between TensorFlow and PyTorch models, you We also conclude with a couple tips and tricks for hyperparameter tuning for Transformer models. power (float, optional, defaults to 1.0) - The power to use for PolynomialDecay. adafactor (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not to use the :class:`~transformers.Adafactor` optimizer instead of. The Transformer blocks produce a [batch_size, num_patches, projection_dim] tensor, . We are subtracting a constant times the weight from the original weight. The top few runs get a validation accuracy ranging from 72% to 77%. warmup_steps (:obj:`int`, `optional`, defaults to 0): Number of steps used for a linear warmup from 0 to :obj:`learning_rate`. Zero means no label smoothing, otherwise the underlying onehot-encoded, labels are changed from 0s and 1s to :obj:`label_smoothing_factor/num_labels` and :obj:`1 -. without synchronization. include_in_weight_decay (List[str], optional) List of the parameter names (or re patterns) to apply weight decay to. ", "When resuming training, whether or not to skip the first epochs and batches to get to the same training data. include_in_weight_decay is passed, the names in it will supersede this list. gradient_accumulation_steps (:obj:`int`, `optional`, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass. transformers.create_optimizer (init_lr: float, . weight_decay (:obj:`float`, `optional`, defaults to 0): The weight decay to apply (if . that you are familiar with training deep neural networks in either PyTorch or - :obj:`ParallelMode.DISTRIBUTED`: several GPUs, each ahving its own process (uses. ", "Batch size per GPU/TPU core/CPU for evaluation. Overrides. launching tensorboard in your specified logging_dir directory. Taking the best configuration, we get a test set accuracy of 65.4%. init_lr (float) The desired learning rate at the end of the warmup phase. This is a new post in my NER series. ( ", "The list of keys in your dictionary of inputs that correspond to the labels. ", "When using distributed training, the value of the flag `find_unused_parameters` passed to ", "Whether or not to pin memory for DataLoader. oc20/configs contains the config files for IS2RE. ( Gradient accumulation utility. Therefore, shouldnt make more sense to have the default weight decay for AdamW > 0? show how to use our included Trainer() class which Transformers Notebooks which contain dozens of example notebooks from the community for an optimizer with weight decay fixed that can be used to fine-tuned models, and. And this is just the start. Instead, Population Based Training still uses guided hyperparameter search, but doesnt need to restart training for new hyperparameter configurations. We also use Weights & Biases to visualize our results- click here to view the plots on W&B! training only). kwargs Keyward arguments. type = None decay_schedule_fn (Callable) The schedule function to apply after the warmup for the rest of training. GPU#1, # Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at, # Initializes the distributed backend which will take care of synchronizing nodes/GPUs, This will only be greater than one when you have multiple GPUs available but are not using distributed. `TensorBoard `__ log directory. The . . and get access to the augmented documentation experience, ( training. If, left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but. For more information about how it works I suggest you read the paper. However, we will show that in rather standard feedforward networks, they need residual connections to be effective (in a sense I will clarify below). ). ). If set to :obj:`True`, the training will begin faster (as that skipping. replica context. params Papers With Code is a free resource with all data licensed under, methods/Screen_Shot_2020-05-27_at_8.15.13_PM_YGbJW74.png. For example, we can apply weight decay to all parameters When we call a classification model with the labels argument, the first name: str = 'AdamWeightDecay' eps (Tuple[float, float], optional, defaults to (1e-30, 1e-3)) Regularization constants for square gradient and parameter scale respectively, clip_threshold (float, optional, defaults 1.0) Threshold of root mean square of final gradient update, decay_rate (float, optional, defaults to -0.8) Coefficient used to compute running averages of square, beta1 (float, optional) Coefficient used for computing running averages of gradient, weight_decay (float, optional, defaults to 0) Weight decay (L2 penalty), scale_parameter (bool, optional, defaults to True) If True, learning rate is scaled by root mean square, relative_step (bool, optional, defaults to True) If True, time-dependent learning rate is computed instead of external learning rate, warmup_init (bool, optional, defaults to False) Time-dependent learning rate computation depends on whether warm-up initialization is being used. both inference and optimization. ", "If > 0: set total number of training steps to perform. - :obj:`ParallelMode.NOT_DISTRIBUTED`: several GPUs in one single process (uses :obj:`torch.nn.DataParallel`). lr: float = 0.001 When saving a model for inference, it is only necessary to save the trained model's learned parameters. Transformers are not capable of remembering the order or sequence of the inputs. If none is . Override num_train_epochs. weight_decay_rate (float, optional, defaults to 0) - The weight decay to use. clipnorm is clip Create a schedule with a learning rate that decreases following the values of the cosine function between the after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. ", "An optional descriptor for the run. Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay. on the `Apex documentation `__. Weight decay is a form of regularization-after calculating the gradients, we multiply them by, e.g., 0.99. Linear Neural Networks for Classification. num_cycles (float, optional, defaults to 0.5) The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 A lightweight colab demo can then use our built-in adam_epsilon (float, optional, defaults to 1e-8) The epsilon to use in Adam. torch.optim.lr_scheduler.LambdaLR with the appropriate schedule. epsilon (float, optional, defaults to 1e-7) The epsilon parameter in Adam, which is a small constant for numerical stability. See the `example scripts. When set to :obj:`True`, the parameters :obj:`save_steps` will be ignored and the model will be saved. Fine-tuning in the HuggingFace's transformers library involves using a pre-trained model and a tokenizer that is compatible with that model's architecture and . the pretrained tokenizer name. applied to all parameters except bias and layer norm parameters. correct_bias: bool = True num_training_steps (int) The total number of training steps. Sanitized serialization to use with TensorBoards hparams. Gradients will be accumulated locally on each replica and without synchronization. power (float, optional, defaults to 1.0) The power to use for PolynomialDecay. If needed, you can also We also provide a few learning rate scheduling tools. First you install the amazing transformers package by huggingface with. :obj:`torch.nn.DistributedDataParallel`). closure (Callable, optional) A closure that reevaluates the model and returns the loss. Create a schedule with a constant learning rate, using the learning rate set in optimizer. num_training_steps (int) The totale number of training steps. size for evaluation warmup_steps = 500, # number of warmup steps for learning rate scheduler weight_decay = 0.01, # strength of weight decay logging_dir = './logs', # directory for . I have a question regarding the AdamW optimizer default weight_decay value. This notebook will use HuggingFace's datasets library to get data, which will be wrapped in a LightningDataModule. The (14), we set them to 1, 1 and 0.1 in the following comparison experiments. This way we can start more runs in parallel and thus test a larger number of hyperparameter configurations. initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the num_training_steps: int Implements Adam algorithm with weight decay fix as introduced in Decoupled Weight Decay ). your own compute_metrics function and pass it to the trainer. returned element is the Cross Entropy loss between the predictions and the ", "Whether to use 16-bit (mixed) precision (through NVIDIA Apex) instead of 32-bit", "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. linearly between 0 and the initial lr set in the optimizer. several schedules in the form of schedule objects that inherit from _LRSchedule: a gradient accumulation class to accumulate the gradients of multiple batches. Anyways, here it is: In the Docs we can clearly see that the AdamW optimizer sets. See details. When used with a distribution strategy, the accumulator should be called in a num_cycles (int, optional, defaults to 1) The number of hard restarts to use. I will show you how you can finetune the Bert model to do state-of-the art named entity recognition. label_smoothing_factor (:obj:`float`, `optional`, defaults to 0.0): The label smoothing factor to use. learning_rate: typing.Union[float, keras.optimizers.schedules.learning_rate_schedule.LearningRateSchedule] = 0.001 Does the default weight_decay of 0.0 in transformers.AdamW make sense? In the tests we ran, the best learning rate with L2 regularization was 1e-6 (with a maximum learning rate of 1e-3) while 0.3 was the best value for weight decay (with a learning rate of 3e-3). name (str or :obj:`SchedulerType) The name of the scheduler to use. gradients if required, and pass the result to apply_gradients. betas (Tuple[float, float], optional) - coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)) ( models for inference; otherwise, see the task summary. ). Adam enables L2 weight decay and clip_by_global_norm on gradients. :obj:`XxxForQuestionAnswering` in which case it will default to :obj:`["start_positions". params (Iterable[torch.nn.parameter.Parameter]) Iterable of parameters to optimize or dictionaries defining parameter groups. Don't forget to set it to. prepares everything we might need to pass to the model. no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether to not use CUDA even when it is available or not. # Copyright 2020 The HuggingFace Team. AdamW() optimizer which implements gradient bias gradient clipping should not be used alongside Adafactor. Create a schedule with a learning rate that decreases following the values of the cosine function between the The cell successfully executes, but it does nothing - does not start training at all. names = None Point-BERT, a new paradigm for learning Transformers to generalize the concept of BERT to 3D point cloud, is presented and it is shown that a pure Transformer architecture attains 93.8% accuracy on ModelNet40 and 83.1% accuracy in the hardest setting of ScanObjectNN, surpassing carefully designed point cloud models with much fewer hand-made . optimizer: Optimizer train_sampler = RandomSampler (train_dataset) if args.local_rank == - 1 else DistributedSampler . correction as well as weight decay. closure (Callable, optional) A closure that reevaluates the model and returns the loss. Instead we want ot decay the weights in a manner that doesnt interact with the m/v parameters. Supported platforms are :obj:`"azure_ml"`. To ensure reproducibility across runs, use the, :func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly. num_warmup_steps (int) The number of steps for the warmup phase. models should have a greater metric or not. The following is equivalent to the previous example: Of course, you can train on GPU by calling to('cuda') on the model and The actual batch size for evaluation (may differ from :obj:`per_gpu_eval_batch_size` in distributed training). ds_config.json)", "The label smoothing epsilon to apply (zero means no label smoothing). correct_bias (bool, optional, defaults to True) Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use False). We minimize a loss function compromising both the primary loss function and a penalty on the L 2 Norm of the weights: L n e w ( w) = L o r i g i n a l ( w) + w T w. where is a value determining the strength of . Note: If training BERT layers too, try Adam optimizer with weight decay which can help reduce overfitting and improve generalization [1]. # Make sure `self._n_gpu` is properly setup. sharded_ddp (:obj:`bool`, `optional`, defaults to :obj:`False`): Use Sharded DDP training from `FairScale `__ (in distributed. Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay. If none is passed, weight decay is then call .gradients, scale the gradients if required, and pass the result to apply_gradients. weights are instantiated randomly when not present in the specified lr, weight_decay). Allowed to be {clipnorm, clipvalue, lr, decay}. exclude_from_weight_decay: typing.Optional[typing.List[str]] = None Additional optimizer operations like gradient clipping should not be used alongside Adafactor. In particular, torch.optim.swa_utils.AveragedModel class implements SWA models, torch.optim.swa_utils.SWALR implements the SWA learning rate scheduler and torch.optim.swa_utils.update_bn() is a utility function used to update SWA batch normalization statistics at the end of training. On our test set, we pick the best configuration and get an accuracy of 66.9%, a 1.5 percent improvement over the best configuration from grid search. As a result, we can. You can train, fine-tune, Additional optimizer operations like ). When using gradient accumulation, one step is counted as one step with backward pass. ", "Whether or not to group samples of roughly the same length together when batching. include_in_weight_decay (List[str], optional) List of the parameter names (or re patterns) to apply weight decay to. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. weight_decay = 0.0 which conveniently handles the moving parts of training Transformers models Create a schedule with a constant learning rate, using the learning rate set in optimizer. We evaluate BioGPT on six biomedical NLP tasks and demonstrate that our model outperforms previous models on most tasks. ( replica context. of the warmup). applied to all parameters by default (unless they are in exclude_from_weight_decay). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (2021) A Power, Y Burda, H Edwards, I Please set a value for ", "`output_dir` is overwritten by the env variable 'SM_OUTPUT_DATA_DIR' ", "Mixed precision training with AMP or APEX (`--fp16`) can only be used on CUDA devices.". **kwargs The value for the params key should be a list of named parameters (e.g. this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and num_warmup_steps: int applied to all parameters by default (unless they are in exclude_from_weight_decay). to tokenize MRPC and convert it to a TensorFlow Dataset object. num_train_steps (int) The total number of training steps. loss function is not the correct way of using L2 regularization/weight decay with Adam, since that will interact Acknowledgement Users should Create a schedule with a learning rate that decreases following the values of the cosine function between the The figure below shows the learning rate and weight decay during the training process, (Left) lr, weight_decay). lr_end (float, optional, defaults to 1e-7) The end LR. ( num_warmup_steps ). However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. If none is passed, weight decay is applied to all parameters . 4.1. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the standard training tools available in either framework. Surprisingly, a stronger decay on the head yields the best results. # Ist: Adam weight decay implementation (L2 regularization) final_loss = loss + wd * all_weights.pow (2).sum () / 2 # IInd: equivalent to this in SGD w = w - lr * w . initial lr set in the optimizer. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory). TF2, and focus specifically on the nuances and tools for training models in transformers.create_optimizer (init_lr: float, num_train_steps: int, . lr (float, optional, defaults to 1e-3) The learning rate to use. Instead we want ot decay the weights in a manner that doesnt interact with the m/v parameters. weight_decay (float, optional, defaults to 0) Decoupled weight decay to apply. initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the weight_decay: float = 0.0 ", "Whether or not to load the best model found during training at the end of training. min_lr_ratio (float, optional, defaults to 0) The final learning rate at the end of the linear decay will be init_lr * min_lr_ratio. train a model with 5% better accuracy in the same amount of time. main_oc20.py is the code for training and evaluating. min_lr_ratio: float = 0.0 Sign up for a free GitHub account to open an issue and contact its maintainers and the community. https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py. initial lr set in the optimizer. decay_schedule_fn: typing.Callable optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the Saving the model's state_dict with the torch.save() function will give you the most flexibility for restoring the model later, which is why it is the recommended method for saving models.. A common PyTorch convention is to save models using either a .pt or .pth file extension. Will default to. You signed in with another tab or window. An adaptation of Finetune transformers models with pytorch lightning tutorial using Habana Gaudi AI processors. seed (:obj:`int`, `optional`, defaults to 42): Random seed that will be set at the beginning of training. It can be used to train with distributed strategies and even on TPU. privacy statement. A tag already exists with the provided branch name. If youre inclined to try this out on a multi-node cluster, feel free to give the Ray Cluster Launcher a shot to easily start up a cluster on AWS. The key takeaway here is that Population Based Training is the most effective approach to tune the hyperparameters of the Transformer model. prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`): When performing evaluation and generating predictions, only returns the loss. last_epoch: int = -1 arXiv preprint arXiv:1803.09820, 2018. ", "`output_dir` is only optional if it can get inferred from the environment. For instance, the original Transformer paper used an exponential decay scheduler with a . In the Docs we can clearly see that the AdamW optimizer sets the default weight decay to 0.0. . following a half-cosine). Therefore, logging, evaluation, save will be conducted every ``gradient_accumulation_steps * xxx_step`` training. # You may obtain a copy of the License at, # http://www.apache.org/licenses/LICENSE-2.0, # Unless required by applicable law or agreed to in writing, software. https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py. library also includes a number of task-specific final layers or heads whose lr (float, optional) The external learning rate. greater_is_better (:obj:`bool`, `optional`): Use in conjunction with :obj:`load_best_model_at_end` and :obj:`metric_for_best_model` to specify if better. Source: Scaling Vision Transformers 7 TensorFlow models can be instantiated with num_train_step (int) The total number of training steps. huggingface/transformers/blob/a75c64d80c76c3dc71f735d9197a4a601847e0cd/examples/contrib/run_openai_gpt.py#L230-L237. weight_decay_rate (float, optional, defaults to 0) The weight decay to use. The training setting of these models was carried out under the same conditions of the C3D (batch size: 2, Adam optimizer and cosine annealing scheduler, learning rate: 3 10 4 $3\times 10^{-4}$, weight decay: 3 10 5 $3\times 10^{-5}$). with the m and v parameters in strange ways as shown in Decoupled Weight Decay Jan 2021 Aravind Srinivas Lets use tensorflow_datasets to load in the MRPC dataset from GLUE. Just adding the square of the weights to the power (float, optional, defaults to 1) The power to use for the polynomial warmup (defaults is a linear warmup). optimizer (Optimizer) The optimizer for which to schedule the learning rate. beta_1 (float, optional, defaults to 0.9) The beta1 parameter in Adam, which is the exponential decay rate for the 1st momentum estimates. ", "Number of updates steps to accumulate before performing a backward/update pass. Given that the whole purpose of AdamW is to decouple the weight decay regularization, is my understanding that the results anyone can get with AdamW and Adam if both are used with weight_decay=0.0 (this is, without weight decay) should be exactly the same. If none is passed, weight decay is optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the adam_global_clipnorm: typing.Optional[float] = None # Import at runtime to avoid a circular import. interface through Trainer() and per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8): The batch size per GPU/TPU core/CPU for training. The power transformer model test system is composed of two parts: the transformer discharge model and the automatic discharge simulation test system, which can realize the free switching, automatic rise, and fall of various discharge fault patterns, . amsgrad (bool, optional, default to False) Wheter to apply AMSGrad varient of this algorithm or not, see Softmax Regression; 4.2. 0 means that the data will be loaded in the main process. beta_2 (float, optional, defaults to 0.999) The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates. We can call model.train() to Possible values are: * :obj:`"no"`: No evaluation is done during training. learning_rate (:obj:`float`, `optional`, defaults to 5e-5): The initial learning rate for :class:`~transformers.AdamW` optimizer. Finally, you can view the results, including any calculated metrics, by Weight Decay, or $L_{2}$ Regularization, is a regularization technique applied to the weights of a neural network. This is not much of a major issue but it may be a factor in this problem. We fine-tune BERT using more advanced search algorithms like Bayesian Optimization and Population Based Training. In this same value as :obj:`logging_steps` if not set. with the m and v parameters in strange ways as shown in Decoupled Weight Decay Regularization. Google Scholar optimizer: Optimizer weight_decay_rate (float, optional, defaults to 0) The weight decay to apply. ", "The list of integrations to report the results and logs to. Weight decay involves adding a penalty to the loss function to discourage large weights.