Release Metrics, speed improvements, new hooks and flags · Lightning-AI/pytorch-lightning

Overview

Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.

brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
hparams can now be anything! (call self.save_hyperparameters() to register anything in the _init_
many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
much faster ddp implementation. Old one was renamed ddp_spawn
better support for Hydra
added the overfit_batches flag and corrected some bugs with the limit_[train,val,test]_batches flag
added conda support
tons of bug fixes 😉

Added overfit_batches, limit_{val|test}_batches flags (overfit now uses training set for all three) (#2213)
Added metrics
- Base classes (#1326, #1877)
- Sklearn metrics classes (#1327)
- Native torch metrics (#1488, #2062)
- docs for all Metrics (#2184, #2209)
- Regression metrics (#2221)
Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723)
Allow dataloaders without sampler field present (#1907)
Added option save_last to save the model at the end of every epoch in ModelCheckpoint (#1908)
Early stopping checks on_validation_end (#1458)
Attribute best_model_path to ModelCheckpoint for storing and later retrieving the path to the best saved model file (#1799)
Speed up single-core TPU training by loading data using ParallelLoader (#2033)
Added a model hook transfer_batch_to_device that enables moving custom data structures to the target device (#1756)
Added black formatter for the code with code-checker on pull (#1610)
Added back the slow spawn ddp implementation as ddp_spawn (#2115)
Added loading checkpoints from URLs (#1667)
Added a callback method on_keyboard_interrupt for handling KeyboardInterrupt events during training (#2134)
Added a decorator auto_move_data that moves data to the correct device when using the LightningModule for inference (#1905)
Added ckpt_path option to LightningModule.test(...) to load particular checkpoint (#2190)
Added setup and teardown hooks for model (#2229)

Allow user to select individual TPU core to train on (#1729)
Removed non-finite values from loss in LRFinder (#1862)
Allow passing model hyperparameters as complete kwarg list (#1896)
Renamed ModelCheckpoint's attributes best to best_model_score and kth_best_model to kth_best_model_path (#1799)
Re-Enable Logger's ImportErrors (#1938)
Changed the default value of the Trainer argument weights_summary from full to top (#2029)
Raise an error when lightning replaces an existing sampler (#2020)
Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
Remove explicit flush from tensorboard logger (#2126)
Changed epoch indexing from 1 instead of 0 (#2206)

Deprecated flags: (#2213)
- overfit_pct in favour of overfit_batches
- val_percent_check in favour of limit_val_batches
- test_percent_check in favour of limit_test_batches
Deprecated ModelCheckpoint's attributes best and kth_best_model (#1799)
Dropped official support/testing for older PyTorch versions <1.3 (#1917)

Removed unintended Trainer argument progress_bar_callback, the callback should be passed in by Trainer(callbacks=[...]) instead (#1855)
Removed obsolete self._device in Trainer (#1849)
Removed deprecated API (#2073)
- Packages: pytorch_lightning.pt_overrides, pytorch_lightning.root_module
- Modules: pytorch_lightning.logging.comet_logger, pytorch_lightning.logging.mlflow_logger, pytorch_lightning.logging.test_tube_logger, pytorch_lightning.overrides.override_data_parallel, pytorch_lightning.core.model_saving, pytorch_lightning.core.root_module
- Trainer arguments: add_row_log_interval, default_save_path, gradient_clip, nb_gpu_nodes, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps
- Trainer attributes: nb_gpu_nodes, num_gpu_nodes, gradient_clip, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps, default_save_path, tng_tqdm_dic

Run graceful training teardown on interpreter exit (#1631)
Fixed user warning when apex was used together with learning rate schedulers (#1873)
Fixed multiple calls of EarlyStopping callback (#1863)
Fixed an issue with Trainer.from_argparse_args when passing in unknown Trainer args (#1932)
Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
Fixed LearningRateLogger in multi-scheduler setting (#1944)
Fixed test configuration check and testing (#1804)
Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
Fixed save_weights_only in ModelCheckpoint (#1780)
Allow use of same WandbLogger instance for multiple training loops (#2055)
Fixed an issue with _auto_collect_arguments collecting local variables that are not constructor arguments and not working for signatures that have the instance not named self (#2048)
Fixed mistake in parameters' grad norm tracking (#2012)
Fixed CPU and hanging GPU crash (#2118)
Fixed an issue with the model summary and example_input_array depending on a specific ordering of the submodules in a LightningModule (#1773)
Fixed Tpu logging (#2230)
Fixed Pid port + duplicate rank_zero logging (#2140, #2231)

If we forgot someone due to not matching commit email with GitHub account, let us know :]