Skip to content

Conversation

@SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Mar 10, 2021

What does this PR do?

I noticed there was additional memory being allocated when using a ZeRO 3 DeepSpeed config as input. This was due to the model being partitioned onto the wrong devices internally in DeepSpeed as we do not set the default device soon enough.

This addresses the issue by setting the default device earlier, but not moving the model at all.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SeanNaren SeanNaren added bug Something isn't working distributed Generic distributed-related topic labels Mar 10, 2021
@SeanNaren SeanNaren added this to the 1.2.x milestone Mar 10, 2021
@SeanNaren SeanNaren requested a review from a team March 10, 2021 11:58
@SeanNaren SeanNaren requested a review from awaelchli as a code owner March 10, 2021 11:58
@SeanNaren SeanNaren self-assigned this Mar 10, 2021
@SeanNaren SeanNaren added the 3rd party Related to a 3rd-party label Mar 10, 2021
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
@carmocca carmocca added the ready PRs ready to be merged label Mar 10, 2021
@mergify mergify bot removed the has conflicts label Mar 10, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@SeanNaren SeanNaren merged commit 1c013b4 into master Mar 10, 2021
@SeanNaren SeanNaren deleted the fix/default_device branch March 10, 2021 16:29
@carmocca carmocca mentioned this pull request Mar 15, 2021
SeanNaren added a commit that referenced this pull request Mar 16, 2021
…6460)

* Ensure we set the default device before initializing deepspeed

* Add CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

(cherry picked from commit 1c013b4)
lexierule pushed a commit that referenced this pull request Mar 16, 2021
…6460)

* Ensure we set the default device before initializing deepspeed

* Add CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

(cherry picked from commit 1c013b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3rd party Related to a 3rd-party bug Something isn't working distributed Generic distributed-related topic ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants