Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When learning with Transformer, loss becomes nan after backpropagation. #37

Closed
sooftware opened this issue Jul 23, 2020 · 11 comments · Fixed by #48
Closed

When learning with Transformer, loss becomes nan after backpropagation. #37

sooftware opened this issue Jul 23, 2020 · 11 comments · Fixed by #48
Labels
help wanted Extra attention is needed

Comments

@sooftware
Copy link
Owner

Currently, Seq2seq and Transformer have two models implemented, and after backpropagation when learning with Transformer, the phenomenon of loss becoming nan continues. I have tried debugging, but I have not yet confirmed which part is wrong. If you have had a similar experience or have any guesses, I would appreciate it if you could help me.

@sooftware sooftware added the help wanted Extra attention is needed label Jul 23, 2020
@sooftware sooftware pinned this issue Jul 23, 2020
@affjljoo3581
Copy link
Contributor

Applying attention masks with -np.inf may lead nan in both outputs and gradients (of course). You can simply test with:

model = SpeechTransformer(
    num_classes=10, d_model=64, input_dim=80, d_ff=256,
    num_encoder_layers=3, num_decoder_layers=3)

inputs = torch.rand((32, 16, 80), dtype=torch.float)
targets = torch.randint(0, 10, (32, 16), dtype=torch.long)
lengths = torch.empty((32,), dtype=torch.long).fill_(80)

with torch.no_grad():
    predicted = model(inputs, lengths, targets)
    print(np.isnan(predicted.numpy()).any())

Using -np.inf in masked_fill_:

True

Using -1e9 in masked_fill_:

False

The implementations of Transformer model usually choose the constant (and bounded) values instead of unusual ones (e.g. np.inf, float('inf')).

Sometimes you can see some with -np.inf, but note that they are written for evaluation, not training. Of course there is no problem with using -np.inf in inference.

@sooftware
Copy link
Owner Author

I never thought there was a bug in that part! Thank you, I'll try it out!

@sooftware
Copy link
Owner Author

After experimenting, loss becomes nan again. There was this problem, but there seems to be another problem.

@sooftware
Copy link
Owner Author

And if you refer to this repo, it works normally when it is -np.inf. Further checks are likely to be needed on that part.

@affjljoo3581
Copy link
Contributor

Let's check why this repository works well with -np.inf. First, create two dummy tensors as follows:

pred = np.random.rand(64, 32, 1024)
pred = np.where(pred < 0.999999, pred, np.nan)
pred = torch.tensor(pred, dtype=torch.float)

target = np.random.randint(0, 1024, (64, 32), dtype=np.long)
target = np.where(np.any(np.isnan(pred), axis=-1), 0, target)
target = torch.tensor(target, dtype=torch.long)

pred tensor contains nan in random positions. Of course, target contains the information to ignore the nan elements. Let's compare the losses from two criterions (LabelSmoothedCrossEntropyLoss and cal_loss in this repository).

  • LabelSmoothedCrossEntropyLoss
loss = LabelSmoothedCrossEntropyLoss(
    num_classes=1024,
    ignore_index=0,
    smoothing=0.1,
    architecture='transformer',
    reduction='mean')
print(loss(pred.view(-1, pred.size(-1)), target.view(-1)))

Output: tensor(nan)

  • cal_loss in kaituoxu/Speech-Transformer
print(cal_loss(pred.view(-1, pred.size(2)), target.view(-1),
               smoothing=0.1))

Output: tensor(6.9713)

Why is it happend? Actually, they both work well without label-smoothing. The problem is in reducing the loss tensor.

  • LabelSmoothedCrossEntropyLoss
# ...
with torch.no_grad():
    label_smoothed = torch.zeros_like(logit)
    label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
    label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
    label_smoothed[target == self.ignore_index, :] = 0
return self.reduction_method(-label_smoothed * logit)
# ...
  • cal_loss in kaituoxu/Speech-Transformer
# ...
non_pad_mask = gold.ne(IGNORE_ID)
n_word = non_pad_mask.sum().item()
loss = -(one_hot * log_prb).sum(dim=1)
loss = loss.masked_select(non_pad_mask).sum() / n_word
# ...

While your code reduces the smoothed logits, cal_loss uses maksed_select to exclude nan values (precisely, IGNORE_ID elements). label_smooth contains non-zero nan weights (that is, nan values should be multiplied with non-zero weights in label_smooth) and it may lead nan of total loss.

So if you want to use -np.inf in attention masks, you should change the code as below:

with torch.no_grad():
    label_smoothed = torch.zeros_like(logit)
    label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
    label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
    # label_smoothed[target == self.ignore_index, :] = 0

    score = (-label_smoothed * logit).sum(1)
    score = score.masked_select(target != self.ignore_index)
return self.reduction_method(score)

Output: tensor(6.9704)

@sooftware
Copy link
Owner Author

oh thanks to let me know.
But loss is still nan.
I'd appreciate it if you could tell me where you can guess.

@affjljoo3581
Copy link
Contributor

I've never seen RampUpLR scheduling before. Can you explain the concept of ramp up lr decay?

@sooftware
Copy link
Owner Author

Never mind.
I checked that it had nothing to do with turning it off and turning it on.

@affjljoo3581
Copy link
Contributor

No. Basically Transformer model with post-LN needs learning rate warm-up. You need to consider that. I don't have any dataset of this project so I cannot test your code accurately. When does the loss diverge? Can you show me the training logs in detail?

@sooftware
Copy link
Owner Author

Can you come to gitter and talk to me in real time?

@sooftware
Copy link
Owner Author

마스킹에 문제있는 것을 확인 => 디버깅중

sooftware added a commit that referenced this issue Jan 4, 2021
sooftware added a commit that referenced this issue Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants