Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[transformer] Add moe_noisy_gate #2495

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

llleohk
Copy link

@llleohk llleohk commented Apr 18, 2024

增加了noisy-gate
实验结果(aishell-1 20epoch)

decoding mode Normal Gate Noisy Gate
ctc_prefix_beam_search 9.60% 8.88%
att_rescoring 8.97% 8.23%

how to use:
image

@Mddct Mddct requested a review from xingchensong April 18, 2024 08:42
@xingchensong
Copy link
Member

先merge一下main

@xingchensong
Copy link
Member

有paper link的话可以贴一下

@llleohk
Copy link
Author

llleohk commented Apr 18, 2024

有paper link的话可以贴一下

好咧,参考的是谷歌的文章:https://arxiv.org/pdf/1701.06538.pdf

@Mddct
Copy link
Collaborator

Mddct commented Apr 18, 2024

贴class下边
好奇完整的epoch跑完会咋样

这个作用是加速收敛呢 还是最终效果也会变好

Comment on lines 113 to 116
if self.gate_type == 'noisy':
noisy_router = self.noisy_gate(xs)
noisy_router = torch.randn_like(router) * F.softplus(noisy_router)
router = router + noisy_router
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

推理阶段也需要吗?我理解这个更像是服务于训练阶段避免有的专家没参与训练的

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

推理理论上是不用的,这个可以做个实验测试一下相差多少

@llleohk
Copy link
Author

llleohk commented Apr 18, 2024

贴class下边 好奇完整的epoch跑完会咋样

这个作用是加速收敛呢 还是最终效果也会变好

我跑个完整的epoch看看,之前测试的结果是最终效果也会变好,不过当时的moe不是用在encoder上

@Mddct
Copy link
Collaborator

Mddct commented Apr 20, 2024

截屏2024-04-21 00 08 27 后边会支持模型并行, moe这里需要特殊的处理, 看到了这个 参考下 截个图放这里 ref:https://zhuanlan.zhihu.com/p/681154742

@llleohk
Copy link
Author

llleohk commented Apr 22, 2024

截屏2024-04-21 00 08 27 后边会支持模型并行, moe这里需要特殊的处理, 看到了这个 参考下 截个图放这里 ref:https://zhuanlan.zhihu.com/p/681154742

好咧周神,这个我研究一下

@xingchensong
Copy link
Member

咋样啦,有最终结果了不

@llleohk
Copy link
Author

llleohk commented Apr 25, 2024

咋样啦,有最终结果了不

模型还在训,卡有点慢。。明天能有结果

@rookie0607
Copy link

@llleohk
Copy link
Author

llleohk commented Apr 27, 2024

来了来了,结果来了:
用的aishell-1,训练个100个epochs,encoder-moe

decoding mode Normal Gate Noisy Gate(train and decode) Noisy Gate (only train)
ctc_prefix_beam_search 5.60% 5.62% 5.62%
att_rescoring 5.23% 5.27% 5.27%

从结果来看感觉加noisy没啥效果,不排除是不是数据量不够多的原因。。。而且推理加noisy和不加效果一样,我简单测了一下门控输出一致性,大概是96%。
从log的loss来看,noisy的收敛是比normal要快,但是最后收敛的都差不多,这里贴个图:
image

然后测了1000条音频的门控输出的标准差平均值,好像起不到负载均衡的效果。。
Normal_std:38.99761676367856
Noisy_only_train_std:41.180588894741426
Noisy_train_decode_std: 41.19181262290304

@llleohk
Copy link
Author

llleohk commented May 2, 2024

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

@fclearner
Copy link
Contributor

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

@llleohk
Copy link
Author

llleohk commented May 7, 2024

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

个人感觉小数据量的encoder-moe 加noisy在训练可能更均衡了 但是很难训练充分,所以效果会更差

现在也在尝试只在后几层做moe,看看效果

@llleohk
Copy link
Author

llleohk commented May 20, 2024

更新一下周神贴的方法的实验结果,encoder专家数量需要根据数据量来确定,太稀疏会影响性能

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate-Decoder 5.77% 4.99%
mask Noisy Gate(4experts)-Encoder 5.46% 5.06%
mask Noisy Gate(8experts)-Encoder 5.82% 5.40%
mask Noisy Gate(4experts)-Decoder 5.85% 5.09%
mask Noisy Gate(8experts)-Decoder 5.76% 5.04%

@MXuer
Copy link

MXuer commented May 21, 2024

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

  1. 上面的cer解码是流式的还是非流式的啊。
  2. 最新一条里面的,u2++-baseline,这个attention rescoring在aishell readme里面能到4.63%,您这个是因为只训练了100个epoch是吗?

感谢。

@llleohk
Copy link
Author

llleohk commented May 21, 2024

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

  1. 上面的cer解码是流式的还是非流式的啊。
  2. 最新一条里面的,u2++-baseline,这个attention rescoring在aishell readme里面能到4.63%,您这个是因为只训练了100个epoch是吗?

感谢。

  1. cer解码的是非流式的,如果您需要的话我可以测试一下流式的结果
  2. 我的u2++-baseline没有完全对齐aishell readme里的训练参数,我是4卡,batch size是8,训练100个epoch;decode的时候average_num设的5

@MXuer
Copy link

MXuer commented May 21, 2024

测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

  1. 上面的cer解码是流式的还是非流式的啊。
  2. 最新一条里面的,u2++-baseline,这个attention rescoring在aishell readme里面能到4.63%,您这个是因为只训练了100个epoch是吗?

感谢。

  1. cer解码的是非流式的,如果您需要的话我可以测试一下流式的结果
  2. 我的u2++-baseline没有完全对齐aishell readme里的训练参数,我是4卡,batch size是8,训练100个epoch;decode的时候average_num设的5

不用测流式的啦,就是想知道一下这个解码的策略。
感谢您的回答,感谢您的分享。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants