[WIP][transformer] bring llm component #2363

Mddct · 2024-02-22T11:33:07Z

这个pr 会将以下llm的组件引入wenet
（llama gemma等都用了）

gated mlp [transformer] add gated mlp #2395
rope
linear bias args [transformer] support bias #2394
multi query
rms norm [transformer] add rms-norm #2396 [transformer] add norm eps #2397

TODO

benchmark multiquery vs multiheaded
@Mddct ROPE 目前实现没有问题，但是init model的时候漏掉了需要修复下
[train_engine] support fsdp #2412
fix some comment
btw：日后可方便加载llm模型

xingchensong · 2024-02-23T01:42:01Z

准备以后也是ckpt重命名的方式引入llm吗？（而不是import transformers）

Mddct · 2024-02-23T02:08:25Z

准备以后也是ckpt重命名的方式引入llm吗？（而不是import transformers）

是后边会用fsdp/deepspeed 直接和transformers用有一堆奇奇怪怪的问题，而且也不方便做部署之类的工作

xingchensong · 2024-02-23T02:25:26Z

同意！

Mddct · 2024-02-23T06:12:31Z

这里大致罗列下主流llm的一些情况，可能随版本变动有些出入

模型名称	参数	隐藏层维度	层数	注意力头数	训练数据	位置编码	激活函数	归一化方法	注意力机制	词表大小	分词方法	最大长度	linear bias
LLAMA	6.7B	4096	32	32	1T	RoPE	SwiGLU	RMSNorm（pre-norm)	多头注意力机制(MHA)	32000	BBPE	2048
LLAMA2	7B	4096	32	32	2.0T	RoPE	SwiGLU	RMSNorm（pre-norm)	多头注意力机制(MHA)	32000	BBPE	4096	false
chatglm-6B	6.2B	4096	28	32	1T	RoPE 2d位置编码	GELU	layer norm（post-norm)	多头注意力机制(MHA)	130528	BBPE	2048
chatglm2-6B	6.2B	4096	28	32	1.4T	RoPE 推理时，舍弃2d位置编码，回归decoder-only	SwiGLU	RMSNorm（post-norm)	Multi-Query Attention （MQA）	65024	BBPE	32768
baichuan-7b	7B	4096	32	32	1.2T	RoPE	SwiGLU	RMSNorm（pre-norm)	多头注意力机制(MHA)	64000	BBPE	4096	false
Qwen-7B	7B	4096	32	32	2.2T	RoPE	SwiGLU	RMSNorm（pre-norm)	多头注意力机制(MHA)	151851	BBPE	2048	false
gemma						rope	gelu	rmsnorm	multi query				false

Mddct · 2024-02-23T11:49:50Z

还是这个配置：#2333 (comment)

	batch size	data type	训练时间	att/rescore/ctc greedy/ctc beam wer
step 模式 avg 20 step 1000 save interval (no stage1 shuffle)	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	10h36min	5.63/5.30/5.84/5.85
step 模式 avg 20 step 1000 save interval (stage1 shuffle) + sdpa	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw		5.70/5.38/5.89/5.90
step 模式 avg 20 step 1000 save interval (no stage1 shuffle) + sdpa	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw		5.67/5.32/89/5.89
	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h11min	5.53/5.24/5.85/5.85
+gated mlp	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h36min1	5.34/5.06/5.51/5.51
+encoder no bias (mlp with bias)	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h4min	5.66/5.29/5.91/5.91
+encoder/decoder no bias (mlp with no bias)	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	13h32min	5.56/5.28/5.91/5.91
+encoder/decoder no bias + gated mlp	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h14min	5.32/5.11/5.58/5.59
+rms norm, eps=1e-5	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h22min	5.65/5.26/5.82/5.82
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-5	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h22min	5.46/5.17/5.60/5.60
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h23min	5.29/5.00/5.42/5.42
transformer encoder no pos directly through blocks	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h5min	6.46/5.46/6.13/6.13
+rope google	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h18min	5.65/5.28/5.87/5.87
+rope llama	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h4min	5.72/5.36/5.99/5.99
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6 + rope	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h35min	5.24/5.02/5.53/5.53
+multiquery	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	14h10min	5.54/5.18/5.86/5.86
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6 + rope + multiquery	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	15h18min	5.62/5.27/5.85/5.85

NOTE: 上述实验中的subsampling （conv2d） + ctc dense 都存在bias

train_conformer.yaml

	batch size	data type	训练时间	att/rescore/ctc greedy/ctc beam wer
	static batch size = 18	raw	/	5.18/4.61/4.94/4.94
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h42min	4.75/4.52/4.88/4.89
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: rms norm	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h48min	4.73/4.5/4.85/4.85
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + rope	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h25min	4.64/4.49/4.73/4.74
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + rope sync bn	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h	4.65/4.43/4.70/4.70 avg30: 4.65/4.39/4.66/4.66
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + rope sync bn no final ln	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h	4.57/4.32/4.58/4.58
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + rope + tie word emb	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h29min	4.70/4.51/4.79/4.79
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + rope + tie word emb + dec no linear bias	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h50min	4.67/4.52/4.80/4.80
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: rms norm + rope	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h55min	4.70/4.52/4.88/4.87
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: rms norm conv no bias + rope	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h29min	4.70/4.50/4.86/4.87
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm:bn + conv norm no bias rope+ multiquery	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h22min	4.62/4.60/6.26/6.26
+encoder/decoder no bias + gated mlp + rms_norm, eps=1e-6, conv_norm: bn + conv norm bias rope+ multiquery	bucket_boundaries: [500, 1000, 1500] bucket_batch_sizes: [128, 64, 32, 16]	raw	21h20min	4.71/4.49/4.74/4.76

conformer moe见：#2474 (comment)

Mddct · 2024-03-08T05:21:11Z

该pr会拆成若干pr 完成

enable bias [transformer] support bias #2394
gated-mlp [transformer] add gated mlp #2395
rms norm [transformer] add rms-norm #2396
norm eps [transformer] add norm eps #2397
multiquery attention [transformer] support multi query attention && multi goruped #2403
rope [transformer] add rope for transformer/conformer #2458
[train_engine] support fsdp #2412

[transformer] add gated mlp

bbedafb

Mddct force-pushed the Mddct-llm-component branch 4 times, most recently from 8c576bb to f261744 Compare February 22, 2024 15:46

gated mlp works

a17fc45

Mddct force-pushed the Mddct-llm-component branch from f261744 to a17fc45 Compare February 22, 2024 16:26

Mddct force-pushed the Mddct-llm-component branch from 427a033 to 8c64edb Compare February 23, 2024 09:37

support linear bias

4d20b69

Mddct force-pushed the Mddct-llm-component branch 4 times, most recently from 2e99c5e to 0f3deaa Compare February 26, 2024 06:25

support rms norm

1afd9de

Mddct force-pushed the Mddct-llm-component branch from 0f3deaa to 1afd9de Compare February 26, 2024 06:40

fix conformer rms+bias+mlp

abc1c48

Mddct force-pushed the Mddct-llm-component branch 4 times, most recently from 2867f5a to 9046622 Compare February 27, 2024 17:40

support multi query attention

412e51b

Mddct force-pushed the Mddct-llm-component branch 4 times, most recently from 558fd14 to ea0888f Compare February 28, 2024 09:44

support rope

0dc48f1

Mddct force-pushed the Mddct-llm-component branch from ea0888f to 0dc48f1 Compare February 28, 2024 09:48

Mddct changed the title ~~[transformer] bring llm component~~ [WIP][transformer] bring llm component Feb 28, 2024

Mddct force-pushed the Mddct-llm-component branch 14 times, most recently from 9f61138 to 522a60a Compare March 1, 2024 02:44

fix init rope attention and rope

6dec5bd

Mddct force-pushed the Mddct-llm-component branch from 522a60a to 6dec5bd Compare March 1, 2024 02:50

fix rope attention decoding

ddf648b

Mddct force-pushed the Mddct-llm-component branch from 7888af5 to ddf648b Compare March 1, 2024 09:06

This was referenced Mar 8, 2024

[transformer] support bias #2394

Merged

[transformer] add gated mlp #2395

Merged

[transformer] add rms-norm #2396

Merged

[transformer] add norm eps #2397

Merged

[transformer] support multi query attention && multi goruped #2403

Merged

Mddct mentioned this pull request Mar 24, 2024

[train_engine] support fsdp #2412

Merged

24 tasks

Mddct mentioned this pull request Apr 3, 2024

[transformer] add rope for transformer/conformer #2458

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][transformer] bring llm component #2363

[WIP][transformer] bring llm component #2363

Mddct commented Feb 22, 2024 •

edited

xingchensong commented Feb 23, 2024 •

edited by Mddct

Mddct commented Feb 23, 2024

xingchensong commented Feb 23, 2024

Mddct commented Feb 23, 2024 •

edited

Mddct commented Feb 23, 2024 •

edited

Mddct commented Mar 8, 2024 •

edited

[WIP][transformer] bring llm component #2363

Are you sure you want to change the base?

[WIP][transformer] bring llm component #2363

Conversation

Mddct commented Feb 22, 2024 • edited

xingchensong commented Feb 23, 2024 • edited by Mddct

Mddct commented Feb 23, 2024

xingchensong commented Feb 23, 2024

Mddct commented Feb 23, 2024 • edited

Mddct commented Feb 23, 2024 • edited

Mddct commented Mar 8, 2024 • edited

Mddct commented Feb 22, 2024 •

edited

xingchensong commented Feb 23, 2024 •

edited by Mddct

Mddct commented Feb 23, 2024 •

edited

Mddct commented Feb 23, 2024 •

edited

Mddct commented Mar 8, 2024 •

edited