MLTalks

Megatron-LM源码系列(六)：Distributed-Optimizer分布式优化器实现Part1

发表于 2023-12-31 分类于机器学习本文字数： 1.8k 阅读时长 ≈ 7 分钟

1. 使用说明

在megatron中指定--use-distributed-optimizer就能开启分布式优化器, 参数定义在megatron/arguments.py中。分布式优化器的思路是将训练中的优化器状态均匀地分布到不同数据并行的rank结点上，相当于开启ZERO-1的训练。

1 2	group.add_argument('--use-distributed-optimizer', action='store_true', help='Use distributed optimizer.')

在使用--use-distributed-optimizer, 同时会check两个参数 args.DDP_impl == 'local'(默认开启)和args.use_contiguous_buffers_in_local_ddp(默认开启)。

# If we use the distributed optimizer, we need to have local DDP
# and we should make sure use-contiguous-buffers-in-local-ddp is on.
if args.use_distributed_optimizer:
    assert args.DDP_impl == 'local'
    assert args.use_contiguous_buffers_in_local_ddp

分布式优化器节省的理论显存值依赖参数类型和梯度类型，以下是每一个parameter对应占用的理论字节数(d表示数据并行的size大小，也就是一个数据并行中的卡数, 等于 \(TP \times PP\) )：

训练数据类型	Non-distributed optim（单位Byte）	Distributed optim（单位Byte）
float16 param, float16 grads	20	4 + 16/d
float16 param, fp32 grads	18	6 + 12/d
fp32 param, fp32 grads	16	8 + 8/d

阅读全文 »

FP16数据格式详解

发表于 2023-12-23 分类于机器学习本文字数： 384 阅读时长 ≈ 1 分钟

1. 浮点格式说明

浮点数的格式通常由三部分组成：符号位(Sign bit)、指数部分(Exponent)和尾数部分(Significand/Fraction)。整个浮点数占用的位数取决于不同的浮点数格式。例如，IEEE 754标准的单精度浮点数（float）有32位，双精度浮点数（double）有64位。参考：Floating-point arithmetic

最终的浮点表示如下，s是significand；p是precision精度(significand中的数字的个数)；b是base，这里base用的是10或者2。 \[ \frac{s}{b^{p-1}} \times b^e\]

一个具体的示例如下：

float32/bfloat16/float16/tf32三种格式的比较:

阅读全文 »

Megatron-LM源码系列(五)： FP16使用

发表于 2023-12-21 分类于机器学习本文字数： 952 阅读时长 ≈ 3 分钟

Megatron-LM代码仓：Megatron-LM

1. FP16参数指定

训练模型要使用fp16时，训练启动参数中指定--fp16, 对应megatron/arguments.py中的定义如下：

1 2	group.add_argument('--fp16', action='store_true', help='Run model in fp16 mode.')

在计算lm-cross-entropy时默认是使用fp32来计算的，在开启--fp16选项的前提下可以通过指定--fp16-lm-cross-entropy来使用fp16计算lm-loss-entropy，对应megatron/arguments.py中的定义如下：

1
2
3

group.add_argument('--fp16-lm-cross-entropy', action='store_true',
                   help='Move the cross entropy unreduced loss calculation'
                   'for lm head to fp16.')

阅读全文 »

Causal Attention论文详解

发表于 2023-10-17 分类于机器学习本文字数： 1.3k 阅读时长 ≈ 5 分钟

1. 背景介绍

Causal Attention论文是一篇因果推断(causal inference)和注意力(attention)结合的一篇文章，主要用在视觉和文本结合的领域，如VQA(Visual Question Answering)视觉问答。

VQA(Visual Question Answering)视觉问答的一个基本流程如下，对输入图进行self-attn编程得到K和V的向量，从文本得到Q的向量进行Attn计算，得到填空的结果(riding)。这个过程可以看成是一个因果推断的过程，对应的示意图如下X->Z->Y，X是输入，Z是模型过程，Y是输出，箭头表示相互依赖的关系。

阅读全文 »

Megatron-LM源码系列(四)：重计算(recompute)

发表于 2023-09-25 分类于机器学习本文字数： 1.8k 阅读时长 ≈ 7 分钟

github: https://github.com/NVIDIA/Megatron-LM

1. recompute参数配置

在megatron/arguments.py中有重计算的参数配置如下：

group.add_argument('--recompute-activations', action='store_true',
                   help='recompute activation to allow for training '
                   'with larger models, sequences, and batch sizes.')
group.add_argument('--recompute-granularity', type=str, default=None,
                   choices=['full', 'selective'],
                   help='Checkpoint activations to allow for training '
                   'with larger models, sequences, and batch sizes. '
                   'It is supported at two granularities 1) full: '
                   'whole transformer layer is recomputed, '
                   '2) selective: core attention part of the transformer '
                   'layer is recomputed.')
group.add_argument('--distribute-saved-activations',
                   action='store_true',
                   help='If set, distribute recomputed activations '
                   'across model parallel group.')
group.add_argument('--recompute-method', type=str, default=None,
                   choices=['uniform', 'block'],
                   help='1) uniform: uniformly divide the total number of '
                   'Transformer layers and recompute the input activation of '
                   'each divided chunk at specified granularity, '
                   '2) recompute the input activations of only a set number of '
                   'individual Transformer layers per pipeline stage and do the '
                   'rest without any recomputing at specified granularity'
                   'default) do not apply activations recompute to any layers')
group.add_argument('--recompute-num-layers', type=int, default=1,
                   help='1) uniform: the number of Transformer layers in each '
                   'uniformly divided recompute unit, '
                   '2) block: the number of individual Transformer layers '
                   'to recompute within each pipeline stage.')

说明：

--recompute-activations: 设置recompute_activations等同于recompute_granularity为selective；selective运行效率更高，大部分场景只设置这个就可以。如果显存更紧张时，再通过recompute-granularity来进行full的设置。
--recompute-granularity: 支持不同颗粒度的重计算，设为full会重计算整个transformer层，设为selective只会重算transformer中的core_attention部分。
--distribute-saved-activations: 按TP并行度分开存储activation。
--recompute-method: uniform计算会把所有的transformer layer分为若干组，分别把每组的input activation保存在内存中, GPU显存不足时，可通过设大每个组内的layer数来运行更大的model；block是针对pipeline并行的每个stage，checkpoint部分transformer layer的input activation, 剩余部分不进行checkpoint缓存，对于一个pipeline stage中有8层的来说，当设为5时，前5层中每一层的input activation都会被缓存，后3层在反向的时候正常计算。
--recompute-num-layers: 对于uniform类型，表示设置在每个重计算的transformer layer group中的层数, 默认为1表示对每一层transformer layer都分别进行checkpoint；对于block类型，设为N表示单个pipeline stage中的前N个layers会缓存input activation。

阅读全文 »

Pytorch LayerNorm源码详解

发表于 2023-08-15 分类于机器学习本文字数： 3.1k 阅读时长 ≈ 11 分钟

1. LayerNorm使用介绍

pytorch中的函数定义如下：

1	torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None)

函数参数说明如如下： * normalized_shape: 进行LayerNorm的维度定义，对于一个多维矩阵[N, C, H, W]来说，这里的normalized_shape定义都是要和矩阵最后几个维度保持一致的，这里就是[C, H, W]。对比数学公式，其中的 \(\gamma\) 和 \(\beta\) 的维度都是[C, H, W]，\(x\) 和 \(y\) 的维度都是[N, C, H, W]。 * eps：为了防止计算公式中的分母为0，加上一个极小的数，默认值: 1e-5 * elementwise_affine：设为True的时候，进行elementwise的仿射变换, \(\gamma\) 和 \(\beta\) 才会生效，在训练过程中做为参数会被学习更新，为False的话不生效。\(\gamma\) 所有元素初始为1， \(\beta\) 所有元素初始为0的。\(\gamma\) 在代码实现中对应 \(gamma\), \(\beta\) 在代码实现中对应 \(beta\)。

LayerNorm的数学公式定义如下：

\[\begin{align*} Y &= \frac{X - E[X]}{\sqrt{Var[X] + \epsilon}} * \gamma + \beta \end{align*}\]

阅读全文 »

Grouped Query Attention论文阅读

发表于 2023-08-06 分类于机器学习本文字数： 741 阅读时长 ≈ 3 分钟

论文：GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

1. 背景介绍

Google在2023年发表的一篇关于Transformer Attention的论文，整体论文写的清晰易读，思想简单但很好用。论文名字简写是GQA，但实际分别代表了两种缩写： 1. Generalized Multi Query Attention 2. Grouped Query Attention

阅读全文 »

LLaMA-2论文阅读

发表于 2023-07-29 分类于机器学习本文字数： 2.2k 阅读时长 ≈ 8 分钟

1. 基本介绍

LLaMA-2是2023年7月24日Meta发布的LLaMA第二代，跟LLaMA-1几个显著区别:

免费可商用版本的大模型
context上下文增加了一倍，从2K变为了4K
训练的总token数从1.0T/1.4T增加为2.0T(\(2 \times 10^{12}\)), 在1.4T基础上增加40%
对于最大的模型参数量65B也增加到了70B(\(70 \times 10^{9}\))，并在34B和70B两个版本上使用了 \(Group-Query-Attention(GQA)\) 的方法

阅读全文 »

Megatron-LM源码系列(三)：详解Pipeline模型并行训练实现

发表于 2023-07-28 分类于机器学习本文字数： 4.2k 阅读时长 ≈ 15 分钟

github: https://github.com/NVIDIA/Megatron-LM

在【Megatron-LM源码系列(二)：Tensor模型并行和Sequence模型并行训练】基础上增加了Pipeline模型并行训练的介绍，对于Pipeline模型并行思路可参考【详解MegatronLM流水线模型并行训练(Pipeline Parallel)】。pipeline并行中网络是按层的粒度进行纵向切分，在通信组通信上中在pipeline的不同stage中进行横向通信。如下图中2机16卡每个色块就是一个pipeline通信组，训练前向通信的顺序是从左向右。

阅读全文 »

Megatron-LM源码系列(二)：Tensor模型并行和Sequence模型并行训练

发表于 2023-07-23 分类于机器学习本文字数： 3.9k 阅读时长 ≈ 14 分钟

代码库地址: https://github.com/NVIDIA/Megatron-LM/tree/23.05

1. 整体介绍

模型并行训练实现的核心代码在megatron/core/目录下，按README.md介绍来说，Megatron Core是一个专门针对transformer类模型、效率高、可扩展的计算库。

阅读全文 »