NVIDIA APEX安装完全指南及Megatron-LM/Pytorch运行问题解决(fused_layer_norm_cuda/packaging/amp_C/libc10.so)

1. 问题列表

在Megatron-LM/Pytorch运行中报错如下: 1. No module named 'fused_layer_norm_cuda': apex没有装或者装的不对,注意直接用pip install apex装的不是真正的nvdia-apex,必须通过源码编译安装 2. ModuleNotFoundError: No module named 'packaging': 在新版的apex上编译会遇到报错,需要切换到之前的代码版本 3. No module named 'amp_C': 编译指令使用 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./,编译后还需要额外执行python setup.py install 4. ImportError: libc10.so: cannot open shared object file: No such file or directory: libc10.so是跟着pytorch一起装的

NVIDIA APEX 代码库:https://github.com/NVIDIA/apex

2. 完整APEX编译安装命令

  • 步骤一:在ubuntu系统中提前安装依赖:
1
apt-get install -y ninja-build libssl-dev libffi-dev

如果上面依赖不够,可以试试如下:

1
apt install -y ninja-build build-essential pkg-config zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev libbz2-dev liblzma-dev
  • 步骤二:python安装(如已安装, 跳过)
1
2
3
4
wget https://www.python.org/ftp/python/3.10.12/Python-3.10.12.tgz
tar zxf Python-3.10.12.tgz && cd Python-3.10.12
./configure
make altinstall

python默认安装路径是/usr/local/bin下,需要设置下PATH和软链

1
2
3
export PATH=/usr/local/bin:$PATH
ln -s /usr/local/bin/python3.10 /usr/local/bin/python
ln -s /usr/local/bin/pip3.10 /usr/local/bin/pip
  • 步骤三:pytorch-1.12.1-gpu版安装,为了解决libc10.so找不到的问题,同时apex安装也依赖torch
1
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  • 步骤四:重装apex
1
2
3
4
5
6
pip uninstall apex
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
pip install -r requirements.txt
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • 步骤五:测试, 在引入amp_C之前要先引入torch
1
2
import torch
import amp_C

3. 参考