horovod
[TOC] 官方介绍 Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. The goal of Horovod is to make distributed Deep Learning fast and easy to use. 官方测试效果 Running Horovod The example commands below show how to run distributed training. See the Running Horovod page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs. 1. 单机4卡: # docker nvidia-docker run -it 172.16.10.10:5000/horovod:0.12.1-tf1.8.0-py3.5 mpirun -np 4 -H localhost:4 python keras_mnist_advanced.py # singularity singularity shell --nv /scratch/containers/ubuntu.simg mpirun -np 4 -H localhost:4 python keras_mnist_advanced.py 2. 多机多卡: $ mpirun -np 16 \ -H server1:4,server2:4,server3:4,server4:4 \ ... python train.py 3. 完整 Docker 使用horovod ...