mirror of
https://github.com/deepinsight/insightface.git
synced 2026-05-14 04:12:35 +00:00
120 lines
3.6 KiB
Markdown
120 lines
3.6 KiB
Markdown
## Training
|
|
### 1.Requirements
|
|
python==3.6
|
|
cuda==10.1
|
|
cudnn==765
|
|
mxnet-cu101==1.6.0.post0
|
|
pip install easydict mxboard opencv-python tqdm
|
|
[nccl](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html)
|
|
[openmpi](mxnet/setup-utils/install-mpi.sh)==4.0.0
|
|
[horovod](mxnet/setup-utils/install-horovod.sh)==0.19.2
|
|
|
|
### 2.Run with horovodrun
|
|
Typically one GPU will be allocated per process, so if a server has 8 GPUs, you will run 8 processes.
|
|
In horovodrun, the number of processes is specified with the -np flag.
|
|
|
|
To run on a machine with 8 GPUs:
|
|
```shell script
|
|
horovodrun -np 8 -H localhost:8 bash config.sh
|
|
```
|
|
|
|
To run on two machine with 16 GPUs:
|
|
```shell script
|
|
horovodrun -np 16 -H ip1:8,ip2:8 bash config.sh
|
|
```
|
|
|
|
### 3.Run with mpi
|
|
```shell script
|
|
bash run.sh
|
|
```
|
|
|
|
### Failures due to SSH issues
|
|
The host where horovodrun is executed must be able to SSH to all other hosts without any prompts.
|
|
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
|
|
### 1. Horovod installed successfully?
|
|
|
|
Run `horovodrun --check` to check the installation of horovod.
|
|
```shell script
|
|
# Horovod v0.19.2:
|
|
#
|
|
# Available Frameworks:
|
|
# [ ] TensorFlow
|
|
# [X] PyTorch
|
|
# [X] MXNet
|
|
#
|
|
# Available Controllers:
|
|
# [X] MPI
|
|
# [X] Gloo
|
|
#
|
|
# Available Tensor Operations:
|
|
# [X] NCCL
|
|
# [ ] DDL
|
|
# [ ] CCL
|
|
# [X] MPI
|
|
# [X] Gloo
|
|
```
|
|
|
|
### 2. Mxnet Version!
|
|
Some versions of mxnet with horovod have bug.
|
|
It is recommended to try version **1.5 or 1.6**.
|
|
|
|
**The community has found that mxnet1.5.1 cannot install horovod.**
|
|
|
|
### 3. Check CUDA version!
|
|
```shell script
|
|
# Make sure your cuda version is same as mxnet, such as mxnet-cu101 (CUDA 10.1)
|
|
|
|
/usr/local/cuda/bin/nvcc -V
|
|
# nvcc: NVIDIA (R) Cuda compiler driver
|
|
# Copyright (c) 2005-2019 NVIDIA Corporation
|
|
# Built on Wed_Apr_24_19:10:27_PDT_2019
|
|
# Cuda compilation tools, release 10.1, V10.1.168
|
|
```
|
|
|
|
### 4. Block IO
|
|
You can turn on the debug mode to check whether your slow training speed is the cause of IO.
|
|
|
|
### 5. Training Speed.
|
|
If you find that your training speed is the io bottleneck, you can mount dataset to RAM,
|
|
using the following command.
|
|
```shell script
|
|
# If your RAM has 256G
|
|
sudo mkdir /train_tmp
|
|
mount -t tmpfs -o size=140G tmpfs /train_tmp
|
|
```
|
|
|
|
## Our Method
|
|

|
|
|
|
### 1. The classification layer model is parallel
|
|
Class centers are evenly distributed across different GPUs. It only takes three communications to complete
|
|
loss-free Softmax calculations.
|
|
|
|
#### 1. Synchronization of features
|
|
Make sure each GPU has all the GPU features on it, as is shown in `AllGather(x_i)`.
|
|
|
|
#### 2. Synchronization of denominator of the softmax function
|
|
We can first calculate the local sum of each GPU, and then compute the global sum through communication, as is shown
|
|
in `Allreduce(sum(exp(logits_i)))`
|
|
|
|
#### 3. Synchronization the gradients of feature
|
|
The gradient of logits can be calculated independently, so is the gradient of the feature. finally, we collect all the
|
|
gradients on GPU and send them back to backbone, as is shown in `Allreduce(deta(X))`
|
|
|
|
### 2. Softmax approximate
|
|
|
|
Just a subset of class centers can approximate softmax's computation(positive class centers must in these class centers),
|
|
this can be done with the following code:
|
|
```python
|
|
centers_p = func_positive(label) # select the positive class centers by the label of the sample
|
|
centers_n = func_negative(centers_p) # negative class centers are randomly sampled after excluding positive classes
|
|
centers_final = concat(centers_n, centers_p) # class centers that participate in softmax calculations
|
|
```
|
|
|
|
|