insightface

ZF/insightface

Fork 0

mirror of https://github.com/deepinsight/insightface.git synced 2026-05-14 12:17:55 +00:00

Files

History

anxiangsir 9aca365456 Update Glint360K.

2020-10-26 14:15:16 +08:00

evaluation

Update Folder.

2020-10-19 01:07:30 +08:00

hosts

Update Folder.

2020-10-19 01:07:30 +08:00

setup-utils

Update Folder.

2020-10-19 01:07:30 +08:00

symbol

Update Folder.

2020-10-19 01:07:30 +08:00

callbacks.py

Update Folder.

2020-10-19 01:07:30 +08:00

config.sh

Update Glint360K.

2020-10-26 14:15:16 +08:00

default.py

Update Glint360K.

2020-10-26 14:15:16 +08:00

image_iter.py

Update Folder.

2020-10-19 01:07:30 +08:00

memory_bank.py

Update Folder.

2020-10-19 01:07:30 +08:00

memory_module.py

Update Folder.

2020-10-19 01:07:30 +08:00

memory_samplers.py

Update Folder.

2020-10-19 01:07:30 +08:00

memory_scheduler.py

Update Folder.

2020-10-19 01:07:30 +08:00

memory_softmax.py

Update Folder.

2020-10-19 01:07:30 +08:00

optimizer.py

Update Folder.

2020-10-19 01:07:30 +08:00

README.md

Update README.md

2020-10-21 13:43:14 +08:00

run.sh

Update bugs.

2020-10-21 14:00:16 +08:00

train_memory.py

Update codes

2020-10-21 14:16:21 +08:00

README.md

Train

Requirements

python==3.6
cuda==10.1
cudnn==765
mxnet-cu101==1.6.0.post0
pip install easydict mxboard opencv-python tqdm
nccl
openmpi==4.0.0
horovod==0.19.2

Failures due to SSH issues

The host where horovodrun is executed must be able to SSH to all other hosts without any prompts.

Run with horovodrun

Typically one GPU will be allocated per process, so if a server has 8 GPUs, you will run 8 processes. In horovodrun, the number of processes is specified with the -np flag.

To run on a machine with 8 GPUs:

horovodrun -np 8 -H localhost:8 bash config.sh

To run on two machine with 16 GPUs:

horovodrun -np 16 -H ip1:8,ip2:8 bash config.sh

Run with mpi

bash run.sh

Troubleshooting

Block IO

You can turn on the debug mode to check whether your slow training speed is the cause of IO.

Training Speed.

If you find that your training speed is the io bottleneck, you can mount dataset to RAM, using the following command.

# If your RAM has 256G
sudo mkdir /train_tmp
mount -t tmpfs -o size=140G  tmpfs /train_tmp