Paid for Multiple GPUs on Azure and Can't Use Them with Deep Learning? This Post is for You!

You went to the Azure portal and got an NC24R that has a wonderful 224 GB of memory, 24 cores, over 1TB of disk, and the best part: 4 M80 cards for your complete enjoyment in Deep Learning training. All perfect, right? Almost. Right at the start, I tried to use a script for training and with a simple htop to monitor the training, I saw that TensorFlow was dumping all the training onto the processors. Even with these wonderful 24 processors running at 100% usage, this doesn’t come close to what our monstrous GPUs can produce. (Note: You wouldn’t trade 4 Ferraris 2017 for 24 Fiat 147s model 1985, right?) Accessing our wonderful machine to see what had happened, I first checked if the GPUs were in the machine, which indeed they were. [code language=”shell”] azure_teste@deep-learning:~$ nvidia-smi [/code] [code language=”shell”] Tue Jun 27 18:21:05 2017 +—————————————————————————–+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |——————————-+———————-+———————-+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | B2A3:00:00.0 Off | 0 | | N/A 47C P0 71W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 1 Tesla K80 Off | C4D8:00:00.0 Off | 0 | | N/A 57C P0 61W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 2 Tesla K80 Off | D908:00:00.0 Off | 0 | | N/A 52C P0 56W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 3 Tesla K80 Off | E2A2:00:00.0 Off | 0 | | N/A 53C P0 56W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+

+—————————————————————————–+ | Processes: | | GPU PID Type Process name Usage Density | |=============================================================================| | No running processes found | +—————————————————————————–+

The GPUs are all there and recognized, but the utilization is at 0%, which means my CPU is doing all the work. This is why I usually suggest you always do a sanity check of the hardware. [code language=”text”] $ sudo apt-get update && sudo apt-get install -y ubuntu-drivers-common $ sudo ubuntu-drivers autoinstall [/code]

After running these commands to install all the recommended drivers, my training job started using the GPUs. But why? Because the drivers I had before were meant for a different version of the kernel and even for a different version of CUDA.

The problem is that the command sudo ubuntu-drivers autoinstall is used to update all drivers to the most recent stable versions. After that, the problem was solved.

I wanted to make sure that the process of installing CUDA from the NVIDIA website or using the standard Ubuntu package repositories does not lead to a system that fails to start. Because if that happens, we have more trouble than it’s worth.

The purpose of this post is to ensure that anyone who uses this VM for deep learning purposes on the Azure cloud does not end up in the same situation, and to demonstrate that it is possible to use GPUs on the cloud. Thanks for reading! [/code]