layout: post permalink: /:title

title: “Paid for Multiple GPUs on Azure and Can’t Use Them with Deep Learning? This Post Is for You.” image: Cover excerpt: “Paid for multiple GPUs on Azure and can’t use them with Deep Learning? This post is for you.” lang: en-us —

You went to the Azure portal, provisioned an NC24R with a marvelous 224 GB of RAM, 24 cores, over 1TB of disk, and best of all: 4 M80 cards for your complete Deep Learning training pleasure. Everything perfect, right? Almost. Right from the start, I tried to use a script for training, and with a simple htop to monitor the training, I saw that TensorFlow was dumping all the training onto the processors. Even with these 24 wonderful processors hitting 100% utilization, it doesn’t even come close to what our colossal GPUs can produce. (Note: You wouldn’t trade 4 Ferraris from 2017 for 24 Fiat 147s from 1985, right?) Accessing our marvelous machine to see what had happened, I first checked if the GPUs were in the machine, which indeed they were. [code language=”shell”] azure_teste@deep-learning:~$ nvidia-smi [/code] [code language=”shell”] Tue Jun 27 18:21:05 2017 +—————————————————————————–+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |——————————-+———————-+———————-+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | B2A3:00:00.0 Off | 0 | | N/A 47C P0 71W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 1 Tesla K80 Off | C4D8:00:00.0 Off | 0 | | N/A 57C P0 61W / 149W | 0MiB / 11439MiB | 0% Default | +——————————-+———————-+———————-+ | 2 Tesla K80 Off | D908:00:00.0 Off | 0 | | N/A 52C P0 56W / 14… [truncated]