layout: post permalink: /:title
title: “Paid for Multiple GPUs on Azure and Can’t Use Them with Deep Learning? This Post Is for You.” image: Cover excerpt: “Paid for multiple GPUs on Azure and can’t use them with Deep Learning? This post is for you.” lang: en-us —
You went to the Azure portal, provisioned an NC24R with a marvelous 224 GB of RAM, 24 cores, over 1TB of disk, and best of all: 4 M80 cards for your complete Deep Learning training pleasure. Everything perfect, right? Almost. Right from the start, I tried to use a script for training, and with a simple htop to monitor the training, I saw that TensorFlow was dumping all the training onto the processors. Even with these 24 wonderful processors hitting 100% utilization, it doesn’t even come close to what our colossal GPUs can produce. (Note: You wouldn’t trade 4 Ferraris from 2017 for 24 Fiat 147s from 1985, right?) Accessing our marvelous machine to see what had happened, I first checked if the GPUs were in the machine, which indeed they were.
[code language=”shell”]
azure_teste@deep-learning:~$ nvidia-smi
[/code]
[code language=”shell”]
Tue Jun 27 18:21:05 2017
+—————————————————————————–+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | B2A3:00:00.0 Off | 0 |
| N/A 47C P0 71W / 149W | 0MiB / 11439MiB | 0% Default |
+——————————-+———————-+———————-+
| 1 Tesla K80 Off | C4D8:00:00.0 Off | 0 |
| N/A 57C P0 61W / 149W | 0MiB / 11439MiB | 0% Default |
+——————————-+———————-+———————-+
| 2 Tesla K80 Off | D908:00:00.0 Off | 0 |
| N/A 52C P0 56W / 14… [truncated]