Use GPU with Docker Container

Questo forum è dedicato agli utenti del sistema installato all'istituto ISASI del CNR costituito da:
- un sistema Nvidia DGX-A100 server
- una unità blade Dell MX-7000 con un server MX-750c
- una unità di memoria SAN a corredo dei nodi elaborazione di circa 200 TB
Post Reply
mdelcoco
Posts: 2
Joined: 15 Feb 2023, 10:16

Use GPU with Docker Container

Post by mdelcoco »

The following guide allows to train a network based on PyTorch.

In your home create and enter in a folder "dataDocker"

Code: Select all

mkdir dataDocker
cd dataDocker
Than create 2 files:

"Dockerfile" with the following content

Code: Select all

FROM nvidia/cuda:11.4.0-base-ubuntu20.04
RUN apt update
RUN apt-get install -y python3 python3-pip
RUN pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

COPY traintest.py .
ENTRYPONT ["python3", "traintest.py"]
"traintest.py:"with the following content

Code: Select all

import torch
import torchvision
import torchvision.transforms as transforms
import sys

sys.stdout = open('/data/somefile.txt', 'w')

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
           
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

net = Net()
net.to(device)

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')
The code will redirect the stdout on the "/data/somefile.txt" file inside the container.
The file will be accessible from the host machine thanks to the mounting of the folder in the docker run command.

Finally we have to run the following commands

This first one will build the docker image

Code: Select all

docker build -t test_nvidia_torch_image .
this second one will run a container based on the previouslu created image

Code: Select all

docker run -it --name testTorch -d --gpus all -v $PWD/data:/data  test_nvidia_torch_image
Post Reply