GPU Resources

Alex Yang

17 Feb 2020 • 5 min read

Overview

There are several GPU nodes in the Europa cluster available to DSA students.

2 3x GPU nodes
2 1x GPU nodes

Card Model:

Nvidia Tesla P100 GPUs
CUDA Capability: 6.0 (Pascal Gen)
Device Memory: 12 GiB

Node Configuration:

at least 4 CPU cores per GPU
18 GiB host memory per GPU

User Quota:

1 concurrent job
1-3 GPU(s) per job
all cores
all RAM

Get Access

Because GPUs are a high-contention resource, only job submission is allowed.
You will need an access token to submit jobs to the GPU nodes.
Treat it like a private key and DO NOT share with anyone.
If you requested GPUs from us, this will be configured for you and supposed to work out-of-the-box when you run the GPU job submission commands.

To check for yourself that you do have a token:

GPURUN_EXPORT=ON /dsa/scripts/gpu-conf

About GPU Container

You code will execute in an Nvidia Docker container so the environment and installed packages are different from those of JupyterHub containers on Europa. The code will run with your own identity and privilege escalation is not allowed.

The following list of images are allowed:

tensorflow/tensorflow
- 2.0.1-gpu-py3
- 1.15.2-gpu-py3
- 1.14.0-gpu-py3
- 1.12.3-gpu-py3
pytorch/pytorch
- 1.2-cuda10.0-cudnn7-runtime

If this is not sufficient, you may request to allow additional images. For courseworks, it will require instructor approval by letting your instructor send our IT Admin an email on your behalf. For projects, we can give you a special permission to run images outside of this list.

The GPU container will have access to:

/dsa/home/$USER (aka your DSA home folder)
/dsa/data

In other words, almost anything you can see from your Europa Jupyter Server, the GPU nodes can also read and write in your identity.
except for shared group folder whose access is not implemented yet.

Job Management and Lifecycle

The following functions are available to manage GPU jobs.

Submit a new job to a GPU worker node

Use Case 1.1

/dsa/scripts/gpu-run a.ipynb

In this case, submit.yaml stores default values for the directory.

Use Case 1.2
Specify a Python script instead of a Jupyter notebook.

/dsa/scripts/gpu-run a.py

Use Case 2.1
Specify the container image at the same time.

/dsa/scripts/gpu-run tensorflow/tensorflow:1.14.0-gpu-py3 a.ipynb

Use Case 2.2

/dsa/scripts/gpu-run tensorflow/tensorflow:1.14.0-gpu-py3 a.py

Use Case 3.1
Store all the job submission parameters in a yaml file.
It must contain a complete set of parameters required to submit a job.

/dsa/scripts/gpu-run a.yaml

# Content of a.yaml

token: *****
# optional if already configured

image: tensorflow/tensorflow:1.14.0-gpu-py3
# optional if specified in the cmdline

gpus: 1
# optional; this is the default value

script: a.ipynb
# optional only in submit.yaml

Use Case 3.2
Use submit.yaml from the current directory.

/dsa/scripts/gpu-run

Example 1: What happens during the job submission

jupyterhub ~/jupyter/GPU$ /dsa/scripts/gpu-run GPUTest.ipynb
GPURUN_IMAGE from submit.yaml
GPURUN_TOKEN from /home/test_user/jupyter/.gpu-run/default.yaml
[NbConvertApp] Converting notebook GPUTest.ipynb to python
[NbConvertApp] Writing 290 bytes to GPUTest.py
renamed 'GPUTest.py' -> 'GPUTest.gpu.py'
Preparing GPUTest.gpu.py
Submitting tensorflow/tensorflow:1.14.0-gpu-py3 GPU/GPUTest.gpu.py
task_id: 90520cd2-51ce-11ea-8803-9268bcb2efec

jupyterhub ~/jupyter/GPU$ /dsa/scripts/gpu-check 90520cd2-51ce-11ea-8803-9268bcb2efec
Succeeded

This script prepares the notebook or Python script for job submission by generating a slightly modified file (*.gpu.py) and then puts in a request for the GPU cluster to run this file. It is supposed to return a task id so that the user can refer to this job in the future.

If there are no resources available, the job will be queued and remain Pending status.

Example 2: Query your reserved GPUs

import os, sys
os.system('nvidia-smi -L > nvidia-smi.txt')

This generates an output in your current folder. Currently, writing to disk is the only way to obtain any kind of output. Standard output is not available.
With that being said, it is entirely feasible to create a wrapper script and redirect all your standard output to a file.

Example 3: A Trivial TensorFlow Exammple

import os, sys
import tensorflow as tf
a = tf.constant(100)
with tf.Session() as sess:
    _a = sess.run(a)
    with open('out.txt', 'w') as f:
        f.write(str(_a))

Check status

/dsa/scripts/gpu-check task_id

The task_id is provided upon a successful job submission. This job submission system is a minimal implementation, we do not store the task_id anywhere. Come up with your own way to keep track of it.

Possible states:

Terminal states
- "Completed": job terminated without detectable error
- "Succeeded": same as Completed; it exists for compatibility
- "Failed": job terminated with error (either from user or system); or the job may have been canceled or killed
- Any other states that the job submission system is not aware of existing will be labeled as "Failed", and it is final.
Transitional states
- "Pending": job is waiting for some type of resource to become available
- "Unknown": the cluster has lost track of the task. Ironically, it is a known state for the job submission system where the cluster simply cannot reach the container.

Cancel a job

/dsa/scripts/gpu-rm task_id

This script will signal the cluster to terminate the job with best effort and provides no feed back. If the task is in a transitional state, it will become "Failed" shortly.

Troubleshoot

Debugging code

It is inconvenient at best without seeing any logs or outputs from the GPU nodes. The GPU nodes do have access to your home folder so outputing to a file can be a way to provide you with some insight as to what went wrong. Other than that we highly recommend testing the code first from the TensorFlow CPU container or on your own computer before submitting a job.

Job has been returning Failed status

More likely than not there is an error in the user code, so reach out for help if you cannot resolve it. It is highly unlikely this type of error would resolve itself in time or through retries.

Error message unreadable

If the error message does not make any sense, it is possible that this is a server-side crash leading to a cascading wave of errors that are unexpected by the client side script, e.g. parsing error, compliation error, etc. Please report it and we will investigate and fix it as a high-priority.

The submission parameters may be incorrect?

If you suspect your GPU access is configured incorrectly. The following command will print the submission parameters for your reference.

GPURUN_EXPORT=ON /dsa/scripts/gpu-conf

Sample output:

GPURUN_IMAGE from submit.yaml
GPURUN_GPUS from submit.yaml
GPURUN_TOKEN from /home/test_user/jupyter/.gpu-run/default.yaml
GPURUN_USER='user'
GPURUN_UID='1000'
GPURUN_GID='1000'
GPURUN_TOKEN='********-ae69-43fa-a06d-57937d04a6aa'
GPURUN_IMAGE='tensorflow/tensorflow:1.14.0-gpu-py3'
GPURUN_GPUS='2'

Need a real person?

Email me: zy5f9@mail.missouri.edu

This is mostly for bugs within our system and limited usage support. Other questions will be redirected, including feature requests or user code errors.