GPU Resources
Overview
There are several GPU nodes in the Europa cluster available to DSA students.
- 2 3x GPU nodes
- 2 1x GPU nodes
Card Model:
- Nvidia Tesla P100 GPUs
- CUDA Capability: 6.0 (Pascal Gen)
- Device Memory: 12 GiB
Node Configuration:
- at least 4 CPU cores per GPU
- 18 GiB host memory per GPU
User Quota:
- 1 concurrent job
- 1-3 GPU(s) per job
- all cores
- all RAM
Get Access
Because GPUs are a high-contention resource, only job submission is allowed.
You will need an access token to submit jobs to the GPU nodes.
Treat it like a private key and DO NOT share with anyone.
If you requested GPUs from us, this will be configured for you and supposed to work out-of-the-box when you run the GPU job submission commands.
To check for yourself that you do have a token:
GPURUN_EXPORT=ON /dsa/scripts/gpu-conf
About GPU Container
You code will execute in an Nvidia Docker container so the environment and installed packages are different from those of JupyterHub containers on Europa. The code will run with your own identity and privilege escalation is not allowed.
The following list of images are allowed:
- tensorflow/tensorflow
- 2.0.1-gpu-py3
- 1.15.2-gpu-py3
- 1.14.0-gpu-py3
- 1.12.3-gpu-py3
- pytorch/pytorch
- 1.2-cuda10.0-cudnn7-runtime
If this is not sufficient, you may request to allow additional images. For courseworks, it will require instructor approval by letting your instructor send our IT Admin an email on your behalf. For projects, we can give you a special permission to run images outside of this list.
The GPU container will have access to:
- /dsa/home/$USER (aka your DSA home folder)
- /dsa/data
In other words, almost anything you can see from your Europa Jupyter Server, the GPU nodes can also read and write in your identity.
except for shared group folder whose access is not implemented yet.
Job Management and Lifecycle
The following functions are available to manage GPU jobs.
Submit a new job to a GPU worker node
Use Case 1.1
/dsa/scripts/gpu-run a.ipynb
In this case, submit.yaml
stores default values for the directory.
Use Case 1.2
Specify a Python script instead of a Jupyter notebook.
/dsa/scripts/gpu-run a.py
Use Case 2.1
Specify the container image at the same time.
/dsa/scripts/gpu-run tensorflow/tensorflow:1.14.0-gpu-py3 a.ipynb
Use Case 2.2
/dsa/scripts/gpu-run tensorflow/tensorflow:1.14.0-gpu-py3 a.py
Use Case 3.1
Store all the job submission parameters in a yaml file.
It must contain a complete set of parameters required to submit a job.
/dsa/scripts/gpu-run a.yaml
# Content of a.yaml
token: *****
# optional if already configured
image: tensorflow/tensorflow:1.14.0-gpu-py3
# optional if specified in the cmdline
gpus: 1
# optional; this is the default value
script: a.ipynb
# optional only in submit.yaml
Use Case 3.2
Use submit.yaml from the current directory.
/dsa/scripts/gpu-run
Example 1: What happens during the job submission
jupyterhub ~/jupyter/GPU$ /dsa/scripts/gpu-run GPUTest.ipynb
GPURUN_IMAGE from submit.yaml
GPURUN_TOKEN from /home/test_user/jupyter/.gpu-run/default.yaml
[NbConvertApp] Converting notebook GPUTest.ipynb to python
[NbConvertApp] Writing 290 bytes to GPUTest.py
renamed 'GPUTest.py' -> 'GPUTest.gpu.py'
Preparing GPUTest.gpu.py
Submitting tensorflow/tensorflow:1.14.0-gpu-py3 GPU/GPUTest.gpu.py
task_id: 90520cd2-51ce-11ea-8803-9268bcb2efec
jupyterhub ~/jupyter/GPU$ /dsa/scripts/gpu-check 90520cd2-51ce-11ea-8803-9268bcb2efec
Succeeded
This script prepares the notebook or Python script for job submission by generating a slightly modified file (*.gpu.py) and then puts in a request for the GPU cluster to run this file. It is supposed to return a task id so that the user can refer to this job in the future.
If there are no resources available, the job will be queued and remain Pending status.
Example 2: Query your reserved GPUs
import os, sys
os.system('nvidia-smi -L > nvidia-smi.txt')
This generates an output in your current folder. Currently, writing to disk is the only way to obtain any kind of output. Standard output is not available.
With that being said, it is entirely feasible to create a wrapper script and redirect all your standard output to a file.
Example 3: A Trivial TensorFlow Exammple
import os, sys
import tensorflow as tf
a = tf.constant(100)
with tf.Session() as sess:
_a = sess.run(a)
with open('out.txt', 'w') as f:
f.write(str(_a))
Check status
/dsa/scripts/gpu-check task_id
The
task_id
is provided upon a successful job submission. This job submission system is a minimal implementation, we do not store thetask_id
anywhere. Come up with your own way to keep track of it.
Possible states:
- Terminal states
- "Completed": job terminated without detectable error
- "Succeeded": same as Completed; it exists for compatibility
- "Failed": job terminated with error (either from user or system); or the job may have been canceled or killed
- Any other states that the job submission system is not aware of existing will be labeled as "Failed", and it is final.
- Transitional states
- "Pending": job is waiting for some type of resource to become available
- "Unknown": the cluster has lost track of the task. Ironically, it is a known state for the job submission system where the cluster simply cannot reach the container.
Cancel a job
/dsa/scripts/gpu-rm task_id
This script will signal the cluster to terminate the job with best effort and provides no feed back. If the task is in a transitional state, it will become "Failed" shortly.
Troubleshoot
Debugging code
It is inconvenient at best without seeing any logs or outputs from the GPU nodes. The GPU nodes do have access to your home folder so outputing to a file can be a way to provide you with some insight as to what went wrong. Other than that we highly recommend testing the code first from the TensorFlow CPU container or on your own computer before submitting a job.
Job has been returning Failed status
More likely than not there is an error in the user code, so reach out for help if you cannot resolve it. It is highly unlikely this type of error would resolve itself in time or through retries.
Error message unreadable
If the error message does not make any sense, it is possible that this is a server-side crash leading to a cascading wave of errors that are unexpected by the client side script, e.g. parsing error, compliation error, etc. Please report it and we will investigate and fix it as a high-priority.
The submission parameters may be incorrect?
If you suspect your GPU access is configured incorrectly. The following command will print the submission parameters for your reference.
GPURUN_EXPORT=ON /dsa/scripts/gpu-conf
Sample output:
GPURUN_IMAGE from submit.yaml
GPURUN_GPUS from submit.yaml
GPURUN_TOKEN from /home/test_user/jupyter/.gpu-run/default.yaml
GPURUN_USER='user'
GPURUN_UID='1000'
GPURUN_GID='1000'
GPURUN_TOKEN='********-ae69-43fa-a06d-57937d04a6aa'
GPURUN_IMAGE='tensorflow/tensorflow:1.14.0-gpu-py3'
GPURUN_GPUS='2'
Need a real person?
Email me: zy5f9@mail.missouri.edu
This is mostly for bugs within our system and limited usage support. Other questions will be redirected, including feature requests or user code errors.