Workarounds

From DeepSense Docs
Jump to: navigation, search

Users may have problems using the DeepSense platform. The problems could be caused by the OS or applications. Sometimes, they are related to how users use the systems. We would try our best to solve all the problems. But for some problems, we may not be able to provide solutions due to the restrictions of the OS or applications. However, we can use some workarounds to either avoid having the problems or solve the problem to some extent.

Jobs fail due to broken sessions on compute nodes

When a user submits an interactive LSF job, he/she would be assigned a compute node for him/her to interact with the systems. For example, a user can open a Jupyter notebook to edit and execute his/her scripts. However, a user's job could be long and and the session opened for the user could be closed such that the user loses the session and consequently the jobs failed. It wastes users' time and it can be frustrating.
Actually, it is easy that a user keeps his/her jobs running at the background even though the session is closed. A user can add 'nohup' and '&' before and after his/her script name to make the script run at the background. The syntax is: 'nohup <script name> &'. For example, a user can open Jupyter Notebook on a compute node by command 'nohup jupyter notebook &' to avoid the sessions being closed automatically. A user that submits batch jobs does not need to worry about this problem because all the computing resources for batch jobs would be administrated by LSF.

GPU memory occupied without any running processes

Sometimes, a user may see that there are no processes running on the GPU but the memory is fully occupied. For example, the output of 'nvidia-smi' below:

[username@ds-cmgpu-01 ~]$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000002:01:00.0 Off |                    0 |
| N/A   29C    P0    41W / 300W |  15682MiB / 16280MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000006:01:00.0 Off |                    0 |
| N/A   27C    P0    28W / 300W |      0MiB / 16280MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ 
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

There could be various reasons for this. For example, a user may press ctrl+c while his/her jobs are running. This may cause the jobs to exit with some processing running. If a users wants to clean up all his processes and release the memory the processes occupy, he/she needs to find out what processes owned by him/her are still running on the systems and kill them. A user can run the command 'ps -ef|grep yourusername' to find out the processes owned by him/her. These processes may be seen by the OS but not the command 'nvidia-smi'. If a user is sure that the processes should be terminated, he/she can use the "kill -9 pid" to kill them. 'pid' is the process id of a process. After all the processes owned by a user are killed, the occupied gpu memory should be automatically released.

Cannot use Caffe on login node or compute nodes without GPUs

Cuda number of devices: -579579216
Current device id: -579579216
Current device name: 
[==========] Running 2207 tests from 293 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from AccuracyLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] AccuracyLayerTest/0.TestSetup
E0206 15:59:26.604874  7990 common.cpp:121] Cannot create Cublas handle. Cublas won't be available.
E0206 15:59:26.611477  7990 common.cpp:128] Cannot create Curand generator. Curand won't be available.
F0206 15:59:26.611616  7990 syncedmem.cpp:500] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***

You may see this error when attempting to use Caffe on a node without GPUs or a GPU node without specifically requesting a GPU.

To resolve this problem, use a GPU node and request a GPU. Caffe cannot run without an available GPU.

Nested anaconda environments may cause strange behaviour

Some users have experienced strange behaviour when activating an anaconda environment within another environment. This may include permission errors, loading incorrect versions of software, or strange conflicts when attempting to install packages. If you encounter problems with a nested anaconda environment then first try deactivating all anaconda environments and activating just the desired environment.