Difference between revisions of "Workarounds"
Line 31: | Line 31: | ||
A user can run the command 'ps -ef|grep yourusername' to find out the processes owned by him/her. These processes may be seen by the OS but not the command 'nvidia-smi'. | A user can run the command 'ps -ef|grep yourusername' to find out the processes owned by him/her. These processes may be seen by the OS but not the command 'nvidia-smi'. | ||
If a user is sure that the processes should be terminated, he/she can use the "kill -9 pid" to kill them. 'pid' is the process id of a process. After all the processes owned by a user are killed, the occupied gpu memory should be automatically released. | If a user is sure that the processes should be terminated, he/she can use the "kill -9 pid" to kill them. 'pid' is the process id of a process. After all the processes owned by a user are killed, the occupied gpu memory should be automatically released. | ||
+ | |||
+ | == Cannot use Caffe on login node or compute nodes without GPUs == | ||
+ | |||
+ | <nowiki>Cuda number of devices: -579579216 | ||
+ | Current device id: -579579216 | ||
+ | Current device name: | ||
+ | [==========] Running 2207 tests from 293 test cases. | ||
+ | [----------] Global test environment set-up. | ||
+ | [----------] 9 tests from AccuracyLayerTest/0, where TypeParam = caffe::CPUDevice<float> | ||
+ | [ RUN ] AccuracyLayerTest/0.TestSetup | ||
+ | E0206 15:59:26.604874 7990 common.cpp:121] Cannot create Cublas handle. Cublas won't be available. | ||
+ | E0206 15:59:26.611477 7990 common.cpp:128] Cannot create Curand generator. Curand won't be available. | ||
+ | F0206 15:59:26.611616 7990 syncedmem.cpp:500] Check failed: error == cudaSuccess (30 vs. 0) unknown error | ||
+ | *** Check failure stack trace: *** | ||
+ | </nowiki> | ||
+ | |||
+ | You may see this error when attempting to use Caffe on a node without GPUs or a GPU node without specifically requesting a GPU. | ||
+ | |||
+ | To resolve this problem, use a GPU node and request a GPU. Caffe cannot run without an available GPU. |
Revision as of 13:37, 4 December 2020
Users may have problems using the DeepSense platform. The problems could be caused by the OS or applications. Sometimes, they are related to how users use the systems. We would try our best to solve all the problems. But for some problems, we may not be able to provide solutions due to the restrictions of the OS or applications. However, we can use some workarounds to either avoid having the problems or solve the problem to some extent.
Jobs fail due to broken sessions on compute nodes
When a user submits an interactive LSF job, he/she would be assigned a compute node for him/her to interact with the systems. For example, a user can open a Jupyter notebook to edit and execute his/her scripts. However, a user's job could be long and and the session opened for the user could be closed such that the user loses the session and consequently the jobs failed. It wastes users' time and it can be frustrating.
Actually, it is easy that a user keeps his/her jobs running at the background even though the session is closed. A user can add 'nohup' and '&' before and after his/her script name to make the script run at the background. The syntax is: 'nohup <script name> &'. For example, a user can open Jupyter Notebook on a compute node by command 'nohup jupyter notebook &' to avoid the sessions being closed automatically.
A user that submits batch jobs does not need to worry about this problem because all the computing resources for batch jobs would be administrated by LSF.
GPU memory occupied without any running processes
Sometimes, a user may see that there are no processes running on the GPU but the memory is fully occupied. For example, the output of 'nvidia-smi' below:
[username@ds-cmgpu-01 ~]$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-SXM2... On | 00000002:01:00.0 Off | 0 | | N/A 29C P0 41W / 300W | 15682MiB / 16280MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-SXM2... On | 00000006:01:00.0 Off | 0 | | N/A 27C P0 28W / 300W | 0MiB / 16280MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
There could be various reasons for this. For example, a user may press ctrl+c while his/her jobs are running. This may cause the jobs to exit with some processing running. If a users wants to clean up all his processes and release the memory the processes occupy, he/she needs to find out what processes owned by him/her are still running on the systems and kill them. A user can run the command 'ps -ef|grep yourusername' to find out the processes owned by him/her. These processes may be seen by the OS but not the command 'nvidia-smi'. If a user is sure that the processes should be terminated, he/she can use the "kill -9 pid" to kill them. 'pid' is the process id of a process. After all the processes owned by a user are killed, the occupied gpu memory should be automatically released.
Cannot use Caffe on login node or compute nodes without GPUs
Cuda number of devices: -579579216 Current device id: -579579216 Current device name: [==========] Running 2207 tests from 293 test cases. [----------] Global test environment set-up. [----------] 9 tests from AccuracyLayerTest/0, where TypeParam = caffe::CPUDevice<float> [ RUN ] AccuracyLayerTest/0.TestSetup E0206 15:59:26.604874 7990 common.cpp:121] Cannot create Cublas handle. Cublas won't be available. E0206 15:59:26.611477 7990 common.cpp:128] Cannot create Curand generator. Curand won't be available. F0206 15:59:26.611616 7990 syncedmem.cpp:500] Check failed: error == cudaSuccess (30 vs. 0) unknown error *** Check failure stack trace: ***
You may see this error when attempting to use Caffe on a node without GPUs or a GPU node without specifically requesting a GPU.
To resolve this problem, use a GPU node and request a GPU. Caffe cannot run without an available GPU.