From DeepSense Docs
Revision as of 15:21, 1 December 2020 by Lyang (talk | contribs)
Jump to: navigation, search

Users may have problems using the DeepSense platform. The problems could be caused by the OS or applications. Sometimes, they are related to how users use the systems. We would try our best to solve all the problems. But for some problems, we may not be able to provide solutions due to the restrictions of the OS or applications. However, we can use some workarounds to either avoid having the problems or solve the problem to some extent.

Jobs fail due to broken sessions on compute nodes

When a user submits an interactive LSF job, he/she would be assigned a compute node for him/her to interact with the systems. For example, a user can open a Jupyter notebook to edit and execute his/her scripts. However, a user's job could be long and and the session opened for the user could be closed such that the user loses the session and consequently the jobs failed. It wastes users' time and it can be frustrating.
Actually, it is easy that a user keeps his/her jobs running at the background even though the session is closed. A user can add 'nohup' and '&' before and after his/her script name to make the script run at the background. The syntax is: 'nohup <script name> &'. For example, a user can open Jupyter Notebook on a compute node by command 'nohup jupyter notebook &' to avoid the sessions being closed automatically. A user that submits batch jobs does not need to worry about this problem because all the computing resources for batch jobs would be administrated by LSF.

GPU memory occupied without any running processes

Sometimes, a user may see that there are no processes running on the GPU but the memory is fully occupied. For example, the output of 'nvidia-smi' below:

[username@ds-cmgpu-01 ~]$ nvidia-smi
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-SXM2...  On   | 00000002:01:00.0 Off |                    0 |
| N/A   29C    P0    41W / 300W |  15682MiB / 16280MiB |      0%   E. Process |
|   1  Tesla P100-SXM2...  On   | 00000006:01:00.0 Off |                    0 |
| N/A   27C    P0    28W / 300W |      0MiB / 16280MiB |      0%   E. Process |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No running processes found                                                 |

There could be various reasons for this. For example, a user may press ctrl+c while his/her jobs are running. This may cause the jobs to exit with some processing running. If a users wants to clean up all his processes and release the memory the processes occupy, he/she needs to find out what processes owned by him/her are still running on the systems and kill them. A user can run the command 'ps -ef|grep yourusername' to find out the processes owned by him/her. These processes may be seen by the OS but not the command 'nvidia-smi'. If a user is sure that the processes should be terminated, he/she can use the "kill -9 pid" to kill them. 'pid' is the process id of a process. After all the processes owned by a user are killed, the occupied gpu memory should be automatically released.