Difference between revisions of "Known problems"

From DeepSense Docs
Jump to: navigation, search
(Update with information about IBM WMLA conda channel)
Line 1: Line 1:
 +
== Where did Tensorflow go? ==
 +
 +
On June 26 we will update the GPU compute nodes to a new version of IBM Watson Machine Learning Accelerator. This will change the way you access deep learning packages like Tensorflow and Pytorch. Instead of "activating" these packages, you will be able to install new versions directly in your anaconda environment.
 +
 +
See [[Installing local software]] and [[Getting started with Deep Learning]] for more information.
 +
 +
We are actively updating the wiki documentation to explain the new method of accessing deep learning packages. Please bear with us during these updates as some documentation may still refer to the old method of "activating" deep learning packages
 +
 
== The bhist command is not working ==
 
== The bhist command is not working ==
  

Revision as of 19:28, 17 June 2020

Where did Tensorflow go?

On June 26 we will update the GPU compute nodes to a new version of IBM Watson Machine Learning Accelerator. This will change the way you access deep learning packages like Tensorflow and Pytorch. Instead of "activating" these packages, you will be able to install new versions directly in your anaconda environment.

See Installing local software and Getting started with Deep Learning for more information.

We are actively updating the wiki documentation to explain the new method of accessing deep learning packages. Please bear with us during these updates as some documentation may still refer to the old method of "activating" deep learning packages

The bhist command is not working

We are aware of a problem causing the bhist command to not find the job log file necessary to print information about completed jobs. Instead, the output is always "no matching job found" even if the user specifies the "-a" option to print information about all running, completed, and failed jobs.

While we work on solving this issue you can manually specify the location of the log file using the -f option. For example, the following command will print all running, completed, and failed jobs for the current user:

bhist -f /lsfshare/lsfswg/lsf/work/DeepSenseLSFCluster/logdir/lsb.events -a


Where is my job output file?

Output may not be written to the specified file immediately when using the -o <filename> or -oo <filename> options. There are two workarounds for this problem:

  1. You can use the bpeek <jobid> command to view the output of a currently running job.
  2. You can send your output to a file with the typical unix output specifications such as > <filename> with your executed programs or by specifying output files in programs that support such options.

Jupyter notebooks or other programs fail trying to access a /run directory

The default login shell is BASH. Make sure the following parameter is in your .bashrc file in your home directory, as it prevents a problem where some types of jobs fail when run through the LSF queue. This should be done automatically the first time you log onto DeepSense.

echo 'unset XDG_RUNTIME_DIR' >> ~/.bashrc

This line has been added to the default .bashrc file for new users but older user accounts may need this step to be done manually.

Browser fails to connect to Jupyter Notebooks

On our MacBook Pros, Jupyter notebooks work in Chrome, but don't work in safari. Unfortunately, no error is given. Safari just fails to connect. Please let us know if you have issues with any other browsers, and we can add that info here.

Cannot Install PyTorch dependencies

UnsatisfiableError: The following specifications were found to be in conflict:
  - powerai-pytorch-prereqs=0.4.1_12295.5cb3523

You may see this error when attempting to install the pytorch dependencies in a local anaconda environment. This error indicates that some of your installed python packages are not compatible with the pytorch prequisites. In particular, we see this error when conda has been updated to version 4.6 (which may sometimes happen when installing the tensorflow dependencies first).

To resolve this problem, create a new environment with a 4.5.x conda version and then install the pytorch dependencies in that environment.

Cannot use Caffe on login node or compute nodes without GPUs

Cuda number of devices: -579579216
Current device id: -579579216
Current device name: 
[==========] Running 2207 tests from 293 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from AccuracyLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] AccuracyLayerTest/0.TestSetup
E0206 15:59:26.604874  7990 common.cpp:121] Cannot create Cublas handle. Cublas won't be available.
E0206 15:59:26.611477  7990 common.cpp:128] Cannot create Curand generator. Curand won't be available.
F0206 15:59:26.611616  7990 syncedmem.cpp:500] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***

You may see this error when attempting to use Caffe on a node without GPUs or a GPU node without specifically requesting a GPU.

To resolve this problem, use a GPU node and request a GPU. Caffe cannot run without an available GPU.

Cannot see GPUs in an LSF job

$ nvidia-smi 
No devices were found

GPUs must be requested with the -gpu - option to bsub. See LSF#GPU_Computation for more information.

Nested anaconda environments may cause strange behaviour

Some users have experienced strange behaviour when activating an anaconda environment within another environment. This may include permission errors, loading incorrect versions of software, or strange conflicts when attempting to install packages. If you encounter problems with a nested anaconda environment then first try deactivating all anaconda environments and activating just the desired environment.