I have been successfully using the RStudio Server on AWS for several months, and the GPU was greatly accelerating the training time for my deep networks (by almost 2 orders of magnitude over the CPU implementation of the same). However, a few weeks ago, the performance slowed considerably. Whereas before, I could train a 6-million-parameter network for 5000 epochs overnight, now the same training would take weeks. I had thought it might be an issue with memory overload as I moved to larger datasets, but even accounting for that, the system runs much slower than it used to.
Recently, I restarted both the R session and the Keras backend and tried reloading a saved model. Unfortunately, training was still slogging along at mere CPU speeds, but I noticed that I got the following error message right after calling the very first Keras function in my script:
2019-05-11 00:02:47.673851: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-11 00:02:47.753743: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN
2019-05-11 00:02:47.753808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (ip-10-217-37-254): /proc/driver/nvidia/version does not exist
It looks like the EC2 instance is no longer able to access the GPU at all, even though it clearly could before. Other than downloading all my data, terminating this instance, spinning up a new instance of RStudio Server, and reuploading, is there any way I can get the GPU working for me again? Any help is appreciated.
The first two ran fine, but I have a sudo problem with the last one:
sudo: unable to resolve host ip-10-217-37-254
[sudo] password for rstudio-user:
rstudio-user is not in the sudoers file. This incident will be reported.
I had to run "install_keras()" in RStudio to get Keras to work again. Unfortunately, it still runs very slowly.
Now, when I first define a Keras layer, I get the following:
WARNING:tensorflow:From /home/rstudio-user/.virtualenvs/r-tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-13 20:57:14.132375: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-13 20:57:14.136902: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300065000 Hz
2019-05-13 20:57:14.137122: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1f477a80 executing computations on platform Host. Devices:
2019-05-13 20:57:14.137150: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
I'm not sure whether it's even trying to use the GPU now.
So I realized that I could run the tutorials from the cmd.exe tunnel (my company has a lot of firewalls, so I have to run a tunnel on cmd.exe to access the 8787 port through my browser). I ran the following commands and rebooted the EC2 instance: