cloudml: job keeps running even after successful deployment and execution

Hello,

I am trying to deploy a test job from my RStudio Desktop to GCP AI platform. I am able to successfully deploy the job after the suggested ammendement (Attempt to fix write() argument must be str) to .\library\cloudml\cloudml\cloudml\deploy.py file with line.decode('utf-8'); but the job keeps on running and consuming the resources even when it is successfully completed. I see the output in gs://bucket/r-cloudml/runs/auto-generated-job-id/iris.rds along with gs://bucket/r-cloudml/runs/auto-generated-job-id/tfruns.d/completed file value set at TRUE. Is anyone encountering the same? Any help is appreciated!!

One more thing - it doesn't take the jobId provided in the job.yml file, but auto-generates it (cloudml_datetimestamp)

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C                   LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3    tinytex_0.20   xfun_0.12
cloudml::gcloud_version()
$`Google Cloud SDK`
[1] ‘301.0.0’

$beta
[1] ‘2020.7.10’

$bq
[1] ‘2.0.58’

$core
[1] ‘2020.7.10’

$gsutil
[1] ‘4.51’
# test_file.R
saveRDS(lm(iris), "iris.rds")
print("End of Code")

# cloudml.yml
trainingInput:
  runtimeVersion: '2.1'
  pythonVersion: '3.7'
  scaleTier: CUSTOM
  masterType: 'n1-standard-4'

# job.yml
jobId: local-r-heramb
storage: gs://bucket-name/r-cloudml
custom_commands: ~

#execution.R
library(cloudml)
setwd("./r-keras-tensorflow/") # dir where I keep my test_file.R and yml configs.
cloudml_train(file = "test_file.R")

Thanks in advance!
Heramb

Found a possible bug and resolution (Unsure if the resolution is a best practice) - Reporting it for the community and developers

The problem is with the below chunk in path-to-library/cloudml/cloudml/cloudml/deploy.py

# Stream output from subprocess to console.
for line in iter(process.stdout.readline, ""):
    sys.stdout.write(line.decode('utf-8'))

Once the execution is completed, this does not does not halt and hence enters a continuous loop.

Resolution : comment out the above chunk from deploy.py and it will give you a successful execution.
Downside : you won't be able to see step-by-step installation progress and hence won't get a hint from logs if there is an error in the script. But below chunk will ensure the check on successful execution. If there is an error in the script, it will keep on running endlessly.

# Finalize the process.
stdout, stderr = process.communicate()

# Detect a non-zero exit code.
if process.returncode != 0:
  fmt = "Command %s failed: exit code %s"
  print(fmt % (commands, process.returncode))
else:
  print("Command %s ran successfully." % (commands, ))

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.