I am trying to sort through the data that I pulled directly off the site. It is in pdf format and when I run the code it comes up extremely sloppy. I would like to be able to automate the process to where whenever the site updates the data it will read it into r and automatically convert it into a data frame so that I can use the information.
Hi Ryan, welcome!
To help us help you, could you please turn this into a proper reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:
This is probably so fragile that it will not work with a different file. I hope it gives you a start on solving the general case. Note that all of the columns in the final data frame are factors! Most will probably need to be converted to numeric.
library(pdftools)
#> Warning: package 'pdftools' was built under R version 3.5.3
library(stringr)
suppressPackageStartupMessages(library(dplyr))
df <- download.file("http://www.mslc.com/Indiana/Resources/documents/ltcisrpt6.pdf",
"ltcisrpt6.pdf", mode = "wb")
RevenuePatientDay <- pdf_text("ltcisrpt6.pdf")
RawPage <- str_split(RevenuePatientDay, "\\n") #break into lines
Hdr <- RawPage[[1]][9] #Define col names from the 9th line
Hdr <- str_replace(Hdr, "^ ", "") #remove leading space
Hdr <- str_replace(Hdr, "\\s+$", "") #remove trailing space
Hdr <- str_replace(Hdr, "For\\s+Profit", "For_Profit") #remove space within col name
Hdr <- str_split(Hdr, "\\s+")
Data <- RawPage[[1]][10:length(RawPage[[1]])] #get all rows after header
Data <- str_replace_all(Data, ",", "") #remove , from numbers
Boundary <- which(grepl("Revenues Per Patient Day", Data)) #Find text-only line
Data <- Data[-Boundary] #remove text only line
Data <- str_replace_all(Data, "(\\w)\\s(\\w)", "\\1_\\2") #replace space with _
Data <- str_replace(Data, "\\s+$", "") #remove trailing space
Data <- Data[-length(Data)] #remove empty line at end
ForDF <- str_split(Data, "\\s+")
#names(ForDF) <- Hdr[[1]]
Mat <- matrix(unlist(ForDF),byrow = TRUE, ncol = 6)
dfFinal <- as.data.frame(Mat)
colnames(dfFinal) <- Hdr[[1]]
dfFinal
#> Number Description State For_Profit
#> 1 142 Beds_Available 98 99
#> 2 143 Total_Bed_Days_Available 35797 36222
#> 3 144 Medicaid_Patient_Days 16897 13850
#> 4 148 Total_Patient_Days 26661 23816
#> 5 151 Occupancy_Percentage 74.48% 65.75%
#> 6 152 Medicaid_Utilization 63.38% 58.16%
#> 7 153 Total_Hours_Worked 161533 147805
#> 8 158 Hours_Worked_PPD 6.06 6.21
#> 9 160 Total_Number_of_Providers 525 27
#> 10 211 Routine_Daily_Service 278.25 281.37
#> 11 231 Physical_Therapy 25.34 35.64
#> 12 232 Speech_and_Audiology_Therapy 7.76 9.76
#> 13 233 Occupational_Therapy 24.53 34.42
#> 14 234 Respiratory_Therapy 2.64 0.05
#> 15 235 Sale_of_Routine_Medical_Supplies 0.69 0.85
#> 16 236 Sale_of_Non-Routine_Medical_Supplies 4.19 1.52
#> 17 237 X-Ray_and_Laboratory 1.34 2.94
#> 18 238 Pharmacy_and_Drugs 13.91 11.40
#> 19 239 Parenteral_and_Enteral_Nutrition 0.14 0.00
#> 20 241 Florist 0.00 0.00
#> 21 242 Barber/Beauty_Shop 0.29 0.13
#> 22 243 Vending_Machines 0.02 0.02
#> 23 244 Personal_Purchases 0.01 0.04
#> 24 245 Meals_Sold_to_Guests_and_Employees 0.18 0.09
#> 25 246 Activity_Sales 0.00 0.00
#> 26 247 Investment_Income 0.63 0.11
#> 27 248 Other_Revenue 3.29 3.33
#> 28 261 Gross_Revenues 363.22 381.65
#> 29 262 Less_Bad_Debts -2.24 -6.00
#> 30 263 Less_Contractual_Charity_Allowances -78.78 -89.71
#> 31 267 Less_Other_Reductions -0.29 -1.03
#> 32 268 Net_Revenues 281.91 284.91
#> Non-Profit Government
#> 1 59 99
#> 2 21598 36124
#> 3 6510 17322
#> 4 18878 27012
#> 5 87.41% 74.77%
#> 6 34.49% 64.13%
#> 7 167659 162145
#> 8 8.88 6.00
#> 9 12 486
#> 10 277.87 278.10
#> 11 37.65 24.62
#> 12 5.61 7.69
#> 13 32.63 23.90
#> 14 1.29 2.79
#> 15 0.57 0.69
#> 16 2.87 4.34
#> 17 2.68 1.24
#> 18 21.23 13.91
#> 19 0.00 0.15
#> 20 0.00 0.00
#> 21 1.05 0.29
#> 22 0.26 0.01
#> 23 0.00 0.01
#> 24 1.71 0.16
#> 25 0.54 0.00
#> 26 15.25 0.40
#> 27 7.67 3.21
#> 28 408.89 361.53
#> 29 -1.96 -2.07
#> 30 -47.57 -78.78
#> 31 -0.01 -0.26
#> 32 359.36 280.43
Created on 2019-05-24 by the reprex package (v0.2.1)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(cronR)
library(miniUI)
library(shiny)
library(shinyFiles)
library(pdftools)
library(tm)
#> Loading required package: NLP
#>
#> Attaching package: 'NLP'
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
library(xlsx)
#> Warning in system("/usr/libexec/java_home", intern = TRUE): running command
#> '/usr/libexec/java_home' had status 1
#> Error: package or namespace load failed for 'xlsx':
#> .onLoad failed in loadNamespace() for 'rJava', details:
#> call: dyn.load(file, DLLpath = DLLpath, ...)
#> error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so':
#> dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-11.0.1.jdk/Contents/Home/lib/server/libjvm.dylib
#> Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so
#> Reason: image not found
library(readtext)
library(stringr)
library(plyr)
#> -------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> -------------------------------------------------------------------------
#>
#> Attaching package: 'plyr'
#> The following objects are masked from 'package:dplyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
#> The following object is masked from 'package:purrr':
#>
#> compact
library(datapasta)
datapasta::df_paste(download.file("http://www.mslc.com/Indiana/Resources/documents/ltcisrpt6.pdf",
"ltcisrpt6.pdf", mode = "wb"))
#> Could not format input_table as table. Unexpected class.
datapasta::df_paste(RevenuePatientDay <- pdf_text("ltcisrpt6.pdf"))
#> Could not format input_table as table. Unexpected class.
RevenuePatientDay
#> [1] "Sort By: Organization Type Myers and Stauffer LC 10/01/18\n Quarter:\n Indiana Medicaid\n Date: 12/05/18\n Long Term Care Information System\n Page: 8\n Statistical Data Per Facility\n Line Proprietary Voluntary\n Number Description State For Profit Non-Profit Government\n142 Beds Available 98 99 59 99\n143 Total Bed Days Available 35,797 36,222 21,598 36,124\n144 Medicaid Patient Days 16,897 13,850 6,510 17,322\n148 Total Patient Days 26,661 23,816 18,878 27,012\n151 Occupancy Percentage 74.48% 65.75% 87.41% 74.77%\n152 Medicaid Utilization 63.38% 58.16% 34.49% 64.13%\n153 Total Hours Worked 161,533 147,805 167,659 162,145\n158 Hours Worked PPD 6.06 6.21 8.88 6.00\n160 Total Number of Providers 525 27 12 486\n Revenues Per Patient Day\n211 Routine Daily Service 278.25 281.37 277.87 278.10\n231 Physical Therapy 25.34 35.64 37.65 24.62\n232 Speech and Audiology Therapy 7.76 9.76 5.61 7.69\n233 Occupational Therapy 24.53 34.42 32.63 23.90\n234 Respiratory Therapy 2.64 0.05 1.29 2.79\n235 Sale of Routine Medical Supplies 0.69 0.85 0.57 0.69\n236 Sale of Non-Routine Medical Supplies 4.19 1.52 2.87 4.34\n237 X-Ray and Laboratory 1.34 2.94 2.68 1.24\n238 Pharmacy and Drugs 13.91 11.40 21.23 13.91\n239 Parenteral and Enteral Nutrition 0.14 0.00 0.00 0.15\n241 Florist 0.00 0.00 0.00 0.00\n242 Barber/Beauty Shop 0.29 0.13 1.05 0.29\n243 Vending Machines 0.02 0.02 0.26 0.01\n244 Personal Purchases 0.01 0.04 0.00 0.01\n245 Meals Sold to Guests and Employees 0.18 0.09 1.71 0.16\n246 Activity Sales 0.00 0.00 0.54 0.00\n247 Investment Income 0.63 0.11 15.25 0.40\n248 Other Revenue 3.29 3.33 7.67 3.21\n261 Gross Revenues 363.22 381.65 408.89 361.53\n262 Less Bad Debts -2.24 -6.00 -1.96 -2.07\n263 Less Contractual Charity Allowances -78.78 -89.71 -47.57 -78.78\n267 Less Other Reductions -0.29 -1.03 -0.01 -0.26\n268 Net Revenues 281.91 284.91 359.36 280.43\n"
Created on 2019-05-24 by the reprex package (v0.2.1)
Looks like you got an error loading a package
library(xlsx) #> Warning in system("/usr/libexec/java_home", intern = TRUE): running command #> '/usr/libexec/java_home' had status 1 #> Error: package or namespace load failed for 'xlsx': #> .onLoad failed in loadNamespace() for 'rJava', details: #> call: dyn.load(file, DLLpath = DLLpath, ...) #> error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so': #> dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-11.0.1.jdk/Contents/Home/lib/server/libjvm.dylib #> Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so #> Reason: image not found
There is a discussion here on a way to solve this (note the replies with instructions to update to the jdk):
Or the discussions here:
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.