I need to create a table that has patient names in the first column then every column after that contains either a zero or one depending on whether the patient had that disease. Also I need to know how to keep the tibble TSV small because when I fill it with NA instead of 0 the table is ~130 GB. Is integer (0 or 1) or boolean/logical (TRUE/FALSE) the smallest?
CURRENT DATA anonymized stub example (there are blank spaces that need to be ignored: id1 has a blank space in f.41270.3 ID2 has a blank space in f.41270.2 because 184.11 is actually in the f.41270.3 column the tabs just didn't work in this post and id3 is all blanks)
ID f.41270.1 f.41270.2 f.41270.3
id1 184.11 151.11
id2 987 184.11
id3
CONVERTED DATA TO CREATE
Either add X_ or force columns to be col_character() because ICD9 has no char it is double but ICD10 is char
ID X_184.11 X_151.11 X_987
id1 1 1 0
id2 1 0 1
id3 0 0 0
EDIT:
library(data.table)
patients = data.table(
ID = c("id1", "id2", "id3"),
f.41270.1 = c("184.11", "987", ""),
f.41270.2 = c("151.11", "", ""),
f.41270.3 = c("", "184.11", "")
)
ICDl = data.table(c("184.11", "151.11", "987", "184.11"))
END EDIT
library(tidyverse)
library(data.table)
ICDl <- read_tsv("ICD_long.txt") # row of all ICD disease codes NOT the general codes f.41270.1 f.41270.2 f.41270.3 but rather the specific codes 184.11 151.11 987 184.11
patients <- read_tsv("patients.txt") # column of patient IDs
patientsRows = count(patients, vars = "ID")
patientsRowsToAdd = as.integer(patientsRows[2])-1
# https://www.rdocumentation.org/packages/berryFunctions/versions/1.20.1/topics/addRows
ICDl = addRows(ICDl, patientsRowsToAdd)
# This fills the blank spaces in the table with NA and that makes the full table too big (~130 GB)
# https://www.rdocumentation.org/packages/berryFunctions/versions/1.20.1/topics/addRows
ICDl = addRows(ICDl, patientsRowsToAdd, values = "0")
# "Killed" I guess I cannot do values = "0" to try to make the table smaller
# https://readr.tidyverse.org/reference/cols.html
# https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/type.convert
ICDl = type.convert(ICDl)
# This could make the table smaller by changing the col_character() columns to col_integer() or col_logical() but I cannot test it yet because values = "0" doesn't work
# https://stackoverflow.com/questions/19508256/how-to-add-new-column-to-an-dataframe-to-the-front-not-end
ICDlPD = cbind(patients, ICDl)
# addRows was required to get cbind to work because cbind only adds a column if the number of rows is equal
# this should add the patient IDs to the first column but leaves all of the ICD disease codes in the first row
# Then I can figure out how to add 1s everywhere the patient has the ICD disease code. Or maybe I could skip the above if you tell me how to add the patients one row at a time.