Hello. I am doing my bachelor thesis and I need help because I have never done anything in R studio. I have an excel file with a column called INITIAL SAMPLE DESCRIPTION that describes ancestry. This column consists of thousands of rows. The rows look like this:
[1] "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
[2] "286 Japanese ancestry cases, 557 Japanese ancestry controls"
[3] "2,031 Han Chinese ancestry cases, 2,044 Han Chinese ancestry controls"
I need a function in R studio that will separate the rows by commas. Then will leave only the words that start with a capital letter in each row. My goal is to have a list of words that describe ancestry for each of the row:
[1] "European" "Papua New Guinean"
[2] Japanese
[3] Han Chinese
Then calculate the frequency of each of these ancestry terms.
Ouch, this could be nasty. The problem, itself, looks easy but the layout of the data can be a problem.
Am I reading this correctly. You have one column "SAMPLE DESCRIPTION" with
European ancestry controls
Papua New Guinean ancestry cases
Papua New Guinean ancestry controls
Japanese ancestry cases
Japanese ancestry controls
Han Chinese ancestry cases
Han Chinese ancestry controls
We can ignore the numbers for the moment. If so, someone with a bit more skill in regex should be able to help.
Do you need the numbers as a variable as well as the categories? Not a problem but we need to know.
The occurrence of both ancestry cases and ancestry controls in alternating order
Provided, that this accurately describes the data, each row can be converted into a data frame and ultimately those data frames can be combined.
# representative data
(a_row <- "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls")
#> [1] "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
# kill the radix, the embedded comma in the numbers and trim white space
(b_row <- gsub("(?<=\\d),(?=\\d)", "", a_row, perl = TRUE)) |> trimws()
#> [1] "1281 European ancestry cases, 6015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
# trim spacing
(c_row <- gsub(" (?=\\d)", "", b_row, perl = TRUE))
#> [1] "1281 European ancestry cases,6015 European ancestry controls,290 Papua New Guinean ancestry cases,286 Papua New Guinean ancestry controls"
# split on commas
(d_row <- strsplit(c_row,","))
#> [[1]]
#> [1] "1281 European ancestry cases"
#> [2] "6015 European ancestry controls"
#> [3] "290 Papua New Guinean ancestry cases"
#> [4] "286 Papua New Guinean ancestry controls"
# Only if confident that the sequence is consistently
# Extract every other element to make a vector
(e_row <- d_row[[1]][1:(d_row |> unlist() |> length()) %% 2 != 0])
#> [1] "1281 European ancestry cases"
#> [2] "290 Papua New Guinean ancestry cases"
# remove the string "ancestry cases"
(f_row <- lapply(e_row, function(x) gsub(" ancestry cases","",x)))
#> [[1]]
#> [1] "1281 European"
#>
#> [[2]]
#> [1] "290 Papua New Guinean"
# replace first space with a comma (∴ sub, not gsub)
(g_row <- lapply(f_row, function(x) sub(" ",",",x)))
#> [[1]]
#> [1] "1281,European"
#>
#> [[2]]
#> [1] "290,Papua New Guinean"
# split on the number
(h_row <- lapply(g_row, function(x) strsplit(x,","))) |> unlist(x = _, recursive = FALSE)
#> [[1]]
#> [1] "1281" "European"
#>
#> [[2]]
#> [1] "290" "Papua New Guinean"
# convert to data frame
(d <- do.call(rbind, unlist(h_row,recursive = FALSE)) |> as.data.frame())
#> V1 V2
#> 1 1281 European
#> 2 290 Papua New Guinean
# Set the column names
(colnames(d) <- c("count", "ancestry"))
#> [1] "count" "ancestry"
d
#> count ancestry
#> 1 1281 European
#> 2 290 Papua New Guinean
Created on 2023-12-26 with [reprex v2.0.2]
John is right, but if the data are not representative it will get much more so. Data that has been manipulated in a spreadsheet is presumptively contaminated. If that is the case with these data, the assumption that the data used are representative is likely to be disappointed.