Leaving words that start with capital letters in an excel file with lots of rows

Hello. I am doing my bachelor thesis and I need help because I have never done anything in R studio. I have an excel file with a column called INITIAL SAMPLE DESCRIPTION that describes ancestry. This column consists of thousands of rows. The rows look like this:
[1] "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
[2] "286 Japanese ancestry cases, 557 Japanese ancestry controls"
[3] "2,031 Han Chinese ancestry cases, 2,044 Han Chinese ancestry controls"

I need a function in R studio that will separate the rows by commas. Then will leave only the words that start with a capital letter in each row. My goal is to have a list of words that describe ancestry for each of the row:
[1] "European" "Papua New Guinean"
[2] Japanese
[3] Han Chinese
Then calculate the frequency of each of these ancestry terms.

Frequency per row or per table?

Ouch, this could be nasty. The problem, itself, looks easy but the layout of the data can be a problem.

Am I reading this correctly. You have one column "SAMPLE DESCRIPTION" with

 European ancestry controls
Papua New Guinean ancestry cases
Papua New Guinean ancestry controls
Japanese ancestry cases
Japanese ancestry controls
Han Chinese ancestry cases
Han Chinese ancestry controls

We can ignore the numbers for the moment. If so, someone with a bit more skill in regex should be able to help.

Do you need the numbers as a variable as well as the categories? Not a problem but we need to know.

1 Like

I took your description literally, constructing a list of vectors showing the regions in each row.

DF <- data.frame(Descr = c("1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls",
                 "286 Japanese ancestry cases, 557 Japanese ancestry controls",
                 "2,031 Han Chinese ancestry cases, 2,044 Han Chinese ancestry controls"))
library(stringr)
library(dplyr)

RegionFunc <- function(Txt) {
  tmp <- str_remove_all(Txt, "\\d+,\\d+|\\d+")
  tmp <- str_split_1(tmp, ",")
  tmp <- str_extract_all(tmp, "[A-Z].+?(?= [a-z])")
  tmp <- unlist(tmp)
  unique(tmp)
}
lapply(DF$Descr, RegionFunc)
#> [[1]]
#> [1] "European"          "Papua New Guinean"
#> 
#> [[2]]
#> [1] "Japanese"
#> 
#> [[3]]
#> [1] "Han Chinese"

Created on 2023-12-25 with reprex v2.0.2

What these three rows have in common:

  1. Commas embedded in numbers
  2. The occurrence of both ancestry cases and ancestry controls in alternating order

Provided, that this accurately describes the data, each row can be converted into a data frame and ultimately those data frames can be combined.

# representative data
(a_row <- "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls")
#> [1] "1,281 European ancestry cases, 6,015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
# kill the radix, the embedded comma in the numbers and trim white space
(b_row <- gsub("(?<=\\d),(?=\\d)", "", a_row, perl = TRUE)) |> trimws()
#> [1] "1281 European ancestry cases, 6015 European ancestry controls, 290 Papua New Guinean ancestry cases, 286 Papua New Guinean ancestry controls"
# trim spacing
(c_row <- gsub(" (?=\\d)", "", b_row, perl = TRUE))
#> [1] "1281 European ancestry cases,6015 European ancestry controls,290 Papua New Guinean ancestry cases,286 Papua New Guinean ancestry controls"
# split on commas
(d_row <- strsplit(c_row,","))
#> [[1]]
#> [1] "1281 European ancestry cases"           
#> [2] "6015 European ancestry controls"        
#> [3] "290 Papua New Guinean ancestry cases"   
#> [4] "286 Papua New Guinean ancestry controls"
# Only if confident that the sequence is consistently
# Extract every other element to make a vector
(e_row <- d_row[[1]][1:(d_row |> unlist() |> length()) %% 2 != 0])
#> [1] "1281 European ancestry cases"        
#> [2] "290 Papua New Guinean ancestry cases"
# remove the string "ancestry cases"
(f_row <- lapply(e_row, function(x) gsub(" ancestry cases","",x)))
#> [[1]]
#> [1] "1281 European"
#> 
#> [[2]]
#> [1] "290 Papua New Guinean"
# replace first space with a comma (∴ sub, not gsub)
(g_row <- lapply(f_row, function(x) sub(" ",",",x)))
#> [[1]]
#> [1] "1281,European"
#> 
#> [[2]]
#> [1] "290,Papua New Guinean"
# split on the number
(h_row <- lapply(g_row, function(x) strsplit(x,","))) |> unlist(x = _, recursive = FALSE)
#> [[1]]
#> [1] "1281"     "European"
#> 
#> [[2]]
#> [1] "290"               "Papua New Guinean"
# convert to data frame
(d <- do.call(rbind, unlist(h_row,recursive = FALSE)) |> as.data.frame())
#>     V1                V2
#> 1 1281          European
#> 2  290 Papua New Guinean
# Set the column names
(colnames(d) <- c("count", "ancestry"))
#> [1] "count"    "ancestry"
d
#>   count          ancestry
#> 1  1281          European
#> 2   290 Papua New Guinean

Created on 2023-12-26 with [reprex v2.0.2]

John is right, but if the data are not representative it will get much more so. Data that has been manipulated in a spreadsheet is presumptively contaminated. If that is the case with these data, the assumption that the data used are representative is likely to be disappointed.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.