Using grepl to match characters in a dataframe column

I am trying to use grepl to match a pattern in a column of a dataframe. The data frame column is a list of irish peoples surnames and I want to return the first letter of the surname. However some surnames start with Mc, Mac and O'. in those cases I want to return the prefix and the next letter after that.
I have some code that successfully does this for the names that have the Mc and Mac prefixes. But I can't get it to work for cases where the name begins with O'.

I have the following code:
ifelse(grepl("^Mc", DF$surname), substr(DF$surname, 1, 3),
ifelse(grepl("^Mac", DF$surname), substr(DF$surname, 1, 4),
ifelse(grepl("^O'", DF$surname), substr(DF$surname, 1, 3), substr(DF$surname, 1, 1))))

This code will work if I run it using a vector I created myself such as surnames <- c("O'Connell, "O'Callaghan")
But doesn't work for a dataframe column
What is the difference between a vector and a dataframe column

Any help would be appreciated :slight_smile:
Thanks!

I can't think why a data frame would act differently. Can you post a few rows of data? To post rows 2, 6, and 10 of the two columns surname and givenname, use syntax like this

dput(DF[c(2,6,10), c("surname", "givenname")])

Post the output of that, putting three back ticks on each side of the output, like this
```
output of dput() goes here
```

structure(list(Op_surname = c("Fulp", "Pyatt", "O’Dwyer"),
Op_forename = c("Pia", "Drusilla", "Mac")), row.names = c(2L,
6L, 51L), class = "data.frame")

Replaced index 10 with index 51 as that corresponds to one of the values where the code won't pick up the fact it starts with O'D

The problem is that in the data frame, the O' names have a "curly" single quote. Notice the difference between O' and O’. When you tried the vector, you probably typed the entries manually and thus had a straight single quote in O'. We humans don't care about the difference, unless we are very carefully proofreading, but the two quote characters are different to the computer.

DF <- structure(list(Op_surname = c("Fulp", "Pyatt", "O’Dwyer"),
               Op_forename = c("Pia", "Drusilla", "Mac")), row.names = c(2L,
                                                                         6L, 51L), class = "data.frame")
ifelse(grepl("^Mc", DF$Op_surname), substr(DF$Op_surname, 1, 3),
       ifelse(grepl("^Mac", DF$Op_surname), substr(DF$Op_surname, 1, 4),
              ifelse(grepl("^O'", DF$Op_surname), substr(DF$Op_surname, 1, 3), substr(DF$Op_surname, 1, 1))))
#> [1] "F" "P" "O"

ifelse(grepl("^Mc", DF$Op_surname), substr(DF$Op_surname, 1, 3),
       ifelse(grepl("^Mac", DF$Op_surname), substr(DF$Op_surname, 1, 4),
              ifelse(grepl("^O’", DF$Op_surname), substr(DF$Op_surname, 1, 3), substr(DF$Op_surname, 1, 1))))
#> [1] "F"   "P"   "O’D"

Created on 2023-10-25 with reprex v2.0.2

1 Like

Yes that did the trick. Thank you for the help. A very subtle difference!

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.