I'm new to R and brand new to the Rstudio community. I'm working with a dataset for a request process and one of the variables is the name of the requestor. I want to run some basic descriptive stats on the dataset. Problem is, the names of the requestors have been input very inconsistently. I see variations similar to what's below but with many different names.
Requestor
Doe, John
JDoe
John Doe
Johndoe
JJones
Jane Jones
Jones, Jane
Is there a way to consolidate the different versions of these names into one unique name per person (John Doe and Jane Jones)? I have many different names in the dataset, but it seems like there are a limited number of different structures (Doe, John vs. John Doe).
It's doable, but not necessarily easily, depending on the degree of comfort with regular expressions and the conformity of the example to the data to be processed. In particular, names in ALLCAPS or alllower, such as JANETROPE and janetrope (is that Jane Trope or Janet Rope?). For that, you need some data source of surnames to split against.
Here's a starter to show how to identify the different patterns shown in the data. Play around with that and come back with questions on how to take the results and transform into some standard form, such as lname,fname.
I've made some headway on this. Here's what I've done so far. Sorry if it's hard to follow, reprex's are next on my list of things to learn.
I changed all characters to lower case using a combination of mutate() and str_to_lower() with the intent of removing any complication caused by differences in case.
I learned some about the format of the data by using str_detect() and regular expressions like you had listed in your previous comment.
I noticed that I had a pretty limited number of last names so I exported each unique value in my requestor column and created a wordbank of surnames based on what I saw.
I returned to R and used str_extract() to create a column that extracted the appropriate surname from the original column.
I'm now almost where I need to be. I want to find a way to pull a first initial and put it with the last name I have extracted. For example, I want to return "jdoe" instead of just "doe". Is there a way to extract all of the characters that do NOT match a pattern in stringr? I think I can figure out the rest of the problem but I'm stuck on this specific piece.