Changing the value of a duplicated column

timwelch · June 26, 2020, 7:19pm

Hello, first post here! I'm trying to use reprex, but I'm not sure if I did it right.

I am trying to identify and modify duplicate values in my dataset:

df
#> function (x, df1, df2, ncp, log = FALSE) 
#> {
#>     if (missing(ncp)) 
#>         .Call(C_df, x, df1, df2, log)
#>     else .Call(C_dnf, x, df1, df2, ncp, log)
#> }
#> <bytecode: 0x0000000013e75248>
#> <environment: namespace:stats>

^{Created on 2020-06-26 by the reprex package (v0.3.0)}

I have two types of duplicate values that I want to remove and/or change:

Columns 7 and 8 contain duplicates across all columns.
Columns 3 and 4 contain duplicates across all columns except the "values" column.

I figured out how to remove these duplicates using this code:

df2 <- df %>% distinct(ID, FamId, question, wave, .keep_all = TRUE)
#> Error in df %>% distinct(ID, FamId, question, wave, .keep_all = TRUE): could not find function "%>%"

^{Created on 2020-06-26 by the reprex package (v0.3.0)}

However, I think the duplicates are a data entry problem (this is a 2nd data analysis) I would like to modify the second "8" in the wave column to "12" (so it reads 1 4 8 12 1 4 8 12). How can I identify the duplicates and then change the value in the wave column such that the duplicate changes to 12?

I am happy to clarify, or to retry the reprex if I didn't do it correctly.

Thanks so much, happy to be joining the R community.

Tim

technocrat · June 26, 2020, 8:49pm

Hi, and welcome!

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers.

In this case it looks like you've named your data frame df, which is also the name of an R function, which is what appears in the post.

The easiest way to remove a duplicated column, say column_dupe is

my_df %>% select(-column_dupe) -> my_df

For columns 3 and 4 it's not clear what is duplicated. Do you have a row named waves?

If so, you may want to consider reorganizing your data frame to a tidy format, with variables, such as wave represented as columns and observations as rows.

In general, it's not a good idea to hand edit a data frame. If you do, meticulous notes are necessary to have any hope of recreating results.

To change just the single value 8 to a 12 can be done with subsetting.

Data frames are indexed row first, column second. So, for example, if the value 8 appears in row 15, column 2, the value can be replaced by

my_df[15,2] <- 12

Preferable, however, running down the error in the source.

arnabp · June 27, 2020, 3:15am

All of this can be done by dplyr. I can suggest you steps and hints

Install package tidyverse
Load package tidyverse
Convert your data frame to a tibble using as_tibble()
To find duplicates use group_by() and then do a summarise with n()
Having found the duplicates do a distinct as you are trying to do
Finally do a if_else() or a case_when() to change specific values and achieve your final goal

~Arnab

system · July 18, 2020, 3:29am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.