Hello everyone,
I have a large data like this:
ID sire Dam
1 2 3
4 1 3
5 1 4
6 5 4
7 5 6
8 7 6
...
I would like to replace 10 percent of numbers of sire with wrong number of sires.
For example, I would like change for ID=1, number of sire = 1, 5 or 7 (Actually, 10 percent
of ID numbers have wrong number of sire).
How can I do this?
Hi,
Welcome to the RStudio community!
Here is an example of a function I created to do this for any categorical variable
set.seed(1)
#Dummy data
df = data.frame(ID = 1:100, sire = sample(c(1,2,5,7), 100, replace = T),
Dam = sample(c(3,4,6), 100, replace = T))
head(df)
#> ID sire Dam
#> 1 1 1 3
#> 2 2 7 3
#> 3 3 5 3
#> 4 4 1 6
#> 5 5 2 4
#> 6 6 1 3
#Function to replace with wrong values
wrongVal = function(x, perc){
#Get the unique values
uniqueVal = unique(x)
#Pick a random number of values to replace (vector index)
toReplace = sample(1:length(x), ceiling(length(x) * perc / 100))
#Replace the numbers with one that is not the same as the current value
x[toReplace] = sapply(x[toReplace], function(y){
sample(uniqueVal[uniqueVal != y], 1)
})
return(x)
}
#Run the function on your data
df$sire2 = wrongVal(df$sire, 10)
head(df)
#> ID sire Dam sire2
#> 1 1 1 3 1
#> 2 2 7 3 2
#> 3 3 5 3 5
#> 4 4 1 6 1
#> 5 5 2 4 2
#> 6 6 1 3 1
#Sanity check: percent of incorrect values from original
sum(df$sire != df$sire2) / length(df$sire) * 100
#> [1] 10
Created on 2023-02-23 by the reprex package (v2.0.1)
Hope this helps,
PJ
Thank you! It works
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.