deidentify and duplicate data

cwiggz · March 1, 2019, 10:44pm

Hi - I am trying to use the deindentify() command and I get this error

The student ID numbers have a "@" character in front of them which I think is one of the issues and there are duplicate ID numbers listed as well. Is there a way to de-idenitfy in R with the @ in front of the ID and I need the duplicates to be rename in the set with this duplicates renamed the same. I hope this makes sense. I appreciate any help and advise. Thanks!

rensa · March 1, 2019, 11:17pm

Welcome to RStudio community, @cwiggz! We can give you a bit of general guidance here, but I think we'll probably need you to make a reprex, or reproducible example, in order to properly help you.

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

The reprex will have stuff like:

The code you're using (not just the line you're stuck on or the error you're getting); and
A sample of the data you're using—or, if you can't provide that, some simulated data that is a similar shape (eg. the same columns).

If you can prep something like this for us, it'll give us a whole lot more context that can help us get to the root of the problem

That said, it seems like there are a few things going on here that we can help with. I'm not familiar with a deidentify() function in R. is this supplied by a package you're using? (This is one of the benefits of supplying a reprex: it can help us establish where things come from!).

If there are @ symbols in your student numbers, you can remove them using the str_replace() function in the readr package.

I'm not quite sure I understand your explanation of how you want duplicates to be handled. If you could give us an example of a correctly handled duplicate along with your reprex, we can probably help you work that out

Thanks!

Chuck · March 1, 2019, 11:43pm

Here's a quick illustration of how to get rid of @ characters if you don't need them. Run:

illustration <- c("@3","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_remove_all(value,"@")) -> illustration
illustration

Chuck · March 2, 2019, 12:03am

And here is the str_replace() variant mentioned above.

illustration <- c("@3","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_replace(value,"@","")) -> illustration
illustration

cwiggz · March 2, 2019, 6:39am

@rensa Thanks for your reply! I will work on a reprex, I am REALLY new to R so this may take me a little bit to figure out. The deindentify() function is from the deidentifyr pkg. I found this on github when searching for a way to de-identify my students. As for the duplication, the data set is all students who have taken math courses at my college. I am tracking there grades and subsequent success. Thus, their student id is repeated every time they took a math course.
What the set looks like now:
Student ID Course Grade
@11111111 MAT137 A
@11111111 MAT167 C+
@11111111 MAT186 B
@2222222 MAT137 C
@3333333 MAT137 A-

When de-identified:
Student Rename Course Grade
fghj2345sd MAT137 A
fghj2345sd MAT167 C+
fghj2345sd MAT186 B
abcdf6789f MAT137 C
wrytu2746r MAT137 A-

Same student id's need to be renamed the same name so that they are still trackable as the same student. I hope this helps explain my problem a little better. I will work on the reprex! And thank you for the advise on how to get rid of the "@" symbol.

Yarnabrina · March 2, 2019, 7:29am

Apart from the link that rensa posted, this is also extremely helpful to understand what a reprex is:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Now, I'm not sure, but are you looking for something like this?

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))

dataset <- within(data = dataset,
                  expr = {
                    Student.ID <- as.integer(x = Student.ID)
                  })

dataset
#>   Student.ID Course Grade
#> 1          1 MAT137     A
#> 2          1 MAT167    C+
#> 3          1 MAT186     B
#> 4          2 MAT137     C
#> 5          3 MAT137    A-

^{Created on 2019-03-02 by the reprex package (v0.2.1)}

PS: I found the deidentifyr , but not the deindentify function. Perhaps, you typed the n by mistake?

Chuck · March 2, 2019, 1:50pm

Try the package anonymizer. Here's an expanded version of my initial illustration.

library(dplyr)
illustration <- c("@3","@4","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_replace(value,"@","")) -> illustration
illustration

library(anonymizer)
illustration %>% mutate(value=anonymize(value, .algo = "crc32", .seed = 1)) -> illustration
illustration

andresrcs · March 2, 2019, 2:29pm

Actually your problem with deidentifyr package is not the "@" character, the problem is that it does not accept duplicate ids, if you add the course column to make each row unique, it works.

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))

library(deidentifyr)
deidentify(dataset, Student.ID, Course)
#>           id Grade
#> 1 f9a2c0fa32     A
#> 2 88796051c5    C+
#> 3 9f215474a0     B
#> 4 d17788bf5f     C
#> 5 300f0621e9    A-

But the idea here is to make each student identifiable along multiple tables as well, so
I would go with @Chuck advise using anonymizer package because works with duplicate Ids.

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))
library(dplyr)
library(anonymizer)
dataset %>%
    mutate(Student.ID = anonymize(Student.ID, .algo = "crc32", .seed = 1))
#>   Student.ID Course Grade
#> 1   3ac8169d MAT137     A
#> 2   3ac8169d MAT167    C+
#> 3   3ac8169d MAT186     B
#> 4   1c846636 MAT137     C
#> 5   1f97526b MAT137    A-

cwiggz · March 2, 2019, 2:33pm

Thank you! The deidentify() function was not needed, your code worked PERFECTLY! You just saved me a ton of time, I am very grateful!

Yarnabrina · March 2, 2019, 3:24pm

Glad I could help.

system · March 9, 2019, 3:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.