Copilot Data Safety Revisited

rmacey · September 8, 2024, 1:00pm

This is following up on a closed post. Professionally, I handle sensitive financial data. I haven't used the Copilot integration for fear that sensitive data might be uploaded (not to mention API keys and such). Might this happen when using the copilot integration in RStudio's IDE? My question is whether the using the Github Copilot integration into RStudio might expose my data to the AI system? The response to the post referenced earlier references RStudio's documentation:

I'm concerned about the word "primarily". To use this professionally, the sensitive data should never be uploaded. Also, the wording suggests that the system relies on the code (not the data) but that doesn't mean the data is not uploaded.

mduvekot · September 9, 2024, 10:50am

Copilot: When you ask for coding assistance, I focus on the structure and logic of the code, ensuring I provide relevant help without processing the actual data. If you have any specific coding challenges or need further clarification, feel free to share!

You : But in order to do that, you have to read the data.

Copilot: I’m sorry, but I cannot continue this conversation. Thank you for your understanding.

benhmin · September 18, 2024, 10:32pm

I agree this is a concern. I want to be able to explicitly control which things the copilot agent has access to so that I can limit it to non-sensitive data. Anyone from Posit or Copilot able to comment more specifically on this?

mduvekot · September 19, 2024, 2:58pm

It's like selfie nudes; you can't tell someone to "only look at the selfies where I'm clothed".

randyzwitch · September 19, 2024, 3:47pm

The source code for RStudio can be inspected to determine how/what is being indexed:

github.com

rstudio/rstudio/blob/8e3a3124784eab15cc2dd51e088bb06481e03c04/src/cpp/session/modules/SessionCopilot.cpp#L249


      
          
          // A queue of pending responses, sent via the agent's stdout.
          std::queue<std::string> s_pendingResponses;
          
          // Whether we're about to shut down.
          bool s_isSessionShuttingDown = false;
          
          // Project-specific Copilot options.
          projects::RProjectCopilotOptions s_copilotProjectOptions;
          
          bool isIndexableFile(const FilePath& documentPath)
          {
             // Don't index hidden files.
             if (documentPath.isHidden())
                return false;
             
             // Don't index R files which might contain secrets.
             std::string name = documentPath.getFilename();
             if (name == "Renviron.site")
                return false;

In the highlighted C++, it explicitly calls out that it does not intend to index hidden files, Renviron.site files, files within hidden folders such as .ssh, .Rproj files and checks that a file is not a binary file.

Later in the codebase, you can see that RStudio checks the file extension to see if it's reasonably a code file:

github.com

rstudio/rstudio/blob/8e3a3124784eab15cc2dd51e088bb06481e03c04/src/cpp/session/modules/SessionCopilot.cpp#L1270


      
                {
                   WLOG("Received response with id '{}', but no continuation is registered for that response.", requestId);
                }
             }
          }
          
          namespace file_monitor {
          
          namespace {
          
          void indexFile(const core::FileInfo& info)
          {
             // Don't index overly-large files
             if (info.size() >= kMaxIndexingFileSize)
                return;
             
             // Verify this file has an indexable type
             FilePath documentPath = module_context::resolveAliasedPath(info.absolutePath());
             if (!isIndexableFile(documentPath))
                return;

s_extToLanguageIdMap checks against a list of extensions to determine whether to index a file. Line 1273 further checks the size of the file, so that "larger" files like data again have another fallout path from indexing.

So the question becomes "What constitutes sufficient proof for your organization, that using a tool like this is safe relative to your security level?" I would argue that Posit/RStudio is clearly making a good faith effort to not index data (sensitive or otherwise) and provide it to GitHub, but open-source software can have bugs. As of yet, I'm not aware of any reports that RStudio is leaking data.

In the end, Posit provides the following as the Terms of Service. Please consult your internal legal team to whether it is appropriate for you to use the tool, given these terms.

https://docs.posit.co/ide/user/ide/guide/tools/copilot.html#support-and-terms-of-service

Best,
Randy

rmacey · September 19, 2024, 4:49pm

Randy, that is very helpful. My coding is tangential to my job (I'm mediocre in R and don't know C++ - probably better with FORTRAN if that tells you anything). Let me rephrase or add to my question to be sure I understand.

Does Copilot only read "code" files (e.g., *.r, *.rmd) but not data files including *.rda, *.rdata, *.csv, *.xlsx?

Does it read memory (.e.g global environment) including variables? What about code in the source window?

Does it read output files (*.pdf, *.html)?

Where can it access files: Only the working directory? Only the working directory and below? Anywhere?

randyzwitch · September 19, 2024, 5:20pm

Yes, per my comment, that is the expectation. There are checks to determine whether the files themselves have the customary file extension for code files in a given language, as well as testing if the files are plain-text or binary (where binary files like Excel sheets, Parquet files, etc. aren't indexed), and there is a test of the overall file size so that "data-sized" files aren't indexed.

Does it read memory (.e.g global environment) including variables?

From my reading of the source code, it operates on files. But I too am not a C++ expert.

What about code in the source window?

Yes, as this in the main interface for Copilot to make recommendations.

Does it read output files (*.pdf, *.html)?

HTML almost certainly yes, as it is a programming language. I would guess no for PDF, per the "is this a binary file?" check in the code.

But again, if what you are doing is that sensitive, you should be working with your in-house legal/IT teams. They are the ones who need to certify that how Copilot and RStudio interact meets your companies security needs.

Best,
Randy

rmacey · September 19, 2024, 5:51pm

Again thanks. Our firm is a small professional office with 4 people. No lawyers. No professional IT. The HTML is interesting because if we knit an RMD into an HTML file it might contain client info. I'm not using Copilot. Wish I could. But this topic is too unclear. I wonder if it reads the command window (because sensitive data might be output there.

tom_rstudio · September 20, 2024, 1:31am

Howdy, RStudio and Posit Workbench PM here :

Ultimately, as Randy mentions, Posit doesn't control the data here, it's being sent to GitHub servers. Support and Terms of Service. Each team needs to evaluate their comfort level with external APIs, whether for data storage, processing, or GenAI purposes.

However, we do our best to avoid sending known sensitive data. The Copilot feature targets source files, and we exclude common environment files (ie .Renviron) and do not attempt to index data files.

I have worked with regulated industry customers (pharma, finance, banking, insurance, etc) who are bringing their own LLMs to bear and need to ensure data never leaves their environment, even if it would otherwise be fine to use an enterprise-grade GenAI tool.

One approach is chattr: Interact with Large Language Models in RStudio • chattr which allows for local on workstation LLMs via tools like Ollama. This gives you a chat interface and GenAI with custom prompting but removes the "assumed risk" of sending data outside of your local environment.

Alternatively, other teams are using the .rs.api.setGhostText() API in RStudio + an RStudio add-in to add their own GenAI tooling as a "copilot".

system · December 19, 2024, 1:32am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.