Support links to explain two specific steps involved in building a reproducible example

Andrea · October 6, 2018, 4:14pm

Hi, all,

I'm looking for links explaining two very important parts of writing a reproducible example:

don't load data from your local file system - maybe even better, "don't load data at all", because some people don't feel comfortable downloading data from a URL (personally I'm fine with downloading csv files or accessing remote databases through the API of well-known CRAN packages such as eurostat, but it's not BP for reprexes and anyway YMMV)
you don't actually load the reprex package in your reprex - this can be a bit difficult to explain over the Internet sometimes, the discussion risks becoming quite meta.

Which links do you think would be the best for these specific points? I'm going to use reprex dos and don'ts (h/t Mara) and the reprex video by Jenny Bryan, which is great! but it's a bit long, so I'm not sure if that could put off some learners. Do you think these are ok for these two points, or would you use something else? Ideally, I'd need stuff which is specific on these two points, rather than a omni-comprehensive list of links about reprexes. The reason is that I already linked to our great reprex FAQ, which contains all of the above and much more, thus now I need to be specific.

cderv · October 6, 2018, 5:26pm

There is also a recent webinar about reprex:

Andrea · October 6, 2018, 5:44pm

Indeed! It's the second link I included.

mara · October 6, 2018, 6:23pm

We're working on making a shorter video out of the longer one, but if you go to the basic usage chapter mark (the whole outline is in the video FAQ), it's about a 2 minute overview.

If you'd like to write something up, feel free to contribute it to the reprex FAQ thread. I know @EconomiCurtis is always trying to update and improve it.

jcblum · October 6, 2018, 6:32pm

I don’t know of anything super specific and to-the-point covering these specific topics, and I agree it’s a real need.

The entire data topic is one of the biggest sticking points and it’s already covered here (and elsewhere), but I’m not sure any of that is at the dead-simple, step-by-step level that can be parsed by someone who’s new to R and reproducible examples and also stressed out about the problem they’re trying to solve.

There are so many angles on the data problem that I think that itself confuses people. I agree that having some modular posts that take on a single small topic would be valuable.

If you’re game, you could start a couple of posts here in #meta for the topics that you’re envisioning and I can make them into wiki-posts so that others can collaborate!

ETA: posted at the same time as @mara

Andrea · October 6, 2018, 6:37pm

I'm not sure...I don't think I could write anything better/more attractive to learners, than what's already there: it even includes Jesse Mostipak's awesome short intro which at just 4 minutes long, shouldn't scare away even the most time-constrained readers.

Also, I think the thread is already long enough as it is. I mean, I do like it! And I use it either as a link to share with others, or as a reference for myself when I have doubts. But I'm not sure if it's better to make it longer and more comprehensive (with the risk of having people not reading through it), or just to point to it as a start, and then point to more specific links if/when needed. When I say "I'm not sure" I mean it - I literally don't know whether it would be more useful for the community to grow that thread, or to ask new questions such as this one.

Andrea · October 6, 2018, 6:41pm

If you’re game, you could start a couple of posts here in meta for the topics that you’re envisioning and I can make them into wiki-posts so that others can collaborate!

Wiki-posts may work - I was unsure whether to add to the reprex FAQ thread (pros: all in one place! cons: to longer it gets, the higher the risk of people not reading the thread!), but having a separate resource might be better.

For today, I'll just use some of the links mentioned, but in the future I may write some posts - not sure I can improve on what's already in the FAQ, but I could write two posts which focus on just these two points, trying to make them as simple as possible.

jcblum · October 6, 2018, 6:43pm

Moderators here have the power to merge posts into other threads and to split threads that get too long off into their own topics, so I wouldn’t worry too much about the format. (But your concern for keeping things tidy is very much appreciated! )

That said, I think a “safe” approach (and one I believe @EconomiCurtis endorses) is to start a new topic in #meta with suggested FAQ content. Once it’s developed, the decision can be made to merge or copy it into a different existing FAQ topic, to simply link to it from other topics, or to let it stand on its own.

prosoitos · October 6, 2018, 6:44pm

Don't forget that there is also dput(). Nobody seems to use it in this forum and in this reprex age, but it is what people on Stack Overflow advise all the time to produce a minimal reproducible example and it is indeed very convenient (if unaesthetic due to the long text) to load data to text. I think that it might be worth adding it as an option to the reprex's workflow.

prosoitos · October 6, 2018, 6:48pm

I think that a third one could be not to use sensitive data and to either make up data following the same structure or anonymise the data first in those cases.

We recently saw that this is not obvious to everyone.

Andrea · October 6, 2018, 6:50pm

Sure - I wasn't implying those were the only important parts, or even the main ones. I just need those specific parts for a thread I'm replying to.

prosoitos · October 6, 2018, 6:51pm

Awesome !!!

I really love the video but it is too long for most people. A short version will be extremely invaluable and I am thrilled to hear that Thanks for all the work!

prosoitos · October 6, 2018, 6:52pm

Oh, I see. Sorry!

Having all key points together in all posts about reprex might be good though (from a pedagogical perspective).

Andrea · October 6, 2018, 6:54pm

Great suggestion! I know dput quite well - I've been using it for quite some time on Cross Validated and Stack Overflow, before reprex came out. Actually, I did suggest using dput in the thread from where all this originated, but I wasn't able to explain myself:

https://forum.posit.co/t/prepping-and-importing-time-series-data-for-noobs/

the other party thought I was suggesting to include the dput command inside the reprex. Eh, it's not easy to understand each other on remote!

prosoitos · October 6, 2018, 6:55pm

I think that a short summary in a bullet form with all the key points would be great! (and there could be links to the longer more detailed threads).

Because you are right: some people are less likely to read something that seems very long and complicated.

prosoitos · October 6, 2018, 6:58pm

dput() is really great. But the ugly text output can be very scary and that really plays against it .

It might be worth writing a little educational post on how to use it using a very small piece of data so that people understand that it is only a matter of copying and pasting and that the structure of the bit to copy and paste does not matter and shouldn't be scary.

prosoitos · October 6, 2018, 7:00pm

In comparison, the formated RMarkdown reprex is incredibly sleek . But both play different roles and are not exclusive.

jcblum · October 6, 2018, 7:02pm

Well, our existing guidance here mentions dput() — though I think it could be more prominent — and I see people (@Andrea included!) recommend it quite often. It’s also in the Reprex Dos and Don’ts and in the Advanced R 1st Ed section on reproducible examples, which is linked to by the Dos and Don’ts page.

I think it would help to have a very explicit set of instructions on using dput() to link to. The process is quite counter-intuitive and somewhat convention-breaking since it involves a weird, non-reproducible step (copy that thing you got from the console and paste it back up into your script). I rarely see anybody get it right on their first try.

What I really want is, like, a dichotomous branching key to including data in a reproducible example that covers everything from using built-in data to including your own data, with very simple and explicit recommendations for each ending node... but if I want that to exist, I probably need to start working on it myself!

prosoitos · October 6, 2018, 7:02pm

I think that what you could add to this is creating a new and minimal summary listing the key points (followed by links to more info if the key points don't make sense).

There is a growing and phenomenal amount of info on reprexes, but it is a bit overwhelming.

Having a central ultra short post could be very useful I think. Then we could always refer to that one small condensed post. And that could be the central point from which other reprex threads could be accessed.

prosoitos · October 6, 2018, 7:10pm

Sorry I didn't realize it was already out there. I don't see it used often and I overlooked this.

That's ok because you don't use dput() in a script. It is only a trick to get give data through copy paste. And it is no different from having to copy some code before running reprex::reprex(). Those are not script tools.

I like the branching idea. A nice scheme goes a long way. Maybe we should start drawing something

And if we go with a branching scheme, then one point that is never mentioned here (because it adds complexity and is not necessary) is the concept of minimal reproducible example (MRE). Ideally, this is the best. A reprex which we can run is good enough, but a MRE is the sleekest form. And if we have a scheme, there could be a branch going into this so that people get exposed to it, but more as an option from the main necessary workflow.