What to call a data rectangle: dataset / data frame / tibble / other?

mine · October 2, 2017, 3:10am

I'm teaching a data science course to complete R novices (first year undergraduate students) and primarily using the tidyverse toolkit. If loading an external dataset, we user readr so the result object is a tibble. We also use dplyr heavily so datasets loaded from other packages that might not have been tibbles get converted to tibbles along the way. I'm not interested in going through the details of how an object of class tibble differs from a data.frame object (but if you think one should, I'd love to hear your thoughts). However I catch myself using the term "data frame" sometimes in class, and I think I really just mean a dataset/data matrix as opposed to the data.frame class, and I'm using this term out of habit.

What term do others teaching the tidyverse use to refer to "data rectangles"? Do you exclusively use the word tibble? I'm trying to train myself to say "dataset" when I just want to talk about a data rectangle, and tibble only when it's necessary to discuss the class of the object, but I'd love to hear others' thoughts on this.

(As an aside, I'm still trying to learn to not say "we subset with filter" and instead say "we filter the dataset", because I don't want to use a word that's also the name of another function that works differently, but old habits die hard... I find that precise wording helps students google things better, hence my semi-obsession with it.)

terence · October 2, 2017, 3:51am

I almost always use the term 'dataset' when I teach social science undergraduates, but that's because it's the norm in my discipline. When I teach tidy data, I add an adjective 'tidy dataset' as opposed to 'messy', 'untidy', or 'datasets designed to get you super frustrated, and for some reason is the default for most international institutions like the World Bank'.

I use tibble in conjunction with "All hail Hadley, [R]ockstar". For instance: "I say tibble, you say All hail Hadley, [R]ockstar." "Tibble?" Most of the time, I'm greeted with silence and stony stares. But on rare occasions, someone gives the correct response and gets an A automatically (I jest, of course!). More seriously, I typically mention tibbles only in my session on readr and tidy data, and have students see how non-tibbles make our console look like The Matrix if we have a sufficiently large number of observations. I'm interested in hearing if others think we should go into the details of how a tibble differs from a data.frame as well. I don't think it's necessary, especially in an intro course.

(On filter(), I suppose I should start learning to not say "we subset with filter". Let's see if I remember to do that tomorrow. I agree that precise wording with respect to how most useRs do things is definitely useful! On precise names for things: I've been trying to say brackets for [] as opposed to 'square brackets', and braces for {} as opposed to 'curly braces' after having read someone inquire why we don't say round (or semi-circle) parentheses...but I still slip into old habits. )

Tazinho · October 2, 2017, 7:36am

How about table (same as in SQL)? When one wants to go more specific then introduce class(). In general I name tibbles with df_something, because they inherit from data frame anyway. Just for data tables I use dt_something, since they really behave different.

EconomiCurtis · October 2, 2017, 7:50am

When I teach, (my experience is lately with social science undergrads getting an intro to R), I like to introduce students the broad idea of a rectangular dataset object with a certain number of rows and single-type columns, following the basic structure of the tidy data vingette (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).

I hadn't thought about it, but I then try to stick to relational database jargon, referring to datasets as "tables".

I suppose a term like data.frame or tibble is more precise, but less useful to students who might later work with other technologies, or with people not familiar with R-specific jargon. I haven't double-checked this, (and I'm particularly curious if others disagree) but I had always thought "table" was the most old-school and broadly popular term to refer to a rectangular dataset?

I suppose that is a good case for calling tables "datasets". But I always thought that term might be risky since it could also refer to lists, json files, and other data.

I'm curious what others have to say!

martin.R · October 2, 2017, 8:25am

I cannot recall where I read this, but ...
Q: Why is a 'tibble' named like that?
A: Because that's how New Zealanders pronounce 'table'!

mine · October 2, 2017, 1:47pm

I try to avoid the word table, because of table() and also because many times when I've asked students to find a dataset of interest for a project they've found things like census tables, went down a path of analyzing it, later to realize the rows don't represent the units of interest.

martin.R · October 2, 2017, 2:11pm

I agree that referring to 'tables' could be confusing, particularly with data.table cropping up when students look for answers on SO, etc.

A dataset would seem a neutral enough term when starting out, but there should be no problem in referring to dataframes once students are actively using them, even in the form of tibbles. If I were a student I think I would appreciate a brief description of what a base data.frame is and how a tibble is an enhanced version of it (without all the technical details).

The biggest problem I had initially with R (some years ago pre-tidyverse) was not appreciating that there were so many different ways to refer to very similar objects such as dataframes and I became thoroughly confused. Once I understood that different packages (e.g. plyr, dplyr, data.table) had their own syntaxes which differ from base then I was able to move on and select which best suited my purposes.

Starting out with the tidyverse and tibbles may make things much easier initially for students to get up and running, but they will still have to learn some base syntax in order to understand any other on-line code or help on SO.

jennybryan · October 2, 2017, 6:30pm

I say "data frame" for rectangular, spreadsheet-y datasets. For me, this is a concept and not tied, e.g., to any particular vector of S3 classes.

I say "tibble" when we're really talking about code and actual R objects that, indeed, are of class tbl_df.

I try to do all of the above consistently in what I say in class and what I write. I am probably not 100% successful at that.

raviolli77 · October 2, 2017, 6:50pm

Likewise I think I've associated any rectangular data sets to data frame, since both r and python use that term a lot helps with speaking across multiple languages.

alistaire · October 3, 2017, 3:19am

"Data frame" seems a necessary term if students are trying to search for answers on StackOverflow, mailing lists, or their favorite search engine; "tibble" won't turn up many relevant results.

Plus "data frame" is distinct—there's no confusion with tables, matrices, or arrays (and if there is with data.tables it's self-inflicted)—but not overly general (rectangular data, dataset). It's portable, both within R (S3 methods, docs, etc.) and outside (to Python, Spark, and Julia at least). It also works fine within the tidyverse: I tend to use data_frame and as_data_frame more than tibble and as_tibble. The print method for tibbles does start with A tibble: though, which may require further explanation.

One problem, though, is whether to use "data.frame", "DataFrame", "dataframe", or "data frame". The first is most directly derived from R, but doesn't make sense in a Python environment (.frame is not a method of data). In pandas, Spark, Julia, and Maple (apparently) they're called "DataFrames", though for general usage the camel case seems overly technical. SO data frame users decided to make [data-frame] and [data.frame] synonyms of [dataframe], but the tag info page uses "data frame" when talking about the concept independent of a language. I like the idea of "data frame" as a general concept and the local spelling in context, but the disparity does make searching a pain if a search engine doesn't realize they're all the same thing.

Edit: R's docs use "data frame" for the concept (as opposed to the function), as well; see ?data.frame:

Data Frames

Description

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

(h/t @Frank)

thoughtfulnz · October 3, 2017, 8:24am

In the Introductory R sessions I run, I say something to the effect of

"The example data is tabular data in rows and columns, where the rows are the individual records of the things we are studying and the columns are the different aspects of the thing that we are keeping track of"

After that I just say "data" or "entries".

jepusto · October 6, 2017, 2:29pm

I struggled with the same issue last semester. I settled on using "data frames" as a generic, and data.frame or data_frame or tibble to refer to specific objects in R. I think it's also helpful to style the text differently for R objects as a further contextual signal.

There is surely a better set of terminology, but I am content with imperfection for now. I think training myself to use new terminology would take a level of mental effort that would detract from other aspects of my teaching.

jtr13 · October 10, 2017, 5:24pm

To respond to a small piece of this, I came to the conclusion that it's ok to refer to a tibble as a data frame since a tibble is also a data frame. Similarly, it seemed odd at first to do things like mytidydf <- tibble(...) but I'm ok with that now. For undergrads I say that if they see tibble they should think data frame; for grad students or anyone with more knowledge of R, I explain the difference.

dtkaplan · November 17, 2017, 3:13pm

I'm detecting a consensus around "data frame." Note the space. I think that without the space, or with a period or underscore, it stops being a general idea and becomes a class in R or python/pandas. I'm interested in people's reaction to usage that let's the word "frame" be used, e.g.

"SQL tables are frames for data."
"The mtcars frame is often used for examples."
"The cross-tabulation in Table 3 is not a data frame. It's a summary or 'presentation' of the contents of the NHANES frame."

It will also be helpful to have a consensus for what to call the components of data frames.
There are at least three: the the architectural metaphor columns, rows, and individual cells. Just to start with something, I'll propose:

"frame variable" This leaves "variable" to be more general, e.g. an algebraic unknown or a measureable value.
"frame row" as in "Each unit of observation will become a row in the frame."
"frame cell"

Frank · November 17, 2017, 3:44pm

I prefer "table", with components...

row = row or tuple
column = column or variable
cell = "value of the variable in the tuple"

Examples of named tuples: (val = 1); (lat = 44, long = 45); (name = "Bubba", age = 44). You can then talk about tables as (possibly sorted) sets of tuples,

... explaining that data sets where tuples/rows are observations are a special case.
... referring to dplyr's setdiff / union / intersect / slice / filter / anti_join in terms of set theory and conditions before explaining grouped operations and other joins.

I'm not a fan of "data frame" unless referring to the class in R. I also like "rectangular data" but would use it for hyperrectangles (arrays in R) or their long-form analogues, like expand.grid(1:2, 3:4, 5:6), a "rectangular grid."

Ideally, the student would understand the concepts well enough to apply them beyond R and not get too hung up on R's or the 'verse's lingo. Anyway, I'm not actually teaching; these are just my thoughts.

dtkaplan · November 17, 2017, 5:57pm

A lot depends on the audience. I'm thinking of students who are taking an introductory data science class. (I'm trying to help move statistics education into a more data-centric orientation, so taking an intro stats course would give some data science. See, for instance StatPREP.org.

I take your point about using common-sense terms like "row", "column", and "cell", or their more erudite equivalents like "tuple" or "variable." For someone focussed on querying (or "wrangling" or "data manipulation") those terms adequately describe the layout of a table (or "relation" or "data frame").

But we also have to deal with entities outside the wrangling process. There are printed tables in books that summarise data, kitchen tables, water tables, 2x2 tables, pivot tables, and tables of contents. It's a lower cognitive load (I think) for students when there is a word that doesn't have so many other associations. "Frame" is such a word. There are mathematical variables, statistical variables, random variables, explanatory variables, dependent variables, to say nothing of variance, variability, varieties, ... So maybe "frame column" would be better than "frame variable" for the wrangling entity.

What's valuable about "data frame" is that students will know when they don't know what that is. If you talk about "tables" in class, it's the rare student who would be brave enough to ask, "What's a table?" because "table" has an obvious everyday meaning. But no shame in asking, "What do you mean by 'data frame?'"

alistaire · November 19, 2017, 10:13pm

I feel obligated to say that I hate the word "cell". It has too many echoes of spreadsheets, and sounds like it's an independently mutable object, which it's not. "Value" is better.

"Row" and "column" are nicely unambiguous, but refer to locations, not objects. "Variable" refers to an object, but it may or may not be part of a data frame. An adjective for when it is may be handy. "Element variable" may work, as it makes sense in that such a variable is an element of the list that makes up a data frame. That's probably R-centric, though.

"Frame" (as a noun without "data") doesn't make much sense to me. A frame is rectangular, yes, but referring to it directly implies it's empty or a framework, which is not true in this context. Though I wouldn't use it, "framed data" does have some sense, as it explains how a list is fit into a rectangular shape. It is actually more apt than "data frame", which intuitively would indicate a frame for data like "bottle cap" or "book cover".

I think we're ultimately doomed to use both "data frame" (or a variant, but I think this is the right one for the concept) and "table", because R, python, etc. use "data frame" for a class, SQL uses "table" (as does R for a different set of classes), and none of those languages or classes are going away soon.

dtkaplan · November 19, 2017, 10:41pm

I'm liking your idea about "framed," treating it as a verb. One reason is
that there's no currently dominant word (that I can think of) for the
process of organizing data into the rectangular format. "Tidying" is a
possibility, but that has so many other connotations ....

alistaire · November 19, 2017, 11:27pm

Given Hadley's paper on tidy data, "to tidy" is more specific than to make data rectangular; rectangular data is only tidy if in a particular arrangement. Moreover data frames are not always totally rectangular, e.g. when they have list columns or variables with attributes.

More generally, organizing data can be to "clean"/"cleanse", "munge", "wrangle" etc. There are some differences in meaning, but to "tidy" now has a more clearly defined resulting data structure thanks to the paper linked above.

danr · November 20, 2017, 3:50am

I'm not sure that row and column is all that ambiguous. In a database table row is an entity and column is a scalar attribute. I still getting my head around the stats world but it seems like row would be a sample and column would be a measurement.