# How to compare several variables in the same column

See the FAQ: How to do a minimal reproducible example `reprex` for beginners, so as not to deter potential users posting answers because of the drag of reverse engineering data.

``````# fake up some data
tib <- data.frame(
name =
c("Denis Rutherford-Sipes", "Denis Rutherford-Sipes", "Tishie Stehr", "Tishie Stehr", "Diann O'Connell", "Diann O'Connell", "Mareli Mertz", "Doll Nader V", "Sharyn Casper", "Sharyn Casper"), date = structure(c(19322, 19323, 19324, 19325, 19326, 19327, 19328, 19329, 19330, 19331),
class = "Date"
),
points =
c(1383L, 1361L, 1455L, 1413L, 1331L, 1381L, 1414L, 1350L, 1466L, 1476L)
)

suppressPackageStartupMessages({
library(dplyr)
})

# data
tib
#>                      name       date points
#> 1  Denis Rutherford-Sipes 2022-11-26   1383
#> 2  Denis Rutherford-Sipes 2022-11-27   1361
#> 3            Tishie Stehr 2022-11-28   1455
#> 4            Tishie Stehr 2022-11-29   1413
#> 5         Diann O'Connell 2022-11-30   1331
#> 6         Diann O'Connell 2022-12-01   1381
#> 7            Mareli Mertz 2022-12-02   1414
#> 8            Doll Nader V 2022-12-03   1350
#> 9           Sharyn Casper 2022-12-04   1466
#> 10          Sharyn Casper 2022-12-05   1476

# preprocessing
census <- tib %>% group_by(name) %>% count()
census <- census %>% filter(n == 2)
tib <- left_join(census,tib,by="name")

# arrange by date
tib <- tib %>% arrange(-desc(date))

# setup
# create result table
names <- unique(tib\$name)
result <- rep(0,length(names))
compared <- data.frame(names,result)

# main
# find index positinos of even and odd rows
finish <- tib[seq_len(nrow(tib)) %% 2 |> as.logical(),]
begin  <- tib[-c(seq_len(nrow(tib))) %% 2 |> as.logical(),]
# calculate change in points
delta <- begin\$points - finish\$points

# finish table
compared\$result <- delta
compared
#>                    names result
#> 1 Denis Rutherford-Sipes      0
#> 2           Tishie Stehr      0
#> 3        Diann O'Connell      0
#> 4          Sharyn Casper      0
``````

Created on 2022-11-26 by the reprex package (v2.0.1)

This example is a mix of `{dplyr}` and `{base}`. Some things are more convenient to do in one and some things are more convenient in the other. Be flexible and don't get married to tools.

The example isn't as "efficient" as it could be, and that's all right. The purpose is not to get the code to run faster, because it runs faster than an interactive user can notice and it wouldn't be until really large data sets that speed would be a consideration. The inefficiency should help the user better understand what is being done at each step. Data analysis is all about steps, breaking data down to its constituent parts and transforming them. Divide and conquer.

Thinking in terms of school algegra, f(x) = y, where x is what you have, y is what you want and f is the function or chain of functions to get you from x to y you've described x and y well. Let's look at f.

By the statement of the problem each `name` has either one or two entries. As it's not possible to compare two `point` scores if there is only one, those should be eliminated, which is what the `# discard singletons` code block does. First, a count of names is made after grouping by names and that result is overridden by filtering on the count, `n` to include only names that appear twice. The `left_join` operation overwrites the `tib` object by joining it to the `census` object, which discards the missing single entries from `census.

Next, we need to assure that the begin and finish dates are in chronological order, which is what `arrange` does. Following that, we create an empty data frame to hold the results.

The heavy lifting is done by this ugly looking operation

``````tib[seq_len(nrow(tib)) %% 2 |> as.logical(),]
``````

This looks scarier than it actually is.

We being with our re-arranged data.frame `tib` and subset it with the `[]` square bracket operators. Notice the comma `,` before the close `]` bracket. That's because of the syntax of the subset operator.

``````object[1] # first column
object[1,3] third column in first row
object[3,] all columns in third row
``````

So we now know that whatever the result of between the `[]` brackets is that it will be used to pull something out of `tib`. `begin` pulls out the odd numbered rows, and `finish` the even numbered. That's what the `%%` modulus operator doesâ€”if you can successively divide by two and reach zero, a number is even. To get the odd rows, we `negate` with wrapping the expression in `c()` and prefixing it with `-`.

The vector of row numbers returned can be simplified by `|>` piping the result to `as.logical` for conversion to TRUE FALSE. The end result is that `tib` is divided into the two pieces with odd and even rows, and we can subtract the odd rows from the even rows to get the change `delta` between the first and second `point` scores. We tack that on to the empty results table `compared` and we're done.

The mantra should always be: I have x, I want y and if I apply f_1 to x I can move one step closer or if I apply f_2 to y I can get closer the other way.

Finally, see the Homework FAQ. I generally provide only hints, rather than solutions. In this case, it's a combination of new users get a break and this assignment seems unreasonably difficult for a beginner. Don't count on solutions as a matter of course.

1 Like