Equivalent of "rank" and "partition" statements in R

omario · December 16, 2021, 3:41am

I am working with the R programming language. I have the following SQL query:

select b.var1 as var1, b.var2 from
(select *, rank() over( partition by var1 order by var3) as rank1
from my_table)b;

The goal of this SQL code is to:

Find groups of records containing duplicate values of the same "var1"
For each of these groups, sort these records based on their values of "var3"
For each of these groups of sorted duplicate records, only keep the the record with the largest value of "var3"
Note: Records containing non-duplicate values of "var1" are left untouched

My Question: Does anyone know if it is possible to run this same code in R? For example:

library(RODBC)
library(dbi)
library(odbc)
library(sqldf)
library(SQLite)

dbWriteTable(my_db_connection, SQL("   select b.var1 as var1, b.var2 from
    (select *, rank() over( partition by var1 order by var3) as rank1
    from my_table)b " ), results_of_this_query)

Does anyone know if it is possible to do this in R? Does R recognize SQL commands like "rank()", "over" and "partition"?

Thanks!

FJCC · December 16, 2021, 4:02am

This code finds the maximum var3 for each value of var1 and filters the data to keep only those rows. Is that what you want?

library(dplyr)
#invent data
set.seed(123)
DF <- data.frame(Name=sample(LETTERS[1:4],10,replace = TRUE),
                 var1=sample(1:4,10,replace = TRUE),
                 var3=runif(10))
DF
#>    Name var1      var3
#> 1     C    4 0.8895393
#> 2     C    2 0.6928034
#> 3     C    2 0.6405068
#> 4     B    1 0.9942698
#> 5     C    2 0.6557058
#> 6     B    3 0.7085305
#> 7     B    4 0.5440660
#> 8     B    1 0.5941420
#> 9     C    3 0.2891597
#> 10    A    3 0.1471136

#find max var3 for each var1
MaxVar3 <- DF |> group_by(var1) |> summarize(Max=max(var3))
MaxVar3
#> # A tibble: 4 x 2
#>    var1   Max
#>   <int> <dbl>
#> 1     1 0.994
#> 2     2 0.693
#> 3     3 0.709
#> 4     4 0.890

# filter DF
DF2 <- semi_join(DF,MaxVar3,by=c("var1",var3="Max"))
DF2
#>   Name var1      var3
#> 1    C    4 0.8895393
#> 2    C    2 0.6928034
#> 3    B    1 0.9942698
#> 4    B    3 0.7085305

^{Created on 2021-12-15 by the reprex package (v2.0.1)}

Yarnabrina · December 16, 2021, 4:56pm

Where exactly in your query are you doing this? There is no WHERE clause to filter on ranks.

My interpretation of the problem using FJCC's data:

set.seed(123)
DF <- data.frame(Name=sample(LETTERS[1:4],10,replace = TRUE),
                 var1=sample(1:4,10,replace = TRUE),
                 var3=runif(10))
DF
#>    Name var1      var3
#> 1     C    4 0.8895393
#> 2     C    2 0.6928034
#> 3     C    2 0.6405068
#> 4     B    1 0.9942698
#> 5     C    2 0.6557058
#> 6     B    3 0.7085305
#> 7     B    4 0.5440660
#> 8     B    1 0.5941420
#> 9     C    3 0.2891597
#> 10    A    3 0.1471136

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

DF |>
    group_by(var1) |>
    filter(var3 == max(var3)) |>
    ungroup()
#> # A tibble: 4 × 3
#>   Name   var1  var3
#>   <chr> <int> <dbl>
#> 1 C         4 0.890
#> 2 C         2 0.693
#> 3 B         1 0.994
#> 4 B         3 0.709

(Never did it in R with a DB connection, so no comment if the question is that only)

system · January 6, 2022, 4:56pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.