How do I limit the number of results in a dataset in RStudio?

Deborah_Lowman · February 10, 2023, 10:48pm

I imported a .csv file into a dataset called bestsellers:
bestsellers <- read.csv("AmazonBooks_workingcopy.csv")

Then I created a dataset called YearsPerTitle to discover how many years a book was a bestseller:
YearsPerTitle <- bestsellers %>%
group_by(Name) %>%
summarize(NumberofYears = n())

Then I created a dataset called MultipleYears to return the books that were bestsellers more than once:
MultipleYears <- YearsPerTitle %>%
group_by(Name) %>%
filter(NumberofYears>1)

Now I want to create a dataset called topfifty to return only the top 50 bestsellers in descending order by the NumberofYears they were bestsellers. Here is the code I have:
topfifty <- MultipleYears %>%
filter(NumberofYears>1) %>%
group_by(NumberofYears)
arrange(topfifty, desc(NumberofYears))
head(topfifty,50)

It works until the last line. My tibble says +40 more rows. So it's grabbing only the 50 records I get the error "Error in if ((ni <- n[i]) < 0L) max(d[i] + ni, 0L) else min(ni, d[i]) :
the condition has length > 1"

Also, they are not arranged in descending order by NumberofYears.

I have also tried top_n()

Here are the first 20 rows of my dataset bestsellers:
dput(head(bestsellers, 20))
structure(list(Name = c("10-Day Green Smoothie Cleanse", "11/22/63: A Novel",
"12 Rules for Life: An Antidote to Chaos", "1984 (Signet Classics)",
"5,000 Awesome Facts (About Everything!) (National Geographic Kids)",
"A Dance with Dragons (A Song of Ice and Fire)", "A Game of Thrones / A Clash of Kings / A Storm of Swords / A Feast of Crows / A Dance with Dragons",
"A Gentleman in Moscow: A Novel", "A Higher Loyalty: Truth, Lies, and Leadership",
"A Man Called Ove: A Novel", "A Man Called Ove: A Novel", "A Patriot's History of the United States: From Columbus's Great Discovery to the War on Terror",
"A Promised Land", "A Stolen Life: A Memoir", "A Wrinkle in Time (Time Quintet)",
"Act Like a Lady, Think Like a Man: What Men Really Think About Love, Relationships, Intimacy, and Commitment",
"Adult Coloring Book Designs: Stress Relief Coloring Book: Garden Designs, Mandalas, Animals, and Paisley Patterns",
"Adult Coloring Book: Stress Relieving Animal Designs", "Adult Coloring Book: Stress Relieving Patterns",
"Adult Coloring Books: A Coloring Book for Adults Featuring Mandalas and Henna Inspired Flowers, Animals, and Paisley"
), Author = c("JJ Smith", "Stephen King", "Jordan B. Peterson",
"George Orwell", "National Geographic Kids", "George R. R. Martin",
"George R. R. Martin", "Amor Towles", "James Comey", "Fredrik Backman",
"Fredrik Backman", "Larry Schweikart", "Barack Obama", "Jaycee Dugard",
"Madeleine L'Engle", "Steve Harvey", "Adult Coloring Book Designs",
"Blue Star Coloring", "Blue Star Coloring", "Coloring Books for Adults"
), Price = c("$8.00 ", "$22.00 ", "$15.00 ", "$6.00 ", "$12.00 ",
"$11.00 ", "$30.00 ", "$15.00 ", "$3.00 ", "$8.00 ", "$8.00 ",
"$2.00 ", "$23.00 ", "$32.00 ", "$5.00 ", "$17.00 ", "$4.00 ",
"$6.00 ", "$6.00 ", "$8.00 "), Year = c(2016L, 2011L, 2018L,
2017L, 2019L, 2011L, 2014L, 2017L, 2018L, 2016L, 2017L, 2010L,
2020L, 2011L, 2018L, 2009L, 2016L, 2015L, 2015L, 2015L), Genre = c("Non-fiction",
"Fiction", "Non-fiction", "Fiction", "Non-fiction", "Fiction",
"Fiction", "Fiction", "Non-fiction", "Fiction", "Fiction", "Non-fiction",
"Non-fiction", "Non-fiction", "Fiction", "Non-fiction", "Non-fiction",
"Non-fiction", "Non-fiction", "Non-fiction")), row.names = c(NA,
20L), class = "data.frame")

Here is my dataset YearsPerTitle:
dput(head(YearsPerTitle, 20))
structure(list(Name = c("10-Day Green Smoothie Cleanse", "11/22/63: A Novel",
"12 Rules for Life: An Antidote to Chaos", "1984 (Signet Classics)",
"5,000 Awesome Facts (About Everything!) (National Geographic Kids)",
"A Dance with Dragons (A Song of Ice and Fire)", "A Game of Thrones / A Clash of Kings / A Storm of Swords / A Feast of Crows / A Dance with Dragons",
"A Gentleman in Moscow: A Novel", "A Higher Loyalty: Truth, Lies, and Leadership",
"A Man Called Ove: A Novel", "A Patriot's History of the United States: From Columbus's Great Discovery to the War on Terror",
"A Promised Land", "A Stolen Life: A Memoir", "A Wrinkle in Time (Time Quintet)",
"Act Like a Lady, Think Like a Man: What Men Really Think About Love, Relationships, Intimacy, and Commitment",
"Adult Coloring Book Designs: Stress Relief Coloring Book: Garden Designs, Mandalas, Animals, and Paisley Patterns",
"Adult Coloring Book: Stress Relieving Animal Designs", "Adult Coloring Book: Stress Relieving Patterns",
"Adult Coloring Books: A Coloring Book for Adults Featuring Mandalas and Henna Inspired Flowers, Animals, and Paisley",
"Alexander Hamilton"), NumberofYears = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

Here is MultipleYears:
dput(head(MultipleYears, 20))
structure(list(Name = c("A Man Called Ove: A Novel", "All the Light We Cannot See",
"Becoming", "Between the World and Me", "Brown Bear, Brown Bear, What Do You See?",
"Catching Fire (The Hunger Games)", "Crazy Love: Overwhelmed by a Relentless God",
"Cutting for Stone", "Dear Zoo: A Lift-the-Flap Book", "Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5",
"Divergent", "Dog Man: Brawl of the Wild: From the Creator of Captain Underpants (Dog Man #6)",
"Eat to Live: The Amazing Nutrient-Rich Program for Fast and Sustained Weight Loss, Revised Edition",
"Educated: A Memoir", "Fahrenheit 451", "Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)",
"First 100 Words", "Game of Thrones Boxed Set: A Game of Thrones/A Clash of Kings/A Storm of Swords/A Feast for Crows",
"Giraffes Can't Dance", "Girl, Wash Your Face: Stop Believing the Lies About Who You Are So You Can Become Who You Were Meant to Be"
), NumberofYears = c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 4L, 2L,
2L, 2L, 2L, 3L, 2L, 2L, 5L, 3L, 6L, 2L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
Name = c("A Man Called Ove: A Novel", "All the Light We Cannot See",
"Becoming", "Between the World and Me", "Brown Bear, Brown Bear, What Do You See?",
"Catching Fire (The Hunger Games)", "Crazy Love: Overwhelmed by a Relentless God",
"Cutting for Stone", "Dear Zoo: A Lift-the-Flap Book", "Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5",
"Divergent", "Dog Man: Brawl of the Wild: From the Creator of Captain Underpants (Dog Man #6)",
"Eat to Live: The Amazing Nutrient-Rich Program for Fast and Sustained Weight Loss, Revised Edition",
"Educated: A Memoir", "Fahrenheit 451", "Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)",
"First 100 Words", "Game of Thrones Boxed Set: A Game of Thrones/A Clash of Kings/A Storm of Swords/A Feast for Crows",
"Giraffes Can't Dance", "Girl, Wash Your Face: Stop Believing the Lies About Who You Are So You Can Become Who You Were Meant to Be"
), .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .drop = TRUE))

Here is topfifty:
dput(head(topfifty, 20))
structure(list(Name = c("A Man Called Ove: A Novel", "All the Light We Cannot See",
"Becoming", "Between the World and Me", "Brown Bear, Brown Bear, What Do You See?",
"Catching Fire (The Hunger Games)", "Crazy Love: Overwhelmed by a Relentless God",
"Cutting for Stone", "Dear Zoo: A Lift-the-Flap Book", "Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5",
"Divergent", "Dog Man: Brawl of the Wild: From the Creator of Captain Underpants (Dog Man #6)",
"Eat to Live: The Amazing Nutrient-Rich Program for Fast and Sustained Weight Loss, Revised Edition",
"Educated: A Memoir", "Fahrenheit 451", "Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)",
"First 100 Words", "Game of Thrones Boxed Set: A Game of Thrones/A Clash of Kings/A Storm of Swords/A Feast for Crows",
"Giraffes Can't Dance", "Girl, Wash Your Face: Stop Believing the Lies About Who You Are So You Can Become Who You Were Meant to Be"
), NumberofYears = c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 4L, 2L,
2L, 2L, 2L, 3L, 2L, 2L, 5L, 3L, 6L, 2L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
NumberofYears = 2:6, .rows = structure(list(c(1L, 2L, 3L,
4L, 8L, 10L, 11L, 12L, 13L, 15L, 16L, 20L), c(5L, 6L, 7L,
14L, 18L), 9L, 17L, 19L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))

I have spent 2 hours on this one line of code, looking at R Help topics galore, and don't know if my logic is wrong or if I just don't know enough about R functions to get it right (I am new to this). Thanks for any help you can give (that works!)

FJCC · February 10, 2023, 11:35pm

The following works for me, starting with your data set named topfifty. I suspect that your problem was that you were not storing the result of the arrange() function.
Three other notes:

You are overusing the group_by() function. You do not need it before you filter() or arrange()
Are you sure you have 50 books that spent more than one year on the list? You are filtering out all the values of 1 and you may end up with less than 50 candidate titles.
Your NumberOfYears variable is a little funny. It counts, as I understand it, the number of years for any part of which the book appeared on the best seller list. If a book made the list in November of 2018 and stayed until February 2019, wouldn't it get a value of 2? But a book on the list from January - November of a single year would have a value of 1.

topfifty <- structure(list(Name = c("A Man Called Ove: A Novel", "All the Light We Cannot See",
                        "Becoming", "Between the World and Me", "Brown Bear, Brown Bear, What Do You See?",
                        "Catching Fire (The Hunger Games)", "Crazy Love: Overwhelmed by a Relentless God",
                        "Cutting for Stone", "Dear Zoo: A Lift-the-Flap Book", "Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5",
                        "Divergent", "Dog Man: Brawl of the Wild: From the Creator of Captain Underpants (Dog Man #6)",
                        "Eat to Live: The Amazing Nutrient-Rich Program for Fast and Sustained Weight Loss, Revised Edition",
                        "Educated: A Memoir", "Fahrenheit 451", "Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)",
                        "First 100 Words", "Game of Thrones Boxed Set: A Game of Thrones/A Clash of Kings/A Storm of Swords/A Feast for Crows",
                        "Giraffes Can't Dance", "Girl, Wash Your Face: Stop Believing the Lies About Who You Are So You Can Become Who You Were Meant to Be"
), NumberofYears = c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 4L, 2L,
                     2L, 2L, 2L, 3L, 2L, 2L, 5L, 3L, 6L, 2L)), class = c("grouped_df",
                                                                         "tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
                                                                           NumberofYears = 2:6, .rows = structure(list(c(1L, 2L, 3L,
                                                                                                                         4L, 8L, 10L, 11L, 12L, 13L, 15L, 16L, 20L), c(5L, 6L, 7L,
                                                                                                                                                                       14L, 18L), 9L, 17L, 19L), ptype = integer(0), class = c("vctrs_list_of",
                                                                                                                                                                                                                               "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
                                                                                                                                                                                                                               ), row.names = c(NA, -5L), .drop = TRUE))


library(dplyr)
topfiftyArrg <- arrange(topfifty, desc(NumberofYears))
topfiftyArrg <- head(topfiftyArrg,50)

^{Created on 2023-02-10 with reprex v2.0.2}

Deborah_Lowman · February 11, 2023, 5:04pm

Is there a way to use SQL in R? Something like this:

dataset <- SQL statement

I thought I was storing the results of the arrange() in my dataset topfifty? That's what the assignment <- is for, right?

Also, if I don't include the group_by() I get an error. I can't remember it off-hand.

I have 101 books that were on the bestseller list for more than 1 year. This list is a once-a-year thing. At the end of the year, Amazon lists the top 50 bestsellers. It isn't by months and there is no carry-over to the next year.

FJCC · February 11, 2023, 5:38pm

Is there a way to use SQL in R? Something like this:

You can certainly query a database and store the result in a data frame. It is not clear to me how that is related to the current problem.

I thought I was storing the results of the arrange() in my dataset topfifty? That's what the assignment <- is for, right?

In the code you posted :

topfifty <- MultipleYears %>%
filter(NumberofYears>1) %>%
group_by(NumberofYears)
arrange(topfifty, desc(NumberofYears))
head(topfifty,50)

there is no %>% after the group_by(). The arrange() that follows uses topfifty but the result is not stored anywhere.

My code using the topfifty data frame that you posted seems to give the desired result. If it does not work with your full topfifty data set, please describe how it fails.

andresrcs · February 11, 2023, 5:54pm

You can run sql queries on data frames with the sqldf package if that is what you meant.

Deborah_Lowman · February 11, 2023, 10:29pm

Typos really can get you, right? Thanks for pointing out the missing piping. I'm not sure why, but I couldn't get your code to work in piping. However, it worked line by line so I'm good. I have so much to learn, and I tend to over-complicate things sometimes. Thanks for your help and your patience.

Deborah_Lowman · February 11, 2023, 10:32pm

That's what I was looking for. Thanks. I have a SQL that I ran in BigQuery so I knew what my resulting dataset should contain but couldn't figure out how to run SQL in R, so I was trying it the other way (the way you suggested). Now, when all else fails I can use SQL.

WangYong · February 17, 2023, 3:06am

You can use DBplyr package with related sql driver to support databse query in R.
Introduction to dbplyr • dbplyr (tidyverse.org)
The above link has show an detail example, here is the brief steps:
1. install the dbplyr package and the database driver. For example, postgres.

install.packages("dbplyr")
install.packages("RPostgre")

2. load the library

library(RPostgres)
library(DBI)
library(dbplyr)

3. establish the database connection

con <- DBI::dbConnect(RPostgres::Postgres(),
host = "hostname_or_ipaddress",
user = "username",
dbname ='dbname',
password = 'password'
)
flights_db <- tbl(con, "flights") # connect to the database table

4. glimpse dataset directly

flights_db
#> # Source: table [?? x 19]
#> # Database: sqlite 3.40.0 [:memory:]
#> year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>
#> 1 2013 1 1 517 515 2 830 819 11
#> 2 2013 1 1 533 529 4 850 830 20
#> 3 2013 1 1 542 540 2 923 850 33
#> 4 2013 1 1 544 545 -1 1004 1022 -18
#> 5 2013 1 1 554 600 -6 812 837 -25
#> 6 2013 1 1 554 558 -4 740 728 12
#> # … with more rows, 10 more variables: carrier , flight ,
#> # tailnum , origin , dest , air_time ,
#> # distance , hour , minute , time_hour , and
#> # abbreviated variable names ¹sched_dep_time, ²dep_delay, ³arr_time,
#> # ⁴sched_arr_time, ⁵arr_delay

5. query the databset in dplyR mode

flights_db %>% select(year:day, dep_delay, arr_delay)
#> # Source: SQL [?? x 5]
#> # Database: sqlite 3.40.0 [:memory:]
#> year month day dep_delay arr_delay
#>
#> 1 2013 1 1 2 11
#> 2 2013 1 1 4 20
#> 3 2013 1 1 2 33
#> 4 2013 1 1 -1 -18
#> 5 2013 1 1 -6 -25
#> 6 2013 1 1 -4 12
#> # … with more rows

flights_db %>% filter(dep_delay > 240)
#> # Source: SQL [?? x 19]
#> # Database: sqlite 3.40.0 [:memory:]
#> year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>
#> 1 2013 1 1 848 1835 853 1001 1950 851
#> 2 2013 1 1 1815 1325 290 2120 1542 338
#> 3 2013 1 1 1842 1422 260 1958 1535 263
#> 4 2013 1 1 2115 1700 255 2330 1920 250
#> 5 2013 1 1 2205 1720 285 46 2040 246
#> 6 2013 1 1 2343 1724 379 314 1938 456
#> # … with more rows, 10 more variables: carrier , flight ,
#> # tailnum , origin , dest , air_time ,
#> # distance , hour , minute , time_hour , and
#> # abbreviated variable names ¹sched_dep_time, ²dep_delay, ³arr_time,
#> # ⁴sched_arr_time, ⁵arr_delay

flights_db %>%
group_by(dest) %>%
summarise(delay = mean(dep_delay))
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use na.rm = TRUE to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.40.0 [:memory:]
#> dest delay
#>
#> 1 ABQ 13.7
#> 2 ACK 6.46
#> 3 ALB 23.6
#> 4 ANC 12.9
#> 5 ATL 12.5
#> 6 AUS 13.0
#> # … with more rows

show the dbplyr to sql language

mf %>%
mutate(z = foofify(x, y)) %>%
show_query()
#>
#> SELECT *, foofify(x, y) AS z
#> FROM dbplyr_001

mf %>%
filter(x %LIKE% "%foo%") %>%
show_query()
#>
#> SELECT *
#> FROM dbplyr_001
#> WHERE (x LIKE '%foo%')

strong text7. store the result to database
you can store the analytic result back to datbase by copy_to function as following.

copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)

last but not last the DBplyr package support bigquery database interface as well.

Good luck!

WangYong

system · March 31, 2023, 3:06am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.