stringr lubricate datawrangling

jak123 · February 7, 2022, 9:49am

Hi R comm

im using this dataset: Billboard "The Hot 100" Songs | Kaggle

running this code:

music_df <- billboard100 %>%
select(date:artist, weeks_popular = "weeks.on.board")

library(lubridate)
library(stringr)

music_df %>%
mutate(date = ymd(date)) %>%
distinct(date) %<%
mutate(month = floor_date(date,"month"))

music_df$artist <- as.character(music_df$artist)

music_df %>%
mutate(date = ymd(date)) %>%
primary_artist = ifelse(str_detect(artist, "Featuring"),
str_match(artist, "(.*)\sFeaturing")[,2],
artist) %>%
select(artist, primary_artist)

want to split the artist into primary artist and featuring, but im getting an error:

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
object 'artist' not found

Thanks!

technocrat · February 7, 2022, 10:00am

Try

select(date,artist, weeks_popular = "weeks.on.board")

substituting a comma for a colon

jak123 · February 7, 2022, 11:14am

not it

any other ideas?

nirgrahamuk · February 7, 2022, 11:31am

to save me from downloading an 18mb file from kaggle can you please share a small sample of the data in a forum friendly way ? i.e. share the results of

dput(head(billboard100))

jak123 · February 7, 2022, 1:29pm

dput(head(music_df,5)) gives me 75000 characters

jak123 · February 7, 2022, 1:30pm

same for dput(head(music_df))

jrkrideau · February 7, 2022, 1:42pm

What does str(music_d) give you? I get the impression that R is not reading a delimiter properly so instead of several columns of data you are getting a single vector.

nirgrahamuk · February 7, 2022, 1:48pm

what about the original dataset though ? i.e. not music_df

jak123 · February 7, 2022, 2:33pm

nirgrahamuk · February 7, 2022, 2:50pm

dput outputs are 'too large' because of the use of factors where characters would do, therefore:

dput(head(mutate(billboard100,across(where(is.factor),as.character))))

jak123 · February 7, 2022, 3:56pm

structure(list(date = c("2021-11-06", "2021-11-06", "2021-11-06",
"2021-11-06", "2021-11-06", "2021-11-06"), rank = 1:6, song = c("Easy On Me",
"Stay", "Industry Baby", "Fancy Like", "Bad Habits", "Way 2 Sexy"
), artist = c("Adele", "The Kid LAROI & Justin Bieber", "Lil Nas X & Jack Harlow",
"Walker Hayes", "Ed Sheeran", "Drake Featuring Future & Young Thug"
), last.week = 1:6, peak.rank = c(1L, 1L, 1L, 3L, 2L, 1L), weeks.on.board = c(3L,
16L, 14L, 19L, 18L, 8L)), row.names = c(NA, 6L), class = "data.frame")

jak123 · February 7, 2022, 4:59pm

so everything was factor before, but how does that make the dput longer? and thanks

jak123 · February 8, 2022, 5:18pm

anyone ?????
br. Rasmus

nirgrahamuk · February 8, 2022, 5:36pm

billboard100 <- structure(list(date = c("2021-11-06", "2021-11-06", "2021-11-06",
                        "2021-11-06", "2021-11-06", "2021-11-06"), rank = 1:6, song = c("Easy On Me",
                                                                                        "Stay", "Industry Baby", "Fancy Like", "Bad Habits", "Way 2 Sexy"
                        ), artist = c("Adele", "The Kid LAROI & Justin Bieber", "Lil Nas X & Jack Harlow",
                                      "Walker Hayes", "Ed Sheeran", "Drake Featuring Future & Young Thug"
                        ), last.week = 1:6, peak.rank = c(1L, 1L, 1L, 3L, 2L, 1L), weeks.on.board = c(3L,
                                                                                                      16L, 14L, 19L, 18L, 8L)), row.names = c(NA, 6L), class = "data.frame")
library(tidyverse)
library(lubridate)

billboard100 %>% select(-last.week,-peak.rank) %>% 
  mutate(date = ymd(date),
  split_artist = str_split_fixed(artist,
                                 "Featuring",
                                 2) 
)

system · February 15, 2022, 5:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.