Selecting data until current month automatically

ksingh19 · October 9, 2020, 4:07pm

Hi,

I am interested in knowing if I can select the data automatically until the current month with any function or such. Current data will keep on changing and growing, but for initial processing, I need to consider data until current month, ie, until September for example this month and next time when I run the program, the data considered should be until October automatically, and so on.

Is there a way for this? Thanks for your help!

df <- data.frame(
          month = c("2020 Jan",
                    "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
                    "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
                    "2020 Dec"),
          sales = c(1343771,1582470,
                    1233360,1271090,1337158,1024617,969186,580039,
                    745976,588006,594815,549379)
)

GreyMerchant · October 9, 2020, 4:24pm

Hello @ksingh19,

Just want to check something with you first before proposing the full solution based on your data. So I understand you correctly that if we're now in October, we only need to go unto September which would mean 9 months correct? And then the same for when we go into November, we should only evaluate until October?

If so, then we can at least do the below to get the current month number. Yes?

library(lubridate)
library(tidyverse)

Sys.time() %>% lubridate::month() - 1
#> [1] 9

^{Created on 2020-10-09 by the reprex package (v0.3.0)}

ksingh19 · October 9, 2020, 4:44pm

Yes that is correct @GreyMerchant and the only reason we are collecting until one previous month is because we always have actual values until then.

Thank you!

GreyMerchant · October 9, 2020, 7:23pm

Let me know if this is the solution you had in mind

library(tidyverse)
library(lubridate)

df <- data.frame(
  month = c(
    "2020 Jan",
    "2020 Feb", "2020 Mar", "2020 Apr", "2020 May", "2020 Jun",
    "2020 Jul", "2020 Aug", "2020 Sep", "2020 Oct", "2020 Nov",
    "2020 Dec"
  ),
  sales = c(
    1343771, 1582470,
    1233360, 1271090, 1337158, 1024617, 969186, 580039,
    745976, 588006, 594815, 549379
  )
)

# Splitting your variable into year month
df_output <-
  df %>% separate(month, c("year", "month"), sep = " ")

# changing your abreviated month into a number
df_output <- df_output %>% mutate(month = month %>% unlist() %>% match(month.abb))

# creating our filter variable
filter_month <- Sys.time() %>% lubridate::month() - 1

# filter the data to the currect subset
df_filter <- df_output %>% filter(month <= filter_month)

df_filter
#>   year month   sales
#> 1 2020     1 1343771
#> 2 2020     2 1582470
#> 3 2020     3 1233360
#> 4 2020     4 1271090
#> 5 2020     5 1337158
#> 6 2020     6 1024617
#> 7 2020     7  969186
#> 8 2020     8  580039
#> 9 2020     9  745976

^{Created on 2020-10-09 by the reprex package (v0.3.0)}

ksingh19 · October 9, 2020, 8:19pm

Thank you so much @GreyMerchant! Yes , the result is right. But can we still perform this without splitting the month column at all because I still needed the result data in the same format as original format.

GreyMerchant · October 9, 2020, 8:23pm

There we are Feel free to mark it as the solution if it works.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- data.frame(
  month = c(
    "2020 Jan",
    "2020 Feb", "2020 Mar", "2020 Apr", "2020 May", "2020 Jun",
    "2020 Jul", "2020 Aug", "2020 Sep", "2020 Oct", "2020 Nov",
    "2020 Dec"
  ),
  sales = c(
    1343771, 1582470,
    1233360, 1271090, 1337158, 1024617, 969186, 580039,
    745976, 588006, 594815, 549379
  )
)

# Splitting your variable into year month
df_output <-
  df %>% separate(month, c("s_year", "s_month"), sep = " ",  remove = FALSE)

# changing your abreviated month into a number
df_output <- df_output %>% mutate(s_month = s_month %>% unlist() %>% match(month.abb))

# creating our filter variable
filter_month <- Sys.time() %>% lubridate::month() - 1

# filter the data to the currect subset
df_filter <- df_output %>% filter(s_month <= filter_month) %>% select(-c(s_month,s_year))

df_filter
#>      month   sales
#> 1 2020 Jan 1343771
#> 2 2020 Feb 1582470
#> 3 2020 Mar 1233360
#> 4 2020 Apr 1271090
#> 5 2020 May 1337158
#> 6 2020 Jun 1024617
#> 7 2020 Jul  969186
#> 8 2020 Aug  580039
#> 9 2020 Sep  745976

^{Created on 2020-10-09 by the reprex package (v0.3.0)}

ksingh19 · October 9, 2020, 8:46pm

Thank you so much @GreyMerchant! Everything looks good here. But it just shows the full data set for me. Not sure why, though.

GreyMerchant · October 9, 2020, 8:48pm

@ksingh19, I made several small changes to the original script. So just have a look

ksingh19 · October 9, 2020, 8:53pm

I am also trying just the following now as I don't need to split the month, but still nothing changes and the result shows the whole data.

# creating our filter variable
filter_month <- Sys.time() %>% lubridate::month() - 1

# filter the data to the currect subset
df_filter <- df %>% filter(month <= filter_month)

GreyMerchant · October 9, 2020, 8:56pm

No Look carefully. We are now not removing the original variable. We are retaining it by doing remove = FALSE.

You will see that I have created s_year and s_month now for the split columns. We then do all the steps on s_month and then we filter filter_month against s_month and then drop the s_year and s_month columns. Just copy my whole script over.

ksingh19 · October 9, 2020, 11:14pm

Thank you @GreyMerchant ! I should have seen this carefully earlier. It does work partially.

Initial Reprex data contains data until 2020 and my real data contains data for next few years. The result using your code now shows values until September 2020 and nothing else for 2020, but then starts back from Jan 2021 until next few years. Because data is quite large, I couldn't see that it worked until September 2020 and started from Jan next year. This time I am including reprex data until 2021, but the real data continues for next few years. I would like to have the subset to completely stop at September 2020 in this case.

df <- data.frame(
          month = c("2020 Jan",
                    "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
                    "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
                    "2020 Dec","2021 Jan","2021 Feb","2021 Mar","2021 Apr",
                    "2021 May","2021 Jun","2021 Jul","2021 Aug","2021 Sep",
                    "2021 Oct","2021 Nov","2021 Dec"),
          sales = c(1337158,1024617,969186,580039,
                    745976,588006,594815,549379,1277256,980282,853458,1007229,
                    1246084,1193005,1503203,1451290,1343771,1582470,
                    1233360,1271090,1337158,1024617,969186,580039)
)

Thank you!

ksingh19 · October 10, 2020, 3:59am

Thanks @GreyMerchant for giving information on Sys.time(). I was not aware of this. I am now able to use the following code and it works now the way we want the result. Thank you so much for the help again!

df <- df%>%
  separate(month, into = c("year", "month"))%>%
  mutate(date = paste0(`year`, "-", `month`, -1))%>%
  mutate(date = ymd(date))

df1 <- df %>% 
  filter(date <= Sys.Date() - 31)

GreyMerchant · October 10, 2020, 8:45am

Hello,

I am glad you got to a solution. I would not recommend this solution you're following though. Rather make use of the split s_year column to also do a filter check by year. See the new updated version based on your new data below:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- data.frame(
  month = c("2020 Jan",
            "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
            "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
            "2020 Dec","2021 Jan","2021 Feb","2021 Mar","2021 Apr",
            "2021 May","2021 Jun","2021 Jul","2021 Aug","2021 Sep",
            "2021 Oct","2021 Nov","2021 Dec"),
  sales = c(1337158,1024617,969186,580039,
            745976,588006,594815,549379,1277256,980282,853458,1007229,
            1246084,1193005,1503203,1451290,1343771,1582470,
            1233360,1271090,1337158,1024617,969186,580039)
)

# Splitting your variable into year month
df_output <-
  df %>% separate(month, c("s_year", "s_month"), sep = " ",  remove = FALSE) %>% 
  mutate(s_year = as.numeric(s_year))

# changing your abreviated month into a number
df_output <- df_output %>% mutate(s_month = s_month %>% unlist() %>% match(month.abb))

# creating our filter variables
filter_month <- Sys.time() %>% lubridate::month() - 1
filter_year <- Sys.time() %>% lubridate::year() 

# filter the data to the currect subset
df_filter <- df_output %>% filter(s_month <= filter_month & s_year == filter_year) %>% select(-c(s_month,s_year))

df_filter
#>      month   sales
#> 1 2020 Jan 1337158
#> 2 2020 Feb 1024617
#> 3 2020 Mar  969186
#> 4 2020 Apr  580039
#> 5 2020 May  745976
#> 6 2020 Jun  588006
#> 7 2020 Jul  594815
#> 8 2020 Aug  549379
#> 9 2020 Sep 1277256

^{Created on 2020-10-10 by the reprex package (v0.3.0)}

ksingh19 · October 10, 2020, 9:58pm

Thanks @GreyMerchant! It works now and I have replaced my code with this one. Is there any reason why the other solution is not recommended. Thanks again for your help!

GreyMerchant · October 11, 2020, 10:41am

@ksingh19 most welcome . With the constant of -31 you will run into problems where if you do the run on a specific day within a month you might extract the wrong set depending on if that month has 30,31 or 28 days. In this new version you guarantee that you're always explicitly testing for the two criteria that matter.

ksingh19 · October 11, 2020, 5:15pm

Thanks! It works great! But it shows data only for 2020 and not previous years. So I have kept everything like you suggested with minor change below and it works brilliantly.

# filter the data to the currect subset
df_filter <- df_output %>% filter(s_month <= filter_month & s_year <= filter_year) %>% select(-c(s_month,s_year))

Thanks again!

ksingh19 · October 11, 2020, 8:29pm

Sorry! I take it back. It does work partially still. For previous years, it stops at September, but we want to see full years for all previous years and until September 2020, removing all future months and years

GreyMerchant · October 12, 2020, 8:04am

@ksingh19, this is why it is important to as clearly as possible frame the problem. In this instance I propose the below. You will see that the constant now updates based on the current day (so it is 12 today given it is the 12th and not 31)

library(tidyverse)
library(lubridate)


df <- data.frame(
  month = c("2020 Jan",
            "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
            "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
            "2020 Dec","2021 Jan","2021 Feb","2021 Mar","2021 Apr",
            "2021 May","2021 Jun","2021 Jul","2021 Aug","2021 Sep",
            "2021 Oct","2021 Nov","2021 Dec"),
  sales = c(1337158,1024617,969186,580039,
            745976,588006,594815,549379,1277256,980282,853458,1007229,
            1246084,1193005,1503203,1451290,1343771,1582470,
            1233360,1271090,1337158,1024617,969186,580039)
)





df <- df%>%
  separate(month, into = c("year", "month"))%>%
  mutate(date = paste0(`year`, "-", `month`, -1))%>%
  mutate(date = ymd(date))


filter_day <- Sys.time() %>% lubridate::day()

df1 <- df %>% 
  filter(date <= Sys.Date() - 31) 

df1
#>   year month   sales       date
#> 1 2020   Jan 1337158 2020-01-01
#> 2 2020   Feb 1024617 2020-02-01
#> 3 2020   Mar  969186 2020-03-01
#> 4 2020   Apr  580039 2020-04-01
#> 5 2020   May  745976 2020-05-01
#> 6 2020   Jun  588006 2020-06-01
#> 7 2020   Jul  594815 2020-07-01
#> 8 2020   Aug  549379 2020-08-01
#> 9 2020   Sep 1277256 2020-09-01


df2 <- df %>% 
  filter(date <= Sys.Date() - filter_day) 

df2
#>   year month   sales       date
#> 1 2020   Jan 1337158 2020-01-01
#> 2 2020   Feb 1024617 2020-02-01
#> 3 2020   Mar  969186 2020-03-01
#> 4 2020   Apr  580039 2020-04-01
#> 5 2020   May  745976 2020-05-01
#> 6 2020   Jun  588006 2020-06-01
#> 7 2020   Jul  594815 2020-07-01
#> 8 2020   Aug  549379 2020-08-01
#> 9 2020   Sep 1277256 2020-09-01

ksingh19 · October 12, 2020, 5:07pm

Thanks @GreyMerchant! I am a little lost with day this time. df1 is same as what I proposed I believe unless I am missing something here. df2 is what I want to focus on. How does this work again? What I have learnt so far with your help is that Sys.time() is more important, I believe. Thanks again for your help!

system · October 19, 2020, 5:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.