Create turnover point variables

idsflitsie · May 27, 2022, 1:46pm

I am trying to recreate the following picture, specifically the 2 variables "Last year of tenureship" and "First year of tenureship". My question consists of two parts, Firstly, the thing is that for a specific company, let's say company A, i want to highlight the first year of the tenureship with their client, and the last year of the tenureship with their client, to calculate the turnover point. If it is not the first or last year, the value is 0 (i was thinking of the mutate and if_else function?).
Secondly, the datas i am using is from 1999 to 2021. If i have like in the picture company C, which only has data from 2005 until 2018, the "first year value" cannot have the value 1 in 2005, but instead 0, as for this company, the data is only available from 2005, and not from the start (1999) , so i cannot measure the tenureship due to a lack of data (in executive, we don't now with who the company was working with in 2004, so therefore, the "first year variable" cannot be equal to 1 in 2005). Same goes for the year 2018, where the data stops again, so before 2021. I would like to give it also the value 0 instead of 1 for the last year variable, as we do not have the information about the year of 2019. Can someone help me out a bit with this? I have explained it a little bit in the image as well, so hopefully it clarifies it a little bit more. I do not know it this is possible to recreate the second part of my question. Thanks in advance already!

https://imgur.com/a/wr3HYGN

EDIT: After the comment from nirgrahamuk, i also am writing the code over here

df <- data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,
2001L,2002L,2003L,2004L,2005L,2006L,2007L,
2008L,2009L,2010L,2011L,2012L,2013L,2014L,
2015L,2016L,2017L,2018L,2019L,2020L,2021L,
1999L,2000L,2001L,2002L,2003L,2004L,2005L,
2006L,2007L,2008L,2009L,2010L,2011L,2012L,
2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,
2010L,2011L,2012L,2013L,2014L,2015L,2016L,
2017L,2018L),
Company = c("A","A","A",
"A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","C","C","C","C","C","C",
"C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,4,1,1,1,2,2,2,2,2,
2,2,2,2,2,2)
)

nirgrahamuk · May 27, 2022, 3:04pm

Hello,
I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

idsflitsie · May 27, 2022, 4:12pm

df <- data.frame(
  stringsAsFactors = FALSE,
                      year = c(1999L,2000L,
                               2001L,2002L,2003L,2004L,2005L,2006L,2007L,
                               2008L,2009L,2010L,2011L,2012L,2013L,2014L,
                               2015L,2016L,2017L,2018L,2019L,2020L,2021L,
                               1999L,2000L,2001L,2002L,2003L,2004L,2005L,
                               2006L,2007L,2008L,2009L,2010L,2011L,2012L,
                               2013L,2014L,2015L,2016L,2017L,2018L,2019L,
                               2020L,2021L,2005L,2006L,2007L,2008L,2009L,
                               2010L,2011L,2012L,2013L,2014L,2015L,2016L,
                               2017L,2018L),
                   Company = c("A","A","A",
                               "A","A","A","A","A","A","A","A","A","A",
                               "A","A","A","A","A","A","A","A","A","A",
                               "B","B","B","B","B","B","B","B","B",
                               "B","B","B","B","B","B","B","B","B","B",
                               "B","B","B","B","C","C","C","C","C","C",
                               "C","C","C","C","C","C","C","C"),
              clientnumber = c(1,1,1,1,2,
                               2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
                               2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,
                               3,3,3,3,3,3,3,4,1,1,1,2,2,2,2,2,
                               2,2,2,2,2,2)
        )

You mean like this? Sorry, i am quite new to this website, and also quite new to rstudio. I will also update the main post. I by the way do not know why there is a L after every year, this happened when i used dpasta for some reason. Thanks for the comment, noted

nirgrahamuk · May 27, 2022, 6:01pm

It seems important to know the first year that each company has data for, as it matters if its 1999 or after....
so calculate that and then bring it in to use it.

(company_bounds_df <- df %>% group_by(Company) %>%
  summarise(first_year_data=min(year),
            last_year_data = max(year)))


(df2 <- left_join(
  df,
  company_bounds_df
) %>% group_by(Company, clientnumber) %>% mutate(
  first_year_ten = as.integer(
    year == min(year) & (year != first_year_data | first_year_data==1999)
)) )%>% print(n=100)

I'm not sure I understood your logic for first year of tenure, I think it should be grouped at company/client level. then for each year, its got to be the earliest year in order to get a 1 , and the exceptions are if its the first year for which that company has had data for .... unless that year is 1999

idsflitsie · May 30, 2022, 1:38pm

Thanks! This was indeed where i was looking for, and i have played a bit around with your script, and got totally what i wanted to recreate. I still have one problem where i am running into right now, it is where i am creating a new variable which calculates the length of the partnership between the company and the clientnumber. If again, as is happening with company C, there is missing data before 2005, the value gets 0 or NA (until there is a change of clientnumber, then it starts counting from the start again) as we cannot know the length of the partnership due to lack of information. This is what i am trying to recreate ( i have put this in manually):

Summary

df <- data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,2001L,2002L,
2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L,
2019L,2020L,2021L,1999L,2000L,2001L,2002L,2003L,
2004L,2005L,2006L,2007L,2008L,2009L,2010L,2011L,
2012L,2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L),
Company = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","C","C","C","C","C",
"C","C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,1,
1,1,2,2,2,2,2,2,2,2,2,2,2),
first_year_data = c(1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,2005L,2005L,2005L,2005L,2005L,2005L,
2005L,2005L,2005L,2005L,2005L,2005L,2005L,2005L),
last_year_data = c(2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2018L,2018L,2018L,2018L,2018L,2018L,
2018L,2018L,2018L,2018L,2018L,2018L,2018L,2018L),
last_year_tenure = c(0L,0L,0L,1L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,0L,0L,1L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
first_year_tenure = c(1L,0L,0L,0L,1L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,
1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
tenure = c(1,2,3,4,1:19,1,2,3,1:19,1,0,0,0,1:11)

)

I tried this using the row_number() formula, but it gave me this:

Summary

df data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,2001L,2002L,
2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L,
2019L,2020L,2021L,1999L,2000L,2001L,2002L,2003L,
2004L,2005L,2006L,2007L,2008L,2009L,2010L,2011L,
2012L,2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L),
Company = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","C","C","C","C","C",
"C","C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,1,
1,1,2,2,2,2,2,2,2,2,2,2,2),
first_year_data = c(1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,2005L,2005L,2005L,2005L,2005L,2005L,
2005L,2005L,2005L,2005L,2005L,2005L,2005L,2005L),
last_year_data = c(2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2018L,2018L,2018L,2018L,2018L,2018L,
2018L,2018L,2018L,2018L,2018L,2018L,2018L,2018L),
last_year_ten = c(0L,0L,0L,1L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,0L,0L,1L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
first_year_tenure = c(1L,0L,0L,0L,1L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,
1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
tenure = c(1L,0L,0L,0L,0L,1L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,1L,0L,0L,0L,1L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L)
)

What am i doing wrong here? I used this code (changed up your code a bit, as for a beginner like myself this gives me more overview and clarity for myself)

df2 <- left_join(df,company_bounds_df) %>% group_by(Company, clientnumber)
df2 <- df2 %>% mutate(last_year_ten = as.integer(year == max(year) & (year != last_year_data | last_year_data==2021)))
df2 <- df2 %>% mutate(first_year_tenure = as.integer(year == min(year) & (year != first_year_data | first_year_data==1999)))
df2 <- df2 %>% group_by(Company, clientnumber) %>% mutate(tenure = as.integer(clientnumber == row_number() & (year != first_year_data | first_year_data==1999)))

I would guess having something with the place of the row_number(), but i am not sure how i could fix this.

system · June 20, 2022, 1:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.