Show data as node to node based on key match

abhi_jain80 · May 5, 2022, 7:24am

I am dealing with complex dataset with millions of rows. This dataset is a combination of shipments from plant to plant. Movement Types 555 & 666 are outbound shipments and 100 are inbound shipments.

Here is the sample of the dataset:

Tx_ref	Movement_Type	Quantity	Delivery_Item	Delivery_Code	Date	Description	Plant	Region
1	555	100	A	123	04/01/2022	DescA123	Plant1	EMEA
2	555	200	A	123	10/01/2022	DescA123	Plant1	EMEA
3	100	300	A	123	11/02/2022	DescA123	Plant2	LATAM
4	555	10	B	256	04/03/2022	DescB256	Plant1	EMEA
5	555	10	B	256	04/03/2022	DescB256	Plant1	EMEA
6	666	300	A	123	22/02/2022	DescA123	Plant2	LATAM
7	100	20	B	256	22/03/2022	DescB256	Plant3	APAC
8	100	300	A	123	12/04/2022	DescA123	Plant3	APAC
9	666	5	C	123	12/04/2022	DescC123	Plant1	EMEA
10	555	5	D	123	12/04/2022	DescD123	Plant1	EMEA
11	666	5	C	123	12/04/2022	DescC123	Plant1	EMEA
12	100	5	C	123	20/04/2022	DescC123	Plant2	LATAM

For outbound transactions, the date/plant/region refers the shipping date/plant/region and for inbound transactions, the date/plant/region refers the receiving date/plant/region.
The key to matching the shipment to the receipt is by the combination of Delivery_Item and the Delivery_Code.

I need to show the data as node to node i.e. starting from shipping plant to end receiving plant. For example: 100 shipped from plant1 on 04/01, 200 shipped from plant1 on 10/01, 300 received at plant2 on 11/02, again this 300 shipped from plant2 on 22/02 and received at plant3 on 12/04. Over here, plant2 is the intermediary plant which I don't want to show. plant1 is the starting plant and plant3 is the end plant.

This is the output I required from this sample data. Just wondering how could I do:

Delivery_Item	Delivery_Code	Description	Start_Plant	Start_Region	End_Plant	End_Region	Start_Date	End_Date	Total Qty Shipped	Total Qty Received
A	123	DescA123	Plant1	EMEA	Plant3	APAC	04/01/2022	12/04/2022	300	300
B	256	DescB256	Plant1	EMEA	Plant3	APAC	04/03/2022	22/03/2022	20	20
C	123	DescC123	Plant1	EMEA	Plant2	LATAM	12/04/2022	20/04/2022	10	5

Thanks in advance....

nirgrahamuk · May 5, 2022, 8:29am

what have you tried so far ? where do you get stuck ?
some initial things you could think about doing.
You have 3 movement codes, but you interpret them as two types (important for structuring your data)
so write some code to make a 2 level movement category.
The relationships between entries in your table are defined by matching Delivery_Item and Delivery_Codes, therefore in the tidyverse/dplyr world, we might start by grouping on these , and then summarising.

abhi_jain80 · May 5, 2022, 9:42am

To start with, I have used the code from one of the previous topic;

Now I have to find lookup the first date/plant/region matching Delivery_Item and Delivery_Code and aggregate the shipping/receiving quantities. Got stuck over here. Hope I am going in the right direction.

library(tidyverse)

df1 <- read_csv('sample.csv')

df1 %>% 
  select(-c(Tx_ref)) %>% 
  mutate(Movement_Type = case_when(
    (Movement_Type == 555 | Movement_Type == 666)  ~ 'Outbound',
    Movement_Type == 100 ~ 'Inbound',
    TRUE ~ 'NA')
  ) %>% pivot_wider(
    values_fn = list,
    names_from = 'Movement_Type',
    values_from = c('Date','Quantity')
  ) %>% unnest(matches("Outbound|Inbound"))

nirgrahamuk · May 5, 2022, 10:16am

I think you should add Plant and region to the values_from as you seemed to want to know which plants are inbound and which are outbound to pick them for the start and end etc.

Need to think about in theory what logic could be used to determine what plant is intermediary vs a start or end plant ?
also its not clear to me whether there could be / would be multiple starts and ends for a delivery, or guaranteed just one for those roles ?

abhi_jain80 · May 5, 2022, 10:36am

Yes, there could be multiple starts or ends for a delivery but need to consider only the first start and last end matching Delivery_Item and Delivery_Code.
If there is any intermediary plant, need to be ignored.

nirgrahamuk · May 5, 2022, 10:36am

A hint for determining what info to pick from which records is to interpret the interim for terminal nodes by what seems to me the logic of if it sends and receives then its in the middle / interim.

df2 %>% arrange(Delivery_Item,
                     Delivery_Code) %>% rowwise() %>% 
  mutate(interim_or_term = 
           case_when(any(is.na(c(Date_Outbound,Date_Inbound))) ~ 'term',
                     TRUE ~ 'interim')) %>% relocate(interim_or_term)

system · May 26, 2022, 10:37am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.