I am dealing with complex dataset with millions of rows. This dataset is a combination of shipments from plant to plant. Movement Types 555 & 666 are outbound shipments and 100 are inbound shipments.
Here is the sample of the dataset:
Tx_ref
Movement_Type
Quantity
Delivery_Item
Delivery_Code
Date
Description
Plant
Region
1
555
100
A
123
04/01/2022
DescA123
Plant1
EMEA
2
555
200
A
123
10/01/2022
DescA123
Plant1
EMEA
3
100
300
A
123
11/02/2022
DescA123
Plant2
LATAM
4
555
10
B
256
04/03/2022
DescB256
Plant1
EMEA
5
555
10
B
256
04/03/2022
DescB256
Plant1
EMEA
6
666
300
A
123
22/02/2022
DescA123
Plant2
LATAM
7
100
20
B
256
22/03/2022
DescB256
Plant3
APAC
8
100
300
A
123
12/04/2022
DescA123
Plant3
APAC
9
666
5
C
123
12/04/2022
DescC123
Plant1
EMEA
10
555
5
D
123
12/04/2022
DescD123
Plant1
EMEA
11
666
5
C
123
12/04/2022
DescC123
Plant1
EMEA
12
100
5
C
123
20/04/2022
DescC123
Plant2
LATAM
For outbound transactions, the date/plant/region refers the shipping date/plant/region and for inbound transactions, the date/plant/region refers the receiving date/plant/region.
The key to matching the shipment to the receipt is by the combination of Delivery_Item and the Delivery_Code.
I need to show the data as node to node i.e. starting from shipping plant to end receiving plant. For example: 100 shipped from plant1 on 04/01, 200 shipped from plant1 on 10/01, 300 received at plant2 on 11/02, again this 300 shipped from plant2 on 22/02 and received at plant3 on 12/04. Over here, plant2 is the intermediary plant which I don't want to show. plant1 is the starting plant and plant3 is the end plant.
This is the output I required from this sample data. Just wondering how could I do:
what have you tried so far ? where do you get stuck ?
some initial things you could think about doing.
You have 3 movement codes, but you interpret them as two types (important for structuring your data)
so write some code to make a 2 level movement category.
The relationships between entries in your table are defined by matching Delivery_Item and Delivery_Codes, therefore in the tidyverse/dplyr world, we might start by grouping on these , and then summarising.
To start with, I have used the code from one of the previous topic;
Now I have to find lookup the first date/plant/region matching Delivery_Item and Delivery_Code and aggregate the shipping/receiving quantities. Got stuck over here. Hope I am going in the right direction.
I think you should add Plant and region to the values_from as you seemed to want to know which plants are inbound and which are outbound to pick them for the start and end etc.
Need to think about in theory what logic could be used to determine what plant is intermediary vs a start or end plant ?
also its not clear to me whether there could be / would be multiple starts and ends for a delivery, or guaranteed just one for those roles ?
Yes, there could be multiple starts or ends for a delivery but need to consider only the first start and last end matching Delivery_Item and Delivery_Code.
If there is any intermediary plant, need to be ignored.
A hint for determining what info to pick from which records is to interpret the interim for terminal nodes by what seems to me the logic of if it sends and receives then its in the middle / interim.