Merging Two Datasets & Codes to "Validate" Certain Values in a Dataset.

HTP · March 1, 2022, 12:45am

Good day everyone!

I am a "newbie" in R (I am one of those Excel plebs trying to learn R, haha) and was wondering if people could assist me with two problems that I'm trying to solve. I reckon that the first problem is going to be super simple for everyone here, but the second one might be tricky.

Please note that you do not have to answer all questions if you feel unsure about something. Partial answers are perfectly fine since they will still give me some starting points. Again, I am new to R, so please be as descriptive as possible to help me understand and learn.

It should also be noted that the actual datasets I'm working with are a lot bigger and more complex than the example I'm using below. The example only shows 10 units in two regions (region 1 & 2) with four fuel categories (liquid fuel, wind, mixed & solar) and two liquid fuel types (gasoline & diesel), but the actual data has over 100 units across multiple regions that use various types of fuels. Therefore, the codes/solution would have to be as "general/flexible" as possible and should be able to be used for much larger datasets. (I feel like I'm overexplaining here, but the solution shouldn't be too "rigid" to the point that it can ONLY be used to address the examples below).

Problem 1:
I've imported two datasets (Data A and Data B). Each dataset has unique information about ten heat generating units in two regions. Specifically, Data A contains information regarding the location of each unit, general category of fuel it uses, and the fuel cost. Data B has additional information about specific fuel types and fuel fractions for liquid and mixed fuel units. Therefore,

Data A:

Unit ID	Region	Fuel	Fuel Cost ($)
Unit 1	1	Liquid Fuel	200
Unit 2	2	Wind	0
Unit 3	1	Mixed	168
Unit 4	1	Liquid Fuel	120
Unit 5	2	Liquid Fuel	200
Unit 6	2	Mixed	183.5
Unit 7	1	Solar	0
Unit 8	2	Liquid Fuel	90
Unit 9	1	Liquid Fuel	200
Unit 10	2	Liquid Fuel	90

Data B:

Unit ID	Fuel Type	Fuel Fraction
Unit 1	Gasoline	1
Unit 3	Gasoline	0.6
Unit 3	Diesel	0.4
Unit 4	Diesel	1
Unit 5	Gasoline	1
Unit 6	Gasoline	0.85
Unit 6	Diesel	0.15
Unit 8	Diesel	1
Unit 9	Gasoline	1
Unit 10	Diesel	1

Note that Data B does not have any information about Unit 2 and Unit 7 in Data A because they use wind and solar, respectively. Also, Unit 3 and 6 have two rows each because they mix different types of fuels (i.e., Unit 3 uses 60% gasoline and 40% diesel).

I wanted to merge these two datasets by Unit ID and create something like this (Data C):

Unit ID	Region	Fuel	Fuel Cost ($)	Fuel Type	Fuel Fraction
Unit 1	1	Liquid Fuel	200	Gasoline	1
Unit 2	2	Wind	0
Unit 3	1	Mixed	168	Gasoline	0.6
Unit 3	1	Mixed	168	Diesel	0.4
Unit 4	1	Liquid Fuel	120	Diesel	1
Unit 5	2	Liquid Fuel	200	Gasoline	1
Unit 6	2	Mixed	183.5	Gasoline	0.85
Unit 6	2	Mixed	183.5	Diesel	0.15
Unit 7	1	Solar	0
Unit 8	2	Liquid Fuel	90	Diesel	1
Unit 9	1	Liquid Fuel	200	Gasoline	1
Unit 10	2	Liquid Fuel	90	Diesel	1

In order to do this, I've used the merge function and wrote this:

DataC<- merge(DataA, DataB, by = "Unit ID")

But the result that I got was:

Unit ID	Region	Fuel	Fuel Cost ($)	Fuel Type	Fuel Fraction
Unit 1	1	Liquid Fuel	200	Gasoline	1
Unit 3	1	Mixed	168	Gasoline	0.6
Unit 3	1	Mixed	168	Diesel	0.4
Unit 4	1	Liquid Fuel	120	Diesel	1
Unit 5	2	Liquid Fuel	200	Gasoline	1
Unit 6	2	Mixed	183.5	Gasoline	0.85
Unit 6	2	Mixed	183.5	Diesel	0.15
Unit 8	2	Liquid Fuel	90	Diesel	1
Unit 9	1	Liquid Fuel	200	Gasoline	1
Unit 10	2	Liquid Fuel	90	Diesel	1

The issue with this is that the function removes Unit 2 and Unit 7 from the output because Unit 2 and 7 in Data A do not have corresponding/matching Unit IDs in Data B.

So anybody know how to solve this? I feel like the solution would be super simple, but please let me know what you think!

Problem 2:

Now, suppose that we found a way to make Data C looking like the way that I wanted. That is,

Data C:

Unit ID	Region	Fuel	Fuel Cost ($)	Fuel Type	Fuel Fraction
Unit 1	1	Liquid Fuel	200	Gasoline	1
Unit 2	2	Wind	0
Unit 3	1	Mixed	168	Gasoline	0.6
Unit 3	1	Mixed	168	Diesel	0.4
Unit 4	1	Liquid Fuel	120	Diesel	1
Unit 5	2	Liquid Fuel	200	Gasoline	1
Unit 6	2	Mixed	183.5	Gasoline	0.85
Unit 6	2	Mixed	183.5	Diesel	0.15
Unit 7	1	Solar	0
Unit 8	2	Liquid Fuel	90	Diesel	1
Unit 9	1	Liquid Fuel	200	Gasoline	1
Unit 10	2	Liquid Fuel	90	Diesel	1

Now, I wanna come up with a chunk of codes that will "validate" the "Fuel Cost ($)" values for each unit. In other words, I wanna check whether the fuel cost information in Data C match/satisfy certain conditions. These conditions are:

Gasoline units spend $200 for fuel
Diesel units in region 1 spend $120 for fuel
Diesel units in region 2 spends $90 for fuel because diesel is cheaper there
For "mixed" units, the fuel costs depend on the fraction of fuels that they use.
For example, Unit 3 uses 60% gasoline and 40% diesel, so the fuel cost is going to be: ($200 x 60%) + ($120 x 40%) = $168.
Similarly, Unit 6 is in region 2 and uses 85% gasoline and 15% diesel, so the fuel cost is going to be: ($200 x 85%) + ($90 x 15%) = $183.5
Solar or wind units spends $0 for fuel (these units have "blank" fuel type and fuel fraction).

So, for example, if the fuel cost for Unit 4 is $140 instead of $120 (2nd condition), the code should flag it by saying "No" under "Validation" column. Please see the chart below:

Unit ID	Region	Fuel	Fuel Cost ($)	Fuel Type	Fuel Fraction (%)	Validation
Unit 1	1	Liquid Fuel	200	Gasoline	1	Yes
Unit 2	2	Wind	0			Yes
Unit 3	1	Mixed	168	Gasoline	0.6	Yes
Unit 3	1	Mixed	168	Diesel	0.4	Yes
Unit 4	1	Liquid Fuel	140	Diesel	1	No
Unit 5	2	Liquid Fuel	200	Gasoline	1	Yes
Unit 6	2	Mixed	183.5	Gasoline	0.85	Yes
Unit 6	2	Mixed	183.5	Diesel	0.15	Yes
Unit 7	1	Solar	0			Yes
Unit 8	2	Liquid Fuel	90	Diesel	1	Yes
Unit 9	1	Liquid Fuel	200	Gasoline	1	Yes
Unit 10	2	Liquid Fuel	90	Diesel	1	Yes

I was thinking maybe I need to use the if...else... function? I tried to figure it out, but I wasn't sure how to approach it due to my lack of experience with R (maybe I am a bit too ambitious with the stuff that I'm trying to do....). Please let me know how I should approach this!

I reckon that the most difficult part of this would be dealing with units that mix different fuels because their fuel costs depend on multiple factors (types of fuel, fuel fraction, and location/region). If it's way too complicated, we can just focus on validating units that do not mix fuels for now.

Thank you for reading my long questions.

FJCC · March 1, 2022, 2:09am

For the first question, you have to set the argument all.x of merge() to TRUE.

DFA <- data.frame(Unit = 1:3, Value = c(2,4,6))
> DFB <- data.frame(Unit = c(1,3), Value = c(12,16))
> DFA
  Unit Value
1    1     2
2    2     4
3    3     6
> DFB
  Unit Value
1    1    12
2    3    16
> merge(DFA, DFB, all.x = TRUE, by = "Unit")
  Unit Value.x Value.y
1    1       2      12
2    2       4      NA
3    3       6      16

FJCC · March 1, 2022, 2:49am

Here is a solution to your Problem 2. Instead of nested ifelse statements, I used case_when. That provides a much cleaner syntax for that kind of logic. The left side consists of a logical test and the right side has values. The first test that evaluates to TRUE provides the value.
The mutate function adds new columns or modifies existing columns. I first calculate the calculated cost as Fuel_Fraction times the CostCheck. I use coalesce() to provide a 0 if Fuel_Fraction is NA. Then for each Unit I sum the calculated cost and name that ValidCost. This will be the sum of one number unless there is mixed fuel. Finally, I compare the ValidCost to the Fuel_Cost, returning TRUE or FALSE.
Note that I changed column names to not have spaces or special characters. I also removed spaces from the unit ID but that was only to ease my translation of your table into a csv file. Rather than posting a table, it is better to define your data as a data frame or tibble and then use the dput() function to post code that others can copy and use to recreate your data quickly.

library(dplyr)
DFC <- read.csv("~/R/Play/Dummy.csv")
DFC
#>    Unit_ID Region        Fuel Fuel_Cost Fuel_Type Fuel_Fraction
#> 1    Unit1      1 Liquid Fuel     200.0  Gasoline          1.00
#> 2    Unit2      2        Wind       0.0                      NA
#> 3    Unit3      1       Mixed     168.0  Gasoline          0.60
#> 4    Unit3      1       Mixed     168.0    Diesel          0.40
#> 5    Unit4      1 Liquid Fuel     120.0    Diesel          1.00
#> 6    Unit5      2 Liquid Fuel     200.0  Gasoline          1.00
#> 7    Unit6      2       Mixed     183.5  Gasoline          0.85
#> 8    Unit6      2       Mixed     183.5    Diesel          0.15
#> 9    Unit7      1       Solar       0.0                      NA
#> 10   Unit8      2 Liquid Fuel      90.0    Diesel          1.00
#> 11   Unit9      1 Liquid Fuel     200.0  Gasoline          1.00
#> 12  Unit10      2  LiquidFuel      90.0    Diesel          1.00
DFD <- DFC |> mutate(CostCheck = case_when(
  Fuel %in% c("Wind", "Solar") ~ 0,
  Fuel_Type == "Gasoline" ~ 200,
  Fuel_Type == "Diesel" & Region == 1 ~ 120,
  Fuel_Type == "Diesel" & Region == 2 ~ 90
)) |> mutate(CalcCost = coalesce(Fuel_Fraction,0) * CostCheck) |> 
  group_by(Unit_ID) |> 
  mutate(ValidCost = sum(CalcCost),
         Validation = Fuel_Cost == ValidCost)

^{Created on 2022-02-28 by the reprex package (v2.0.1)}

HTP · March 1, 2022, 3:14pm

Thank you so much for both of your answers!

If I can't change column names to not have spaces or special characters (i.e., "Fuel Fraction" instead of "Fuel_Fraction"), how should I modify the code for the second problem? I tried putting them in quotation marks, but it's not working.

FJCC · March 1, 2022, 3:43pm

The function that reads in the data may change the names automatically unless you override that.

Putting them in back ticks should work. That key is to the left of the 1 key on a US keyboard.

I would be tempted to change the column names after reading in the data just to save trouble. You can change them back if you have to write out data.

system · March 8, 2022, 3:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.