Good day everyone!
I am a "newbie" in R (I am one of those Excel plebs trying to learn R, haha) and was wondering if people could assist me with two problems that I'm trying to solve. I reckon that the first problem is going to be super simple for everyone here, but the second one might be tricky.
Please note that you do not have to answer all questions if you feel unsure about something. Partial answers are perfectly fine since they will still give me some starting points. Again, I am new to R, so please be as descriptive as possible to help me understand and learn.
It should also be noted that the actual datasets I'm working with are a lot bigger and more complex than the example I'm using below. The example only shows 10 units in two regions (region 1 & 2) with four fuel categories (liquid fuel, wind, mixed & solar) and two liquid fuel types (gasoline & diesel), but the actual data has over 100 units across multiple regions that use various types of fuels. Therefore, the codes/solution would have to be as "general/flexible" as possible and should be able to be used for much larger datasets. (I feel like I'm overexplaining here, but the solution shouldn't be too "rigid" to the point that it can ONLY be used to address the examples below).
Problem 1:
I've imported two datasets (Data A and Data B). Each dataset has unique information about ten heat generating units in two regions. Specifically, Data A contains information regarding the location of each unit, general category of fuel it uses, and the fuel cost. Data B has additional information about specific fuel types and fuel fractions for liquid and mixed fuel units. Therefore,
Data A:
Unit ID | Region | Fuel | Fuel Cost ($) |
---|---|---|---|
Unit 1 | 1 | Liquid Fuel | 200 |
Unit 2 | 2 | Wind | 0 |
Unit 3 | 1 | Mixed | 168 |
Unit 4 | 1 | Liquid Fuel | 120 |
Unit 5 | 2 | Liquid Fuel | 200 |
Unit 6 | 2 | Mixed | 183.5 |
Unit 7 | 1 | Solar | 0 |
Unit 8 | 2 | Liquid Fuel | 90 |
Unit 9 | 1 | Liquid Fuel | 200 |
Unit 10 | 2 | Liquid Fuel | 90 |
Data B:
Unit ID | Fuel Type | Fuel Fraction |
---|---|---|
Unit 1 | Gasoline | 1 |
Unit 3 | Gasoline | 0.6 |
Unit 3 | Diesel | 0.4 |
Unit 4 | Diesel | 1 |
Unit 5 | Gasoline | 1 |
Unit 6 | Gasoline | 0.85 |
Unit 6 | Diesel | 0.15 |
Unit 8 | Diesel | 1 |
Unit 9 | Gasoline | 1 |
Unit 10 | Diesel | 1 |
Note that Data B does not have any information about Unit 2 and Unit 7 in Data A because they use wind and solar, respectively. Also, Unit 3 and 6 have two rows each because they mix different types of fuels (i.e., Unit 3 uses 60% gasoline and 40% diesel).
I wanted to merge these two datasets by Unit ID and create something like this (Data C):
Unit ID | Region | Fuel | Fuel Cost ($) | Fuel Type | Fuel Fraction |
---|---|---|---|---|---|
Unit 1 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 2 | 2 | Wind | 0 | ||
Unit 3 | 1 | Mixed | 168 | Gasoline | 0.6 |
Unit 3 | 1 | Mixed | 168 | Diesel | 0.4 |
Unit 4 | 1 | Liquid Fuel | 120 | Diesel | 1 |
Unit 5 | 2 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 6 | 2 | Mixed | 183.5 | Gasoline | 0.85 |
Unit 6 | 2 | Mixed | 183.5 | Diesel | 0.15 |
Unit 7 | 1 | Solar | 0 | ||
Unit 8 | 2 | Liquid Fuel | 90 | Diesel | 1 |
Unit 9 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 10 | 2 | Liquid Fuel | 90 | Diesel | 1 |
In order to do this, I've used the merge function and wrote this:
DataC<- merge(DataA, DataB, by = "Unit ID")
But the result that I got was:
Unit ID | Region | Fuel | Fuel Cost ($) | Fuel Type | Fuel Fraction |
---|---|---|---|---|---|
Unit 1 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 3 | 1 | Mixed | 168 | Gasoline | 0.6 |
Unit 3 | 1 | Mixed | 168 | Diesel | 0.4 |
Unit 4 | 1 | Liquid Fuel | 120 | Diesel | 1 |
Unit 5 | 2 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 6 | 2 | Mixed | 183.5 | Gasoline | 0.85 |
Unit 6 | 2 | Mixed | 183.5 | Diesel | 0.15 |
Unit 8 | 2 | Liquid Fuel | 90 | Diesel | 1 |
Unit 9 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 10 | 2 | Liquid Fuel | 90 | Diesel | 1 |
The issue with this is that the function removes Unit 2 and Unit 7 from the output because Unit 2 and 7 in Data A do not have corresponding/matching Unit IDs in Data B.
So anybody know how to solve this? I feel like the solution would be super simple, but please let me know what you think!
Problem 2:
Now, suppose that we found a way to make Data C looking like the way that I wanted. That is,
Data C:
Unit ID | Region | Fuel | Fuel Cost ($) | Fuel Type | Fuel Fraction |
---|---|---|---|---|---|
Unit 1 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 2 | 2 | Wind | 0 | ||
Unit 3 | 1 | Mixed | 168 | Gasoline | 0.6 |
Unit 3 | 1 | Mixed | 168 | Diesel | 0.4 |
Unit 4 | 1 | Liquid Fuel | 120 | Diesel | 1 |
Unit 5 | 2 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 6 | 2 | Mixed | 183.5 | Gasoline | 0.85 |
Unit 6 | 2 | Mixed | 183.5 | Diesel | 0.15 |
Unit 7 | 1 | Solar | 0 | ||
Unit 8 | 2 | Liquid Fuel | 90 | Diesel | 1 |
Unit 9 | 1 | Liquid Fuel | 200 | Gasoline | 1 |
Unit 10 | 2 | Liquid Fuel | 90 | Diesel | 1 |
Now, I wanna come up with a chunk of codes that will "validate" the "Fuel Cost ($)" values for each unit. In other words, I wanna check whether the fuel cost information in Data C match/satisfy certain conditions. These conditions are:
-
Gasoline units spend $200 for fuel
-
Diesel units in region 1 spend $120 for fuel
-
Diesel units in region 2 spends $90 for fuel because diesel is cheaper there
-
For "mixed" units, the fuel costs depend on the fraction of fuels that they use.
For example, Unit 3 uses 60% gasoline and 40% diesel, so the fuel cost is going to be: ($200 x 60%) + ($120 x 40%) = $168.
Similarly, Unit 6 is in region 2 and uses 85% gasoline and 15% diesel, so the fuel cost is going to be: ($200 x 85%) + ($90 x 15%) = $183.5 -
Solar or wind units spends $0 for fuel (these units have "blank" fuel type and fuel fraction).
So, for example, if the fuel cost for Unit 4 is $140 instead of $120 (2nd condition), the code should flag it by saying "No" under "Validation" column. Please see the chart below:
Unit ID | Region | Fuel | Fuel Cost ($) | Fuel Type | Fuel Fraction (%) | Validation |
---|---|---|---|---|---|---|
Unit 1 | 1 | Liquid Fuel | 200 | Gasoline | 1 | Yes |
Unit 2 | 2 | Wind | 0 | Yes | ||
Unit 3 | 1 | Mixed | 168 | Gasoline | 0.6 | Yes |
Unit 3 | 1 | Mixed | 168 | Diesel | 0.4 | Yes |
Unit 4 | 1 | Liquid Fuel | 140 | Diesel | 1 | No |
Unit 5 | 2 | Liquid Fuel | 200 | Gasoline | 1 | Yes |
Unit 6 | 2 | Mixed | 183.5 | Gasoline | 0.85 | Yes |
Unit 6 | 2 | Mixed | 183.5 | Diesel | 0.15 | Yes |
Unit 7 | 1 | Solar | 0 | Yes | ||
Unit 8 | 2 | Liquid Fuel | 90 | Diesel | 1 | Yes |
Unit 9 | 1 | Liquid Fuel | 200 | Gasoline | 1 | Yes |
Unit 10 | 2 | Liquid Fuel | 90 | Diesel | 1 | Yes |
I was thinking maybe I need to use the if...else... function? I tried to figure it out, but I wasn't sure how to approach it due to my lack of experience with R (maybe I am a bit too ambitious with the stuff that I'm trying to do....). Please let me know how I should approach this!
I reckon that the most difficult part of this would be dealing with units that mix different fuels because their fuel costs depend on multiple factors (types of fuel, fuel fraction, and location/region). If it's way too complicated, we can just focus on validating units that do not mix fuels for now.
Thank you for reading my long questions.