Condition & Comparition among values of a column

PalomaLlorente · October 9, 2019, 9:15am

I need to group and compare a list of values depending on several factors.

An extract of the data I have to process is (real one is more that 2600000 rows):

data.frame(stringsAsFactors=FALSE,
Start.ID = c("%%EDH", "%DIPA", "%DIPI", "%DRSI", "*%200", "**EG1",
"**TNT", "*01RD"),
End.ID = c("WSN", "PITES", "LADAT", "SITET", "BAKER", "MID",
"IBUGO", "IVLUT"),
Lon_i = c(9.23027777777778, 6.39722222222222, 7.35638888888889,
-0.0458333333333333, 0.516666666666667,
-0.922777777777778, 11.1397222222222, 4.78694444444444),
Lat_i = c(53.5088888888889, 49.7688888888889, 49.4375,
50.1633333333333, 51.35, 51.2816666666667,
46.0202777777778, 52.1555555555556),
Lon_f = c(8.87472222222222, 6.51944444444444, 7.83944444444444, 0,
0.298055555555556, -0.625, 11.3766666666667,
5.25694444444444),
Lat_f = c(53.3472222222222, 49.7286111111111, 49.2652777777778,
50.1, 51.495, 51.0538888888889, 45.6411111111111,
52.2441666666667),
Rumbo_circular = c(53, 297, 299, 335, 137, 321, 336, 253)
)

What I need to do is:

First I need to compare their "Rumbo_circular" value so that they are different. Afterward, I need to compare their Lon_i value; if the difference between their value is less than 3 units, another condition is introduced, I need to check Lat_i; if their Lat_i difference is less than 3 units a solution is reached: that combination is identified as "Possible_Conflict", and the process should start with another row. In short, I need to ask the command these questions:

Have they different "Rumbo_circular" value?

If yes: Go to question 2) and question 4).
If not: That combination is identified as "Non_Possible_Conflict", as they have the same "Rumbo_circular" value.

Have they a difference in "Lat_i" value under 3 units?

If yes: Go to question 3)
If not: That combination is identified as "Non_Possible_Inicial_Conflict"

Have they a difference in "Lon_i" value under 3 units?

If yes: That combination is identified as "Possible_Inicial_Conflict"
If not: That combination is identified as "Non_Possible_Inicial_Conflict"

Have they a difference in "Lat_f" value under 3 units?

If yes: Go to question 5)
If not: That combination is identified as "Non_Possible_End_Conflict"

Have they a difference in "Lon_f" value under 3 units?

If yes: That combination is identified as "Possible_End_Conflict"
If not: That combination is identified as "Non_Possible_End_Conflict"

Here is a numeric example of what I expect R to do for me:

Lat_i: 58; Lon_i: 27; Lat_f: 60; Lon_f: 65; Rumbo_Circular: 60
Lat_i: 59; Lon_i: 27; Lat_f: 60; Lon_f: 70; Rumbo_Circular: 60
Lat_i: 55; Lon_i: 29 ; Lat_f: 55; Lon_f: 65; Rumbo_Circular: 63
Lat_i: 55; Lon_i: 29; Lat_f: 57; Lon_f: 65; Rumbo_Circular: 65

Comparing 1&2: When answering question 1), they do NOT have different "Rumbo_Circular" value and therefore the combination 1_2 is categorized as "Non_Possible_Conflict".
Comparing 1&3: Answer to question 1) is yes. Answer to question 2) is yes. Answer to question 3) is yes and therefore the combination 1_3 is identified as "Possible_Inicial_Conflict". However as answer to question 1) is yes I need to answer also question 4). Answer to question 4) is no, so they are identified as "Non_Possible_End_Conflict". Then, combination 1_3 has 2 identifications: "Possible_Inicial_Conflict", "Non_Possible_End_Conflict". (I think these identifications could be shown as a table format, that is to say, one column for "Non_Possible_Conflict", other for "Non_Possible_Inicial_Conflict", other for "Possible_Inicial_Conflict", other for "Possible_End_Conflict" and another for "Non_Possible_End_Conflict". Therefore, if they are identified under one of those categories the identification would appear in the column of the categorie and if not it would be blank).
Comparing 1&4: Answer to question 1) is yes; Answer to question 2) and 3) is yes, so the combination 1_4 has is categorized as "Possible_Inicial_Conflict". As question 1) was affirmative, question 4) must be answered. 4) and 5) are also true, therefore the combination is categorized as "Possible_End_Conflict". Then, this combination has 2 identifications: "Possible_Inicial_Conflict", "Possible_End_Conflict".

Hope somebody could help me writing this someway in r

pieterjanvc · October 9, 2019, 10:44am

Hi,

Thanks for the nice explanation and providing a data extract. I have a few remarks/questions before I start tackling this:

Looking at Question 1: "Have they different "Rumbo_circular" value?". First you say, if yes, go to 2 and 4, but in your example below you say, dentified as "Non_Possible_Conflict" (which is the answer for No)
Your dataset provided might not be the best example as none of the Rumbo_circular values are the same. Could you add data so at least every outcome can be seen

There are relatively easy ways to implement this in R, but since you have such a huge number of rows it would be best for me to think a bit about the fastest implementation since normal logic one by one is going to be pretty slow on the full data I think

If you clarify your question 1 and get a bit more data, I'll start thinking about the best implementation.

Kind regards,
PJ

PalomaLlorente · October 9, 2019, 2:25pm

Sorry for the mistake, I have edited the first post.

I have created a data.frame that summarizes what I expect R Studio to do. In the example I have compared 1 and 2 with all the others rows (by hand) , reaching the solutions that are exposed in each column regarding the answer to the questions.

data.frame(stringsAsFactors=FALSE,
Start.ID = c("%%EDH", "%DIPA", "%DIPI", "%DRSI",
"*%200", "**EG1", "**TNT", "*01RD"),
End.ID = c("WSN", "PITES", "LADAT", "SITET", "BAKER",
"MID", "IBUGO", "IVLUT"),
Lon_i = c(9.23027777777778, 6.39722222222222,
7.35638888888889, -0.0458333333333333,
0.516666666666667, -0.922777777777778,
7.1397222222222, 9.78694444444444),
Lat_i = c(53.5088888888889, 49.7688888888889,
52.4375, 50.1633333333333, 51.35,
51.2816666666667, 52.0202777777778,
52.1555555555556),
Lon_f = c(8.87472222222222, 6.51944444444444,
7.83944444444444, 0, 8.29805555555555,
-0.625, 11.3766666666667, 8.25694444444444),
Lat_f = c(53.3472222222222, 49.7286111111111,
52.2652777777778, 50.1, 51.495,
51.0538888888889, 45.6411111111111, 52.2441666666667),
Rumbo_circular = c(53, 53, 50, 50, 137, 321, 336, 253),
Non_Possible_Conflict = c("%%EDH_WSN & %DIPA_PITES", NA, NA, NA, NA,
NA, NA, NA),
Possible_Inicial_Conflict = c("%%EDH_WSN & %DIPI_LADAT",
"%%EDH_WSN & **TNT_IBUGO",
"%%EDH_WSN & *01RD_IVLUT", "%DIPA_PITES & %DIPI_LADAT",
"%DIPA_PITES & **TNT_IBUGO",
"%DIPA_PITES & *01RD_IVLUT", NA, NA),
Non_Possible_Inicial_Conflict = c("%%EDH_WSN & DRSI_SITET",
"%%EDH_WSN & *%200_BAKER",
"%%EDH_WSN & **EG1_MID", "%DIPA_PITES & %DRSI_SITET",
"%DIPA_PITES & **EG1_MID",
"%DIPA_PITES & *%200_BAKER", NA, NA),
Posible_End_Conflict = c("%%EDH_WSN & %DIPI_LADAT",
"%%EDH_WSN & *%200_BAKER",
"%%EDH_WSN &*01RD_IVLUT", "%DIPA_PITES & %DIPI_LADAT",
"%DIPA_PITES & *01RD_IVLUT", NA, NA, NA),
Non_Possible_End_Conflict = c("%%EDH_WSN & DRSI_SITET",
"%%EDH_WSN & **EG1_MID",
"%%EDH_WSN & **TNT_IBUGO", "%DIPA_PITES & %DRSI_SITET",
"%DIPA_PITES & **EG1_MID",
"%DIPA_PITES & **TNT_IBUGO", "%DIPA_PITES & *%200_BAKER", NA)
)

As it can be seen the combinations under each category are:

Non_Possible_Conflict: 1&2
Possible_Inicial_Conflict: 1&3, 1&7, 1&8, 2&3, 2&7, 2&8
Non_Possible_Inicial_Conflict: 1&4, 1&5, 1&6, 2&4, 2&6, 2&5
Possible_End_Conflict: 1&3, 1&5, 1&8, 2&3, 2&8
Non_Possible_End_Conflict: 1&4, 1&6, 1&7, 2&4, 2&6, 2&7, 2&5
where:
1 is %%EDH_WSN
2 is DIPA_PITES
3 is %DIPI_LADAT
4 is DRSI_SITET
5 is *%200_BAKER
6 is **EG1_MID
7 is **TNT_IBUGO
8 is *01RD_IVLUT
(I just linked startID and endID by a "_" to name each row)

Hope now there is not mistake. Thanks in advance for the help!

pieterjanvc · October 10, 2019, 1:54am

Hi,

As promised,I came up with an implementation that could be pretty fast ...
For the sake of saving space, I did not copy the whole data frame you provided in the reprex but saved it in the variable myData

#Give each row a unique ID (easier for prosessing)
myData$ID = 1:nrow(myData)

#Evaluate every row
myResult = purrr::map_df(1:(nrow(myData)-1), function(currentRow){
  #Get the data for the current row
  myRow = myData %>% slice(currentRow)
  
  rbind(
    #Get all cases where Rumbo_circular is identical to the current row and set the values
    myData %>% slice(-c(1:currentRow)) %>% filter(Rumbo_circular == myRow$Rumbo_circular) %>%  
      transmute(ID1 = myRow$ID, ID2 = ID, Non_Possible_Conflict = T,
                Possible_Inicial_Conflict = F, Possible_End_Conflict = F),
    
    #Get all cases where Rumbo_circular is different to the current row and set the values
    myData %>% slice(-c(1:currentRow)) %>% filter(Rumbo_circular != myRow$Rumbo_circular) %>% 
      transmute(ID1 = myRow$ID, ID2 = ID, Non_Possible_Conflict = F,
                Possible_Inicial_Conflict = abs(Lon_i - myRow$Lon_i) < 3 & abs(Lat_i - myRow$Lat_i) < 3,
                Possible_End_Conflict = abs(Lon_f - myRow$Lon_f) < 3 & abs(Lat_f - myRow$Lat_f) < 3)
  )
})

> myResult
   ID1 ID2 Non_Possible_Conflict Possible_Inicial_Conflict Possible_End_Conflict
1    1   2                  TRUE                     FALSE                 FALSE
2    1   3                 FALSE                      TRUE                  TRUE
3    1   4                 FALSE                     FALSE                 FALSE
4    1   5                 FALSE                     FALSE                  TRUE
5    1   6                 FALSE                     FALSE                 FALSE
6    1   7                 FALSE                      TRUE                 FALSE
7    1   8                 FALSE                      TRUE                  TRUE
8    2   3                 FALSE                      TRUE                  TRUE
9    2   4                 FALSE                     FALSE                 FALSE
10   2   5                 FALSE                     FALSE                  TRUE
11   2   6                 FALSE                     FALSE                 FALSE
12   2   7                 FALSE                      TRUE                 FALSE
13   2   8                 FALSE                     FALSE                  TRUE
14   3   4                  TRUE                     FALSE                 FALSE
15   3   5                 FALSE                     FALSE                  TRUE
16   3   6                 FALSE                     FALSE                 FALSE
17   3   7                 FALSE                      TRUE                 FALSE
18   3   8                 FALSE                      TRUE                  TRUE
19   4   5                 FALSE                      TRUE                 FALSE
20   4   6                 FALSE                      TRUE                  TRUE
21   4   7                 FALSE                     FALSE                 FALSE
22   4   8                 FALSE                     FALSE                 FALSE
23   5   6                 FALSE                      TRUE                 FALSE
24   5   7                 FALSE                     FALSE                 FALSE
25   5   8                 FALSE                     FALSE                  TRUE
26   6   7                 FALSE                     FALSE                 FALSE
27   6   8                 FALSE                     FALSE                 FALSE
28   7   8                 FALSE                      TRUE                 FALSE

By using map, I can get a faster implementation than if I'd loop over each row
I created two sets for each row I then bound together (rbind)
- The first set looks for all Rumbo_circular values that are equal and assigns Non_Possible_Conflict as TRUE, the rest as FALSE
- The second set takes all Rumbo_circular values that are different, and with some simple logic and calculations assigns Possible_Inicial_Conflict and Possible_End_Conflict to TRUE or FALSE depending on the outcome
By using the slice operator, I made sure that combinations of rows that were previously evaluated, would not be evaluated again. E.g.row 1 evaluates row 2, 3, 4, ... 8, so by the time row 3 for example is evaluated it will not evaluate anything below that (because 1-3 and 2-3 have already been evaluated)

The result of the final combinations and their comparisons can be seen in the table myResult

THE PROBLEM OF COMBINATIONS
However, since you want to compare every row with all other rows, this generates a combinatorial problem that will quickly go out of hand. As you can see, 8 rows already create 28 combinations, 1000 rows will generate ~500,000 combinations. You have 2600000 rows, which will be a number of combinations so high that no computer will be able to get the results in a reasonable time...

You should really think about ways you could filter your data or improve your algorithm to fix this problem, as the code, which is very fast, will fail on the size of the task because it will generate billions of combinations in a file that you'll never be able to process

So in a way I provided a solution, but your implementation will not be practical.

If I did miss something, or you find other ways of improving the algorithm, I;m happy to help with the coding!

PJ

PalomaLlorente · October 10, 2019, 10:01am

I have managed to reduced the file to 35833 rows and 5 columns (Lon_i, Lat_i, Lon_f, Lat_f, Rumbo_circular, ID); how ever, when running the code an error message appeared: " Error: cannot allocate vector of size 266 Kb " . I reduced the items to the minimun (I also rounded the numbers to the decimals). Do you know if this message is due to my computer capacity, the disk where I am storing R information or due to R capacity? And in any of the cases, have you got any idea how could I solve this problem? I have an Intel Core i7-2630QM CPU 2Ghz processor with a RAM available of 3.85 GB. Thanks in advace for the help

pieterjanvc · October 10, 2019, 12:12pm

Hi,

Yes indeed, I told you the data would become very big
I expanded my code to help you create data partitions that will save a fixed set of rows of the results to file to prevent your R from crashing

#Give each row a unique ID (easier for prosessing)
myData$ID = 1:nrow(myData)

maxNperPart = 10 #must be equal or larger than n rows

#Caluclate the number of parts to split the data in
myGroups = data.frame(row = 1:(nrow(myData)-1), nResults = (nrow(myData)-1):(1), group = 0)
myGroup = 1
mySum = 0
for(x in 1:nrow(myGroups)){ # x = 4
  mySum = sum(mySum, myGroups$nResults[x])
  if(mySum <= maxNperPart){
    myGroups$group[x] = myGroup
  } else {
    myGroup = myGroup + 1
    myGroups$group[x] = myGroup
    mySum = myGroups$nResults[x]
  }
  
}

#Read this output to know how many parts you'll have
print(paste("There will be", max(myGroups$group), "data partitions"))

#Select which parts you like to run
startPart = 1 
stopPart = max(myGroups$group) #max value here is max number of parts

#Run parts
for(myGroup in startPart:stopPart){
  startRow = myGroups %>% filter(group == myGroup) %>% pull(row) %>% min()
  endRow = myGroups %>% filter(group == myGroup) %>% pull(row) %>% max()
  
  #Evaluate every row in partition
  myResult = purrr::map_df(startRow:endRow, function(currentRow){
    #Get the data for the current row
    myRow = myData %>% slice(currentRow)
    
    rbind(
      #Get all cases where Rumbo_circular is identical to the current row and set the values
      myData %>% slice(-c(1:currentRow)) %>% filter(Rumbo_circular == myRow$Rumbo_circular) %>%  
        transmute(ID1 = myRow$ID, ID2 = ID, Non_Possible_Conflict = T,
                  Possible_Inicial_Conflict = F, Possible_End_Conflict = F),
      
      #Get all cases where Rumbo_circular is different to the current row and set the values
      myData %>% slice(-c(1:currentRow)) %>% filter(Rumbo_circular != myRow$Rumbo_circular) %>% 
        transmute(ID1 = myRow$ID, ID2 = ID, Non_Possible_Conflict = F,
                  Possible_Inicial_Conflict = abs(Lon_i - myRow$Lon_i) < 3 & abs(Lat_i - myRow$Lat_i) < 3,
                  Possible_End_Conflict = abs(Lon_f - myRow$Lon_f) < 3 & abs(Lat_f - myRow$Lat_f) < 3)
    )
  })
  
  print(myResult) #replace this with a save function to save each file (see below)
  # write.csv(myResult, paste0("myPath/result_part", myGroup, ".csv"), row.names = F)
}

You now have to set the variable maxNperPart to how many rows you like per output file. This makes you control the file size. R can easily handle 100,000 rows in a data file for example.

The next part of the code calculates how many data partitions you'll generate if you have set maxNperPart to be the limit per file. You can then set startPart and stopPart to only run the partitions you like to evaluate and store.

The code will run for these partitions and save a file for each, having a max number of rows defined by maxNperPart. You need to uncomment (and update the filepath) the save function at the bottom of the code for this to work. Now it'll just print the results instead.

In the example given above, I set maxNperPart = 10 (very small but also dummy set) and run all 4 partitions (each containing no more than 10 results). It looks like this:

  ID1 ID2 Non_Possible_Conflict Possible_Inicial_Conflict Possible_End_Conflict
1   1   2                  TRUE                     FALSE                 FALSE
2   1   3                 FALSE                      TRUE                  TRUE
3   1   4                 FALSE                     FALSE                 FALSE
4   1   5                 FALSE                     FALSE                  TRUE
5   1   6                 FALSE                     FALSE                 FALSE
6   1   7                 FALSE                      TRUE                 FALSE
7   1   8                 FALSE                      TRUE                  TRUE
  ID1 ID2 Non_Possible_Conflict Possible_Inicial_Conflict Possible_End_Conflict
1   2   3                 FALSE                      TRUE                  TRUE
2   2   4                 FALSE                     FALSE                 FALSE
3   2   5                 FALSE                     FALSE                  TRUE
4   2   6                 FALSE                     FALSE                 FALSE
5   2   7                 FALSE                      TRUE                 FALSE
6   2   8                 FALSE                     FALSE                  TRUE
  ID1 ID2 Non_Possible_Conflict Possible_Inicial_Conflict Possible_End_Conflict
1   3   4                  TRUE                     FALSE                 FALSE
2   3   5                 FALSE                     FALSE                  TRUE
3   3   6                 FALSE                     FALSE                 FALSE
4   3   7                 FALSE                      TRUE                 FALSE
5   3   8                 FALSE                      TRUE                  TRUE
6   4   5                 FALSE                      TRUE                 FALSE
7   4   6                 FALSE                      TRUE                  TRUE
8   4   7                 FALSE                     FALSE                 FALSE
9   4   8                 FALSE                     FALSE                 FALSE
  ID1 ID2 Non_Possible_Conflict Possible_Inicial_Conflict Possible_End_Conflict
1   5   6                 FALSE                      TRUE                 FALSE
2   5   7                 FALSE                     FALSE                 FALSE
3   5   8                 FALSE                     FALSE                  TRUE
4   6   7                 FALSE                     FALSE                 FALSE
5   6   8                 FALSE                     FALSE                 FALSE
6   7   8                 FALSE                      TRUE                 FALSE

You can see the output is exactly the same as the first code I gave you, but now it's cut into 4 datasets.

In your dataset, now ~36000 rows, if we set maxNperPart = 110,000, we'll generate 6554 data partitions, each containing max 110,000 rows of the result. By playing with the startPart and sendPart, you can make sure you don't do them all at the same time, but test with for example 10 (e.g. startPart = 1, stopPart = 10).

You see this will still generate huge amount of data...

One thing you could do (depending on what you need from the data) is see if you can filter the data after it has been processed, meaning you filter the myResult data frame. If you can reduce it, you can increase the maxNperPart because it will be shrunk after processing.

One example: imagine you know that many combinations will never have a conflict, so first row is TRUE, all the rest FALSE, you can upgrade the code after the rbind to filter those out. You then know that if a combination is not in your final dataset, it means there was no conflict.

Good luck!
PJ

system · October 17, 2019, 12:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.