Help with more complicated filtering conditions

Hello,

I have a data frame with real-estate properties data, which contain errors because were entered manually by people. I created column "price_m2_controll" with intention to use it for filtering-out false records.

I have created these rules:

  # Filter-out rules for column "price_m2_controll"
  # 
  # for RESIDENTIAL & COMMERCIAL property types
  #   - sale: >100€/m2 & <20000€/m2
  #   - rent: >1€/m2/month & <100€/m2/month
  # 
  # for LAND property type
  #   - sale: >0.1€/m2 & <5000€/m2
  #   - rent: >0.01€/m2/month & <10€/m2/month

I want to filter out and drop all records outside these rules and keep only those which fit.

Sample of simplified df

records <- tribble(
        ~property_type, ~operation_type, ~price_m2_controll,
        "residential", "rent", 0.5,
        "residential", "rent", 5, 
        "residential", "rent", 200,
        "residential", "sale", 20,
        "residential", "sale", 1000,
        "residential", "sale", 25000,
        
        "commercial", "rent", 0.25,
        "commercial", "rent", 48, 
        "commercial", "rent", 180,
        "commercial", "sale", 9,
        "commercial", "sale", 222,
        "commercial", "sale", 28000,
        
        "land", "rent", 0.005,
        "land", "rent", 5, 
        "land", "rent", 12,
        "land", "sale", 0.1,
        "land", "sale", 1000,
        "land", "sale", 60000
      )
A tibble: 18 × 3
   property_type operation_type price_m2_controll
   <chr>         <chr>                      <dbl>
 1 residential   rent                       0.5  
 2 residential   rent                       5    
 3 residential   rent                     200    
 4 residential   sale                      20    
 5 residential   sale                    1000    
 6 residential   sale                   25000    
 7 commercial    rent                       0.25 
 8 commercial    rent                      48    
 9 commercial    rent                     180    
10 commercial    sale                       9    
11 commercial    sale                     222    
12 commercial    sale                   28000    
13 land          rent                       0.005
14 land          rent                       5    
15 land          rent                      12    
16 land          sale                       0.1  
17 land          sale                    1000    
18 land          sale                   60000 

This is my filtering code

records <- records %>%

# Residential & commercial (other) - SALE
filter( property_type != "land" | operation_type == "sale" & price_m2_controll > 100 & price_m2_controll < 20000) %>%
      
 # Residential & commercial - RENT
 filter(property_type != "land" | operation_type == "rent" & price_m2_controll > 1 & price_m2_controll < 100 ) %>%
      
 # Land - SALE
 filter(property_type == "land"| operation_type == "sale" & price_m2_controll > 0.1 & price_m2_controll < 10) %>%
      
 # Land - RENT
 filter(property_type = "land" | operation_type == "rent" & price_m2_controll > 0.01 & price_m2_controll < 100)

I am stuck in this , I cant figure out the filtering rule to get correct results. I will appreciate some help please...
Thanks in advance

Are these typeof numeric?

Yes, they are numeric. It's my explanation of desired rules in half-word half-code way..:slight_smile:

1 Like

Wow. Can you share how this works? I’ve never seen arithmetic operations work on numbers shown formatted as string numerals. I really didn’t think it possible.

@technocrat,
my apologies, I didnt understand you correctly, and as I now see I did not describe my mistake; apologies. Its getting late night here already.:slight_smile:

Basically, for each property type (commercial, residential and land) and for each operation type (sale, rent) I need to filter-out records with extreme values in price_m2_controll column.
These extreme values are different for each property type and operation type, because there apply different price-ranges; and are defined in rules you were asking about.
I am not able a write a filtering code to apply these rules correctly to the "records" tibble.

1.) if column property_type is "residential" or "commercial" and column operation_type is "sale", keep only records where column price_m2_controll has value between 100 and 20000

2.) if column property_type is "residential" or "commercial" and column operation_type is "rent" and column price_m2_controll is between 1 and 100

3.) if column property_type is "land" and column operation_type is "sale" and column price_m2_controll is between 0.1 and 5000

4.) if column property_type is "land" and column operation_type is "rent" and column price_m2_controll is between 0.01 and 100

Apologies for the mild snark. I mistook your pcode. The immediate problem is

The assignment operator = instead of the equality operator ==, but beyond that, after all the filtering there is nothing left. You get some sleep while I work on it overnight your time.

After this operation there is only a single row left over and that is filtered out by the last filter()

records <- tibble::tribble(
  ~property_type, ~operation_type, ~price_m2_controll,
  "residential", "rent", 0.5,
  "residential", "rent", 5, 
  "residential", "rent", 200,
  "residential", "sale", 20,
  "residential", "sale", 1000,
  "residential", "sale", 25000,
  "commercial", "rent", 0.25,
  "commercial", "rent", 48, 
  "commercial", "rent", 180,
  "commercial", "sale", 9,
  "commercial", "sale", 222,
  "commercial", "sale", 28000,
  "land", "rent", 0.005,
  "land", "rent", 5, 
  "land", "rent", 12,
  "land", "sale", 0.1,
  "land", "sale", 1000,
  "land", "sale", 60000
)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
piece1 <- records |>
  # Residential & commercial (other) - SALE
  filter(property_type != "land" | 
         operation_type == "sale" & 
         price_m2_controll > 100 & 
         price_m2_controll < 20000)
dim(piece1)
#> [1] 13  3

piece2<- piece1 |>
# Residential & commercial - RENT
  filter(property_type != "land" | operation_type == "rent" & price_m2_controll > 1 & price_m2_controll < 100 )
dim(piece2)
#> [1] 12  3

piece3 <- piece2 |>
  # Residential & commercial - RENT
  filter(property_type != "land" | operation_type == "rent" & price_m2_controll > 1 & price_m2_controll < 100)
dim(piece3)
#> [1] 12  3

piece4 <- piece3 |>
 # Land - SALE
  filter(property_type == "land"| operation_type == "sale" & price_m2_controll > 0.1 & price_m2_controll < 10) 
dim(piece4) 
#> [1] 1 3

piece5 <- piece4 |>
  # Land - RENT
  filter(property_type == "land" | operation_type == "rent" & price_m2_controll > 0.01 & price_m2_controll < 100)
dim(piece5)
#> [1] 0 3

Created on 2023-11-21 with reprex v2.0.2

Thank you @technocrat, I appreciate your willingness.

I think that after filtering, there should remain 6 records in this test data frame...

I am lost in combinations of & and | and struggling even with the implementation of even only the first rule.
I would expect that code:

filter(property_type != "land" | operation_type == "sale" & between(price_m2_controll, 100, 20000))

will filter out any records for "commercial" and "residential" property types and "sale" operation_type, where "price_m2_controll" is below 100 or above 20000.
But this is not happening, I still have these values there...I dont understand why...

When in doubt with logical operators there are two choices

  1. Try to keep the order of precedence straight (which is fine if you do it all the time)
  2. Use () to specify order of evaluation
filter(property_type != "land" | operation_type == "sale" & between(price_m2_controll, 100, 20000))

filter(property_type (!= "land" | operation_type == "sale") & between(price_m2_controll, 100, 20000))

filter(property_type != "land" | operation_type == "sale" & between(price_m2_controll, 100, 20000))

behave differently. () evaluates from the inside out.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.