Issue with the left join when combining tables with different number of rows

nikos_geo · June 27, 2024, 7:48am

I have 6 csv files which I want to join them into one data.table. 5 out 6 dt have 4994 rows and one dt has 13802 rows. All of them have two columns, and one of them is common (called LSOA21CD). When I perform a left join I would expect the resulting dt to have maximum 4994 rows but it has 11379. Any idea why is that?

library(dplyr)
library(data.table)

wd <- "path/"

disability <- read.table(paste0(wd, "disability.csv"), sep = ",", header = TRUE)
unemployment <- read.table(paste0(wd, "unemployment.csv"), sep = ",", header = TRUE)
poverty <- read.table(paste0(wd, "poverty.csv"), sep = ",", header = TRUE)
dwellings <- read.table(paste0(wd, "dwellings.csv"), sep = ",", header = TRUE)
education <- read.table(paste0(wd, "education.csv"), sep = ",", header = TRUE)
income <- read.table(paste0(wd, "income.csv"), sep = ",", header = TRUE)

# Join all CSV files by LSOA21CD
comb_csv <- disability %>%
  left_join(unemployment, by = "LSOA21CD") %>%
  left_join(poverty, by = "LSOA21CD") %>%
  left_join(dwellings, by = "LSOA21CD") %>%
  left_join(education, by = "LSOA21CD") %>% 
  na.omit()

Some info about the dts:

>glimpse(disability)
Rows: 4,994
Columns: 2
$ LSOA21CD   <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", "E…
$ disability <dbl> 1.73e-05, 1.67e-05, 2.50e-05, 2.14e-05, 1.33e-05, 1.09e-05, 2.34e-05, 3.05e-05, 2.41e-05, 2.60e-05, 1.97e-…

> glimpse(unemployment)
Rows: 4,994
Columns: 2
$ LSOA21CD     <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", …
$ unemployment <dbl> 4.90e-06, 4.50e-06, 9.60e-06, 8.30e-06, 6.30e-06, 6.20e-06, 8.20e-06, 2.17e-05, 9.10e-06, 1.15e-05, 9.60…

>glimpse(poverty)
Rows: 4,994
Columns: 2
$ LSOA21CD <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", "E01…
$ poverty  <dbl> 0.0000841, 0.0000827, 0.0001545, 0.0000908, 0.0000791, 0.0000847, 0.0001013, 0.0001986, 0.0001195, 0.0001136…

>glimpse(education)
Rows: 4,994
Columns: 2
$ LSOA21CD  <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01000006", "E01000007", "E01000008", "E01000009", "E0…
$ education <dbl> 0.0001585, 0.0001529, 0.0001485, 0.0000545, 0.0000702, 0.0001474, 0.0000564, 0.0000667, 0.0000597, 0.000086…

> glimpse(dwellings)
Rows: 13,802
Columns: 14
$ LSOA21CD     <chr> "K04000001", "E92000001", "K04000001", "E92000001", "K04000001", "K04000001", "E12000008", "E92000001", …
$ BP_PRE_1900  <dbl> 0.0234840, 0.0216032, 0.0073970, 0.0070164, 0.0038138, 0.0042477, 0.0030158, 0.0034661, 0.0037748, 0.002…
$ BP_1900_1918 <dbl> 0.0080375, 0.0073906, 0.0025434, 0.0024284, 0.0014341, 0.0016321, 0.0009104, 0.0012901, 0.0014291, 0.001…
$ BP_1919_1929 <dbl> 0.0074172, 0.0070351, 0.0021863, 0.0021462, 0.0015143, 0.0014130, 0.0008918, 0.0014169, 0.0012956, 0.001…
$ BP_1930_1939 <dbl> 0.0152449, 0.0148074, 0.0028105, 0.0027793, 0.0034948, 0.0030263, 0.0019092, 0.0033848, 0.0029499, 0.002…
$ BP_1945_1954 <dbl> 0.0099544, 0.0094076, 0.0036657, 0.0035650, 0.0020153, 0.0023276, 0.0014179, 0.0018730, 0.0021321, 0.001…
$ BP_1955_1964 <dbl> 0.0152505, 0.0144511, 0.0035432, 0.0034092, 0.0040905, 0.0027278, 0.0027751, 0.0038973, 0.0025433, 0.002…
$ BP_1965_1972 <dbl> 0.0146310, 0.0138550, 0.0033734, 0.0032575, 0.0039386, 0.0029219, 0.0024486, 0.0037553, 0.0027939, 0.002…
$ BP_1973_1982 <dbl> 0.0137278, 0.0130055, 0.0034935, 0.0033814, 0.0031260, 0.0030291, 0.0023649, 0.0029675, 0.0029078, 0.001…
$ BP_1983_1992 <dbl> 0.0106068, 0.0100728, 0.0018814, 0.0018213, 0.0023124, 0.0022804, 0.0020788, 0.0022021, 0.0022008, 0.001…
$ BP_1993_1999 <dbl> 0.0076928, 0.0072429, 0.0009978, 0.0009628, 0.0016073, 0.0013678, 0.0011846, 0.0014971, 0.0013123, 0.001…
$ BP_2000_2009 <dbl> 0.0107114, 0.0101158, 0.0013318, 0.0013037, 0.0024646, 0.0020593, 0.0016876, 0.0023641, 0.0020208, 0.002…
$ BP_2010_2015 <dbl> 0.0041631, 0.0039439, 0.0006243, 0.0006043, 0.0009972, 0.0008512, 0.0006563, 0.0009474, 0.0008245, 0.000…
$ BP_UNKNOWN   <dbl> 0.0018659, 0.0017546, 0.0005002, 0.0004710, 0.0002967, 0.0002955, 0.0002918, 0.0002825, 0.0002819, 0.000…

And the resulting dt:

> glimpse(comb_csv)
Rows: 11,379
Columns: 18
$ LSOA21CD     <chr> "E01000001", "E01000001", "E01000002", "E01000002", "E01000002", "E01000003", "E01000003", "E01000005", …
$ disability   <dbl> 1.73e-05, 1.73e-05, 1.67e-05, 1.67e-05, 1.67e-05, 2.50e-05, 2.50e-05, 2.14e-05, 2.34e-05, 2.34e-05, 2.34…
$ unemployment <dbl> 4.90e-06, 4.90e-06, 4.50e-06, 4.50e-06, 4.50e-06, 9.60e-06, 9.60e-06, 8.30e-06, 8.20e-06, 8.20e-06, 8.20…
$ poverty      <dbl> 0.0000841, 0.0000841, 0.0000827, 0.0000827, 0.0000827, 0.0001545, 0.0001545, 0.0000908, 0.0001013, 0.000…
$ BP_PRE_1900  <dbl> 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 2e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1900_1918 <dbl> 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 1.0e-07, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1919_1929 <dbl> 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 7.0e-07, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1930_1939 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 6e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1945_1954 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1955_1964 <dbl> 1e-07, 0e+00, 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1965_1972 <dbl> 1.4e-06, 0.0e+00, 5.0e-07, 0.0e+00, 0.0e+00, 4.0e-07, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1973_1982 <dbl> 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1983_1992 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1993_1999 <dbl> 2e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_2000_2009 <dbl> 1.0e-07, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_2010_2015 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_UNKNOWN   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ education    <dbl> 0.0001585, 0.0001585, 0.0001529, 0.0001529, 0.0001529, 0.0001485, 0.0001485, 0.0000545, 0.0000702, 0.000…

I can share the csv files if needed.

R 4.4.0, RStudio 2024.04.2 Build 764, Windows 11

nikos_geo · June 27, 2024, 8:01am

I think I found it. It was relatively easy. I found the duplicates in the LSOA21CD column of the comb_csv and I remove them, like so:

# remove duplicates found in the LSOA21CD column
merged_df <- comb_csv %>% distinct(LSOA21CD, .keep_all = TRUE)

craig.parylo · June 27, 2024, 9:46am

Hi @nikos_geo, I think you have your finger on the issue - one of your tables contains multiple rows for the same LSOA21CD code, so when you left-join each of the tables together you get some rows duplicated.

It seems likley the dwellings is where the duplicates are as I can see duplicate codes in your glimpsed output.

Do you know why there are multliple rows for the same LSOA21CD in the dwellings file? If it is anything like the data available from the London datastore then there is one row per LSOA code for each band of dwelling, i.e. either ALL or bands A:I.

Your solution to distinct comb_csv on LSOA21CD will give you the correct number of rows but its not clear if the data you're getting for dwellings is for the correct house band.

mikecrobp · June 28, 2024, 12:46pm

I have 2 useful "tools" in my kit bag:

Check for duplicates based on one or more columns
DeDeuplicate based on an ID which should be unique and a field to use to choose between duplicates

Maybe there are better standard functions, but I didn't find them a few years ago.

DupCheckMulti = function(dfin , ...) {
  if (missing(...)) {
    cat("DupCheck. ID parameter missing\n")
    stop()
  }
  quoID <- enquos(...)
  
  
  dfout = dfin %>%
    group_by(!!!quoID) %>%
    summarise (count=n(), .groups = "drop") %>%
    filter (count > 1)
  # TO DO suppress join by warning 
  dfin %>% inner_join(dfout, by = colnames(dfout %>% select(-count))) %>% arrange(!!!quoID)
}

DeDupByMax = function(dfin , ID, maxfield) {
  if (missing(ID)) {
    cat("DeDupByMax ID parameter missing\n")
    stop()
  }
  quoID <- enquo(ID)
  maxfieldquo <- enquo(maxfield)
  
  dfout <- dfin %>%
    group_by(!!quoID) %>%
    filter(testequal(!!maxfieldquo, max(!!maxfieldquo)) )
  
  ungroup(dfout)
}

system · September 26, 2024, 12:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.