I have 6 csv files which I want to join them into one data.table. 5 out 6 dt have 4994 rows and one dt has 13802 rows. All of them have two columns, and one of them is common (called LSOA21CD). When I perform a left join I would expect the resulting dt to have maximum 4994 rows but it has 11379. Any idea why is that?
library(dplyr)
library(data.table)
wd <- "path/"
disability <- read.table(paste0(wd, "disability.csv"), sep = ",", header = TRUE)
unemployment <- read.table(paste0(wd, "unemployment.csv"), sep = ",", header = TRUE)
poverty <- read.table(paste0(wd, "poverty.csv"), sep = ",", header = TRUE)
dwellings <- read.table(paste0(wd, "dwellings.csv"), sep = ",", header = TRUE)
education <- read.table(paste0(wd, "education.csv"), sep = ",", header = TRUE)
income <- read.table(paste0(wd, "income.csv"), sep = ",", header = TRUE)
# Join all CSV files by LSOA21CD
comb_csv <- disability %>%
left_join(unemployment, by = "LSOA21CD") %>%
left_join(poverty, by = "LSOA21CD") %>%
left_join(dwellings, by = "LSOA21CD") %>%
left_join(education, by = "LSOA21CD") %>%
na.omit()
Some info about the dts:
>glimpse(disability)
Rows: 4,994
Columns: 2
$ LSOA21CD <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", "E…
$ disability <dbl> 1.73e-05, 1.67e-05, 2.50e-05, 2.14e-05, 1.33e-05, 1.09e-05, 2.34e-05, 3.05e-05, 2.41e-05, 2.60e-05, 1.97e-…
> glimpse(unemployment)
Rows: 4,994
Columns: 2
$ LSOA21CD <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", …
$ unemployment <dbl> 4.90e-06, 4.50e-06, 9.60e-06, 8.30e-06, 6.30e-06, 6.20e-06, 8.20e-06, 2.17e-05, 9.10e-06, 1.15e-05, 9.60…
>glimpse(poverty)
Rows: 4,994
Columns: 2
$ LSOA21CD <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01032739", "E01032740", "E01000006", "E01000007", "E01…
$ poverty <dbl> 0.0000841, 0.0000827, 0.0001545, 0.0000908, 0.0000791, 0.0000847, 0.0001013, 0.0001986, 0.0001195, 0.0001136…
>glimpse(education)
Rows: 4,994
Columns: 2
$ LSOA21CD <chr> "E01000001", "E01000002", "E01000003", "E01000005", "E01000006", "E01000007", "E01000008", "E01000009", "E0…
$ education <dbl> 0.0001585, 0.0001529, 0.0001485, 0.0000545, 0.0000702, 0.0001474, 0.0000564, 0.0000667, 0.0000597, 0.000086…
> glimpse(dwellings)
Rows: 13,802
Columns: 14
$ LSOA21CD <chr> "K04000001", "E92000001", "K04000001", "E92000001", "K04000001", "K04000001", "E12000008", "E92000001", …
$ BP_PRE_1900 <dbl> 0.0234840, 0.0216032, 0.0073970, 0.0070164, 0.0038138, 0.0042477, 0.0030158, 0.0034661, 0.0037748, 0.002…
$ BP_1900_1918 <dbl> 0.0080375, 0.0073906, 0.0025434, 0.0024284, 0.0014341, 0.0016321, 0.0009104, 0.0012901, 0.0014291, 0.001…
$ BP_1919_1929 <dbl> 0.0074172, 0.0070351, 0.0021863, 0.0021462, 0.0015143, 0.0014130, 0.0008918, 0.0014169, 0.0012956, 0.001…
$ BP_1930_1939 <dbl> 0.0152449, 0.0148074, 0.0028105, 0.0027793, 0.0034948, 0.0030263, 0.0019092, 0.0033848, 0.0029499, 0.002…
$ BP_1945_1954 <dbl> 0.0099544, 0.0094076, 0.0036657, 0.0035650, 0.0020153, 0.0023276, 0.0014179, 0.0018730, 0.0021321, 0.001…
$ BP_1955_1964 <dbl> 0.0152505, 0.0144511, 0.0035432, 0.0034092, 0.0040905, 0.0027278, 0.0027751, 0.0038973, 0.0025433, 0.002…
$ BP_1965_1972 <dbl> 0.0146310, 0.0138550, 0.0033734, 0.0032575, 0.0039386, 0.0029219, 0.0024486, 0.0037553, 0.0027939, 0.002…
$ BP_1973_1982 <dbl> 0.0137278, 0.0130055, 0.0034935, 0.0033814, 0.0031260, 0.0030291, 0.0023649, 0.0029675, 0.0029078, 0.001…
$ BP_1983_1992 <dbl> 0.0106068, 0.0100728, 0.0018814, 0.0018213, 0.0023124, 0.0022804, 0.0020788, 0.0022021, 0.0022008, 0.001…
$ BP_1993_1999 <dbl> 0.0076928, 0.0072429, 0.0009978, 0.0009628, 0.0016073, 0.0013678, 0.0011846, 0.0014971, 0.0013123, 0.001…
$ BP_2000_2009 <dbl> 0.0107114, 0.0101158, 0.0013318, 0.0013037, 0.0024646, 0.0020593, 0.0016876, 0.0023641, 0.0020208, 0.002…
$ BP_2010_2015 <dbl> 0.0041631, 0.0039439, 0.0006243, 0.0006043, 0.0009972, 0.0008512, 0.0006563, 0.0009474, 0.0008245, 0.000…
$ BP_UNKNOWN <dbl> 0.0018659, 0.0017546, 0.0005002, 0.0004710, 0.0002967, 0.0002955, 0.0002918, 0.0002825, 0.0002819, 0.000…
And the resulting dt:
> glimpse(comb_csv)
Rows: 11,379
Columns: 18
$ LSOA21CD <chr> "E01000001", "E01000001", "E01000002", "E01000002", "E01000002", "E01000003", "E01000003", "E01000005", …
$ disability <dbl> 1.73e-05, 1.73e-05, 1.67e-05, 1.67e-05, 1.67e-05, 2.50e-05, 2.50e-05, 2.14e-05, 2.34e-05, 2.34e-05, 2.34…
$ unemployment <dbl> 4.90e-06, 4.90e-06, 4.50e-06, 4.50e-06, 4.50e-06, 9.60e-06, 9.60e-06, 8.30e-06, 8.20e-06, 8.20e-06, 8.20…
$ poverty <dbl> 0.0000841, 0.0000841, 0.0000827, 0.0000827, 0.0000827, 0.0001545, 0.0001545, 0.0000908, 0.0001013, 0.000…
$ BP_PRE_1900 <dbl> 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 2e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1900_1918 <dbl> 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 1.0e-07, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1919_1929 <dbl> 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 7.0e-07, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1930_1939 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 6e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1945_1954 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1955_1964 <dbl> 1e-07, 0e+00, 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1965_1972 <dbl> 1.4e-06, 0.0e+00, 5.0e-07, 0.0e+00, 0.0e+00, 4.0e-07, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_1973_1982 <dbl> 1e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1983_1992 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_1993_1999 <dbl> 2e-07, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_2000_2009 <dbl> 1.0e-07, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+00, 0.0e+…
$ BP_2010_2015 <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00,…
$ BP_UNKNOWN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ education <dbl> 0.0001585, 0.0001585, 0.0001529, 0.0001529, 0.0001529, 0.0001485, 0.0001485, 0.0000545, 0.0000702, 0.000…
I can share the csv files if needed.
R 4.4.0, RStudio 2024.04.2 Build 764, Windows 11