How to merge (fully join) SUBSETS from the same original data frame?

lostandfound8 · September 6, 2022, 4:06pm

Hey there RStudio-Community! I hope someone can help me out here:

1) Context

I am a student from Germany currently working on my master thesis in educational science and therefore working with PIAAC datasets from various countries using EdSurvey in R-Studio. (Also I would call myself a beginner in R.)

So far I used EdSurveys getData-function to convert the edSurvey.data.frames to regular data.frames, so I can use basic R-functions for data-manipulation. I then created various subsets from the same original data.frame in order to add new variables to these. The new variables are the same for each subset, but differentiate in their response category (LitLevel: 1, 2 or 3 - depending on the subset and LitLow, LitAve and LitHigh each coded binary 0/1). Each row has an ID called "seqid".
(see R-Script below for Details.)

2) Aim

Now I would like to merge these subsets (not data.frames!) again (using their ID "seqid"). I specifically do not want new rows for each subset I manipulated (and therefore have double ID's with Variables that are 0/1 or NA, which would distort my subsequent regression analysis), but would like them to merge accordingly (all in one row, by their ID "seqid"). The latter is what I had mistakenly done the first time (see R-Script below).

3) Problem and attempted approaches

So far I have tried various functions (i.e. merge, dplyr/full_join, combine.subsets), yet always receive errors such as:

Error in `dplyr::full_join()`:
! `suffix` must be a character vector of length 2.
ℹ `suffix` is a `data.frame` object of length 124.

or

Error in `dplyr::full_join()`:
! Join columns must be present in data.
✖ Problem with `sdf_deuRaw$seqid`.
---
Backtrace:
 1. dplyr::full_join(...)
 2. dplyr:::full_join.data.frame(...)

or

Error in fix.by(by.x, x) :
   'by' must specify one or more columns as numbers, names, or logical values

or

Error in combine.subsets(sdf_deuRaw_LitLow, sdf_deuRaw_LitAve, sdf_deuRaw_LitHigh, :
   could not find function "combine.subsets"

4) desired solution / question
Is there a way to fully join these subsets or is there an alternative to prepare the data.frames I would like to use for the regression analysis? I would like to do two binary logistic regressions, the first between LitLow and LitAve and the second between LitAve and LitHigh (see R-Script below for Details).

5) Additional Info
If required I can give more Info regarding my R-Script or answer questions about the thesis.

Excerpt of R-Script for more Details

# 1.1.2.1) (requires) creating regular data.frame via getData

# Germany
sdf_deuRaw <- getData(
  data = sdf_deu,
  varnames = c('seqid', 'lit', 'num', 'psl', 'spfwt0',
               'ageg10lfs', 'gender_r', 'nativelang',
               'readytolearn_wle_ca', 'pared', 'edlevel3', 'j_q02a', 'j_q03a'
               ),
  drop = FALSE,
  dropUnusedLevels = TRUE,
  omittedLevels = TRUE,
  defaultConditions = TRUE,
  formula = NULL,
  recode = NULL,
  includeNaLabel = FALSE,
  addAttributes = FALSE,
  returnJKreplicates = TRUE
)
head(sdf_deuRaw)
dim(sdf_deuRaw) # N=2899 Variables=120

# --------------------------------------------------------------------------------------------

# a) subsetting each competence sample and creating new variable naming each level 
# LitLevel: Low (1), Average (2) and High (3) 
# LitLow / LitAve / LitHigh (for each Litlevel): no (0) and yes (1)

  # Literacy (N=1542)
  
    # Low
sdf_deuRaw_LitLow <- subset(x = sdf_deuRaw,
                          subset = pvlit1 < 226 & pvlit2 < 226 & pvlit3 < 226 & pvlit4 < 226 & pvlit5 < 226 & pvlit6 < 226 & pvlit7 < 226 & pvlit8 < 226 & pvlit9 < 226 & pvlit10 < 226,
                          inside = FALSE)
dim(sdf_deuRaw_LitLow) # N=100 Variables=120

sdf_deuRaw_LitLow$LitLevel <- 1 
sdf_deuRaw_LitLow$LitLow <- 1
sdf_deuRaw_LitLow$LitAve <- 0
sdf_deuRaw_LitLow$LitHigh <- 0

dim(sdf_deuRaw_LitLow) # N=100 Variables=124

    # Average
sdf_deuRaw_LitAve <- subset(x = sdf_deuRaw,
                          subset = (pvlit1 >=226 & pvlit1 <326) & (pvlit2 >=226 & pvlit2 <326) & (pvlit3 >=226 & pvlit3 <326) & (pvlit4 >=226 & pvlit4 <326) & (pvlit5 >=226 & pvlit5 <326) & (pvlit6 >=226 & pvlit6 <326) & (pvlit7 >=226 & pvlit7 <326) & (pvlit8 >=226 & pvlit8 <326) & (pvlit9 >=226 & pvlit9 <326) & (pvlit10 >=226 & pvlit10 <326),
                          inside = FALSE)
dim(sdf_deuRaw_LitAve) # N=1371 Variables=120

sdf_deuRaw_LitAve$LitLevel <- 2 
sdf_deuRaw_LitAve$LitLow <- 0
sdf_deuRaw_LitAve$LitAve <- 1
sdf_deuRaw_LitAve$LitHigh <- 0

dim(sdf_deuRaw_LitAve) # N=1371 Variables=124

    # High
sdf_deuRaw_LitHigh <- subset(x = sdf_deuRaw,
                         subset = pvlit1 >=326 & pvlit2 >=326 & pvlit3 >=326 & pvlit4 >=326 & pvlit5 >=326 & pvlit6 >=326 & pvlit7 >=326 & pvlit8 >=326 & pvlit9 >=326 & pvlit10 >=326,
                         inside = FALSE)
dim(sdf_deuRaw_LitHigh) # N=71 Variables=120

sdf_deuRaw_LitHigh$LitLevel <- 3 
sdf_deuRaw_LitHigh$LitLow <- 0
sdf_deuRaw_LitHigh$LitAve <- 0
sdf_deuRaw_LitHigh$LitHigh <- 1

dim(sdf_deuRaw_LitHigh) # N=71 Variables=124


  # Numeracy (N=1389)

    # Low
sdf_deuRaw_NumLow <- subset(x = sdf_deuRaw,
                            subset = pvnum1 < 226 & pvnum2 < 226 & pvnum3 < 226 & pvnum4 < 226 & pvnum5 < 226 & pvnum6 < 226 & pvnum7 < 226 & pvnum8 < 226 & pvnum9 < 226 & pvnum10 < 226,
                            inside = FALSE)
dim(sdf_deuRaw_NumLow) # N=109 Variables=120

sdf_deuRaw_NumLow$NumLevel <- 1 
sdf_deuRaw_NumLow$NumLow <- 1
sdf_deuRaw_NumLow$NumAve <- 0
sdf_deuRaw_NumLow$NumHigh <- 0

dim(sdf_deuRaw_NumLow) # N=109 Variables=124

    # Average
sdf_deuRaw_NumAve <- subset(x = sdf_deuRaw,
                            subset = (pvnum1 >=226 & pvnum1 <326) & (pvnum2 >=226 & pvnum2 <326) & (pvnum3 >=226 & pvnum3 <326) & (pvnum4 >=226 & pvnum4 <326) & (pvnum5 >=226 & pvnum5 <326) & (pvnum6 >=226 & pvnum6 <326) & (pvnum7 >=226 & pvnum7 <326) & (pvnum8 >=226 & pvnum8 <326) & (pvnum9 >=226 & pvnum9 <326) & (pvnum10 >=226 & pvnum10 <326),
                            inside = FALSE)
dim(sdf_deuRaw_NumAve) # N=1148 Variables=120

sdf_deuRaw_NumAve$NumLevel <- 2 
sdf_deuRaw_NumAve$NumLow <- 0
sdf_deuRaw_NumAve$NumAve <- 1
sdf_deuRaw_NumAve$NumHigh <- 0

dim(sdf_deuRaw_NumAve) # N=1148 Variables=124

    # High
sdf_deuRaw_NumHigh <- subset(x = sdf_deuRaw,
                             subset = pvnum1 >=326 & pvnum2 >=326 & pvnum3 >=326 & pvnum4 >=326 & pvnum5 >=326 & pvnum6 >=326 & pvnum7 >=326 & pvnum8 >=326 & pvnum9 >=326 & pvnum10 >=326,
                             inside = FALSE)
dim(sdf_deuRaw_NumHigh) # N=132 Variables=120

sdf_deuRaw_NumHigh$NumLevel <- 3 
sdf_deuRaw_NumHigh$NumLow <- 0
sdf_deuRaw_NumHigh$NumAve <- 0
sdf_deuRaw_NumHigh$NumHigh <- 1

dim(sdf_deuRaw_NumHigh) # N=132 Variables=124


  # PSL / ICT (N=1233)

    # Low
sdf_deuRaw_pslLow <- subset(x = sdf_deuRaw,
                            subset = pvpsl1 < 226 & pvpsl2 < 226 & pvpsl3 < 226 & pvpsl4 < 226 & pvpsl5 < 226 & pvpsl6 < 226 & pvpsl7 < 226 & pvpsl8 < 226 & pvpsl9 < 226 & pvpsl10 < 226,
                            inside = FALSE)
dim(sdf_deuRaw_pslLow) # N=49 Variables=120

sdf_deuRaw_pslLow$pslLevel <- 1 
sdf_deuRaw_pslLow$pslLow <- 1
sdf_deuRaw_pslLow$pslAve <- 0
sdf_deuRaw_pslLow$pslHigh <- 0

dim(sdf_deuRaw_pslLow) # N=49 Variables=124
  
    # Average
sdf_deuRaw_pslAve <- subset(x = sdf_deuRaw,
                            subset = (pvpsl1 >=226 & pvpsl1 <326) & (pvpsl2 >=226 & pvpsl2 <326) & (pvpsl3 >=226 & pvpsl3 <326) & (pvpsl4 >=226 & pvpsl4 <326) & (pvpsl5 >=226 & pvpsl5 <326) & (pvpsl6 >=226 & pvpsl6 <326) & (pvpsl7 >=226 & pvpsl7 <326) & (pvpsl8 >=226 & pvpsl8 <326) & (pvpsl9 >=226 & pvpsl9 <326) & (pvpsl10 >=226 & pvpsl10 <326),
                            inside = FALSE)
dim(sdf_deuRaw_pslAve) # N=1120 Variables=120

sdf_deuRaw_pslAve$pslLevel <- 2 
sdf_deuRaw_pslAve$pslLow <- 0
sdf_deuRaw_pslAve$pslAve <- 1
sdf_deuRaw_pslAve$pslHigh <- 0

dim(sdf_deuRaw_pslAve) # N=1120 Variables=124

    # High
sdf_deuRaw_pslHigh <- subset(x = sdf_deuRaw,
                             subset = pvpsl1 >=326 & pvpsl2 >=326 & pvpsl3 >=326 & pvpsl4 >=326 & pvpsl5 >=326 & pvpsl6 >=326 & pvpsl7 >=326 & pvpsl8 >=326 & pvpsl9 >=326 & pvpsl10 >=326,
                             inside = FALSE)
dim(sdf_deuRaw_pslHigh) # N=84 Variables=120

sdf_deuRaw_pslHigh$pslLevel <- 3 
sdf_deuRaw_pslHigh$pslLow <- 0
sdf_deuRaw_pslHigh$pslAve <- 0
sdf_deuRaw_pslHigh$pslHigh <- 1

dim(sdf_deuRaw_pslHigh) # N=84 Variables=124

# --------------------------------------------------------------------------------------------

# b) combine to new data.frame and then rebind to light.edSurvey.data.frame, so edSurvey-functions can be used for further analysis

sdf_deuRaw_List <- list(sdf_deuRaw_LitLow, sdf_deuRaw_LitAve, sdf_deuRaw_LitHigh,
                        sdf_deuRaw_NumLow, sdf_deuRaw_NumAve, sdf_deuRaw_NumHigh,
                        sdf_deuRaw_pslLow, sdf_deuRaw_pslAve, sdf_deuRaw_pslHigh)
sdf_deuRaw_Final <- Reduce(function(x, y) merge(x, y, all=TRUE), sdf_deuRaw_List)

sdf_deu_rebinded <- rebindAttributes(sdf_deuRaw_Final, sdf_deu) # final (deu) light.edSurvey.data.frame for further analysis

dim(sdf_deu_rebinded) # N=4090 Variables=131

# --------------------------------------------------------------------------------------------

# c) creating suitable subsets for logistic regressions

# Literacy
sdf_deu_rebinded_EX_LitHigh <- subset(sdf_deu_rebinded, LitHigh !=1)
dim(sdf_deu_rebinded_EX_LitHigh) # N=4019; Variables=131

sdf_deu_rebinded_EX_LitLow <- subset(sdf_deu_rebinded, LitLow !=1)
dim(sdf_deu_rebinded_EX_LitLow) # N=3990; Variables=131


# Numeracy
sdf_deu_rebinded_EX_NumHigh <- subset(sdf_deu_rebinded, NumHigh !=1)
dim(sdf_deu_rebinded_EX_NumHigh) # N=3958; Variables=131

sdf_deu_rebinded_EX_NumLow <- subset(sdf_deu_rebinded, NumLow !=1)
dim(sdf_deu_rebinded_EX_NumLow) # N=3981; Variables=131


# PSL
sdf_deu_rebinded_EX_pslHigh <- subset(sdf_deu_rebinded, pslHigh !=1)
dim(sdf_deu_rebinded_EX_pslHigh) # N=4006; Variables=131

sdf_deu_rebinded_EX_pslLow <- subset(sdf_deu_rebinded, pslLow !=1)
dim(sdf_deu_rebinded_EX_pslLow) # N=4040; Variables=131

technocrat · September 7, 2022, 5:43am

Restatement of the problem f(x) = y, where x is the data frame to start and y is the data frame to be constructed by f, a composite function. f begins by dividing x into unique subsets and adding one or more new variables to each subset, which may be the same as or different from variables added to other subsets. All or some variables present in x are present in each subset. Subsets have, or may have, differing numbers of rows. All subsets are to be combined into a data frame containing all rows and columns of the original data frame x and all rows and columns created by operations on the subsets of x.

One way to do this, illustrated below, is having a key variable, in this case rn that uniquely identifies each row of x and is carried over to each subset. Following the operations on each subset, all variables except the key and the new variables are joined to the original data frame.

library(data.table)
# some data
DT <- data.table(mtcars, keep.rownames = TRUE)
# subset
sub1 <- DT[gear == 5 & carb == 2,]
# add new variable
sub1$type <- "Sports"
# join to original 
dplyr::left_join(DT,sub1[,c("rn","type")], by = "rn")
#>                      rn  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#>  1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#>  2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#>  3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#>  4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#>  5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#>  6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#>  7:          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#>  8:           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#>  9:            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10:            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11:           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12:          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13:          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14:         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15:  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16: Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17:   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18:            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19:         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20:      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21:       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22:    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23:         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24:          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25:    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26:           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27:       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28:        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29:      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30:        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31:       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32:          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#>                      rn  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#>       type
#>  1:   <NA>
#>  2:   <NA>
#>  3:   <NA>
#>  4:   <NA>
#>  5:   <NA>
#>  6:   <NA>
#>  7:   <NA>
#>  8:   <NA>
#>  9:   <NA>
#> 10:   <NA>
#> 11:   <NA>
#> 12:   <NA>
#> 13:   <NA>
#> 14:   <NA>
#> 15:   <NA>
#> 16:   <NA>
#> 17:   <NA>
#> 18:   <NA>
#> 19:   <NA>
#> 20:   <NA>
#> 21:   <NA>
#> 22:   <NA>
#> 23:   <NA>
#> 24:   <NA>
#> 25:   <NA>
#> 26:   <NA>
#> 27: Sports
#> 28: Sports
#> 29:   <NA>
#> 30:   <NA>
#> 31:   <NA>
#> 32:   <NA>
#>       type

^{Created on 2022-09-06 by the reprex package (v2.0.1)}

system · September 14, 2022, 5:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.