Examples of interconnected datasets

Hi everyone,

This is the beginning of what I hope will become a very long thread of quality responses from the community. Just like many of you here, I teach some data science with R using the tidyverse. I must say that my favorite datasets to do this are the data in the nycflights13 package, namely: flights, weather, airport, planes and airlines. This is because of how rich they are in terms of size and multiplicity of data types (e.g. numeric, character, date, ...). You can manipulate, join, visualize, hack and slice them in so many interesting ways. I use them to teach: dplyr, ggplot2, tidyr, purrr, lubridate, ... And I am about to start doing some modeling with the data to answer the question: which of the weather measures provided has the highest explanatory power to predict flight delays? In short, nycflights13 is awesome.

I am reaching out to you, the community, to ask if you have other examples of publicly available, interconnected, rich datasets like that. It would be great to not have to use the same datasets over and over again.

If you are participating, please provide the source of the data and also provide a brief explanation on how the different parts are interconnected.

Thank you.

1 Like

I have a few datasets available through AWS.

The first is approximately 9Kx17 binary table suitable for logistic regression topics. It derives from a medical database with a response variable (internal injuries) and 16 treatment variates (skeletal injuries). All fields are binary. The field name are completely anonymized to protect the intellectual property of the compiler.

The second, of more general interest is approximately 100K records with the following fields

MariaDB [dlf]> describe y7;
+----------+---------------+------+-----+---------+-------+
| Field    | Type          | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| city     | varchar(25)   | YES  |     | NULL    |       |
| cltv     | decimal(10,0) | YES  |     | NULL    |       |
| ctapeno  | decimal(10,0) | YES  |     | NULL    |       |
| deal     | varchar(25)   | YES  |     | NULL    |       |
| down     | int(11)       | YES  |     | NULL    |       |
| dti      | decimal(10,0) | YES  |     | NULL    |       |
| dtype    | varchar(25)   | YES  |     | NULL    |       |
| ebal     | decimal(10,0) | YES  |     | NULL    |       |
| fico     | decimal(10,0) | YES  |     | NULL    |       |
| fpd      | date          | YES  |     | NULL    |       |
| gainloss | decimal(10,0) | YES  |     | NULL    |       |
| grade    | varchar(5)    | YES  |     | NULL    |       |
| irate    | decimal(10,0) | YES  |     | NULL    |       |
| issuer   | varchar(10)   | YES  |     | NULL    |       |
| lat      | decimal(10,0) | YES  |     | NULL    |       |
| lien     | int(11)       | YES  |     | NULL    |       |
| liq      | decimal(10,0) | YES  |     | NULL    |       |
| lng      | decimal(10,0) | YES  |     | NULL    |       |
| lstatus  | int(11)       | YES  |     | NULL    |       |
| ltype    | varchar(25)   | YES  |     | NULL    |       |
| margin   | decimal(10,0) | YES  |     | NULL    |       |
| metro    | varchar(25)   | YES  |     | NULL    |       |
| nrate    | decimal(10,0) | YES  |     | NULL    |       |
| obal     | decimal(10,0) | YES  |     | NULL    |       |
| odate    | date          | YES  |     | NULL    |       |
| oltv     | decimal(10,0) | YES  |     | NULL    |       |
| orate    | decimal(10,0) | YES  |     | NULL    |       |
| oterm    | int(11)       | YES  |     | NULL    |       |
| payments | int(11)       | YES  |     | NULL    |       |
| pmiflag  | int(11)       | YES  |     | NULL    |       |
| pocode   | int(11)       | YES  |     | NULL    |       |
| pod      | date          | YES  |     | NULL    |       |
| ppp      | int(11)       | YES  |     | NULL    |       |
| ptd      | date          | YES  |     | NULL    |       |
| purpose  | varchar(25)   | YES  |     | NULL    |       |
| remit    | decimal(10,0) | YES  |     | NULL    |       |
| sbal     | decimal(10,0) | YES  |     | NULL    |       |
| servno   | decimal(10,0) | YES  |     | NULL    |       |
| sint     | decimal(10,0) | YES  |     | NULL    |       |
| sprin    | decimal(10,0) | YES  |     | NULL    |       |
| spymt    | decimal(10,0) | YES  |     | NULL    |       |
| st       | varchar(4)    | YES  |     | NULL    |       |
| zip      | int(11)       | YES  |     | NULL    |       |
| otype    | varchar(25)   | YES  |     | NULL    |       |
| rdate    | date          | YES  |     | NULL    |       |
| ptype    | varchar(15)   | YES  |     | NULL    |       |
| dptd     | int(7)        | YES  |     | NULL    |       |
| dfpd     | int(7)        | YES  |     | NULL    |       |
+----------+---------------+------+-----+---------+-------+
48 rows in set (0.004 sec)

It derives from publicly available data on underwriting criteria and loan performance of subprime mortgage backed securities issued in 2006. If an SQL server is not available to host the source file (which is how I've used it, due to memory constraints when the database was assembled in 2007), I can convert it to a csv file readily, and provide a codesheet.

The attraction of this data set is the combination of date, quantitative, categorical, binary, and geographic data variables and the opportunity to compare loan characteristics at origination with payment history.

It's big enough to provide large n power for machine learning training/testing/validation sets.

I would want to anonymize the data compilation to mask its origin, but that would not detract from its usefulness.

Finally, I have a 35K structured email set derived from the Enron Corpus described in my paper on application of latent network analysis to identify users of interest more efficiently than keyword analysis in litigation discovery. An rda tibble is available. This is also suitable for NLP use.

In addition, the caret::GermanCredit dataset has 1K records of credit evaluations ("good/bad") based on multiple underwriting criteria.

This dataset can be used either for lm or glm

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
suppressPackageStartupMessages(library(dplyr)) 
data(GermanCredit)
# Convert "good/bad" factor to 1/0 integer
credit <- GermanCredit %>% mutate(Score = as.integer(Class)-1) %>% select(Score, -Class, everything())
# examine saturated logistic model with the converted Class, good/bad credit as response variable
# logistic model
saturated <- glm(Score ~ Duration + Amount + InstallmentRatePercentage +  
ResidenceDuration  + Age + NumberExistingCredits + NumberPeopleMaintenance + Telephone + ForeignWorker + CheckingAccountStatus.lt.0 + CheckingAccountStatus.0.to.200 + CheckingAccountStatus.gt.200 + CheckingAccountStatus.none + CreditHistory.NoCredit.AllPaid + CreditHistory.ThisBank.AllPaid + CreditHistory.PaidDuly + CreditHistory.Delay + CreditHistory.Critical + Purpose.NewCar + Purpose.UsedCar + Purpose.Furniture.Equipment + Purpose.Radio.Television + Purpose.DomesticAppliance + Purpose.Repairs + Purpose.Education + Purpose.Vacation + Purpose.Retraining + Purpose.Business + Purpose.Other + SavingsAccountBonds.lt.100 + SavingsAccountBonds.100.to.500 + SavingsAccountBonds.500.to.1000 + SavingsAccountBonds.gt.1000 + SavingsAccountBonds.Unknown + EmploymentDuration.lt.1 + EmploymentDuration.1.to.4 + EmploymentDuration.4.to.7 + EmploymentDuration.gt.7 + EmploymentDuration.Unemployed + Personal.Male.Divorced.Seperated + Personal.Female.NotSingle + Personal.Male.Single + Personal.Male.Married.Widowed + Personal.Female.Single + OtherDebtorsGuarantors.None + OtherDebtorsGuarantors.CoApplicant + OtherDebtorsGuarantors.Guarantor + Property.RealEstate + Property.Insurance + Property.CarOther + Property.Unknown + OtherInstallmentPlans.Bank + OtherInstallmentPlans.Stores + OtherInstallmentPlans.None + Housing.Rent + Housing.Own + Housing.ForFree + Job.UnemployedUnskilled + Job.UnskilledResident + Job.SkilledEmployee + Job.Management.SelfEmp.HighlyQualified, data = credit)
summary(saturated)
#> 
#> Call:
#> glm(formula = Score ~ Duration + Amount + InstallmentRatePercentage + 
#>     ResidenceDuration + Age + NumberExistingCredits + NumberPeopleMaintenance + 
#>     Telephone + ForeignWorker + CheckingAccountStatus.lt.0 + 
#>     CheckingAccountStatus.0.to.200 + CheckingAccountStatus.gt.200 + 
#>     CheckingAccountStatus.none + CreditHistory.NoCredit.AllPaid + 
#>     CreditHistory.ThisBank.AllPaid + CreditHistory.PaidDuly + 
#>     CreditHistory.Delay + CreditHistory.Critical + Purpose.NewCar + 
#>     Purpose.UsedCar + Purpose.Furniture.Equipment + Purpose.Radio.Television + 
#>     Purpose.DomesticAppliance + Purpose.Repairs + Purpose.Education + 
#>     Purpose.Vacation + Purpose.Retraining + Purpose.Business + 
#>     Purpose.Other + SavingsAccountBonds.lt.100 + SavingsAccountBonds.100.to.500 + 
#>     SavingsAccountBonds.500.to.1000 + SavingsAccountBonds.gt.1000 + 
#>     SavingsAccountBonds.Unknown + EmploymentDuration.lt.1 + EmploymentDuration.1.to.4 + 
#>     EmploymentDuration.4.to.7 + EmploymentDuration.gt.7 + EmploymentDuration.Unemployed + 
#>     Personal.Male.Divorced.Seperated + Personal.Female.NotSingle + 
#>     Personal.Male.Single + Personal.Male.Married.Widowed + Personal.Female.Single + 
#>     OtherDebtorsGuarantors.None + OtherDebtorsGuarantors.CoApplicant + 
#>     OtherDebtorsGuarantors.Guarantor + Property.RealEstate + 
#>     Property.Insurance + Property.CarOther + Property.Unknown + 
#>     OtherInstallmentPlans.Bank + OtherInstallmentPlans.Stores + 
#>     OtherInstallmentPlans.None + Housing.Rent + Housing.Own + 
#>     Housing.ForFree + Job.UnemployedUnskilled + Job.UnskilledResident + 
#>     Job.SkilledEmployee + Job.Management.SelfEmp.HighlyQualified, 
#>     data = credit)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -1.02398  -0.31980   0.08878   0.28380   0.94333  
#> 
#> Coefficients: (13 not defined because of singularities)
#>                                          Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                             1.742e+00  2.049e-01   8.500  < 2e-16
#> Duration                               -4.724e-03  1.496e-03  -3.158 0.001639
#> Amount                                 -1.801e-05  7.130e-06  -2.525 0.011726
#> InstallmentRatePercentage              -4.525e-02  1.326e-02  -3.412 0.000673
#> ResidenceDuration                       5.024e-04  1.324e-02   0.038 0.969738
#> Age                                     1.734e-03  1.343e-03   1.291 0.197116
#> NumberExistingCredits                  -3.811e-02  2.801e-02  -1.360 0.174030
#> NumberPeopleMaintenance                -3.341e-02  3.804e-02  -0.878 0.380039
#> Telephone                              -4.419e-02  2.964e-02  -1.491 0.136289
#> ForeignWorker                          -1.453e-01  7.052e-02  -2.060 0.039625
#> CheckingAccountStatus.lt.0             -2.702e-01  3.450e-02  -7.832 1.28e-14
#> CheckingAccountStatus.0.to.200         -1.918e-01  3.370e-02  -5.690 1.69e-08
#> CheckingAccountStatus.gt.200           -8.319e-02  5.575e-02  -1.492 0.136000
#> CheckingAccountStatus.none                     NA         NA      NA       NA
#> CreditHistory.NoCredit.AllPaid         -2.579e-01  7.177e-02  -3.593 0.000344
#> CreditHistory.ThisBank.AllPaid         -2.672e-01  6.940e-02  -3.850 0.000126
#> CreditHistory.PaidDuly                 -1.157e-01  3.651e-02  -3.168 0.001585
#> CreditHistory.Delay                    -7.039e-02  5.113e-02  -1.377 0.168909
#> CreditHistory.Critical                         NA         NA      NA       NA
#> Purpose.NewCar                         -2.504e-01  1.254e-01  -1.996 0.046169
#> Purpose.UsedCar                        -1.870e-02  1.278e-01  -0.146 0.883696
#> Purpose.Furniture.Equipment            -1.224e-01  1.271e-01  -0.963 0.335562
#> Purpose.Radio.Television               -1.163e-01  1.265e-01  -0.919 0.358252
#> Purpose.DomesticAppliance              -1.730e-01  1.719e-01  -1.007 0.314373
#> Purpose.Repairs                        -2.100e-01  1.510e-01  -1.391 0.164457
#> Purpose.Education                      -2.719e-01  1.364e-01  -1.994 0.046399
#> Purpose.Vacation                               NA         NA      NA       NA
#> Purpose.Retraining                      4.091e-03  1.852e-01   0.022 0.982380
#> Purpose.Business                       -1.324e-01  1.291e-01  -1.026 0.305118
#> Purpose.Other                                  NA         NA      NA       NA
#> SavingsAccountBonds.lt.100             -1.249e-01  3.569e-02  -3.499 0.000488
#> SavingsAccountBonds.100.to.500         -7.366e-02  5.081e-02  -1.450 0.147444
#> SavingsAccountBonds.500.to.1000        -4.392e-02  5.941e-02  -0.739 0.459967
#> SavingsAccountBonds.gt.1000             2.277e-02  6.612e-02   0.344 0.730644
#> SavingsAccountBonds.Unknown                    NA         NA      NA       NA
#> EmploymentDuration.lt.1                 7.409e-03  6.942e-02   0.107 0.915031
#> EmploymentDuration.1.to.4               3.196e-02  6.636e-02   0.482 0.630156
#> EmploymentDuration.4.to.7               1.210e-01  6.938e-02   1.744 0.081516
#> EmploymentDuration.gt.7                 5.353e-02  6.617e-02   0.809 0.418758
#> EmploymentDuration.Unemployed                  NA         NA      NA       NA
#> Personal.Male.Divorced.Seperated       -8.416e-02  7.316e-02  -1.150 0.250309
#> Personal.Female.NotSingle              -2.941e-02  4.898e-02  -0.600 0.548420
#> Personal.Male.Single                    4.492e-02  4.869e-02   0.922 0.356504
#> Personal.Male.Married.Widowed                  NA         NA      NA       NA
#> Personal.Female.Single                         NA         NA      NA       NA
#> OtherDebtorsGuarantors.None            -1.651e-01  6.058e-02  -2.725 0.006540
#> OtherDebtorsGuarantors.CoApplicant     -2.416e-01  8.660e-02  -2.790 0.005381
#> OtherDebtorsGuarantors.Guarantor               NA         NA      NA       NA
#> Property.RealEstate                     1.218e-01  6.350e-02   1.918 0.055470
#> Property.Insurance                      7.741e-02  6.253e-02   1.238 0.216028
#> Property.CarOther                       8.894e-02  6.079e-02   1.463 0.143746
#> Property.Unknown                               NA         NA      NA       NA
#> OtherInstallmentPlans.Bank             -8.194e-02  3.901e-02  -2.100 0.035980
#> OtherInstallmentPlans.Stores           -8.112e-02  6.230e-02  -1.302 0.193181
#> OtherInstallmentPlans.None                     NA         NA      NA       NA
#> Housing.Rent                           -1.192e-01  7.215e-02  -1.652 0.098931
#> Housing.Own                            -4.808e-02  6.886e-02  -0.698 0.485195
#> Housing.ForFree                                NA         NA      NA       NA
#> Job.UnemployedUnskilled                 7.048e-02  1.023e-01   0.689 0.491076
#> Job.UnskilledResident                  -4.170e-03  5.386e-02  -0.077 0.938314
#> Job.SkilledEmployee                    -1.553e-02  4.396e-02  -0.353 0.723941
#> Job.Management.SelfEmp.HighlyQualified         NA         NA      NA       NA
#>                                           
#> (Intercept)                            ***
#> Duration                               ** 
#> Amount                                 *  
#> InstallmentRatePercentage              ***
#> ResidenceDuration                         
#> Age                                       
#> NumberExistingCredits                     
#> NumberPeopleMaintenance                   
#> Telephone                                 
#> ForeignWorker                          *  
#> CheckingAccountStatus.lt.0             ***
#> CheckingAccountStatus.0.to.200         ***
#> CheckingAccountStatus.gt.200              
#> CheckingAccountStatus.none                
#> CreditHistory.NoCredit.AllPaid         ***
#> CreditHistory.ThisBank.AllPaid         ***
#> CreditHistory.PaidDuly                 ** 
#> CreditHistory.Delay                       
#> CreditHistory.Critical                    
#> Purpose.NewCar                         *  
#> Purpose.UsedCar                           
#> Purpose.Furniture.Equipment               
#> Purpose.Radio.Television                  
#> Purpose.DomesticAppliance                 
#> Purpose.Repairs                           
#> Purpose.Education                      *  
#> Purpose.Vacation                          
#> Purpose.Retraining                        
#> Purpose.Business                          
#> Purpose.Other                             
#> SavingsAccountBonds.lt.100             ***
#> SavingsAccountBonds.100.to.500            
#> SavingsAccountBonds.500.to.1000           
#> SavingsAccountBonds.gt.1000               
#> SavingsAccountBonds.Unknown               
#> EmploymentDuration.lt.1                   
#> EmploymentDuration.1.to.4                 
#> EmploymentDuration.4.to.7              .  
#> EmploymentDuration.gt.7                   
#> EmploymentDuration.Unemployed             
#> Personal.Male.Divorced.Seperated          
#> Personal.Female.NotSingle                 
#> Personal.Male.Single                      
#> Personal.Male.Married.Widowed             
#> Personal.Female.Single                    
#> OtherDebtorsGuarantors.None            ** 
#> OtherDebtorsGuarantors.CoApplicant     ** 
#> OtherDebtorsGuarantors.Guarantor          
#> Property.RealEstate                    .  
#> Property.Insurance                        
#> Property.CarOther                         
#> Property.Unknown                          
#> OtherInstallmentPlans.Bank             *  
#> OtherInstallmentPlans.Stores              
#> OtherInstallmentPlans.None                
#> Housing.Rent                           .  
#> Housing.Own                               
#> Housing.ForFree                           
#> Job.UnemployedUnskilled                   
#> Job.UnskilledResident                     
#> Job.SkilledEmployee                       
#> Job.Management.SelfEmp.HighlyQualified    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.1586705)
#> 
#>     Null deviance: 210.0  on 999  degrees of freedom
#> Residual deviance: 150.9  on 951  degrees of freedom
#> AIC: 1046.7
#> 
#> Number of Fisher Scoring iterations: 2
# Second approach
# Create a NPV from Duration + Amount + InstallmentRatePercentage
# OLS linear regression

Created on 2020-01-17 by the reprex package (v0.3.0)

I'm available gratis remotely or by recording to provide domain context for non-profit institutions if helpful.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.