I have a few datasets available through AWS.
The first is approximately 9Kx17 binary table suitable for logistic regression topics. It derives from a medical database with a response variable (internal injuries) and 16 treatment variates (skeletal injuries). All fields are binary. The field name are completely anonymized to protect the intellectual property of the compiler.
The second, of more general interest is approximately 100K records with the following fields
MariaDB [dlf]> describe y7;
+----------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| city | varchar(25) | YES | | NULL | |
| cltv | decimal(10,0) | YES | | NULL | |
| ctapeno | decimal(10,0) | YES | | NULL | |
| deal | varchar(25) | YES | | NULL | |
| down | int(11) | YES | | NULL | |
| dti | decimal(10,0) | YES | | NULL | |
| dtype | varchar(25) | YES | | NULL | |
| ebal | decimal(10,0) | YES | | NULL | |
| fico | decimal(10,0) | YES | | NULL | |
| fpd | date | YES | | NULL | |
| gainloss | decimal(10,0) | YES | | NULL | |
| grade | varchar(5) | YES | | NULL | |
| irate | decimal(10,0) | YES | | NULL | |
| issuer | varchar(10) | YES | | NULL | |
| lat | decimal(10,0) | YES | | NULL | |
| lien | int(11) | YES | | NULL | |
| liq | decimal(10,0) | YES | | NULL | |
| lng | decimal(10,0) | YES | | NULL | |
| lstatus | int(11) | YES | | NULL | |
| ltype | varchar(25) | YES | | NULL | |
| margin | decimal(10,0) | YES | | NULL | |
| metro | varchar(25) | YES | | NULL | |
| nrate | decimal(10,0) | YES | | NULL | |
| obal | decimal(10,0) | YES | | NULL | |
| odate | date | YES | | NULL | |
| oltv | decimal(10,0) | YES | | NULL | |
| orate | decimal(10,0) | YES | | NULL | |
| oterm | int(11) | YES | | NULL | |
| payments | int(11) | YES | | NULL | |
| pmiflag | int(11) | YES | | NULL | |
| pocode | int(11) | YES | | NULL | |
| pod | date | YES | | NULL | |
| ppp | int(11) | YES | | NULL | |
| ptd | date | YES | | NULL | |
| purpose | varchar(25) | YES | | NULL | |
| remit | decimal(10,0) | YES | | NULL | |
| sbal | decimal(10,0) | YES | | NULL | |
| servno | decimal(10,0) | YES | | NULL | |
| sint | decimal(10,0) | YES | | NULL | |
| sprin | decimal(10,0) | YES | | NULL | |
| spymt | decimal(10,0) | YES | | NULL | |
| st | varchar(4) | YES | | NULL | |
| zip | int(11) | YES | | NULL | |
| otype | varchar(25) | YES | | NULL | |
| rdate | date | YES | | NULL | |
| ptype | varchar(15) | YES | | NULL | |
| dptd | int(7) | YES | | NULL | |
| dfpd | int(7) | YES | | NULL | |
+----------+---------------+------+-----+---------+-------+
48 rows in set (0.004 sec)
It derives from publicly available data on underwriting criteria and loan performance of subprime mortgage backed securities issued in 2006. If an SQL server is not available to host the source file (which is how I've used it, due to memory constraints when the database was assembled in 2007), I can convert it to a csv file readily, and provide a codesheet.
The attraction of this data set is the combination of date, quantitative, categorical, binary, and geographic data variables and the opportunity to compare loan characteristics at origination with payment history.
It's big enough to provide large n
power for machine learning training/testing/validation sets.
I would want to anonymize the data compilation to mask its origin, but that would not detract from its usefulness.
Finally, I have a 35K structured email set derived from the Enron Corpus
described in my paper on application of latent network analysis to identify users of interest more efficiently than keyword analysis in litigation discovery. An rda tibble
is available. This is also suitable for NLP
use.
In addition, the caret::GermanCredit
dataset has 1K records of credit evaluations ("good/bad") based on multiple underwriting criteria.
This dataset can be used either for lm
or glm
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
suppressPackageStartupMessages(library(dplyr))
data(GermanCredit)
# Convert "good/bad" factor to 1/0 integer
credit <- GermanCredit %>% mutate(Score = as.integer(Class)-1) %>% select(Score, -Class, everything())
# examine saturated logistic model with the converted Class, good/bad credit as response variable
# logistic model
saturated <- glm(Score ~ Duration + Amount + InstallmentRatePercentage +
ResidenceDuration + Age + NumberExistingCredits + NumberPeopleMaintenance + Telephone + ForeignWorker + CheckingAccountStatus.lt.0 + CheckingAccountStatus.0.to.200 + CheckingAccountStatus.gt.200 + CheckingAccountStatus.none + CreditHistory.NoCredit.AllPaid + CreditHistory.ThisBank.AllPaid + CreditHistory.PaidDuly + CreditHistory.Delay + CreditHistory.Critical + Purpose.NewCar + Purpose.UsedCar + Purpose.Furniture.Equipment + Purpose.Radio.Television + Purpose.DomesticAppliance + Purpose.Repairs + Purpose.Education + Purpose.Vacation + Purpose.Retraining + Purpose.Business + Purpose.Other + SavingsAccountBonds.lt.100 + SavingsAccountBonds.100.to.500 + SavingsAccountBonds.500.to.1000 + SavingsAccountBonds.gt.1000 + SavingsAccountBonds.Unknown + EmploymentDuration.lt.1 + EmploymentDuration.1.to.4 + EmploymentDuration.4.to.7 + EmploymentDuration.gt.7 + EmploymentDuration.Unemployed + Personal.Male.Divorced.Seperated + Personal.Female.NotSingle + Personal.Male.Single + Personal.Male.Married.Widowed + Personal.Female.Single + OtherDebtorsGuarantors.None + OtherDebtorsGuarantors.CoApplicant + OtherDebtorsGuarantors.Guarantor + Property.RealEstate + Property.Insurance + Property.CarOther + Property.Unknown + OtherInstallmentPlans.Bank + OtherInstallmentPlans.Stores + OtherInstallmentPlans.None + Housing.Rent + Housing.Own + Housing.ForFree + Job.UnemployedUnskilled + Job.UnskilledResident + Job.SkilledEmployee + Job.Management.SelfEmp.HighlyQualified, data = credit)
summary(saturated)
#>
#> Call:
#> glm(formula = Score ~ Duration + Amount + InstallmentRatePercentage +
#> ResidenceDuration + Age + NumberExistingCredits + NumberPeopleMaintenance +
#> Telephone + ForeignWorker + CheckingAccountStatus.lt.0 +
#> CheckingAccountStatus.0.to.200 + CheckingAccountStatus.gt.200 +
#> CheckingAccountStatus.none + CreditHistory.NoCredit.AllPaid +
#> CreditHistory.ThisBank.AllPaid + CreditHistory.PaidDuly +
#> CreditHistory.Delay + CreditHistory.Critical + Purpose.NewCar +
#> Purpose.UsedCar + Purpose.Furniture.Equipment + Purpose.Radio.Television +
#> Purpose.DomesticAppliance + Purpose.Repairs + Purpose.Education +
#> Purpose.Vacation + Purpose.Retraining + Purpose.Business +
#> Purpose.Other + SavingsAccountBonds.lt.100 + SavingsAccountBonds.100.to.500 +
#> SavingsAccountBonds.500.to.1000 + SavingsAccountBonds.gt.1000 +
#> SavingsAccountBonds.Unknown + EmploymentDuration.lt.1 + EmploymentDuration.1.to.4 +
#> EmploymentDuration.4.to.7 + EmploymentDuration.gt.7 + EmploymentDuration.Unemployed +
#> Personal.Male.Divorced.Seperated + Personal.Female.NotSingle +
#> Personal.Male.Single + Personal.Male.Married.Widowed + Personal.Female.Single +
#> OtherDebtorsGuarantors.None + OtherDebtorsGuarantors.CoApplicant +
#> OtherDebtorsGuarantors.Guarantor + Property.RealEstate +
#> Property.Insurance + Property.CarOther + Property.Unknown +
#> OtherInstallmentPlans.Bank + OtherInstallmentPlans.Stores +
#> OtherInstallmentPlans.None + Housing.Rent + Housing.Own +
#> Housing.ForFree + Job.UnemployedUnskilled + Job.UnskilledResident +
#> Job.SkilledEmployee + Job.Management.SelfEmp.HighlyQualified,
#> data = credit)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.02398 -0.31980 0.08878 0.28380 0.94333
#>
#> Coefficients: (13 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.742e+00 2.049e-01 8.500 < 2e-16
#> Duration -4.724e-03 1.496e-03 -3.158 0.001639
#> Amount -1.801e-05 7.130e-06 -2.525 0.011726
#> InstallmentRatePercentage -4.525e-02 1.326e-02 -3.412 0.000673
#> ResidenceDuration 5.024e-04 1.324e-02 0.038 0.969738
#> Age 1.734e-03 1.343e-03 1.291 0.197116
#> NumberExistingCredits -3.811e-02 2.801e-02 -1.360 0.174030
#> NumberPeopleMaintenance -3.341e-02 3.804e-02 -0.878 0.380039
#> Telephone -4.419e-02 2.964e-02 -1.491 0.136289
#> ForeignWorker -1.453e-01 7.052e-02 -2.060 0.039625
#> CheckingAccountStatus.lt.0 -2.702e-01 3.450e-02 -7.832 1.28e-14
#> CheckingAccountStatus.0.to.200 -1.918e-01 3.370e-02 -5.690 1.69e-08
#> CheckingAccountStatus.gt.200 -8.319e-02 5.575e-02 -1.492 0.136000
#> CheckingAccountStatus.none NA NA NA NA
#> CreditHistory.NoCredit.AllPaid -2.579e-01 7.177e-02 -3.593 0.000344
#> CreditHistory.ThisBank.AllPaid -2.672e-01 6.940e-02 -3.850 0.000126
#> CreditHistory.PaidDuly -1.157e-01 3.651e-02 -3.168 0.001585
#> CreditHistory.Delay -7.039e-02 5.113e-02 -1.377 0.168909
#> CreditHistory.Critical NA NA NA NA
#> Purpose.NewCar -2.504e-01 1.254e-01 -1.996 0.046169
#> Purpose.UsedCar -1.870e-02 1.278e-01 -0.146 0.883696
#> Purpose.Furniture.Equipment -1.224e-01 1.271e-01 -0.963 0.335562
#> Purpose.Radio.Television -1.163e-01 1.265e-01 -0.919 0.358252
#> Purpose.DomesticAppliance -1.730e-01 1.719e-01 -1.007 0.314373
#> Purpose.Repairs -2.100e-01 1.510e-01 -1.391 0.164457
#> Purpose.Education -2.719e-01 1.364e-01 -1.994 0.046399
#> Purpose.Vacation NA NA NA NA
#> Purpose.Retraining 4.091e-03 1.852e-01 0.022 0.982380
#> Purpose.Business -1.324e-01 1.291e-01 -1.026 0.305118
#> Purpose.Other NA NA NA NA
#> SavingsAccountBonds.lt.100 -1.249e-01 3.569e-02 -3.499 0.000488
#> SavingsAccountBonds.100.to.500 -7.366e-02 5.081e-02 -1.450 0.147444
#> SavingsAccountBonds.500.to.1000 -4.392e-02 5.941e-02 -0.739 0.459967
#> SavingsAccountBonds.gt.1000 2.277e-02 6.612e-02 0.344 0.730644
#> SavingsAccountBonds.Unknown NA NA NA NA
#> EmploymentDuration.lt.1 7.409e-03 6.942e-02 0.107 0.915031
#> EmploymentDuration.1.to.4 3.196e-02 6.636e-02 0.482 0.630156
#> EmploymentDuration.4.to.7 1.210e-01 6.938e-02 1.744 0.081516
#> EmploymentDuration.gt.7 5.353e-02 6.617e-02 0.809 0.418758
#> EmploymentDuration.Unemployed NA NA NA NA
#> Personal.Male.Divorced.Seperated -8.416e-02 7.316e-02 -1.150 0.250309
#> Personal.Female.NotSingle -2.941e-02 4.898e-02 -0.600 0.548420
#> Personal.Male.Single 4.492e-02 4.869e-02 0.922 0.356504
#> Personal.Male.Married.Widowed NA NA NA NA
#> Personal.Female.Single NA NA NA NA
#> OtherDebtorsGuarantors.None -1.651e-01 6.058e-02 -2.725 0.006540
#> OtherDebtorsGuarantors.CoApplicant -2.416e-01 8.660e-02 -2.790 0.005381
#> OtherDebtorsGuarantors.Guarantor NA NA NA NA
#> Property.RealEstate 1.218e-01 6.350e-02 1.918 0.055470
#> Property.Insurance 7.741e-02 6.253e-02 1.238 0.216028
#> Property.CarOther 8.894e-02 6.079e-02 1.463 0.143746
#> Property.Unknown NA NA NA NA
#> OtherInstallmentPlans.Bank -8.194e-02 3.901e-02 -2.100 0.035980
#> OtherInstallmentPlans.Stores -8.112e-02 6.230e-02 -1.302 0.193181
#> OtherInstallmentPlans.None NA NA NA NA
#> Housing.Rent -1.192e-01 7.215e-02 -1.652 0.098931
#> Housing.Own -4.808e-02 6.886e-02 -0.698 0.485195
#> Housing.ForFree NA NA NA NA
#> Job.UnemployedUnskilled 7.048e-02 1.023e-01 0.689 0.491076
#> Job.UnskilledResident -4.170e-03 5.386e-02 -0.077 0.938314
#> Job.SkilledEmployee -1.553e-02 4.396e-02 -0.353 0.723941
#> Job.Management.SelfEmp.HighlyQualified NA NA NA NA
#>
#> (Intercept) ***
#> Duration **
#> Amount *
#> InstallmentRatePercentage ***
#> ResidenceDuration
#> Age
#> NumberExistingCredits
#> NumberPeopleMaintenance
#> Telephone
#> ForeignWorker *
#> CheckingAccountStatus.lt.0 ***
#> CheckingAccountStatus.0.to.200 ***
#> CheckingAccountStatus.gt.200
#> CheckingAccountStatus.none
#> CreditHistory.NoCredit.AllPaid ***
#> CreditHistory.ThisBank.AllPaid ***
#> CreditHistory.PaidDuly **
#> CreditHistory.Delay
#> CreditHistory.Critical
#> Purpose.NewCar *
#> Purpose.UsedCar
#> Purpose.Furniture.Equipment
#> Purpose.Radio.Television
#> Purpose.DomesticAppliance
#> Purpose.Repairs
#> Purpose.Education *
#> Purpose.Vacation
#> Purpose.Retraining
#> Purpose.Business
#> Purpose.Other
#> SavingsAccountBonds.lt.100 ***
#> SavingsAccountBonds.100.to.500
#> SavingsAccountBonds.500.to.1000
#> SavingsAccountBonds.gt.1000
#> SavingsAccountBonds.Unknown
#> EmploymentDuration.lt.1
#> EmploymentDuration.1.to.4
#> EmploymentDuration.4.to.7 .
#> EmploymentDuration.gt.7
#> EmploymentDuration.Unemployed
#> Personal.Male.Divorced.Seperated
#> Personal.Female.NotSingle
#> Personal.Male.Single
#> Personal.Male.Married.Widowed
#> Personal.Female.Single
#> OtherDebtorsGuarantors.None **
#> OtherDebtorsGuarantors.CoApplicant **
#> OtherDebtorsGuarantors.Guarantor
#> Property.RealEstate .
#> Property.Insurance
#> Property.CarOther
#> Property.Unknown
#> OtherInstallmentPlans.Bank *
#> OtherInstallmentPlans.Stores
#> OtherInstallmentPlans.None
#> Housing.Rent .
#> Housing.Own
#> Housing.ForFree
#> Job.UnemployedUnskilled
#> Job.UnskilledResident
#> Job.SkilledEmployee
#> Job.Management.SelfEmp.HighlyQualified
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.1586705)
#>
#> Null deviance: 210.0 on 999 degrees of freedom
#> Residual deviance: 150.9 on 951 degrees of freedom
#> AIC: 1046.7
#>
#> Number of Fisher Scoring iterations: 2
# Second approach
# Create a NPV from Duration + Amount + InstallmentRatePercentage
# OLS linear regression
Created on 2020-01-17 by the reprex package (v0.3.0)
I'm available gratis remotely or by recording to provide domain context for non-profit institutions if helpful.