Fault in R code

iren_plnk · June 1, 2022, 10:45am

I have a problem with the code. I tried to organize a data frame with the number of people in different industries, those who get more than 50 and less than 50 thousand, but there are problems with the number. At first I replaced all unknown indicators with a mod value, combined everything into 4 groups, but for some reason it can't find the number. The same process works fine on other factors and the dataset is fine. What is the problem then?

levels(adult$workclass)[1] <- 'Unknown'

adult$workclass <- gsub(' Federal-gov', 'Government', adult$workclass)
adult$workclass <- gsub(' Local-gov', 'Government', adult$workclass)
adult$workclass <- gsub(' State-gov', 'Government', adult$workclass) 

adult$workclass <- gsub(' Self-emp-inc', 'Self-Employed', adult$workclass)
adult$workclass <- gsub(' Self-emp-not-inc', 'Self-Employed', adult$workclass)

adult$workclass <- gsub(' Never-worked', 'Other/Unknown', adult$workclass)
adult$workclass <- gsub(' Without-pay', 'Other/Unknown', adult$workclass)
adult$workclass <- gsub(' Other', 'Other/Unknown', adult$workclass)
adult$workclass <- gsub(' Unknown', 'Other/Unknown', adult$workclass)

adult$workclass <- as.factor(adult$workclass)

summary(adult$workclass)

count <- table(adult[adult$workclass == 'Government',]$income_class)["<=50K"]
count <- c(count, table(adult[adult$workclass == 'Government',]$income_class)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income_class)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income_class)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income_class)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income_class)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income_class)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income_class)[">50K"])
count <- as.numeric(count)


industry <- rep(levels(adult$workclass), each = 2)
income <- rep(c('<=50K', '>50K'), 4)
idf <- data.frame(industry, income, count)
idf

After code have this on console

       industry income count
1       Private  <=50K    NA
2       Private   >50K    NA
3    Government  <=50K    NA
4    Government   >50K    NA
5 Other/Unknown  <=50K    NA
6 Other/Unknown   >50K    NA
7 Self-Employed  <=50K    NA
8 Self-Employed   >50K    NA

nirgrahamuk · June 1, 2022, 2:45pm

Hello.
Thanks for providing code , but you could take further steps to make it more convenient for other forum users to help you.

Share some representative data that will enable your code to run and show the problematic behaviour.

You might use tools such as the library datapasta, or the base function dput() to share a portion of data in code form, i.e. that can be copied from forum and pasted to R session.

Reprex Guide

andresrcs · June 1, 2022, 9:55pm

As Nir said, we would need sample data to give you a working (tested) solution but I can give you a pointer if you are interested. If you use dplyr functions you could greatly simplify your code, for example, you could do something along this lines.

library(dplyr)

idf <- adult %>% 
    mutate(industry = case_when(
        workclass %in% c(' Federal-gov', ' Local-gov', ' State-gov') ~ 'Government',
        workclass %in% c(' Self-emp-inc', ' Self-emp-not-inc') ~ 'Self-Employed',
        workclass == 'Private' ~ 'Private',
        TRUE ~ 'Other/Unknown'
    )) %>% 
    count(industry, income, name = 'count')

idf

system · June 22, 2022, 9:56pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.