Problem with as.numeric function

woodland2081 · June 19, 2024, 2:18pm

Hi,

I'm new to R and having trouble with the as.numeric function in RStudio. I have a variable "TOT_EMP_Ind" that is a character variable - it doesn't appear that any values have commas or other symbols and no missing data. When I try "as.numeric" I get the warning that NAs were introduced by coercion but the "TOT_EMP_Ind" remains a character variable. Not sure what I'm missing!

dput(Nat_Ind_Total2)
structure(list(YEAR = c(2012, 2012, 2012, 2012, 2012, 2012, 2013,
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014,
2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016,
2016, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018,
2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020,
2020, 2020, 2020, 2021, 2021, 2021, 2021, 2021, 2021, 2022, 2022,
2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023, 2023, 2023),
AREA_TITLE = c("U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S."), INDUSTRY = c("Soc", "Gov", "Ins", "Amb",
"Hosp", "Total", "Ins", "Total", "Soc", "Gov", "Amb", "Hosp",
"Ins", "Amb", "Total", "Soc", "Gov", "Hosp", "Hosp", "Ins",
"Amb", "Soc", "Gov", "Total", "Total", "Hosp", "Ins", "Amb",
"Soc", "Gov", "Hosp", "Soc", "Ins", "Amb", "Total", "Gov",
"Amb", "Hosp", "Soc", "Gov", "Ins", "Total", "Ins", "Total",
"Amb", "Hosp", "Soc", "Gov", "Amb", "Ins", "Total", "Hosp",
"Soc", "Gov", "Amb", "Hosp", "Ins", "Total", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Gov", "Total", "Soc", "Ins", "Amb",
"Hosp", "Soc", "Gov", "Total"), TOT_EMP_Ind = c("10470",
"7110", "500", "6490", "3410", "38020", "570", "45800", "11940",
"7880", "8950", "3690", "400", "9980", "47880", "13080",
"8060", "4080", "4600", "480", "10510", "12040", "8190",
"48130", "51900", "5730", "720", "10900", "13520", "8920",
"5750", "14170", "940", "11330", "54760", "9460", "10610",
"5770", "14120", "11310", "1110", "56130", "0", "58950",
"11760", "5240", "14220", "11210", "11430", "2580", "58670",
"5390", "13950", "11340", "11020", "5220", "3610", "61010",
"12740", "13150", "3410", "11150", "6480", "11660", "61300",
"13220", "2300", "11950", "6400", "12870", "11880", "58550"
), H_MEAN = c(14.95, 19.38, 24.94, 18.44, 21.73, 18.02, 24,
18.1, 15.53, 19.45, 17.64, 21.66, 23.72, 17.95, 18.35, 15.78,
19.79, 21.4, 22.41, 23.24, 18.31, 17.11, 20.82, 19.3, 19.8,
23.11, 23.17, 18.94, 17.37, 20.9, 23.9, 18.16, 22.12, 19.62,
20.36, 21.63, 20.16, 23.85, 18.48, 22.54, 23.5, 20.9, 26.47,
21.34, 20.58, 24.43, 18.97, 22.96, 21.32, 26.94, 22.12, 24.77,
19.92, 23.49, 22.71, 25.9, 28.11, 22.97, 20.33, 24.05, 28.96,
22.91, 27.53, 24.81, 23.99, 20.82, 30.75, 23.93, 28.66, 22.52,
26.4, 25.3), H_MEDIAN = c(14.48, 17.81, 23.73, 16.9, 20.49,
16.64, 23.92, 16.64, 14.62, 17.96, 16.36, 20.28, 23.97, 16.56,
16.76, 14.76, 18.21, 19.86, 20.93, 22.37, 16.7, 15.96, 18.86,
17.45, 17.95, 21.73, 22, 17.24, 16.55, 18.97, 22.28, 17.27,
20.96, 17.88, 18.45, 19.59, 18.38, 22.04, 17.6, 20.76, 22.03,
19.01, 26.03, 19.41, 18.78, 22.72, 17.8, 21.27, 19.43, 26.18,
20.19, 23.15, 18.35, 21.87, 21.39, 23.68, 29.39, 22.4, 18.6,
22.79, 28.97, 21.26, 26.15, 22.97, 22.21, 19.67, 30.27, 22.18,
26.48, 21.83, 24.3, 23.17), TOT_EMP_Nat = c(38020, 38020,
38020, 38020, 38020, 38020, 45800, 45800, 45800, 45800, 45800,
45800, 47880, 47880, 47880, 47880, 47880, 47880, 48130, 48130,
48130, 48130, 48130, 48130, 51900, 51900, 51900, 51900, 51900,
51900, 54760, 54760, 54760, 54760, 54760, 54760, 56130, 56130,
56130, 56130, 56130, 56130, 58950, 58950, 58950, 58950, 58950,
58950, 58670, 58670, 58670, 58670, 58670, 58670, 61010, 61010,
61010, 61010, 61010, 61010, 61300, 61300, 61300, 61300, 61300,
61300, 58550, 58550, 58550, 58550, 58550, 58550)), row.names = c(NA,
-72L), class = "data.frame")

str(Nat_Ind_Total2)
'data.frame': 72 obs. of 7 variables:
YEAR : num 2012 2012 2012 2012 2012 ... AREA_TITLE : chr "U.S." "U.S." "U.S." "U.S." ...
INDUSTRY : chr "Soc" "Gov" "Ins" "Amb" ... TOT_EMP_Ind: chr "10470" "7110" "500" "6490" ...
H_MEAN : num 14.9 19.4 24.9 18.4 21.7 ... H_MEDIAN : num 14.5 17.8 23.7 16.9 20.5 ...
$ TOT_EMP_Nat: num 38020 38020 38020 38020 38020 ...

as.numeric("TOT_EMP_Ind")
[1] NA
Warning message:
NAs introduced by coercion

str(Nat_Ind_Total2)
'data.frame': 72 obs. of 7 variables:
YEAR : num 2012 2012 2012 2012 2012 ... AREA_TITLE : chr "U.S." "U.S." "U.S." "U.S." ...
INDUSTRY : chr "Soc" "Gov" "Ins" "Amb" ... TOT_EMP_Ind: chr "10470" "7110" "500" "6490" ...
H_MEAN : num 14.9 19.4 24.9 18.4 21.7 ... H_MEDIAN : num 14.5 17.8 23.7 16.9 20.5 ...
$ TOT_EMP_Nat: num 38020 38020 38020 38020 38020 ...

dromano · June 19, 2024, 2:34pm

Thanks for sharing your dput() output, @woodland2081 !

jrkrideau · June 19, 2024, 2:38pm

Hi and welcome to the forum

You have a couple of issues.
First, you have

"TOT_EMP_Ind"

which R understands as a literal set of letters and if you try

it will not work because R does not know to look in the data.frame Nat_Ind_Total2

This should work

as.numeric(Nat_Ind_Total2$TOT_EMP_Ind)

Here "Nat_Ind_Total2$" tells R to look in the data.frame "Nat_Ind_Total2" for a variable named "TOT_EMP_Ind"

woodland2081 · June 19, 2024, 3:21pm

Thank you so very much! The code that you suggested seemed to help (I got an actual output for the as.numeric function) but it still seems to not actually convert the variable to numeric. I think the original issue was that my data had a ** value in it, and even when I converted it to 0 in R is still wasn't recognizing as a a number. I'm showing a dataset below with slightly different variable names but it's the same structure/issue (you can see the "**" value in TOT_EMP):

dput(Nat_Ind_Total)
structure(list(AREA_TITLE = c("U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S."), YEAR = c(2023, 2023, 2023, 2023,
2023, 2022, 2022, 2022, 2022, 2022, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2017, 2016, 2016,
2016, 2016, 2016, 2015, 2015, 2015, 2015, 2015, 2014, 2014, 2014,
2014, 2014, 2013, 2013, 2013, 2013, 2013, 2012, 2012, 2012, 2012,
2012, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
2013, 2012), INDUSTRY = c("Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp",
"Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb",
"Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins",
"Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Total", "Total", "Total", "Total", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total"), TOT_EMP = c("2300",
"11950", "6400", "12870", "11880", "3410", "11150", "6480", "13220",
"11660", "3610", "11020", "5220", "12740", "13150", "2580", "11430",
"5390", "13950", "11340", "**", "11760", "5240", "14220", "11210",
"1110", "10610", "5770", "14120", "11310", "940", "11330", "5750",
"14170", "9460", "720", "10900", "5730", "13520", "8920", "480",
"10510", "4600", "12040", "8190", "400", "9980", "4080", "13080",
"8060", "570", "8950", "3690", "11940", "7880", "500", "6490",
"3410", "10470", "7110", "58550", "61300", "61010", "58670",
"58950", "56130", "54760", "51900", "48130", "47880", "45800",
"38020"), H_MEAN = c(30.75, 23.93, 28.66, 22.52, 26.4, 28.96,
22.91, 27.53, 20.82, 24.81, 28.11, 22.71, 25.9, 20.33, 24.05,
26.94, 21.32, 24.77, 19.92, 23.49, 26.47, 20.58, 24.43, 18.97,
22.96, 23.5, 20.16, 23.85, 18.48, 22.54, 22.12, 19.62, 23.9,
18.16, 21.63, 23.17, 18.94, 23.11, 17.37, 20.9, 23.24, 18.31,
22.41, 17.11, 20.82, 23.72, 17.95, 21.4, 15.78, 19.79, 24, 17.64,
21.66, 15.53, 19.45, 24.94, 18.44, 21.73, 14.95, 19.38, 25.3,
23.99, 22.97, 22.12, 21.34, 20.9, 20.36, 19.8, 19.3, 18.35, 18.1,
18.02), H_MEDIAN = c(30.27, 22.18, 26.48, 21.83, 24.3, 28.97,
21.26, 26.15, 19.67, 22.97, 29.39, 21.39, 23.68, 18.6, 22.79,
26.18, 19.43, 23.15, 18.35, 21.87, 26.03, 18.78, 22.72, 17.8,
21.27, 22.03, 18.38, 22.04, 17.6, 20.76, 20.96, 17.88, 22.28,
17.27, 19.59, 22, 17.24, 21.73, 16.55, 18.97, 22.37, 16.7, 20.93,
15.96, 18.86, 23.97, 16.56, 19.86, 14.76, 18.21, 23.92, 16.36,
20.28, 14.62, 17.96, 23.73, 16.9, 20.49, 14.48, 17.81, 23.17,
22.21, 22.4, 20.19, 19.41, 19.01, 18.45, 17.95, 17.45, 16.76,
16.64, 16.64)), row.names = c(NA, -72L), class = c("tbl_df",
"tbl", "data.frame"))

as.numeric(Nat_Ind_Total$TOT_EMP)
[1] 2300 11950 6400 12870 11880 3410 11150 6480 13220 11660 3610 11020 5220
[14] 12740 13150 2580 11430 5390 13950 11340 NA 11760 5240 14220 11210 1110
[27] 10610 5770 14120 11310 940 11330 5750 14170 9460 720 10900 5730 13520
[40] 8920 480 10510 4600 12040 8190 400 9980 4080 13080 8060 570 8950
[53] 3690 11940 7880 500 6490 3410 10470 7110 58550 61300 61010 58670 58950
[66] 56130 54760 51900 48130 47880 45800 38020
Warning message:
NAs introduced by coercion

##^here is says that the ** was replaced by NA, but when I looked at the data after this the ** is still there

dput(Nat_Ind_Total)
structure(list(AREA_TITLE = c("U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S."), YEAR = c(2023, 2023, 2023, 2023,
2023, 2022, 2022, 2022, 2022, 2022, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2017, 2016, 2016,
2016, 2016, 2016, 2015, 2015, 2015, 2015, 2015, 2014, 2014, 2014,
2014, 2014, 2013, 2013, 2013, 2013, 2013, 2012, 2012, 2012, 2012,
2012, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
2013, 2012), INDUSTRY = c("Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp",
"Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb",
"Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins",
"Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Total", "Total", "Total", "Total", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total"), TOT_EMP = c("2300",
"11950", "6400", "12870", "11880", "3410", "11150", "6480", "13220",
"11660", "3610", "11020", "5220", "12740", "13150", "2580", "11430",
"5390", "13950", "11340", "**", "11760", "5240", "14220", "11210",
"1110", "10610", "5770", "14120", "11310", "940", "11330", "5750",
"14170", "9460", "720", "10900", "5730", "13520", "8920", "480",
"10510", "4600", "12040", "8190", "400", "9980", "4080", "13080",
"8060", "570", "8950", "3690", "11940", "7880", "500", "6490",
"3410", "10470", "7110", "58550", "61300", "61010", "58670",
"58950", "56130", "54760", "51900", "48130", "47880", "45800",
"38020"), H_MEAN = c(30.75, 23.93, 28.66, 22.52, 26.4, 28.96,
22.91, 27.53, 20.82, 24.81, 28.11, 22.71, 25.9, 20.33, 24.05,
26.94, 21.32, 24.77, 19.92, 23.49, 26.47, 20.58, 24.43, 18.97,
22.96, 23.5, 20.16, 23.85, 18.48, 22.54, 22.12, 19.62, 23.9,
18.16, 21.63, 23.17, 18.94, 23.11, 17.37, 20.9, 23.24, 18.31,
22.41, 17.11, 20.82, 23.72, 17.95, 21.4, 15.78, 19.79, 24, 17.64,
21.66, 15.53, 19.45, 24.94, 18.44, 21.73, 14.95, 19.38, 25.3,
23.99, 22.97, 22.12, 21.34, 20.9, 20.36, 19.8, 19.3, 18.35, 18.1,
18.02), H_MEDIAN = c(30.27, 22.18, 26.48, 21.83, 24.3, 28.97,
21.26, 26.15, 19.67, 22.97, 29.39, 21.39, 23.68, 18.6, 22.79,
26.18, 19.43, 23.15, 18.35, 21.87, 26.03, 18.78, 22.72, 17.8,
21.27, 22.03, 18.38, 22.04, 17.6, 20.76, 20.96, 17.88, 22.28,
17.27, 19.59, 22, 17.24, 21.73, 16.55, 18.97, 22.37, 16.7, 20.93,
15.96, 18.86, 23.97, 16.56, 19.86, 14.76, 18.21, 23.92, 16.36,
20.28, 14.62, 17.96, 23.73, 16.9, 20.49, 14.48, 17.81, 23.17,
22.21, 22.4, 20.19, 19.41, 19.01, 18.45, 17.95, 17.45, 16.76,
16.64, 16.64)), row.names = c(NA, -72L), class = c("tbl_df",
"tbl", "data.frame"))

##So then I tried manually replacing the ** with NA, and then manually replacing the NA with 0, and both changes did actually work/show up in the data.

Nat_Ind_Total[Nat_Ind_Total == "**"] <- NA
Nat_Ind_Total$TOT_EMP[is.na(Nat_Ind_Total$TOT_EMP)] <- 0

as.numeric(Nat_Ind_Total$TOT_EMP)
[1] 2300 11950 6400 12870 11880 3410 11150 6480 13220 11660 3610 11020 5220
[14] 12740 13150 2580 11430 5390 13950 11340 0 11760 5240 14220 11210 1110
[27] 10610 5770 14120 11310 940 11330 5750 14170 9460 720 10900 5730 13520
[40] 8920 480 10510 4600 12040 8190 400 9980 4080 13080 8060 570 8950
[53] 3690 11940 7880 500 6490 3410 10470 7110 58550 61300 61010 58670 58950
[66] 56130 54760 51900 48130 47880 45800 38020

str(Nat_Ind_Total)
tibble [72 × 6] (S3: tbl_df/tbl/data.frame)
AREA_TITLE: chr [1:72] "U.S." "U.S." "U.S." "U.S." ... YEAR : num [1:72] 2023 2023 2023 2023 2023 ...
INDUSTRY : chr [1:72] "Ins" "Amb" "Hosp" "Soc" ... TOT_EMP : chr [1:72] "2300" "11950" "6400" "12870" ...
H_MEAN : num [1:72] 30.8 23.9 28.7 22.5 26.4 ... H_MEDIAN : num [1:72] 30.3 22.2 26.5 21.8 24.3 ...

##But then still even with trying as.numeric after that, the TOT_EMP variable remains character.

When I went back to the original excel file and just changed the ** to 0 I didn't have this problem (that's what I did with my original data set I had posted) so that seems to be the easiest solution, but just curious if there is something I'm missing here for doing it in R.

Thank you so much for your time!

dromano · June 20, 2024, 7:07pm

Hi @woodland2081 ,

A few observations, but first of all, I would recommend placing any code or output you share between a pair of triple backticks, like this:

```
place code or output here
```

which produces a copyable code block like this one:

Nat_Ind_Total <- 
structure(list(AREA_TITLE = c("U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S."), YEAR = c(2023, 2023, 2023, 2023,
2023, 2022, 2022, 2022, 2022, 2022, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2017, 2016, 2016,
2016, 2016, 2016, 2015, 2015, 2015, 2015, 2015, 2014, 2014, 2014,
2014, 2014, 2013, 2013, 2013, 2013, 2013, 2012, 2012, 2012, 2012,
2012, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
2013, 2012), INDUSTRY = c("Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp",
"Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb",
"Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins",
"Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Total", "Total", "Total", "Total", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total"), TOT_EMP = c("2300",
"11950", "6400", "12870", "11880", "3410", "11150", "6480", "13220",
"11660", "3610", "11020", "5220", "12740", "13150", "2580", "11430",
"5390", "13950", "11340", "**", "11760", "5240", "14220", "11210",
"1110", "10610", "5770", "14120", "11310", "940", "11330", "5750",
"14170", "9460", "720", "10900", "5730", "13520", "8920", "480",
"10510", "4600", "12040", "8190", "400", "9980", "4080", "13080",
"8060", "570", "8950", "3690", "11940", "7880", "500", "6490",
"3410", "10470", "7110", "58550", "61300", "61010", "58670",
"58950", "56130", "54760", "51900", "48130", "47880", "45800",
"38020"), H_MEAN = c(30.75, 23.93, 28.66, 22.52, 26.4, 28.96,
22.91, 27.53, 20.82, 24.81, 28.11, 22.71, 25.9, 20.33, 24.05,
26.94, 21.32, 24.77, 19.92, 23.49, 26.47, 20.58, 24.43, 18.97,
22.96, 23.5, 20.16, 23.85, 18.48, 22.54, 22.12, 19.62, 23.9,
18.16, 21.63, 23.17, 18.94, 23.11, 17.37, 20.9, 23.24, 18.31,
22.41, 17.11, 20.82, 23.72, 17.95, 21.4, 15.78, 19.79, 24, 17.64,
21.66, 15.53, 19.45, 24.94, 18.44, 21.73, 14.95, 19.38, 25.3,
23.99, 22.97, 22.12, 21.34, 20.9, 20.36, 19.8, 19.3, 18.35, 18.1,
18.02), H_MEDIAN = c(30.27, 22.18, 26.48, 21.83, 24.3, 28.97,
21.26, 26.15, 19.67, 22.97, 29.39, 21.39, 23.68, 18.6, 22.79,
26.18, 19.43, 23.15, 18.35, 21.87, 26.03, 18.78, 22.72, 17.8,
21.27, 22.03, 18.38, 22.04, 17.6, 20.76, 20.96, 17.88, 22.28,
17.27, 19.59, 22, 17.24, 21.73, 16.55, 18.97, 22.37, 16.7, 20.93,
15.96, 18.86, 23.97, 16.56, 19.86, 14.76, 18.21, 23.92, 16.36,
20.28, 14.62, 17.96, 23.73, 16.9, 20.49, 14.48, 17.81, 23.17,
22.21, 22.4, 20.19, 19.41, 19.01, 18.45, 17.95, 17.45, 16.76,
16.64, 16.64)), row.names = c(NA, -72L), class = c("tbl_df",
"tbl", "data.frame"))

That makes it so that folks here can easily copy your code and data for troubleshooting.

This means "make a new vector by converting all the elements of the vector Nat_Ind_Total$TOT_EMP into doubles," which is what it did, but creating a new vector doesn't replace the old vector unless you take steps for that to happen, too, like you did here:

My preference for doing the same, so that it's clearer to me what's going on, would be to first view the row where the ** appears:

library(tidyverse) # contains useful functions for manipulating tables
Nat_Ind_Total |> 
  # extract row in table where value of TOT_EMP is "**", just to view
  filter(TOT_EMP == "**")
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>    <chr>    <dbl>    <dbl>
#> 1 U.S.        2019 Ins      **        26.5     26.0

There are a few ways to convert TOT_EMP to a numeric column. One is with the function as.numeric() you used above:

Nat_Ind_Total |> 
  # TEMPORARILY change (mutate) table by replacing TOT_EMP with numeric version
  mutate(TOT_EMP = as.numeric(TOT_EMP)) |> 
  # extract row in table where value of TOT_EMP is NA, just to view
  filter(TOT_EMP |> is.na())
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = as.numeric(TOT_EMP)`.
#> Caused by warning:
#> ! NAs introduced by coercion
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins           NA   26.5     26.0

Another is to use the parse_number() function from the tidyverse package, which extracts numbers from strings:

Nat_Ind_Total |> 
  # TEMPORARILY change (mutate) table by replacing TOT_EMP with number extracted
  mutate(TOT_EMP = parse_number(TOT_EMP))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 72 × 6
#>    AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>    <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#>  1 U.S.        2023 Ins         2300   30.8     30.3
#>  2 U.S.        2023 Amb        11950   23.9     22.2
#>  3 U.S.        2023 Hosp        6400   28.7     26.5
#>  4 U.S.        2023 Soc        12870   22.5     21.8
#>  5 U.S.        2023 Gov        11880   26.4     24.3
#>  6 U.S.        2022 Ins         3410   29.0     29.0
#>  7 U.S.        2022 Amb        11150   22.9     21.3
#>  8 U.S.        2022 Hosp        6480   27.5     26.2
#>  9 U.S.        2022 Soc        13220   20.8     19.7
#> 10 U.S.        2022 Gov        11660   24.8     23.0
#> # ℹ 62 more rows

The warning message says the issue is in row 21, which we can inspect with the slice() function:

Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # extract row number 21 from table, just to view
  slice(21)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins           NA   26.5     26.0

Now, once NAs are introduced in column, you can replace them with your value of choice:

Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # replace NA in TOT_EMP by 0
  replace_na(list(TOT_EMP = 0)) |> 
  # extract row in table where value of TOT_EMP is 0 just to view
  filter(TOT_EMP == 0)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins            0   26.5     26.0

Since we've only been viewing temporary changes, we have to assign the desired result in order to change the table itself:

# assign previous result to actually change table
Nat_Ind_Total <- 
  Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # replace NA in TOT_EMP by 0
  replace_na(list(TOT_EMP = 0))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **

^{Created on 2024-06-20 with reprex v2.0.2}

I hope that helps, and here is all the code in a single, copyable code block:

Full reprex

Nat_Ind_Total <- 
structure(list(AREA_TITLE = c("U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.",
"U.S.", "U.S.", "U.S.", "U.S."), YEAR = c(2023, 2023, 2023, 2023,
2023, 2022, 2022, 2022, 2022, 2022, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2017, 2016, 2016,
2016, 2016, 2016, 2015, 2015, 2015, 2015, 2015, 2014, 2014, 2014,
2014, 2014, 2013, 2013, 2013, 2013, 2013, 2012, 2012, 2012, 2012,
2012, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
2013, 2012), INDUSTRY = c("Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp",
"Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb",
"Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov", "Ins",
"Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc", "Gov",
"Ins", "Amb", "Hosp", "Soc", "Gov", "Ins", "Amb", "Hosp", "Soc",
"Gov", "Total", "Total", "Total", "Total", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total"), TOT_EMP = c("2300",
"11950", "6400", "12870", "11880", "3410", "11150", "6480", "13220",
"11660", "3610", "11020", "5220", "12740", "13150", "2580", "11430",
"5390", "13950", "11340", "**", "11760", "5240", "14220", "11210",
"1110", "10610", "5770", "14120", "11310", "940", "11330", "5750",
"14170", "9460", "720", "10900", "5730", "13520", "8920", "480",
"10510", "4600", "12040", "8190", "400", "9980", "4080", "13080",
"8060", "570", "8950", "3690", "11940", "7880", "500", "6490",
"3410", "10470", "7110", "58550", "61300", "61010", "58670",
"58950", "56130", "54760", "51900", "48130", "47880", "45800",
"38020"), H_MEAN = c(30.75, 23.93, 28.66, 22.52, 26.4, 28.96,
22.91, 27.53, 20.82, 24.81, 28.11, 22.71, 25.9, 20.33, 24.05,
26.94, 21.32, 24.77, 19.92, 23.49, 26.47, 20.58, 24.43, 18.97,
22.96, 23.5, 20.16, 23.85, 18.48, 22.54, 22.12, 19.62, 23.9,
18.16, 21.63, 23.17, 18.94, 23.11, 17.37, 20.9, 23.24, 18.31,
22.41, 17.11, 20.82, 23.72, 17.95, 21.4, 15.78, 19.79, 24, 17.64,
21.66, 15.53, 19.45, 24.94, 18.44, 21.73, 14.95, 19.38, 25.3,
23.99, 22.97, 22.12, 21.34, 20.9, 20.36, 19.8, 19.3, 18.35, 18.1,
18.02), H_MEDIAN = c(30.27, 22.18, 26.48, 21.83, 24.3, 28.97,
21.26, 26.15, 19.67, 22.97, 29.39, 21.39, 23.68, 18.6, 22.79,
26.18, 19.43, 23.15, 18.35, 21.87, 26.03, 18.78, 22.72, 17.8,
21.27, 22.03, 18.38, 22.04, 17.6, 20.76, 20.96, 17.88, 22.28,
17.27, 19.59, 22, 17.24, 21.73, 16.55, 18.97, 22.37, 16.7, 20.93,
15.96, 18.86, 23.97, 16.56, 19.86, 14.76, 18.21, 23.92, 16.36,
20.28, 14.62, 17.96, 23.73, 16.9, 20.49, 14.48, 17.81, 23.17,
22.21, 22.4, 20.19, 19.41, 19.01, 18.45, 17.95, 17.45, 16.76,
16.64, 16.64)), row.names = c(NA, -72L), class = c("tbl_df",
"tbl", "data.frame")) 

library(tidyverse) # contains useful functions for manipulating tables
Nat_Ind_Total |> 
  # extract row in table where value of TOT_EMP is "**", just to view
  filter(TOT_EMP == "**")
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>    <chr>    <dbl>    <dbl>
#> 1 U.S.        2019 Ins      **        26.5     26.0
  
Nat_Ind_Total |> 
  # TEMPORARILY change (mutate) table by replacing TOT_EMP with numeric version
  mutate(TOT_EMP = as.numeric(TOT_EMP)) |> 
  # extract row in table where value of TOT_EMP is NA, just to view
  filter(TOT_EMP |> is.na())
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = as.numeric(TOT_EMP)`.
#> Caused by warning:
#> ! NAs introduced by coercion
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins           NA   26.5     26.0

Nat_Ind_Total |> 
  # TEMPORARILY change (mutate) table by replacing TOT_EMP with number extracted
  mutate(TOT_EMP = parse_number(TOT_EMP))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 72 × 6
#>    AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>    <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#>  1 U.S.        2023 Ins         2300   30.8     30.3
#>  2 U.S.        2023 Amb        11950   23.9     22.2
#>  3 U.S.        2023 Hosp        6400   28.7     26.5
#>  4 U.S.        2023 Soc        12870   22.5     21.8
#>  5 U.S.        2023 Gov        11880   26.4     24.3
#>  6 U.S.        2022 Ins         3410   29.0     29.0
#>  7 U.S.        2022 Amb        11150   22.9     21.3
#>  8 U.S.        2022 Hosp        6480   27.5     26.2
#>  9 U.S.        2022 Soc        13220   20.8     19.7
#> 10 U.S.        2022 Gov        11660   24.8     23.0
#> # ℹ 62 more rows

Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # extract row number 21 from table, just to view
  slice(21)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins           NA   26.5     26.0

Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # replace NA in TOT_EMP by 0
  replace_na(list(TOT_EMP = 0)) |> 
  # extract row in table where value of TOT_EMP is 0 just to view
  filter(TOT_EMP == 0)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **
#> # A tibble: 1 × 6
#>   AREA_TITLE  YEAR INDUSTRY TOT_EMP H_MEAN H_MEDIAN
#>   <chr>      <dbl> <chr>      <dbl>  <dbl>    <dbl>
#> 1 U.S.        2019 Ins            0   26.5     26.0

# assign previous result to actually change table
Nat_Ind_Total <- 
Nat_Ind_Total |> 
  mutate(TOT_EMP = parse_number(TOT_EMP)) |> 
  # replace NA in TOT_EMP by 0
  replace_na(list(TOT_EMP = 0))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `TOT_EMP = parse_number(TOT_EMP)`.
#> Caused by warning:
#> ! 1 parsing failure.
#> row col expected actual
#>  21  -- a number     **

^{Created on 2024-06-20 with reprex v2.0.2}

jrkrideau · June 20, 2024, 8:00pm

@ dromano has given an excellent explanation and solution using {dplyr} from {tidyverse}

Here is a slightly different approach using {data.table}. I tend to prefer the syntax.

library(data.table
# Convert to from tibble to data.table.  Normally I'd load in the data in data.table format
setDT(Nat_Ind_Total)  

# find the problem(s)
which(is.na(as.numeric(as.character(Nat_Ind_Total$TOT_EMP))))  # pure, basic R

# Replace ** with the correct input in character format
Nat_Ind_Total[21, TOT_EMP := "100"]

# convert the TOT_EMP to numeric.  The := is doing the assignment. 
Nat_Ind_Total[, TOT_EMP := as.numeric(TOT_EMP)]

{data.table} uses quite different syntax than basic R or {dplyr} . These links may help you see what I am doing.
data.table in R – The Complete Beginners G
A data.table and dplyr tour

OOPS, forgot, you will probably have to install {data.table}

install.packages("data.table")

system · September 18, 2024, 8:01pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.