Generally speaking, it sounds like dist()
had to coerce some of the values you gave it into compatible data types, and when it did that the result was to (somehow) create NAs.
You might take a good look at the documentation for dist()
and see if what you’re asking it to do makes sense. The data frame you’re feeding it has a lot of variables of a lot of different types (you don’t seem to have included all of the str()
output above, but already I see both factors and numerics).
dist()
uses one of several possible distance measures to “compute the distances between the rows of a data matrix”. It expects to get a matrix of values, but it will try to work with a data frame if that’s what you give it. However, dist()
doesn’t know what to do with factors (= categorical data) — I strongly suspect this is the source of your NAs:
# Create a data frame with some categorical (factor) data
# and some numeric data
dfr <- data.frame(
lc = letters[1:4],
uc = LETTERS[1:4],
num1 = c(1, 1, 1, 1),
num2 = c(0, 1, 0, 1)
)
dfr
#> lc uc num1 num2
#> 1 a A 1 0
#> 2 b B 1 1
#> 3 c C 1 0
#> 4 d D 1 1
str(dfr)
#> 'data.frame': 4 obs. of 4 variables:
#> $ lc : Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#> $ uc : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
#> $ num1: num 1 1 1 1
#> $ num2: num 0 1 0 1
# Compute Euclidean distance of the rows, looking only at
# the numeric columns
dist(dfr[3:4])
#> 1 2 3
#> 2 1
#> 3 0 1
#> 4 1 0 1
# Compute Euclidean distance of the rows, looking only at
# the factor columns
dist(dfr[1:2])
#> Warning in dist(dfr[1:2]): NAs introduced by coercion
#> 1 2 3
#> 2 NA
#> 3 NA NA
#> 4 NA NA NA
# For the whole data frame...
dist(dfr)
#> Warning in dist(dfr): NAs introduced by coercion
#> 1 2 3
#> 2 1.414214
#> 3 0.000000 1.414214
#> 4 1.414214 0.000000 1.414214
Created on 2018-10-01 by the reprex package (v0.2.1)
Compare to what you get if you convert the factors into numeric values yourself, first:
dfr2 <- as.data.frame(lapply(dfr, as.numeric))
dfr2
#> lc uc num1 num2
#> 1 1 1 1 0
#> 2 2 2 1 1
#> 3 3 3 1 0
#> 4 4 4 1 1
dist(dfr2)
#> 1 2 3
#> 2 1.732051
#> 3 2.828427 1.732051
#> 4 4.358899 2.828427 1.732051
But that result isn’t terribly meaningful!