Trouble with Multi-Linear Regression Model

blackish952 · June 7, 2018, 12:57pm

Hello,

I hope I post this question in the right forum.

I am building a regression model:

> m2 = lm(log(abs(MarginDollars)) ~ 
 CUST_REGION_DESCR+log(abs(Sales))+QtySold+log(abs(MFGCOST))+
            PRODUCT_SUB_LINE_DESCR, data)

The reason why I used abs() is that some values in my variables are negative:

 > summary(MFGCOST)
   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 -3900.00    13.72    33.29    65.78    78.05 53138.51 
> summary(QtySold)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-40.000   1.000   1.000   2.806   3.000 499.000 
 > summary(MarginDollars)
 Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2222.00     6.43    16.95    28.77    37.62 24316.27

The reason I am using log transformation is that some values in my object variables are pretty huge so log scale down the number to help me see the correlation better.

I am having an error message:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :NA/NaN/Inf in 'x'

I did some checks:

> all(is.na(MarginDollars))
[1] FALSE
> all(is.na(log(abs(MarginDollars))))
[1] FALSE
> all(is.na(log(abs(MFGCOST))))
[1] FALSE
> all(is.na(log(abs(Sales))))
[1] FALSE
> all(is.na(CUST_REGION_DESCR)))
[1] FALSE
> all(is.na(QtySold))
[1] FALSE
> all(is.na(PRODUCT_SUB_LINE_DESCR))
[1] FALSE

If all of these involved variables do not have any NA's in them, so what does the error message tell me?

Thanks!

tbradley · June 7, 2018, 1:02pm

There are likely NA, NaN or Inf in one of your datasets. Using the all function is looking to see if every data point is an NA. You should use the any function instead which will tell you if there are any NA in the dataset. Also you could use the which function to find their locations.

Here is a toy example showing the difference:

dummy <- c(1, 3, 45, 3, 5, NA_real_)

# only returns TRUE is all elements are NA
all(is.na(dummy))
#> [1] FALSE

# returns TRUE is any elements are NA
any(is.na(dummy))
#> [1] TRUE

# gives you the index of which elements are NA
which(is.na(dummy))
#> [1] 6

blackish952 · June 7, 2018, 1:06pm

@tbradley
I did check.

All returned false!

tbradley · June 7, 2018, 1:09pm

You checked with any? did you also check using is.nan and is.infinite instead of is.na?

blackish952 · June 7, 2018, 1:28pm

@tbradley
Hello Tyler,
So here we go:

> any(is.infinite(log(abs(MarginDollars))))
[1] TRUE
> any(is.infinite(log(abs(Sales))))
[1] TRUE
> any(is.infinite(log(abs(MFGCOST))))
[1] TRUE
> any(is.infinite(CUST_REGION_DESCR))
[1] FALSE . 
> any(is.infinite(QtySold))
[1] FALSE
> any(is.infinite(PRODUCT_SUB_LINE_DESCR))
[1] FALSE

So, my doubt is: maybe some of my values are 0 so log(0) is not defined...

jcblum · June 7, 2018, 1:44pm

Yes, in R, log(0) returns -Inf. This StackOverflow discussion might help:

Basically, if your data have meaningful zeroes, then a log transformation is not appropriate because the natural logarithm is only defined for x > 0. If the zeroes are really just missing data, then they need to be encoded and dealt with as missing data. There are other transformations (such as square root) that might be more appropriate for data like yours.

blackish952 · June 7, 2018, 1:46pm

@jcblum
So much to learn about Data Science every day. I love it!
Also, I do have some negative values in my variables (Customer returns, etc.), so I don't think square root will work either.
Where do I find readings on Square Root Transformation?

Thanks!

blackish952 · June 7, 2018, 2:38pm

@tbradley
Hello Tyler,
How can I interpret this plot?
56%20AM

EconKid · June 8, 2018, 12:28am

Hi, @blackish952,

another way to maintain variable MFGCOST the original ordinal is

df$MFGCOST <- log(df$MFGCOST + min(df$MFGCOST) + 1)

Here adding 1 is to avoid log(0).

rahmed · June 26, 2018, 1:25am

Could we use log1p function instead?

blackish952 · June 26, 2018, 1:35am

please explain what log1p does.

rahmed · June 26, 2018, 2:03am

The function log1p will compute log(x+1) where x is a numeric vector. So log1p(0) is equivalent to log(1). It will do a log transformation for base 10 by default. The function will work well for non-negative x