Fitting a complicated slope

Hello,

I'm trying to replicate some analysis where the terms x and y are related using the following mathematical equation, where A through G are dimensionless constants:

y = \frac{Ax^2+Bx+C+Dx^{-1}}{Ex^2+Fx+Gx}

I am at a complete loss as to how to turn a tibble (for sake of simplicity, say it is named df and the columns are just x and y) into a model that follows this equation. Is this an nlm() job?

This is something I struggle with; transforming a known mathematical expression into the R formula that could produce it. For full context, this is a model taken from an Excel spreadsheet where A through G were given and you just had to plug in your value of x. I have my own data that I'd like to use to recreate the same kind of x-y curve.

Thanks in advance,
Jack

HI there,

If I get it correctly, you are trying to calculate y given x and the 7 constants?
In that case I have an example below on how to do that

set.seed(5) #only for reproducibility

#Set the 7 constants
cons = c(1,2,3,4,5,6,7)

#Get data with x values
data = data.frame(x = runif(10, 0, 1))
data
#>            x
#> 1  0.2002145
#> 2  0.6852186
#> 3  0.9168758
#> 4  0.2843995
#> 5  0.1046501
#> 6  0.7010575
#> 7  0.5279600
#> 8  0.8079352
#> 9  0.9565001
#> 10 0.1104530

#Plug in the formula to calculate y
data$y = (cons[1] * data$x^2 + cons[2] * data$x + 
            cons[3] + cons[4] * data$x^(-1)) / 
  (cons[5] * data$x^2 + cons[6] * data$x + cons[7] * data$x)

#Plot x and y
plot(data$x, data$y)

Created on 2022-01-27 by the reprex package (v2.0.1)

Hope this helps,
PJ

Ah, no, apologies for the misunderstanding. I'm trying to fit my own model with that structure, get my own constants out of it.

So you have the values of x and y, and you want to find the constants A-G?

Yes, that is correct!

This doesnt seem workable on its face...
Fx and Gx are both there. thats an infinity of solutions even for fixed known x/y pairs...
maybe it should be only Fx and forget the Gx, as the x is the same power ?

1 Like

I think its hard to do, heres my best:

#generate data  that  conforms to some example params

yfunc <- function(x,
                  A_=100,
                  B_ = -75,
                  C_ = 40,
                  D_ = -15,
                  E_ = 10,
                  F_ = -1
)(A_*x^2 +B_*x +C_ +D_/(x))/(E_*x^2+F_*x)

(x_ <- seq(1,10,0.1))
(y_ <- yfunc(x_))

plot(x_,y_)



y_solve_func <- function(x){
  #x[[1-6]] will be A_:F_
  
  lft <- purrr::map_dbl(x_,
                        ~yfunc(.,x[[1]],
                               x[[2]],
                               x[[3]] ,
                               x[[4]],
                               x[[5]],
                               x[[6]]
                        ))
  rgt <- y_
 sqrt(sum((lft-rgt)^2))
}


optim(rep(30,6)
      ,y_solve_func,method = "BFGS",control=list(maxit=10000,trace=TRUE))
# converged
# $par
# [1] 93.440883 24.016749 -9.362049 -6.886836  9.314845  8.862910
# 
# $value
# [1] 0.02698084
2 Likes

I think this requires a nonlinear optimization with optim.

You must write a function that accepts as an argument the parameter values A, B, C, ... and returns a sum of square error.

Then ask optim to adjust the parameter values to minimize the error.

In my experience you have to be careful that the arguments are consistent between all your functions. You'll have a function for the model, a function that returns the error, and optim, that requires the arguments to be specified a certain way.

Here's my solution adapted from work I've done in the past. method = "BFGS" generally works best for me.

library(tidyverse)

# model
my_model <- function(p, x) {
  y <- p["A"] * x^2 + p["B"] * x + p["C"] + p["D"] / x # just the numerator
  return(y)
}

# real values
p <- c(A = 1, B = 2, C = 3, D = 4)

# simulated data 
# make x order 1 otherwise some model terms will dominate
my_data <- tibble(
  x = seq(0.5, 1.5, .001),  
  y = my_model(p, x) # + 0.01 + runif(length(x))
)

with(my_data, qplot(x, y))

# objective fn
calc_obj <- function(p, model, df) {
  resid <- model(p, df$x) - df$y
  return(sqrt(sum(resid^2)))
}

# initial guess
p0 <- c(A = 1.1, B = 2.2, C = 3.3, D = 4.4) 

# try to recover the real values through nonlinear opt
soln <- optim(
  par = p0, 
  fn = calc_obj, 
  method = "BFGS", 
  control = list(trace = 5, REPORT = TRUE),
  model = my_model,
  df = my_data
)
#> initial  value 33.193450 
#> iter   2 value 5.247466
#> iter   3 value 4.227183
#> iter   4 value 1.328340
#> iter   5 value 0.789765
#> iter   6 value 0.344899
#> iter   7 value 0.155544
#> iter   8 value 0.109454
#> iter   9 value 0.084487
#> iter  10 value 0.040555
#> iter  11 value 0.035276
#> iter  12 value 0.017175
#> iter  13 value 0.007212
#> iter  14 value 0.006679
#> iter  15 value 0.006320
#> iter  16 value 0.005695
#> iter  17 value 0.005051
#> iter  17 value 0.005051
#> iter  18 value 0.004443
#> iter  19 value 0.004018
#> iter  20 value 0.002721
#> iter  21 value 0.002691
#> iter  21 value 0.002691
#> iter  22 value 0.002289
#> iter  23 value 0.002194
#> iter  24 value 0.001797
#> iter  24 value 0.001797
#> iter  25 value 0.001328
#> iter  26 value 0.001268
#> iter  27 value 0.001174
#> iter  28 value 0.001153
#> iter  29 value 0.001147
#> iter  30 value 0.001145
#> iter  31 value 0.001145
#> iter  32 value 0.001145
#> iter  33 value 0.001145
#> iter  34 value 0.000906
#> iter  35 value 0.000877
#> iter  35 value 0.000877
#> iter  36 value 0.000766
#> iter  37 value 0.000746
#> iter  38 value 0.000688
#> iter  38 value 0.000688
#> iter  39 value 0.000637
#> iter  40 value 0.000637
#> iter  40 value 0.000637
#> iter  41 value 0.000615
#> iter  42 value 0.000615
#> iter  43 value 0.000598
#> iter  44 value 0.000598
#> iter  45 value 0.000597
#> iter  45 value 0.000597
#> iter  46 value 0.000591
#> iter  47 value 0.000585
#> iter  48 value 0.000580
#> iter  49 value 0.000577
#> iter  49 value 0.000577
#> iter  50 value 0.000574
#> iter  51 value 0.000569
#> iter  52 value 0.000569
#> iter  53 value 0.000569
#> iter  53 value 0.000569
#> iter  54 value 0.000566
#> iter  55 value 0.000563
#> iter  56 value 0.000546
#> iter  57 value 0.000546
#> iter  58 value 0.000546
#> iter  58 value 0.000546
#> iter  59 value 0.000540
#> iter  60 value 0.000537
#> iter  61 value 0.000534
#> iter  62 value 0.000533
#> iter  62 value 0.000533
#> iter  63 value 0.000511
#> iter  64 value 0.000511
#> iter  65 value 0.000510
#> iter  66 value 0.000509
#> iter  67 value 0.000509
#> iter  68 value 0.000509
#> iter  69 value 0.000509
#> iter  69 value 0.000509
#> iter  70 value 0.000507
#> iter  71 value 0.000503
#> iter  72 value 0.000502
#> iter  73 value 0.000502
#> iter  74 value 0.000502
#> iter  75 value 0.000502
#> iter  76 value 0.000502
#> iter  76 value 0.000502
#> iter  77 value 0.000496
#> iter  78 value 0.000496
#> iter  79 value 0.000492
#> iter  79 value 0.000492
#> iter  80 value 0.000488
#> iter  81 value 0.000481
#> iter  82 value 0.000479
#> iter  83 value 0.000476
#> iter  84 value 0.000475
#> iter  84 value 0.000475
#> iter  85 value 0.000472
#> iter  86 value 0.000468
#> iter  87 value 0.000468
#> iter  88 value 0.000467
#> iter  88 value 0.000467
#> iter  89 value 0.000458
#> iter  90 value 0.000449
#> iter  91 value 0.000443
#> iter  92 value 0.000436
#> iter  92 value 0.000436
#> iter  93 value 0.000434
#> iter  94 value 0.000430
#> iter  95 value 0.000424
#> iter  96 value 0.000424
#> iter  96 value 0.000424
#> iter  97 value 0.000421
#> iter  98 value 0.000416
#> iter  99 value 0.000415
#> iter 100 value 0.000413
#> final  value 0.000413 
#> stopped after 100 iterations

# results
tibble(
  name       = names(p),
  true_value = p,
  guess      = p0,
  solution   = soln$par
) %>% 
  print()
#> # A tibble: 4 x 4
#>   name  true_value guess solution
#>   <chr>      <dbl> <dbl>    <dbl>
#> 1 A              1   1.1    0.999
#> 2 B              2   2.2    2.00 
#> 3 C              3   3.3    3.00 
#> 4 D              4   4.4    4.00

Created on 2022-01-28 by the reprex package (v2.0.1)

HI,

I'm interested as well now to see indeed if there is a way to do this with the provided formula. I tried to use nls() but I keep getting the error :

data = data.frame(x = 10:1, y = 1:10)
nls(y ~ (Aa*x^2 + Bb*x + Cc + Dd*x^(-1)) * Ee,
    data, start = list(Aa = 1, Bb = 1, Cc = 1, Dd = 1, Ee = 1))

Error in nlsModel(formula, mf, start, wts, scaleOffset = scOff, nDcentral = nDcntr) : 
  singular gradient matrix at initial parameter estimates

As @nirgrahamuk showed, this might be because of the formula, and I like his approach for the estimation of the initial values. I'm not good at this type of maths, so will follow closely to see which approach works best in the end :slight_smile:

Try some different starting values, don't make them all the same.

I tried many combinations and all gave the same error :stuck_out_tongue: Then again I was doing this totally at random ...

Oh. I think in the data you created for the example y = 1/x. So you should get As=Bb=Cc=0. But then Dd and Ee are interchangeable. That's probably the source of the error.

1 Like

This is not the kind of problem that one typically encounters in applied statistics. You know the data generation process in functional form but not the coefficients. Think of someone telling you "I know that the circumference of a circle is related to the radius but I don't know what the constant is". How would you find PI? But this relationship has many more degrees of freedom.

1 Like

So you have the data for x and y, and, you have a prior assumption of what the relationship is, could a Bayesian simulation problem.

1 Like

So if we plug in your original equation to something like Desmos, we can see that if all the parameters = 1, we get really interesting plot with lines going down each axis.

If you make a scatter plot of your XY data, you could use Desmos by adjusting the sliders until you get something that you could imagine would fit your data. Then use those values as starting values in nls().

I guess my concern is what sort of data would you have that requires such a function? You have negative X and Y values too? Are you only concerned with what is going on when X and Y are both positive?

1 Like

I agree with @arthur.t in that your goal to come up with the parameter values will require some clever work with nls and optim. This is one reason why self-starting functions are popular (some included in stats and more available through other packages like nlraa.

Just as an example, here is a reprex that uses a self-starting function with 6 parameters. That is pretty close to your 7, and it demonstrates that your problem may be solved somehow.

library(tidyverse)
library(nlraa)
library(minpack.lm)

fit <- minpack.lm::nlsLM(
    Petal.Width ~ SSsharp(Sepal.Length, a, b, c, d, e, f), data = iris
    )

fit

iris %>% 
    ggplot(aes(Sepal.Length, Petal.Width)) +
    geom_point() +
    geom_line(aes(y = predict(fit))) +
    # or
    geom_line(stat = "smooth",
              method = "nlsLM",
              formula = y ~ SSsharp(x, a, b, c, d, e, f),
              se = FALSE,
              color = "red")
1 Like

If the equation is something that an engineer came up with in Excel solver, frankly you may want to scrap it and start fresh. I've seen some terrible models created by people who didn't understand overfitting.

The practice of "curve fitting" with polynomials and Excel solver is best left in the past.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.