Creating a data frame based of values created using lm() for correlation between variables in ggplot

So I'm trying to create a data table that is grouped by a certain column in my data frame, "locid" based on the value that was created by the lm() function in ggplot2. The number highlighted in yellow is what I am looking to get.(Sorry for including image but I don't know the exact technical term for that number)


Is there any way to do this? This is what I have tried so far but it doesn't work

df_cor<-final_new_df%>%
dplyr::group_by(locid)%>%
dplyr::summarise(correl=lm(formula = y~x,Specific_conductance,tds))

Sample Data:

structure(list(stdate = structure(c(16611, 16611, 16615, 16615,
14004, 14004, 16616, 16616, 16616, 17485, 17485, 17483, 17483,
16678, 16678, 14000, 14000, 17211, 17211, 17210), class = "Date"),
sttime = structure(c(37800, 37800, 35100, 35100, 42600, 42600,
38700, 38700, 32400, 35400, 35400, 33000, 33000, 49800, 49800,
34200, 34200, 37800, 37800, 30600), class = c("hms", "difftime"
), units = "secs"), locid = c("USGS-01388500", "USGS-01388500",
"USGS-01464585", "USGS-01464585", "USGS-01464515", "USGS-01464515",
"USGS-01407330", "USGS-01407330", "USGS-01466500", "USGS-01387500",
"USGS-01387500", "USGS-01395000", "USGS-01395000", "USGS-01400860",
"USGS-01400860", "USGS-01377000", "USGS-01377000", "USGS-01367625",
"USGS-01367625", "USGS-01398000"), Specific_conductance = c(525,
525, 184, 184, 226, 226, 203, 203, 41, 674, 674, 466, 466,
312, 312, 540, 540, 844, 844, 683), valunit_sc = c("uS/cm @25C",
"uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C",
"uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C",
"uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C",
"uS/cm @25C", "uS/cm @25C", "uS/cm @25C", "uS/cm @25C"),
HUC14 = c("HUC02030103110020", "HUC02030103110020", "HUC02040201100030",
"HUC02040201100030", "HUC02040201060020", "HUC02040201060020",
"HUC02030104070070", "HUC02030104070070", "HUC02040202030070",
"HUC02030103100030", "HUC02030103100030", "HUC02030104050060",
"HUC02030104050060", "HUC02030105090020", "HUC02030105090020",
"HUC02030103170060", "HUC02030103170060", "HUC02020007010010",
"HUC02020007010010", "HUC02030105030060"), WMA_sc = c(3L,
3L, 20L, 20L, 20L, 20L, 12L, 12L, 19L, 3L, 3L, 7L, 7L, 10L,
10L, 5L, 5L, 2L, 2L, 8L), tds = c(294, 275, 119, 100, 155,
116, 155, 115, 43, 403, 382, 286, 274, 177, 173, 328, 277,
435, 440, 347), valunit_tds = c("mg/l", "mg/l", "mg/l", "mg/l",
"mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l",
"mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l"
), WMA_tds = c(3L, 3L, 20L, 20L, 20L, 20L, 12L, 12L, 19L,
3L, 3L, 7L, 7L, 10L, 10L, 5L, 5L, 2L, 2L, 8L), Chloride = c(109,
109, 31.9, 31.9, 33, 33, 36.4, 36.4, 3.38, 153, 153, 72.6,
72.6, 41.5, 41.5, 105, 105, 179, 179, 161), valunit = c("mg/l",
"mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l",
"mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l", "mg/l",
"mg/l", "mg/l", "mg/l"), WMA = c(3L, 3L, 20L, 20L, 20L, 20L,
12L, 12L, 19L, 3L, 3L, 7L, 7L, 10L, 10L, 5L, 5L, 2L, 2L,
8L)), .Names = c("stdate", "sttime", "locid", "Specific_conductance",
"valunit_sc", "HUC14", "WMA_sc", "tds", "valunit_tds", "WMA_tds",
"Chloride", "valunit", "WMA"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))

I would appreciate any help or guidance or if there is another way to do this. Thanks in advance.

Could be wrong but I think you're going to want to join your two tables first and then the lm() stats will be available in your main table.

If you can post a simple reprex we can probably help you out.

@rywhale Not sure what you mean by joining my two tables first? I have one data frame
Also, what else would you need to make reproducible? I added a sample of my data and the code I tried.

Sorry, I thought you had used the lm() function rather than calling it in ggplot(). Your code above won't work because lm() returns a whole bunch of stuff, rather than just a single value.

e.g. for your data

Call:
lm(formula = Specific_conductance ~ tds, data = my_data)

Coefficients:
(Intercept)          tds  
     -36.30         1.92 

You can access these internally by index like so:

test_data <- my_data %>%
    group_by(locid) %>%
    summarize(correl = lm(Specific_conductance ~ tds)$coefficients[[2]][1],
              intercept = lm(Specific_conductance ~ tds)$coefficients[[1]][1])

head(test_data)
# A tibble: 6 x 3
  locid             correl intercept
  <chr>              <dbl>     <dbl>
1 USGS-01367625   0.            844 
2 USGS-01377000   0.            540 
3 USGS-01387500  -7.66e-15      674.
4 USGS-01388500  -8.46e-15      525.
5 USGS-01395000  -6.70e-15      466.
6 USGS-01398000  NA             683 

Not sure if this helps. Maybe someone else will jump in with more knowledge.

1 Like

@rywhale It's okay! I appreciate your response. I just have a few questions..
1.) What is the difference between the correl & intercept columns? Is correl the r-squared value?
2.) How did you know which index to use?

No, I believe correl is the correlation coefficient. See here for more information on interpreting the output of lm().

You can get more information on the model using summary() like so:

my_lm <- lm(Specific_conductance ~ tds, data = my_data)
summary(my_lm)

Call:
lm(formula = Specific_conductance ~ tds, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.588 -26.736   2.567  33.750  52.950 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.30481   19.97640  -1.817   0.0858 .  
tds           1.92033    0.07351  26.122 9.18e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 38.85 on 18 degrees of freedom
Multiple R-squared:  0.9743,	Adjusted R-squared:  0.9729 
F-statistic: 682.4 on 1 and 18 DF,  p-value: 9.182e-16

Going to bow out now as this is not my area of expertise, hopefully someone can help with any further questions.

2 Likes

You can use broom to extract the model parameters tidily.

3 Likes