I was trying to knit a file (.Rmd) in pdf and it was showing this message:
processing file: clt_and_t-distribution.Rmd
|...... | 9%
ordinary text without R code
|............. | 18%
label: options (with options)
List of 1
$ echo: logi FALSE
|................... | 27%
ordinary text without R code
|......................... | 36%
label: unnamed-chunk-1
|................................ | 45%
ordinary text without R code
|...................................... | 55%
label: unnamed-chunk-2 (with options)
List of 1
$ message: logi FALSE
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
|............................................. | 64%
ordinary text without R code
|................................................... | 73%
label: population_histograms (with options)
List of 3
$ fig.cap : chr "Histograms of all weights for both populations."
$ fig.width : num 10.5
$ fig.height: num 5.25
|......................................................... | 82%
ordinary text without R code
|................................................................ | 91%
label: population_qqplots (with options)
List of 3
$ fig.cap : chr "Quantile-quantile plots of all weights for both populations."
$ fig.width : num 10.5
$ fig.height: num 5.25
|......................................................................| 100%
ordinary text without R code
"F:/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS clt_and_t-distribution.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output clt_and_t-distribution.tex --self-contained --highlight-style tango --pdf-engine pdflatex --variable graphics --lua-filter "F:/R-3.6.3/library/rmarkdown/rmd/lua/pagebreak.lua" --lua-filter "F:/R-3.6.3/library/rmarkdown/rmd/lua/latex-div.lua" --variable "geometry:margin=1in"
output file: clt_and_t-distribution.knit.md
Error: LaTeX failed to compile clt_and_t-distribution.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips.
In addition: Warning message:
In system2(..., stdout = if (use_file_stdout()) f1 else FALSE, stderr = f2) :
'"pdflatex"' not found
Execution halted
No LaTeX installation detected (LaTeX is required to create PDF output). You should install a LaTeX distribution for your platform: https://www.latex-project.org/get/
If you are not sure, you may install TinyTeX in R: tinytex::install_tinytex()
Otherwise consider MiKTeX on Windows - http://miktex.org
MacTeX on macOS - https://tug.org/mactex/
(NOTE: Download with Safari rather than Chrome _strongly_ recommended)
Linux: Use system package manager
I tried to install the TinyTex by using the function and this is what I got:
> tinytex::install_tinytex()
trying URL 'http://mirror.ctan.org/systems/texlive/tlnet/install-tl.zip'
Content length 339 bytes
downloaded 23.1 MB
trying URL 'https://yihui.org/gh/tinytex/tools/pkgs-custom.txt'
Content length 81 bytes
downloaded 534 bytes
trying URL 'https://yihui.org/gh/tinytex/tools/tinytex.profile'
Content length 81 bytes
downloaded 183 bytes
Starting to install TinyTeX to C:\Users\DEEP & AVRA\AppData\Roaming/TinyTeX. It will take a few minutes.
'A~1\AppData\Local\Temp\RtmpU7qBGQ\install-tl-20200509\' is not recognized as an internal or external command,
operable program or batch file.
'perl' is not recognized as an internal or external command,
operable program or batch file.
Please quit and reopen your R session and IDE (if you are using one, such as RStudio or Emacs) and check if tinytex:::is_tinytex() is TRUE.
Warning message:
In file.remove("TinyTeX/install-tl.log") :
cannot remove file 'TinyTeX/install-tl.log', reason 'No such file or directory'
>
I quit my R session and reopened it and entered as prompted to check if it was installed. This is what I got:
> tinytex:::is_tinytex()
[1] FALSE
>
Can anyone help me out?
I don't know where I'm going wrong.
BTW, this is the file that I'm trying to knit to pdf:
---
title: "Central Limit Theorem and t-distribution"
output: pdf_document
layout: page
---
```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```
## Central Limit Theorem and t-distribution
Below we will discuss the Central Limit Theorem (CLT) and the t-distribution, both of which help us make important calculations related to probabilities. Both are frequently used in science to test statistical hypotheses. To use these, we have to make different assumptions from those for the CLT and the t-distribution. However, if the assumptions are true, then we are able to calculate the exact probabilities of events through the use of mathematical formula.
#### Central Limit Theorem
The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average $\bar{Y}$ of a random sample follows a normal distribution centered at the population average $\mu_Y$ and with standard deviation equal to the population standard deviation $\sigma_Y$, divided by the square root of the sample size $N$. We refer to the standard deviation of the distribution of a random variable as the random variable's _standard error_.
Please note that if we subtract a constant from a random variable, the
mean of the new random variable shifts by that
constant. Mathematically, if $X$ is a random variable with mean $\mu$
and $a$ is a constant, the mean of $X - a$ is $\mu-a$. A similarly
intuitive result holds for multiplication and the standard deviation (SD).
If $X$ is a random
variable with mean $\mu$ and SD $\sigma$, and $a$ is a constant, then
the mean and SD of $aX$ are $a \mu$ and $\mid a \mid \sigma$
respectively. To see how intuitive this is, imagine that we subtract
10 grams from each of the mice weights. The average weight should also
drop by that much. Similarly, if we change the units from grams to
milligrams by multiplying by 1000, then the spread of the numbers
becomes larger.
This implies that if we take many samples of size $N$, then the quantity:
$$
\frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}}
$$
is approximated with a normal distribution centered at 0 and with standard deviation 1.
Now we are interested in the difference between two sample averages. Here again a mathematical result helps. If we have two random variables $X$ and $Y$ with means $\mu_X$ and $\mu_Y$ and variance $\sigma_X$ and $\sigma_Y$ respectively, then we have the following result: the mean of the sum $Y + X$ is the sum of the means $\mu_Y + \mu_X$. Using one of the facts we mentioned earlier, this implies that the mean of $Y - X = Y + aX$ with $a = -1$ , which implies that the mean of $Y - X$ is $\mu_Y - \mu_X$. This is intuitive. However, the next result is perhaps not as intuitive. If $X$ and $Y$ are independent of each other, as they are in our mouse example, then the variance (SD squared) of $Y + X$ is the sum of the variances $\sigma_Y^2 + \sigma_X^2$. This implies that variance of the difference $Y - X$ is the variance of $Y + aX$ with $a = -1$ which is $\sigma^2_Y + a^2 \sigma_X^2 = \sigma^2_Y + \sigma_X^2$. So the variance of the difference is also the sum of the variances. If this seems like a counterintuitive result, remember that if $X$ and $Y$ are independent of each other, the sign does not really matter. It can be considered random: if $X$ is normal with certain variance, for example, so is $-X$. Finally, another useful result is that the sum of normal variables is again normal.
All this math is very helpful for the purposes of our study because we have two sample averages and are interested in the difference. Because both are normal, the difference is normal as well, and the variance (the standard deviation squared) is the sum of the two variances.
Under the null hypothesis that there is no difference between the population averages, the difference between the sample averages $\bar{Y}-\bar{X}$, with $\bar{X}$ and $\bar{Y}$ the sample average for the two diets respectively, is approximated by a normal distribution centered at 0 (there is no difference) and with standard deviation $\sqrt{\sigma_X^2 +\sigma_Y^2}/\sqrt{N}$.
This suggests that this ratio:
$$
\frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}}
$$
is approximated by a normal distribution centered at 0 and standard deviation 1. Using this approximation makes computing p-values simple because we know the proportion of the distribution under any value. For example, only 5% of these values are larger than 2 (in absolute value):
```{r}
pnorm(-2) + (1 - pnorm(2))
```
We don't need to buy more mice, 12 and 12 suffice.
However, we can't claim victory just yet because we don't know the population standard deviations: $\sigma_X$ and $\sigma_Y$. These are unknown population parameters, but we can get around this by using the sample standard deviations, call them $s_X$ and $s_Y$. These are defined as:
$$ s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2 $$
Note that we are dividing by $M-1$ and $N-1$, instead of by $M$ and $N$. There is a theoretical reason for doing this which we do not explain here. But to get an intuition, think of the case when you just have 2 numbers. The average distance to the mean is basically 1/2 the difference between the two numbers. So you really just have information from one number. This is somewhat of a minor point. The main point is that $s_X$ and $s_Y$ serve as estimates of $\sigma_X$ and $\sigma_Y$
So we can redefine our ratio as
$$
\sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}}
$$
if $M=N$ or in general,
$$
\frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}}
$$
The CLT tells us that when $M$ and $N$ are large, this random variable is normally distributed with mean 0 and SD 1. Thus we can compute p-values using the function `pnorm`.
#### The t-distribution
The CLT relies on large samples, what we refer to as _asymptotic results_. When the CLT does not apply, there is another option that does not rely on asymptotic results. When the original population from which a random variable, say $Y$, is sampled is normally distributed with mean 0, then we can calculate the distribution of:
$$
\sqrt{N} \frac{\bar{Y}}{s_Y}
$$
This is the ratio of two random variables so it is not
necessarily normal. The fact that the denominator can be small by
chance increases the probability of observing large
values. [William Sealy Gosset](http://en.wikipedia.org/wiki/William_Sealy_Gosset),
an employee of the Guinness brewing company, deciphered the
distribution of this random variable and published a paper under the
pseudonym "Student". The distribution is therefore called Student's
t-distribution. Later we will learn more about how this result is
used.
Here we will use the mice phenotype data as an example. We start by
creating two vectors, one for the control population and one for the
high-fat diet population:
```{r,message=FALSE}
library(dplyr)
dat <- read.csv("mice_pheno.csv") #We downloaded this file in a previous section
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
```
It is important to keep in mind that what we are assuming to be normal here is the distribution of $y_1,y_2,\dots,y_n$, not the random variable $\bar{Y}$. Although we can't do this in practice, in this illustrative example, we get to see this distribution for both controls and high fat diet mice:
```{r population_histograms, fig.cap="Histograms of all weights for both populations.",fig.width=10.5,fig.height=5.25}
library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)
```
We can use *qq-plots* to confirm that the distributions are relatively
close to being normally distributed. We will explore these plots in
more depth in a later section, but the important thing to know is that
it compares data (on the y-axis) against a theoretical distribution
(on the x-axis). If the points fall on the identity line, then the
data is close to the theoretical distribution.
```{r population_qqplots, fig.cap="Quantile-quantile plots of all weights for both populations.",fig.width=10.5,fig.height=5.25}
mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)
```
The larger the sample, the more forgiving the result is to the
weakness of this approximation. In the next section, we will see that
for this particular dataset the t-distribution works well even for
sample sizes as small as 3.