I'm writing documentation for datasets included with my R package, as described in the "Documenting Datasets" section in "Chapter 9: External Data" of the R Packages book (r-pkgs.org).
One of these datasets contains 888 variables. The first six variables are distinct and require documentation. The latter 882 variables contain the same type of data, in the same format, but with data from 882 different sources. Specifically, this is a version of the PLINK .traw genetic marker data format, loaded into R.
Here is an excerpt from my attempt at documentation. Please note that the format for the seventh variable is the same format followed for all subsequent variables, but I am avoiding repeating myself for every variable. I have only written documentation for the seven variables below.
#'
#' \describe{
#' \item{CHR}{Integer or character value indicating the chromosome or scaffold
#' on which a given SNP is found}
#' \item{SNP}{Character providing a label for a given SNP; this is optional
#' and the column contains "." for all values by default if SNPs are not
#' named. This variable is not used by the `mtmcskat` package.}
#' \item{X.C.M}{Position of a given SNP in morgans or centimorgans; this is
#' optional and can be filled with "0" if not used. This variable is not used
#' by the `mtmcskat` package.}
#' \item{POS}{Integer providing the position of a given SNP, in base pairs}
#' \item{COUNTED}{Character, either "A", "T", "C" or "G" indicating the
#' common (or most common) allele at the position of the given SNP}
#' \item{COUNTER}{Character, either "A", "T", "C" or "G" indicating the
#' alternative allele, also known as the rare allele,
#' at the position of the given SNP. If multiple alternative alleles exist
#' for a position, they are provided on separate rows with the same common
#' allele and position.}
#' \item{X201782_400194}{Columns 7-888 contain alternative allele
#' counts for each of 882 genotypes in the poplar GWAS population. These are
#' integer values ranging from 0-2, indicating the number of alleles which are
#' the alternative allele for a given SNP in the column for a given genotype.
#' The name of each column from 7-888 follows a format that includes the
#' Family ID and Individual ID for each genotype, following the format
#' X<FID>_<IID>}
When I run a check on my package, I receive the following warning. I inserted ellipses because all 888 variable names are listed out.
Variables in data frame 'sample_genodata'
Code: ALT CHR COUNTED POS SNP X.C.M X201782_400194 X201782_400495 ...
...Docs: CHR COUNTED COUNTER POS SNP X.C.M X201782_400194
The large dataset is essential to the nature of the R package, which is for population-scale genetic analysis. It is critical to unit tests and cannot be excluded from the package.
How can I prepare this documentation in such a manner to avoid warnings, avoid repeating myself hundreds of times, and have my package accepted by CRAN?
Thank you for your time and help!