# Cross-validation procedures for AMMI and BLUP

#### Tiago Olivoto

#### 2022-09-21

Source:`vignettes/vignettes_cross-validation.Rmd`

`vignettes_cross-validation.Rmd`

## Getting started

In this section, we will use the data in `data_ge`

. For
more information, please, see `?data_ge`

. Other data sets can
be used provided that the following columns are in the dataset:
environment, genotype, block/replicate and response variable(s). See the
section Rendering engine to know how HTML
tables were generated.

## Predictive accuracy

The predictive accucary of both AMMI and BLUP models may be obtained
using a cross-validation procedure implemented by the functions
`cv_ammif()`

and `cv_blup()`

The
`cv_ammif()`

function provides a complete cross-validation
procedure for all member of AMMI model family (AMMI0-AMMIF) using
replicate-based data, according to the diagram below. Automatically the
first validation is carried out considering the AMMIF (all possible axis
used). Considering this model, the original data set is split up into
two sets: training set and validation set.

The training set has all combinations (genotype x environment) with
the number of replications informed in `nrepval`

argument.
The validation set has one replication that were not included in the
training set. The splitting of the data set into training and validation
sets depends on the design considered. For a Randomized Complete Block
Design (default option) and the procedure we used in the article,
completely blocks are randomly selected within environments, as
suggested by Piepho (1994). The remaining block serves as
validation data. If `design = "CRD"`

is informed, thus
declaring that a completely randomized design was used, single
observations are randomized for each treatment (genotype-by-environment
combination). This is the same procedure suggested by Gauch and Zobel (1988). The estimated values for each
member of the AMMI model family in each re-sampling cycle are compared
with the observed values in the validation data. Then, the Root Mean
Square Prediction Difference is computed as follows:

\[ RMSPD = {\left[ {\left( {\sum\nolimits_{i = 1}^n {{{\left( {{{\hat y}_{ij}} - {y_{ij}}} \right)}^2}} } \right)/n} \right]^{0.5}} \]

where \(\widehat{y}_{ij}\) is the
model predicted value; and \(y_{ij}\)
is the observed value in the validation set. The number of random
selection of blocks/replicates (*n*) is defined in the argument
`nboot`

. At the end of the *n* cycles for all models,
a list with all estimated RMSPD and the average of RMSPD is
returned.

## Cross-validation for AMMI model

The following code computes the cross-validation of the oat data using the AMMI model. There are two options for doing that. The first is to perform the cross-validation for a specific member of the AMMI-family model. The second (and more realistic) is to perform the cross-validation for all AMMI-family models in the same procedure.

### Declaring the number of axes

The function `cv_ammi()`

is used to compute a
cross-validation procedure for the AMMI0, AMMI2 and AMMIF (9 axes)
models.

### AMMI0 to AMMIF

The function `cv_ammif()`

is used to compute a
cross-validation procedure for all members of the AMMI-family model. In
this case, AMMI0-AMMI9.

`AMMIF <- cv_ammif(data_ge, ENV, GEN, REP, GY)`

## Cross-validation for BLUP prediction

The function `cv_blup()`

provides a cross-validation of
replicate-based data using mixed models. By default, complete blocks are
randomly selected for each environment. Using the argument
`random`

it is possible to chose the random effects of the
model, as shown bellow.

### Genotype and genotype-vs-environment as random effects

`BLUP_g <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "gen")`

### Environment, replication-within-environment and interaction as random effects

`BLUP_e <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "env")`

### A random model (all terms as random effects)

`BLUP_ge <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "all")`

### Printing the means of RMSPD estimates

```
bind_mod <- bind_cv(AMMIF, BLUP_g, bind = "means")
print_table(bind_mod$RMSPD)
```

The table above showns the descriptive statistics (mean, standard deviation, standar error of the mean, and quantiles 2.5% and 97.5%) of the 100 RMSPD estimates for each model, and are presented from the most accurate model (smallest RMSPD mean) to the least accurate model (highest RMSPD mean).

### Plotting the RMSPD values

The values of the RMSPD estimates obtained in the cross-validation
process may be plotted using the function`plot()`

.

```
bind1 <- bind_cv(AMMI0, AMMI2, AMMI9)
bind2 <- bind_cv(AMMIF, BLUP_g, BLUP_e, BLUP_ge)
a <- plot(bind1, violin = TRUE)
b <- plot(bind2,
width.boxplot = 0.6,
order_box = TRUE,
plot_theme = theme_metan_minimal())
arrange_ggplot(a, b, tag_levels = "a")
```

Six statistics are shown in this boxplot. The mean (black rhombus),
the median (black line), the lower and upper hinges that correspond sto
the first and third quartiles (the 25th and 75th percentiles,
respectively). The upper whisker extends from the hinge to the largest
value no further than \(1.5\times{IQR}\) from the hinge (where IQR
is the inter-quartile range). The lower whisker extends from the hinge
to the smallest value at most \(1.5\times{IQR}\) of the hinge. Data beyond
the end of the whiskers are considered outlying points. If the condition
`violin = TRUE`

, a violin plot is added along with the
boxplot. A violin plot is a compact display of a continuous distribution
displayed in the same way as a boxplot.

## Rendering engine

This vignette was built with pkgdown. All tables were produced
with the package `DT`

using the
following function.

```
library(DT) # Used to make the tables
# Function to make HTML tables
print_table <- function(table, rownames = FALSE, digits = 3, ...){
df <- datatable(table, rownames = rownames, extensions = 'Buttons',
options = list(scrollX = TRUE,
dom = '<<t>Bp>',
buttons = c('copy', 'excel', 'pdf', 'print')), ...)
num_cols <- c(as.numeric(which(sapply(table, class) == "numeric")))
if(length(num_cols) > 0){
formatSignif(df, columns = num_cols, digits = digits)
} else{
df
}
}
```

## References

*Theor. Appl. Genet.*76 (1): 1–10. https://doi.org/10.1007/BF00288824.

*Theor. Appl. Genet.*89 (5): 647–54. https://doi.org/10.1007/BF00222462.