Structuring genotype × environment interaction — an overview*

The phenotype of an individual is determined by both the genotype and environment. Farmers and scientists aim to determine a superior genotype over a wide range of environmental conditions but also over years. The basic cause of differences between genotypes in their yield stability is when these two effects are not only additive, i.e. when genotype × environment interaction (GEI) is present in the data. Multi-location trials play an important role in plant breeding and agronomic research. The data from such trials have three main points: (i) to accurately estimate and predict yield based on limited experimental data; (ii) to determine yield stability and the pattern of response of genotypes across environments; and (iii) to provide reliable guidance for selecting the best genotypes or agronomic treatments for planting in future years and at new sites (Crossa, 1990). The purpose of the present paper is (i) to describe various multivariate statistical methods for analyzing interactions in general and GEI in particular, and (ii) to present a selected bibliography of 142 references to previous work.


INTRODUCTION
Farmers and scientists aim at determining a genotype which is superior over a wide range of environmental conditions and also over a number of years (multi-environment trials). The basic cause of differences between genotypes in their yield stability is when these two effects (genotype and environment) are not simply additive, i.e. when GEI is present in the data.
Various statistical methods have been proposed for the analysis of interaction in general, and GEI in particular and these will be reviewed in this paper. First of all it is necessary to find if there is an interaction effect present in the data, then one has to consider this effect and importance as a subsequent work. Freeman (1972) described the aim of structuring GEI as: "This is all part of an attempt to discover more about the data than the crude tests reveal; thus, assuming that data = pattern + noise, one wants to find as much as possible of the pattern while eliminating the maximum noise".
To the best of our knowledge there is no recent review paper which provides an up-todate and extensive list of references about the most used techniques for analyzing and structuring GEI.
The decision as to which papers are most important and relevant is necessarily subjective and doubtful, a lot of very good papers are not included. The aim of this selection is to give some key references as a possible basis for a further reading. Some of the references are very recent and with high quality which means that this subject is still under development.

MOTIVATION AND PREVIEW OF THE REVIEW PAPERS
Literally hundreds (even thousands) of papers have been written on this subject by statisticians, breeders, agronomists and geneticists. Over time some authors presented helpful works with many references about GEI: the paper of Aastveit and Mejza (1992) presents a collection with more than 120 references. Also the review works by Denis and Vincourt (1982), Cox (1984), Westcoutt (1986) and Crossa (1990) and the books by Kang (1990), Gauch (1992) and Kang and Gauch (1996) contain many references. The MSc. Thesis by Alberts (2004) also provides some useful references. However it is not easy to keep up with the expanding literature on this subject and the main reasons for this were pointed out by Kang and Gauch Jr. (1996), (i) "many important developments on interaction are published in statistical journals not read routinely by many agricultural researchers, and relevant statistical applications appear in an even wider assortment of medical, engineering, and other journals"; and (ii) "merely the agricultural literature itself is enormous, appearing in various languages and published in numerous countries. Some important items are published as departmental monographs or as chapters in books that may not come to some researcher's attention." 4. MULTIVARIATE ANALYSIS Crossa (1990) pointed out that multivariate analysis has three main purposes: (i) to eliminate noise from the data pattern (i.e., to distinguish systematic from nonsystematic variation); (ii) to summarize the data; and (iii) to reveal a structure in the data. He also divided the multivariate techniques used to structuring GEI in two groups: (i) ordination techniques -PCA, biplots and graphical representation, principal coordinates analysis and factor analysis -which assume that the data is continuous and attempt to represent genotype and environment relationships in a low-dimensional space; and (ii) classification techniques -cluster analysis and discriminant analysis -which seek for discontinuities in the data, group similar entities in clusters and summarize redundancy in the data. Here we follow the same division as Crossa (1990) to present the most used multivariate techniques for structuring GEI.

Principal component analysis
When dealing with large data sets, one of the main problems is how to extract the information. Principal component analysis (PCA) is the most used technique for reducing the data set while preserving significant features (e.g. Jolliffe, 2002).
The most immediate objective of PCA is to verify if a small number of the first principal components exists which explains a high proportion of the variation in the original data set. If so, a few principal components can be used to represent the original data set without a big loss of information. This procedure corresponds to the dimension reduction of the original data set.
Another goal of PCA is the visualization of variables underlying the original structure (the principal components) which have physical meaning and, therefore, help to see the initial structure from another point of view.
This analysis can effectively reduce the structure of a two-way genotype × environment data matrix of G (genotypes) points in E (environments) dimensions in a subspace of fewer dimensions (Crossa, 1990).
The model can be written as where Yij is the observed mean yield of the i-th genotype at the j-th environment, µ is the general mean, kn is the singular value of the n-th axis/principal component ( 2 is the eigenvalue), vni is the eigenvector of the i-th genotype for the n-th axis, snj is the eigenvector of the j-th environment for the n-th axis, and ∑ ℎ . Johnson and Graybill (1972) and Corsten and van Eijnsbergen (1972) gave the full statistical analysis underlying the principal component approach and derived the appropriate distribution for a test of significance of the first component.
Another generalization which includes PCA (together with the analysis of variance -ANOVA) is the Additive Main effects and Multiplicative Interaction (AMMI) model (Fisher and Mackenzie, 1923;Mandel, 1971;Gauch, 1988;Zobel, Wright and Gauch 1988;Gauch and Zobel, 1997;Gauch, 2006). The AMMI model, also called doubly-centered PCA, applies the singular value decomposition (SVD) to the data minus the genotype and environment means (Gauch, 2006). In the AMMI model, ANOVA estimates the additive effects of genotypes and environments and SVD or PCA estimates the GEI.
Three mode PCA (Kroonenberg, 1983) has been used to analyse and structuring GEI when the data are considered as a three dimensional matrix, e.g. Genotype × Environment × Year: Basford et al., 1991;van Eeuwijk and Kroonenberg, 1998;de la Vega and Kroonenberg, 2002). Furthermore, one can find some more applications on Kroonenberg and Basford (1989); and de la Vega and Chapman (2001, 2006. Varela et al. (2006 presented as a generalization of three mode PCA for AMMI models. Although PCA is more efficient than the linear regression method in describing genotypic performance (Crossa, 1990), the interpretation of resulting principal components is difficult (Aastveit and Mejza, 1992). This weakness can be overcome by the use of graphical visualisations and biplots which are described in the next subsection. More weak points of PCA can be found in the literature: (i) the reduction of dimensionality of multivariate data may lead to distortions on interpretations (Crossa, 1990); (ii) if the proportion of variance explained by the first principal components is small, the individuals that are quite different may be represented by points that are close together (Gower, 1967); (iii) as a multiplicative model it has the problem of not describing the additive main effects (Zobel et al., 1998) and confuses the additive (main effects of genotypes and environments) structure of the data with nonadditivity effects (GEI) (Crossa, 1990); (iv) and nonlinear association in the data prevents PCA from efficiently describing the real relationships between entities (Williams, 1976). Perkins (1972) also pointed out that PCA was not useful for studying the adaptation of a group of inbred lines of Nicotiana rustica.

Biplot and graphical visualization
The display of genotypes and environments along the first two principal component axes for the interaction table of residuals is called a biplot (Gabriel, 1971;Bradu and Gabriel, 1978;Kempton, 1984;Gower and Hand, 1996). This technique can provide useful information on grouping similar genotypes and/or environments and can also provide some useful information about the GEI to identify genotypes which are well adapted to a particular environment (Zobel et al., 1988;Crossa, 1990;Crossa et al., 2002). Three dimensional biplots were presented by Gower and Digby (1981).
Since Bradu and Gabriel (1978) explored the use of the biplot as a diagnostic tool for choosing an appropriate model for the analysis of two-way data it has been used in multienvironment trial data analysis (Yan and Hunt, 2002). Consequently some generalizations and applications have been proposed: GGE biplots display both genotype main effects (G) and genotype × environment interaction (GE) which are two sources of yield variation that are relevant in cultivar evaluation (Cooper et al., 1997;Yan, 1999Yan, , 2001Yan and Kang, 2003); Yan and Tinker (2005) presented a third type of biplot (after biplot and GGE biplot) which displays the yield-trait relations in individual environments and addresses whether and how the GEI for yield can be explored by indirect selection for the other traits. A genotype x trait biplot (Yan and Rajcan, 2002;Yan and Kang, 2003) graphically approximates a genotype × trait two-way table and can be used to visualize the genetic correlations among traits.
Applications of the biplot methodology to diallel data (Yan and Hunt, 2002), to hostby-pathogen data (Yan and Falk, 2002), to barley yield (Dehghani et al., 2006) and to linear-bilinear models , among others, can be found in the literature.

Principal coordinates analysis
Principal coordinates analysis (PCO) was pioneered by Gower (1966) as an alternative to PCA, which is a metric multidimensional scaling method based on projection and uses spectral decomposition to approximate a matrix of distances/dissimilarities by the distances between a set of points in several dimensions. It can be seen as the dual of PCA, i.e. instead of the use of the matrix WW' whose size is the number of variables or environments (PCA), it consider the matrix W'W whose size is the number of individuals or genotypes, where W represents all the variation within genotypes, including that for the error (Freeman, 1972).
In contrast to PCA, PCO employs a broader range of distances (PCO is equivalent to PCA on a covariance matrix of a transposed data matrix if the distance matrix is Euclidean; and PCO is equivalent to PCA on a correlation matrix of a transposed data matrix if the distance matrix is Penrose) or dissimilarity coefficients between objects (Crossa, 1990). The objectives and limitations of PCO are similar to those of PCA with some advantages as pointed out by Crossa (1988Crossa ( , 1990: (i) it is trustworthy when used for data that include extremely low or high yielding sites; (ii) it does not depend on the set of genotypes included in the analysis; and (iii) it is simple to identify stable varieties from the sequence of graphic displays.

Factor Analysis
Factor Analysis (FA) is similar to PCA where the principal components are replaced by "factors". As in PCA, in FA a large number of correlated variables is reduced to a small number of main (uncorrelated) factors (Spearman, 1904;Cattell, 1965), and assumed that each variable can be expressed as a linear compound of k (lower than the number of individuals) hypothetical variables, the common factors, plus an additional term depending only on the particular variable and known as the specific factor (Gower, 1966). Another difference between PCA and FA is that in FA the rotation of the obtained factors is possible.
A rotation which requires the factors to remain uncorrelated is an orthogonal rotation, while others are oblique rotations. Oblique rotations often achieve greater simple structure, though at the cost that one must also consider the matrix of factor intercorrelations when interpreting results.
FA has been used to analyse GEI. Grafius and Kiesling (1960) constructed orthogonal vectors representing environmental effects using FA and predicted genotype responses in terms of these vectors. Walton (1972), and Seiler and Stafford (1985) used FA to understand relationships between yield components and morphological characteristics. Gollob (1968) refers to the fixed two-way models 1 , 1,..., ; 1,..., , where μ is the general mean, τi and βj represent the main effects, and the ~(0, 2 ), as factor analysis of variance or simply FANOVA models because they incorporate the benefits of data reduction from a factor decomposition of residuals and the ease of interpretation permitted by the analysis of variance. Latter, Yochmowitz and Cornell (1978) proposed a stepwise test base upon likelihood ratio statistics to determine the number of multiplicative components in the MANOVA (Multivariate ANOVA) model.
Applications of FA to GEI can be found, among others, in Peterson and Pfeiffer (1989), Oliveira et al. (2005) and Garbuglio et al. (2007).

Cluster analysis
Cluster analysis (CA) (Cormack, 1971;Everitt et al., 2001) seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation.
Two types of clustering can be distinguished: (i) hierarchical and (ii) partitional/nonhierarchical. Hierarchical clustering algorithms find successive clusters using previously established clusters and do not require preset knowledge of the number of groups (the traditional representation is called a hierarchical tree diagram or dendrogram). Hierarchical algorithms can be agglomerative -beginning with each element as a separate cluster and merging them into successively larger clusters; or divisive -beginning with the whole set and proceeding to divide it into successively smaller clusters.
Partitional/nonhierarchical algorithms determine all clusters at once and one has to decide in the beginning of the analysis the number of clusters (the most well known partitional algorithm is K-means which uses the Euclidian distance). CA also requires a measure of distance between the individuals which determines how the similarity/dissimilarity of two objects is calculated. The most known and used distance functions are: Euclidean; Mahalanobis, Manhattan, maximum norm and Hamming.
CA is one of the most important methods to analyzing and structuring GEI (Aastveit and Mejza, 1992) and the most widely used technique for classifying environments or genotypes into homogeneous groups . CA has been used to study genotype adaptation by simplifying the pattern of responses and to subdivide genotypes and environments into more homogeneous groups (Crossa, 1990). Lin et al. (1986) and Westcott (1987) presented review papers about the application of classification methods to GEI and discussed their problems. Crossa (1990) pointed out some of the disadvantages of CA: (i) numerous hierarchical grouping exist, and each of them may produce different cluster groups; (ii) the truncation level of the classificatory hierarchies may be decided arbitrarily; (iii) many different similarity measures can be used yielding different results; and (iv) CA may produce misleading results by showing structures and patterns in the data when they do not exist (Gordon, 1999).
With the two-stage Ward-Modified Location Model (Ward-MLM), proposed by Franco et al. (1998) where the initial groups are generated by the Ward (1963) minimum variance within-groups hierarchical method, it is possible to avoid the "independence across sites" assumption and therefore to estimate variance of the traits within sites and their covariances across sites (Franco et al., 2003). Franco et al. (1999) extended the Ward-MLM strategy to the case of clustering threeway data of cultivar x environment x trait. This strategy considers the same trait measured in different environments as different variables, and it considers the clustering of n individuals on the basis of p continuous and q categorical traits measured in r different environments (Franco et al., 2003).
Multiplicative models for multi-environment cultivar trials that consider one trait at a time (unvaried case) have been used for developing methods for clustering environments or genotypes into groups with statistically negligible crossover interaction (Cornelius et al., 1992Crossa et al., 1993;Kang and Gauch, 1996;Crossa and Cornelius, 1997;Abdalla et al., 1997). The multivariate case is when one considers all the traits (continuous and categorical) simultaneously (three-way Ward-MLM methodology) (Franco et al., 2003). Basford and McLachlan (1985) also proposed a classification method for three-way cluster analysis when all measured traits are continuous, and an association between categorical and continuous variables can be found in  and  with application in .
The first attempt to apply CA to GEI was by Abou-El-Fittouh et al. (1969) using cotton. Lin (1982) proposed a cluster method to group genotypes according to their response to environments when the GEI is large. Ramey and Rosielle (1983) proposed a new method of cluster analysis, termed the hierarchical agglomerative sums of squares method (HASS) to facilitate clustering of genotypes and environments where GEI exists. Corsten and Denis (1990) proposed an agglomerative hierarchical clustering procedure for an orthogonal twoway table. Franco et al. (1998) proposed the two-stage clustering Ward-Modified Location Model, and Lefkovitch (1985) discussed the problems of multi-criteria clustering in GEI. Huhn and Truberg (2002) compare different cluster techniques, present theoretical results and give wide list of references to preview work in CA, in particular applied to GEI. The experimental results are given by Truberg and Huhn (2002).
Many references to the application of CA to GEI can be found: Robert (1997) used a clustering procedure to identify groups of environments which were homogeneous for interaction in bread wheat, in two multi-location series of trials. Berger et al. (2002) used the K-means clustering algorithm to group genotypes of undomesticated Mediterranean Vicia. CA was undertaken by Adugna and Labuschagne (2003) for 10 genotypes of linseed that were tested in a four-times replicated randomized block design across 18 environments (six localities by 3 years) of Ethiopia. Gruneberg et al. (2005) derived three environmental groups, for sweet potato clones from CA. A CA was used by Isik and Kleinschmit (2005) in order to illustrate the general pattern of similarities among the test sites of Picea abies in northern Germany. Padi (2007) used hierarchical CA to identify discrete groups of cowpea genotypes.
Many times the multivariate methods constitute a basis for a CA (e.g. Seif et al., 1979;Calinski et al, 1989 a, b;Adugna and Labuschagne, 2003 -for canonical analysis).

Linear discriminant analysis
Linear discriminant function analysis or discriminant analysis (LDA) (Fisher, 1936) aims to find the linear combinations of features which best separate two or more predefined classes of objects or events. The multivariate case, known as Multiple Discriminant Analysis, Discriminant Factor Analysis or Canonical Discriminant Analysis (CDA), derives the canonical coefficients and finds linear combinations of the quantitative variables that provide maximal separation between the classes or groups in much the same way that principal components summarize total variation (from the n original variables it constructs n linear functions -canonical discriminant functions -and the discriminant functions may be used to calculate a set of n discriminant scores for each discriminant function).
Applications of DA and CDA may be found in the literature about GEI. Gallacher (1997) used both a monothetic and a polythetic discriminant analysis on selected characteristics of Australian sugarcane germplasm. Martinez et al. (1997) used DA on Grenache N grapevine, and Salehuzzaman and Joarder (1979) on soybean. Crown shape of four different clones planted out in six experimental fields located in five European countries are described and compared using discriminant analysis (Santini and Camussi, 2000). Johnston et al. (2001) used CDA to infer the environmental factors most strongly associated with the different genotypic groups in a Louisiana iris hybrid population.
Canonical analysis is concerned with determining the relationships between groups of variables in a data set (Hotelling, 1936;Cliff and Krus, 1976) and has also been used in GEI. Applications of this methodology to GEI can be found in Calinski et al. (1987aCalinski et al. ( , 1987b; Lejeune and Calinski (2000) and Adugna and Labuschagne (2003).
The AMMI method first applies the addictive analysis of variance model to a two-way data (e.g. genotypes and environments), and then applies the multiplicative principal component analysis model to the residual from the additive model, i.e. to the interaction (e.g. GEI) (Gauch, 1988(Gauch, , 1992. The AMMI model is given by, see Gauch (1988) where: Yge is the yield of genotype g in environment e; µ is the grand mean; αg are the genotype mean deviations (the genotype means minus the grand mean); βe are the environment mean deviations; λn is the eigenvalue of PCA axis n; γgn and δen are the genotype and environment PCA scores for PCA axis n; Ν is the number of PCA axes retained in the model; θge is the residual. If the experiment is replicated, an error term εger, which is the difference between the Yge mean and the single observation for replicate r, may be added. Crossa (1990) pointed out three main purposes of AMMI models: (i) model diagnosis -AMMI is more appropriate in the initial statistical analysis of yield trials because it provides an analytical tool for diagnosing other models as subclasses when these are better for a particular data set (Bradu and Gabriel, 1978); (ii) to clarify GEI -AMMI models summarize patterns and relations of genotypes and environments (Kempton, 1984;Zobel et al., 1988;Crossa et al. 1990);and (iii) to improve the accuracy of yield estimatesgains have been obtained in the accuracy of yield estimates that are equivalent to increasing the number of replicates by a factor of two to five (Zobel et al., 1988;Crossa et al., 1990) which can be used to reduce the costs by reducing the number of replications, to include more treatments in the experiment, or to improve efficiency in selecting the best genotypes.
There are several possible AMMI models characterized by a number of significant principal component axes ranging from zero (AMMI-0, i.e. additive model) to min(g-1,e-1), where g is the number of genotypes and e is the number of environments. The full model (AMMI-F), with the highest number of principal component axes, provides a perfect fit between expected and observed data. Models including one (AMMI-1) or two (AMMI-2) PC axes are usually the most appropriate where there is significant GEI. Due to their simplicity, they provide a notable reduction of dimensionality for the adaptation patterns relative to observed data (Annicchiarico, 2002).

Multivariate analysis of variance
Multivariate analysis of variance (MANOVA) is an extension of ANOVA for several dependent variables. The main aims of MANOVA are identifying whether changes in the independent variables have a significant effect on the dependent variables, and the interactions among the independent variables and the association between dependent variables, if any.
MANOVA has been used to analyse GEI. Yang and Baker (1991) proposed two heuristic tests for the significance of the two causes of GEI. A comparison between MANOVA and REML -restricted maximum likelihood analysis -was presented by Yang (2002).

Additional information -environmental variables
Many times the environment part of GEI can be characterized with some additional variables (Wood, 1976) such as soil characteristics and climate (e.g. temperature, rainfall, radiation level, altitude). Usually the available data is composed of two-way data for yields to structure GEI and of data characterizing the environment (environmental variables), with the authors aiming to relate these two data sets to find relationships.
The most usual way to introduce environmental variables in the analysis of GEI is using linear regression and related techniques. The multilinear regression method can be used for predictive purpose when environmental data are used as independent variables (Knight, 1970).
Factorial regression analysis (FA) (van Eeuwijk et al., 1996) was used by Signor et al. (2001) to model the interaction effect using additional genotypic and environmental information for grain yield of early maize, and by Campbell et al. (2004) to explain GEI and quantitative trait × environment interaction (QEI) on wheat. Crossa et al. (1999) found that temperature differences across environments accounted for a large portion of the GEI detected in tropical maize (Zea mays L.). Ceretta and van Eeuwlik (2008) used FA to model GEI directly in relation to measured environmental variables in malting barley cultivars. Aastveit and Martens (1986) used the Partial Least Squares (PLS) regression (Wold, 1966;Wold, 1975) to relate year differences in straw length of 15 barley genotypes to climate variation over 9 years. A data set of rainfall, temperature, and global radiation at different stages throughout the growth season over years was related with the residual table of GEI after the factors' effects were removed by an ANOVA. Other applications of PLS regression can be found, for example, in Vargas et al. (1999Vargas et al. ( , 2001; Reynolds et al. (2004);and Ortiz et al. (2007). Vargas et al. (1999) compared results obtained from PLS and FR in two large multienvironment trials and showed that both methods identified similar environmental and genotypic covariates that explain a large proportion of the GEI. Ortiz et al. (2007) also presented a comparison between PLS and FR with application to tomato genotypes. PCA, combined with multiple regression, may be useful for reducing the number of environmental variables to be included in the final analysis (Perkins, 1972;Crossa, 1990). Reynolds et al. (2004) added environmental variables in an AMMI biplot.  presented another approach for incorporating genetic and environmental covariates in multi-environment trials data analysis, in which the observed GGE (combination between genotypes and genotype x environment interaction) patterns were explained as interactions between genetic covariates and environmental covariates. This was achieved by relating the genetic-environmental covariates to the genotypic-environmental scores of the first two principal components derived from GGE biplot analysis (Yan and Tinker, 2005).

CONCLUDING REMARKS
A wide range of multivariate methods can be used to structure and analyze GEI on multilocation trial data. Although some of these have limitations on interpretation of the achieved results, they often yield suggestions into particular complex situations and they usually overcome the limitations of linear regression. Nowadays researchers combine different techniques to get better results (e.g. AMMI models joint together ANOVA and PCA).
In this paper we reviewed the most used multivariate techniques and we gave a selection with some references to be considered as a possible basis for further reader. If the reader finds out some important reference missing please let us know.