Commonly used statistical methods for detecting differential gene expression in microarray experiments

Bioinformatics tools provide the needs for carrying out microarray analysis. Differential gene expression analysis reveals the substantial aspects of biological pathways and takes a leading part for further hypothesis development. The principle in analyzing the gene expression data is the need for determination the genes of which expression models differ by phenotype or an experimental condition. Microarrays are employed to detect the distinctive gene expression profiles in a wide variety of experimental conditions. In the field of bioinformatics, different computer science technologies and statistical methods are considered. This multidisciplinary approach allows the understanding of the relationship between statistical methods, bioinformatics applications and computer science technologies. Images resulting from microarray experiments are not directly interpreted to reveal differences in sample condition. To make microarray experiments interpretable, a number of algorithms and statistical approaches need to be applied. The raw dataset has to go through preprocessing step prior to a series of subsequent analyses as significance analysis clustering and visualization of the relevant biological components and samples. Many microarray studies are designed to detect genes associated with different phenotypes. This review attempts to give an overview of statistical methods for identifying differentially expressed genes in microarray experiments. In this review, our aim is to summarize briefly basic methods as a guide for differential gene expression studies. All the summarized methods are given from the basic to more complicated in the review. .


Introduction
Bioinformatics research studies are concerned with the analysis of large quantities of biological data with the help of computational techniques. In recent years, advances in molecular biology and information technology have allowed a major part of the genomes of various species to be investigated. Current bioinformatics studies are concerned with the structural and functional aspects of genes and proteins since the amount of data produced in the field of molecular biology is enormous. Most of these studies are related to the Human Genome Project. In summary, bioinformatics is an interdisciplinary area at the intersection of information technology, statistics and biology. 1,2 The most basic tasks in the field of bioinformatics are the creation and development of biological databases. The majority of these databases consist of nucleic acid sequences and protein sequences derived from them. One of the most important applications in the field of bioinformatics is the analysis of sequence information. Bioinformatics investigates the genetic structures of all living organisms through the development of new information technologies to clarify complex biological questions. 1 A primary goal of empirical genetic studies is the identification, quantification, and comparison of genetic differentiation among individuals, populations, species and studies. 3 Microarray technology has made it possible to measure the expression of thousands of genes simultaneously. 4 Statistical analysis of microarray data is started through software programs using CEL files defined as raw data. Prior to the start of the analysis, quality assessment of raw data is performed as the first step. In order to evaluate the homogeneity of the arrays and to compare the density distribution between the arrays, box graphs are plotted for each array using the densities of the logarithm2 base of the raw data. Images of the CEL files are obtained to observe the dimensional distributions of the densities on each array and to detect dimensional artifacts. MA-plots are used to compare the expression values for all possible pair of arrays with a probeset-wise median array. The MA plots are generated by plotting M values which are obtained by logarithmic ratios versus A values which are average logarithmic intensity values. The pre-normalization quality control step can be complemented by histograms drawn to assess the density distributions of each array. 5 After quality control of raw data, background correction and normalization should be applied to the data using background correction methods such as RMA (Robust Multiple-Array Average) method. With the RMA method, the probe-level signal is removed from the background signal. Quantile normalization is performed by the RMA method and it is ensured that all the arrays have the same quantile. Using the RMA method, the expression set to be used in the analysis is generated by normalized and the background corrected intensities. After the background correction and the normalization methods are performed, box charts related to each array are drawn to re-evaluate the quality control. Following normalization and background correction, a list of genes that differ between two different conditions can be obtained by applying various statistical tests to the expression dataset to be used for analysis. 5

Preprocessing of microarray data
Gene expression measurement is generally obtained as a measure of fluorescence intensity. 6 Background fluorescence can arise from many sources such as non-specific binding, residual precipitates after the washing step, optical noise from the scanner. 5,7 Measurement values may have undergone various adjustments in the device system, such as calibration. Thus, in the presentation of gene expression data, it must be explained how the values are generated by the device system. 5,6 These expression measures always contain a component called "background noise." Local background noise levels are measured from the areas of the glass slide that do not contain probes. The background correction tries to remove non-specific background noise and local variations of the overall signal level on each chip. 5 The most common method to remove the background effect is to remove the measured fluorescence intensity around the spots. 8 Microarray gene expression data sets consist of is given as follows [6]; The most common methods used for background correction in microarray analysis are; The "Robust Multi-Array Average (RMA) Background Correction" method and the "MAS 5.0 Background Extraction" methods. 9 RMA background correction: RMA background correction is a method that uses only Perfect Match (PM) intensities. PM values are corrected using a global model for the distribution of probe intensities. 7 The model is based on the experimental distribution of probe intensities. Observed PM probes are modeled as a Gaussian noise component with µ average and 2 σ variance. 7 To avoid negative expression values, the normal distribution is truncated at zero. If the observed density is assumed to be Y, the correction will be as follows; RMA background correction has been one of the most commonly used pre-processing method in the recent literature. [10][11][12] Performed assessments of the measure's precision, consistency of fold change, and specificity and sensitivity of the measure's ability to detect differential expression and demonstrated the substantial benefits of using the RMA measure to users of the Gene Chip technology. They used data from spike-in and dilution experiments to conduct various assessments on the MAS 5.0, dChip and RMA expression measures. Irizarry have demonstrated that RMA has similar accuracy but better precision than the other two summaries and RMA provides more consistent estimates of fold change. 12 The study of 13 implements seven data extraction methods including MAS 5.0 and RMA to calculate expression indices from Affymetrix microarray gene expression data and tested use of different background correction methods calculated the correlation coefficient for each pairwise comparison of background correction methods. 15 Quality assessment: It is necessary to evaluate the quality of the data before the normalization of the arrays. Quality control assessment should be carried out to determine whether the quality of experimental data is acceptable and whether any hybridization should be repeated. 5,7 Various descriptive data plots are drawn to identify potential problems with hybridization or other experimental structures in the quality control evaluation process. Quality control plots are basically divided into diagnostic and spot statistics. 6,7 Diagnostic Plots: The diagnostic plots include various plots such as MA-plots for evaluating intensity bias and histograms for examining signal-to-noise ratios for each channel. Diagnostic plots are usually used to observe non-linear trends between two channels. 7 a.
MA plots: M and A are commonly used variables in the analysis of two-color arrays. A is defined as follows; Cy5 and Cy3 denote green and red dye intensities for a given spot, respectively. A variable is a measure of the total intensity of the logarithmic transformation of a spot. Thus, if the combined red and green intensities are high for a particular spot, the A value will also be high. 7,9 M variable is defined as follows; ( ) ( ) The M variable is the logarithmic transformation of the intensity ratio. The M value shows which of the red and green dyes are more binding to a particular spot array. 7 MA plots are used to investigate density bias. A disproportionate amount of spot above or below the x-axis on the graph indicates a problem in the array. MA plots are an indication of whether normalization within the array is required. 6,7 Alvord et al.demonstrated the use of some of the exploratory plots including boxplots, volcano plots and MA plots for the expression level data on the soybean genome 14 . Lu et al.'s study, can be cited as an example of MA-plot application in method comparison studies, in which MA-plots were created on the raw data and normalized data to compare normalization methods. 15

b.
Histograms: In microarray designs, it is very important to obtain the histograms of the p-values of tests conducted to identify different gene expression. Histograms are graphs that are easy to interpret and contain considerable information. A histogram is an indication of whether there is a signal in the gene and whether the genes are differently expressed. Histograms also allow for estimation of how many genes are differentially expressed in reality. 16 Spot statistics plots: Spot Statistics help to predict the structures of spot and hybridizations. The main plots that can be obtained with spot statistics are spatial plots, box plots, scatter diagrams and volcanic plots. 16 a. Spatial plots: Spatial plots are used to reveal irregular spot and hybridization structures. Spatial plots are used to observe the spatial distributions of the intensities on each array and to detect the artifacts. Spatial plots play a fundamental role in determining the background correction, depending on whether there are any dimensional artifacts on the arrays. 7 b. Box plots: Box plots are one of the most commonly used plots for displaying spot and hybridization structures. At the same time, box plots can be drawn to understand the scale differences between different arrays. It is necessary to evaluate the box plots to see if between-array normalization is required. The homogeneity of the arrays can be observed quite clearly from the box plots. 7,16 c. Scatter diagrams: Scatter diagrams used to compare the expression values of two samples are the most commonly used plots for visualizing microarray data. 17 In the first step of the microarray data analysis, a scatter diagram is drawn between the two intensity channels to view the general structures and variability. Scatter diagrams are also commonly used to find out slides lying away from the center, which have abnormal expression structures. 18 d. Volcano plots: Volcano Plots are used to summarize fold change and t-test criteria. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis and log of the fold change between the two conditions on the x-axis. For each gene, there is a point on the graph that represents the t-statistic and the fold change 16 (Figure 1).

Normalization:
The purpose of the normalization phase is to adjust the data according to the technical variation. Variations can cause measurement differences between general fluorescence intensity levels of various arrays. The normalization process is necessary to make the measured values obtained from different arrays comparable. 9 Normalization methods depend on which microarray technology is used. Generally, logarithmically transformed data are used for further analysis. 19 The most commonly used methods of normalization are as follows. 7 1. Scaling Normalization Method 2. Nonlinear Normalization Methods

Contrast Normalization
Scaling normalization method: The Scaling Normalization Method scales all the arrays with the same average intensity by choosing a reference array. The constructed procedure is to determine a reference array and then create a linear regression between each array and the chosen array without a constant term. Subsequently, the regression line is used as a normalization relation. 7

Nonlinear normalization methods: Non-Linear Normalization
Methods perform non-linear corrections between arrays. These methods generally perform better than linear corrections such as the scaling method. 19 There are many nonlinear normalization methods in the literature. Workman et al. proposed a nonlinear normalization method using array signal distribution analysis and cubic splines. 18 Chen et al. proposed a subset normalization to adjust for location biases combined with global normalization for intensity biases. 19 Edwards 19 proposed a nonlinear LOWESS normalization method used in one channel cDNA microarrays mainly for correcting spatial heterogeneity.
Quantile normalization: The purpose of quantile normalization is to impose the same experimental density distribution on each array. If two data vectors have the same distribution, a Q-Q graph has a straight diagonal line with 1 slope and 0 intercept. 20 Drawing quantiles of two data vectors against each other and designing each point on a 45-degree diagonal line leads to a transformation that allows both data vectors to have the same distribution. 20 Cyclic loess normalization: The cyclic loess method is a generalization of the global loess method in which the Cy5 and Cy3 channel intensities are normalized using MA graphics. When dealing with single channel array data, array pairs are normalized according to each other. The Cyclic Loess method normalizes the intensities for an array set by working in dual form. 21 Contrast normalization: In contrast normalization, the data is transformed into a contrast set and a nonlinear MA-plot normalization is performed. Afterward, inverse transformation is applied. 7 Normalization procedures are essential as the first step of expression analyses for adjusting artifacts on samples and making samples comparable. Choice of normalization procedure plays a fundamental role in the final results of gene expression analysis. There are several methods for normalization in the literature. Quantile normalization procedure is one of the most commonly used between these methods.
Qian et al. used quantile algorithm for normalization in their study which is aimed to identify differentially expressed genes and compare the expression profiles. 22 The raw expression data was prepro cessed Statistical methods used in differential gene expression analysis: The principle in analysing the gene expression data is the need for determination the genes of which expression models differ by phenotype or an experimental condition. 1 A simple approach to selecting genes is to use the "fold change" criteria. This is only possible if there are no or only a few repetitions. However, an analysis based only on the fold change does not allow for the assessment of the significance of expression differences in the presence of biological and experimental variations that may vary from gene to gene. This is the main reason for using statistical tests to evaluate differential expressions. Parametric tests generally have a higher power if the underlying model assumptions are met. 9 When doing the statistical analysis of microarray data, an important question is determining which scale to analyze the data. Generally, a logarithmic scale is used to approximate the distribution of the repeated measures for each gene to roughly symmetric and normal. 7 The variance-stabilizing transformation derived from an error model for microarray measurements can be used to make the variance of the measured intensities independent of their expected value. This may be advantageous for gene-based statistical tests based on variance homogeneity. 16 This will reduce variance differences between experimental conditions arising from differences in intensity level. However, it should not be forgotten that differences in the variance between conditions may also have gene-specific biological causes. 19,20 t-test comparisons for one or two groups, variance analysis for multiple groups, and trend tests are frequently used models to assess differential gene expression. Due to lack of knowledge about the coregulation of genes, linear models are usually calculated separately for each gene. When lists of relevant genes are identified, researchers can begin coordinated regulatory studies to further model their common actions. 7,20 A statistical testing approach for each gene is common. However, this approach has some difficulties. Most importantly, a large number of hypothesis tests are being performed. This potentially leads to a number of false positives. Multiple test procedures allow the assessment of the overall significance of the results of a group of hypothesis tests. These procedures focus on specificity by controlling type I (false positive) error rates such as experimental error rate or false discovery rate. These controls are statistical methods used in multiple hypothesis tests to correct for multiple comparisons. However, testing multiple hypotheses remains a challenge. Because an increase in specificity is related to a loss of sensitivity, as provided by the p-value correction methods. Therefore, the possibility of detecting the true positivity decreases. 7,9 Most microarray experiments involve only a few repetitions for each condition, making it difficult to predict gene-specific variances. Different methods have been developed to take advantage of variance information from all gene data. 18

Fold change criteria:
A simple microarray experiment is carried out to determine the expression differences between two conditions. Each condition can be represented by one or more RNA samples. The simplest method used to test expression differences of genes is the "Fold Change" criterion. 16 The method calculates the logarithm rate between expression levels of the two conditions. It then evaluates whether all genes are greater than a threshold value determined for differential expression. The Fold change method is not considered reliable because it does not take statistical variability into consideration. The fold-change method is subject to bias if the data have not been properly normalized. 18 For gene i, fold change is defined by as follows; Where i x and i y denote the expression levels in the control and treatment groups for the i th gene, respectively. 16 The standard t-test is frequently used to identify differentially expressed genes between two conditions in microarray studies.

Shi et al. used student's t-test to obtain differentially expressed genes between embryonic stem cells and urinary induced pluripotent stem cells. 25
Satterthwaite-welch t-test: Homogeneity of variances is rarely seen in microarray experiments. Heterogeneous sample or cell samples may cause heterogeneous variance in microarray experiments.
Changing the correlation structure of expression change by the transcription factor can also cause heterogeneous variance. Therefore, Welch (Satterthwaite's) t-test method would be more appropriate for independent samples with different variances. 21 Let 1 x and 2 x be two independent gene expression data under two conditions, The Welch t-statistic is calculated as follows for a given gene; The Welch method has a special correction for the degree of freedom under the variances of different samples; i ω is the estimated squared standard error for sample i; Let j x and j s be the mean difference and standard deviation of differences, respectively. Paired sample test statistic is given by; With n-1 degrees of freedom.

Moderated t-test:
Moderated t-test is one of the most common methods used in microarray studies. The modification is to add a small value to the standard deviation, which reduces the variability of the t-value, while the t-statistic is being calculated. The main purpose of the moderated t-test is to reduce the statistical significance of genes with a small standard deviation to avoid false positives. 19,28 Moderated t-statistic is described by as follows; Where 0 s is a selected constant to reduce the variability of ' i T .
The moderated t-statistic improves the ranking performance if the purpose is to create a short sorted list. 21,29 It has been proven in many studies that the moderated t-statistic performs better than the classical t-test and the fold change criteria. 30,31 Moderated t-test is most frequently used method to identify differentially expressed genes between two conditions in the literature.  The ANOVA model for microarray data can be determined in two steps. 34 The first stage is the normalization model; Where ijgr y is the logarithm of signal intensity for the i th array, j th dye, g th gene and r th measurement. μ is the overall average expression level.
A is the effect of the array at the measured intensity, D is the effect of the dye, and AD is the effect of interaction between dye and array. The first step is to form ijgr r term from the measured intensities. In the second step, gene-specific effects are modeled in terms of residuals of the normalization method. 21 The gene-specific model is expressed as follows; G is the average intensity for a particular gene, i AG is the effect of an array on this gene, j DG is the effect of dye on this gene, and ijr ∈ is the residual value. The variety-by-gene (VG) term is the main interest in the analysis, and reflects the variability in expression levels between samples for a particular gene. ij VG is expressed as a "catchall" term for the effects related to samples. In the simplest case, ij VG is an indicator of the condition represented by the sample for j dye and i array. In more complex experiments, the design structure at the biological sample level is reflected in the VG terms. 34 When there are duplicated spots in an array, the model should include additional terms for labeling and spot effects. The Gene-specific model can be modified by removing the terms DG and AG for singlecolor data. 34 Hypothesis testing involves comparing two models. The null hypothesis suggests that there are no differential expressions (the VG values are equal to zero) and the alternative hypothesis suggests that there are differential expressions between conditions (the VG values are not equal to zero). 29

Conclusion
In this review, we have attempted to give a broad overview of the statistical methods used for differential gene expression analysis. The microarray has made possible the simultaneous investigation of thousands of genes. Microarray technology helps researchers learn about various kinds of diseases. The identification of differentially expressed genes has a great importance to understand biological issues. Scientists can find out the expression levels of thousands of genes by using microarrays. Many of the genetic discoveries show that fundamental need is to implement an appropriate statistical method for finding the differentially expressed gene lists. For this reason, both information technologies and statistical methods have been adapted and developed according to this need. The use of bioinformatics tools for the studies related to gene expression datasets has seen a massive increase in recent years. It is without a doubt, progress in the field of bioinformatics in the future will be at the forefront of other branches of science to define genetic structures. Bioinformatics has empowered the scientific researchers to understand the significant points of genetic issues and various kinds of diseases.