Introduction
Bioinformatics research studies are concerned with the analysis of large quantities of biological data with the help of computational techniques. In recent years, advances in molecular biology and information technology have allowed a major part of the genomes of various species to be investigated. Current bioinformatics studies are concerned with the structural and functional aspects of genes and proteins since the amount of data produced in the field of molecular biology is enormous. Most of these studies are related to the Human Genome Project. In summary, bioinformatics is an interdisciplinary area at the intersection of information technology, statistics and biology.1,2
The most basic tasks in the field of bioinformatics are the creation and development of biological databases. The majority of these databases consist of nucleic acid sequences and protein sequences derived from them. One of the most important applications in the field of bioinformatics is the analysis of sequence information. Bioinformatics investigates the genetic structures of all living organisms through the development of new information technologies to clarify complex biological questions.1
A primary goal of empirical genetic studies is the identification, quantification, and comparison of genetic differentiation among individuals, populations, species and studies.3 Microarray technology has made it possible to measure the expression of thousands of genes simultaneously.4 Statistical analysis of microarray data is started through software programs using CEL files defined as raw data. Prior to the start of the analysis, quality assessment of raw data is performed as the first step. In order to evaluate the homogeneity of the arrays and to compare the density distribution between the arrays, box graphs are plotted for each array using the densities of the logarithm2 base of the raw data. Images of the CEL files are obtained to observe the dimensional distributions of the densities on each array and to detect dimensional artifacts. MA-plots are used to compare the expression values for all possible pair of arrays with a probeset-wise median array. The MA plots are generated by plotting M values which are obtained by logarithmic ratios versus A values which are average logarithmic intensity values. The pre-normalization quality control step can be complemented by histograms drawn to assess the density distributions of each array.5
After quality control of raw data, background correction and normalization should be applied to the data using background correction methods such as RMA (Robust Multiple-Array Average) method. With the RMA method, the probe-level signal is removed from the background signal. Quantile normalization is performed by the RMA method and it is ensured that all the arrays have the same quantile. Using the RMA method, the expression set to be used in the analysis is generated by normalized and the background corrected intensities. After the background correction and the normalization methods are performed, box charts related to each array are drawn to re-evaluate the quality control. Following normalization and background correction, a list of genes that differ between two different conditions can be obtained by applying various statistical tests to the expression dataset to be used for analysis.5
Preprocessing of microarray data
Gene expression measurement is generally obtained as a measure of fluorescence intensity.6 Background fluorescence can arise from many sources such as non-specific binding, residual precipitates after the washing step, optical noise from the scanner.5,7
Measurement values may have undergone various adjustments in the device system, such as calibration. Thus, in the presentation of gene expression data, it must be explained how the values are generated by the device system.5,6 These expression measures always contain a component called "background noise." Local background noise levels are measured from the areas of the glass slide that do not contain probes. The background correction tries to remove non-specific background noise and local variations of the overall signal level on each chip.5 The most common method to remove the background effect is to remove the measured fluorescence intensity around the spots.8
Microarray gene expression data sets consist of
gene expression values, with
genes and
samples.
values are arranged in a
data matrix, where each gene corresponds to one row and each sample to one column. The readout gene expression value
can be statistically defined as the sum of the true gene expression value
and the background noise
components [6];
(1)
The structure and correction of the background noise depend on the microarray technology used. Spot array data provides an estimate of background noise
, with uncorrected expression intensities
values. If the background estimate is expressed as
, background corrected expression value,
is given as follows[6];
(2)
The most common methods used for background correction in microarray analysis are; The "Robust Multi-Array Average (RMA) Background Correction" method and the "MAS 5.0 Background Extraction" methods.9
RMA background correction:RMA background correction is a method that uses only Perfect Match (PM) intensities. PM values are corrected using a global model for the distribution of probe intensities.7
The model is based on the experimental distribution of probe intensities. Observed PM probes are modeled as a Gaussian noise component with
average and
variance.7
To avoid negative expression values, the normal distribution is truncated at zero. If the observed density is assumed to be Y, the correction will be as follows;
(3)
and
where S is an averaged exponential signal component with
mean.
and
are the standard normal density and distribution functions, respectively.7
MAS 5.0 background correction:In the MAS 5.0 background correction method, the chip is divided into a rectangular grid with k rectangular regions. At each region, at least 2% of the probe intensities are used to calculate a background value for this grid. Then, each probe intensity is corrected based on a weighted average of the background values. The weights depend on the distance between the probes and the center of gravity of the grid.7
Weights are calculated as follows;
(4)
Where
is a Euclidean distance from (x, y) position to the center of gravity of region k and
is correction coefficient.
In MAS 5.0 Background Correction method, both Perfect Match (PM) and Mismatch (MM) probes are corrected.7
RMA background correction has been one of the most commonly used pre-processing method in the recent literature.10-12 Performed assessments of the measure’s precision, consistency of fold change, and specificity and sensitivity of the measure’s ability to detect differential expression and demonstrated the substantial benefits of using the RMA measure to users of the Gene Chip technology. They used data from spike-in and dilution experiments to conduct various assessments on the MAS 5.0, dChip and RMA expression measures. Irizarry have demonstrated that RMA has similar accuracy but better precision than the other two summaries and RMA provides more consistent estimates of fold change.12
The study of13 implements seven data extraction methods including MAS5.0 and RMA to calculate expression indices from Affymetrix microarray gene expression data and tested use of different background correction methods calculated the correlation coefficient for each pair-wise comparison of background correction methods.15
Quality assessment: It is necessary to evaluate the quality of the data before the normalization of the arrays. Quality control assessment should be carried out to determine whether the quality of experimental data is acceptable and whether any hybridization should be repeated.7,7
Various descriptive data plots are drawn to identify potential problems with hybridization or other experimental structures in the quality control evaluation process. Quality control plots are basically divided into diagnostic and spot statistics.6,7
- Diagnostic Plots: The diagnostic plots include various plots such as MA-plots for evaluating intensity bias and histograms for examining signal-to-noise ratios for each channel. Diagnostic plots are usually used to observe non-linear trends between two channels.7
- MA plots: M and A are commonly used variables in the analysis of two-color arrays. A is defined as follows;
(5)
Cy5 and Cy3 denote green and red dye intensities for a given spot, respectively. A variable is a measure of the total intensity of the logarithmic transformation of a spot. Thus, if the combined red and green intensities are high for a particular spot, the A value will also be high.7,9
M variable is defined as follows;
(6)
The M variable is the logarithmic transformation of the intensity ratio. The M value shows which of the red and green dyes are more binding to a particular spot array.7
MA plots are used to investigate density bias. A disproportionate amount of spot above or below the x-axis on the graph indicates a problem in the array. MA plots are an indication of whether normalization within the array is required.6,7
Alvord et al.demonstrated the use of some of the exploratory plots including boxplots, volcano plots and MA plots for the expression level data on the soybean genome [14]. Lu et al.’s study, can be cited as an example of MA-plot application in method comparison studies, in which MA-plots were created on the raw data and normalized data to compare normalization methods.15
- Histograms: In microarray designs, it is very important to obtain the histograms of the p-values of tests conducted to identify different gene expression. Histograms are graphs that are easy to interpret and contain considerable information. A histogram is an indication of whether there is a signal in the gene and whether the genes are differently expressed. Histograms also allow for estimation of how many genes are differentially expressed in reality.16
- Spot statistics plots: Spot Statistics help to predict the structures of spot and hybridizations. The main plots that can be obtained with spot statistics are spatial plots, box plots, scatter diagrams and volcanic plots.16
- Spatial plots: Spatial plots are used to reveal irregular spot and hybridization structures. Spatial plots are used to observe the spatial distributions of the intensities on each array and to detect the artifacts. Spatial plots play a fundamental role in determining the background correction, depending on whether there are any dimensional artifacts on the arrays.7
- Box plots: Box plots are one of the most commonly used plots for displaying spot and hybridization structures. At the same time, box plots can be drawn to understand the scale differences between different arrays. It is necessary to evaluate the box plots to see if between-array normalization is required. The homogeneity of the arrays can be observed quite clearly from the box plots.7,16
- Scatter diagrams: Scatter diagrams used to compare the expression values of two samples are the most commonly used plots for visualizing microarray data.17 In the first step of the microarray data analysis, a scatter diagram is drawn between the two intensity channels to view the general structures and variability. Scatter diagrams are also commonly used to find out slides lying away from the center, which have abnormal expression structures.18
- Volcano plots: Volcano Plots are used to summarize fold change and t-test criteria. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis and log of the fold change between the two conditions on the x-axis. For each gene, there is a point on the graph that represents the t-statistic and the fold change16 (Figure 1).
|
Figure 1: Volcano Plot.
|
Normalization:The purpose of the normalization phase is to adjust the data according to the technical variation. Variations can cause measurement differences between general fluorescence intensity levels of various arrays. The normalization process is necessary to make the measured values obtained from different arrays comparable.9 Normalization methods depend on which microarray technology is used. Generally, logarithmically transformed data are used for further analysis.19
The most commonly used methods of normalization are as follows.7
- Scaling Normalization Method
- Nonlinear Normalization Methods
- Quantile Normalization
- Cyclic Loess Normalization
- Contrast Normalization
Scaling normalization method:The Scaling Normalization Method scales all the arrays with the same average intensity by choosing a reference array. The constructed procedure is to determine a reference array and then create a linear regression between each array and the chosen array without a constant term. Subsequently, the regression line is used as a normalization relation.7
Nonlinear normalization methods:Non-Linear Normalization Methods perform non-linear corrections between arrays. These methods generally perform better than linear corrections such as the scaling method.19
There are many nonlinear normalization methods in the literature. Workman et al. proposed a nonlinear normalization method using array signal distribution analysis and cubic splines.18 Chen et al. proposed a subset normalization to adjust for location biases combined with global normalization for intensity biases.19 Edwards19 proposed a nonlinear LOWESS normalization method used in one channel cDNA microarrays mainly for correcting spatial heterogeneity.
Quantile normalization:The purpose of quantile normalization is to impose the same experimental density distribution on each array. If two data vectors have the same distribution, a Q-Q graph has a straight diagonal line with 1 slope and 0 intercept.20
Drawing quantiles of two data vectors against each other and designing each point on a 45-degree diagonal line leads to a transformation that allows both data vectors to have the same distribution.20
Cyclic loess normalization:The cyclic loess method is a generalization of the global loess method in which the Cy5 and Cy3 channel intensities are normalized using MA graphics. When dealing with single channel array data, array pairs are normalized according to each other. The Cyclic Loess method normalizes the intensities for an array set by working in dual form.21
Contrast normalization:In contrast normalization, the data is transformed into a contrast set and a nonlinear MA-plot normalization is performed. Afterward, inverse transformation is applied.7
Normalization procedures are essential as the first step of expression analyses for adjusting artifacts on samples and making samples comparable. Choice of normalization procedure plays a fundamental role in the final results of gene expression analysis. There are several methods for normalization in the literature. Quantile normalization procedure is one of the most commonly used between these methods.
Qian et al. used quantile algorithm for normalization in their study which is aimed to identify differentially expressed genes and compare the expression profiles.22 The raw expression data was preprocessed using the RMA which includes default configuration for background correction, normalization and calculation of expression values in the most of microarray studies as Zhang et al.’s study and Kupfer et al.’s study.11,14,23 Wu et al. demonstrated that cyclic loess normalization procedures performed better than quantile normalization procedures at reducing the number of false-positive up-regulated miRNAs.24
Statistical methods used in differential gene expression analysis: The principle in analysing the gene expression data is the need for determination the genes of which expression models differ by phenotype or an experimental condition.1 A simple approach to selecting genes is to use the "fold change" criteria. This is only possible if there are no or only a few repetitions. However, an analysis based only on the fold change does not allow for the assessment of the significance of expression differences in the presence of biological and experimental variations that may vary from gene to gene. This is the main reason for using statistical tests to evaluate differential expressions. Parametric tests generally have a higher power if the underlying model assumptions are met.9
When doing the statistical analysis of microarray data, an important question is determining which scale to analyze the data. Generally, a logarithmic scale is used to approximate the distribution of the repeated measures for each gene to roughly symmetric and normal.7 The variance-stabilizing transformation derived from an error model for microarray measurements can be used to make the variance of the measured intensities independent of their expected value. This may be advantageous for gene-based statistical tests based on variance homogeneity.16 This will reduce variance differences between experimental conditions arising from differences in intensity level. However, it should not be forgotten that differences in the variance between conditions may also have gene-specific biological causes.19,20
t-test comparisons for one or two groups, variance analysis for multiple groups, and trend tests are frequently used models to assess differential gene expression. Due to lack of knowledge about the co-regulation of genes, linear models are usually calculated separately for each gene. When lists of relevant genes are identified, researchers can begin coordinated regulatory studies to further model their common actions.7,20.
A statistical testing approach for each gene is common. However, this approach has some difficulties. Most importantly, a large number of hypothesis tests are being performed. This potentially leads to a number of false positives. Multiple test procedures allow the assessment of the overall significance of the results of a group of hypothesis tests. These procedures focus on specificity by controlling type I (false positive) error rates such as experimental error rate or false discovery rate. These controls are statistical methods used in multiple hypothesis tests to correct for multiple comparisons. However, testing multiple hypotheses remains a challenge. Because an increase in specificity is related to a loss of sensitivity, as provided by the p-value correction methods. Therefore, the possibility of detecting the true positivity decreases.7,9
Most microarray experiments involve only a few repetitions for each condition, making it difficult to predict gene-specific variances. Different methods have been developed to take advantage of variance information from all gene data.18
Fold Change Criteria: A simple microarray experiment is carried out to determine the expression differences between two conditions. Each condition can be represented by one or more RNA samples. The simplest method used to test expression differences of genes is the "Fold Change" criterion.16
The method calculates the logarithm rate between expression levels of the two conditions. It then evaluates whether all genes are greater than a threshold value determined for differential expression. The Fold change method is not considered reliable because it does not take statistical variability into consideration. The fold-change method is subject to bias if the data have not been properly normalized.18
For gene i, fold change is defined by as follows;
(7)
Where
and
denote the expression levels in the control and treatment groups for the ith gene, respectively.16
Student t-test:The Student's t-test is one of the most basic statistical methods used to determine differentially expressed genes if there is constant variance. Student t-distribution should be used if the genes are independent of each other and the observations are normally distributed.20
Let
and
be two independent gene expression data under two conditions; For example, normal and disease groups, two samples t-statistic for a given gene is;
(8)
Where
and
denote the average expression level of a given gene in the control and disease group, respectively. S is the pooled standard deviation. Pooled variance is
is estimated by:
(9)
Where
and
are the variances for control and disease groups, respectively.
The calculated test statistic is compared to the critical value of
degrees of freedom where
and
are the sample sizes for the control and disease group, respectively. Since the t-test makes use of the variances of the samples, it also draws attention to the shortcomings of the fold change approach.20
The standard t-test is frequently used to identify differentially expressed genes between two conditions in microarray studies. Shi et al. used student's t-test to obtain differentially expressed genes between embryonic stem cells and urinary induced pluripotent stem cells.25
Satterthwaite-welch t-test: Homogeneity of variances is rarely seen in microarray experiments. Heterogeneous sample or cell samples may cause heterogeneous variance in microarray experiments. Changing the correlation structure of expression change by the transcription factor can also cause heterogeneous variance. Therefore, Welch (Satterthwaite's) t-test method would be more appropriate for independent samples with different variances.21
Let
and
be two independent gene expression data under two conditions, The Welch t-statistic is calculated as follows for a given gene;
(10)
Where
is the mean value of given gene in the control group with
sample size,
is the mean value in the disease group with
sample size, and
and
are the variances for the control and disease groups, respectively.21
The Welch method has a special correction for the degree of freedom under the variances of different samples;
(11)
is the estimated squared standard error for sample i;
(12)
Barajas et al. used Welch test to compare imaging and histopathologic variables in their study.26 Mosig et al. compared monocyte gene expression profiles from FH patients with healthy controls using a Welch T-test.27
Paired sample t-test: The dependent design is often used in two-channel experiments where gene expression comparisons involving a natural match of experimental units are made. Student t-test and Welch t-test are used for independent samples. The dependent sample t-test should be used when the samples are dependent.21
Let
and
be the mean difference and standard deviation of differences, respectively. Paired sample test statistic is given by;
(13)
With n-1 degrees of freedom.
Moderated t-test:Moderated t-test is one of the most common methods used in microarray studies. The modification is to add a small value to the standard deviation, which reduces the variability of the t-value, while the t-statistic is being calculated. The main purpose of the moderated t-test is to reduce the statistical significance of genes with a small standard deviation to avoid false positives.19,28
Moderated t-statistic is described by as follows;
(14)
Where
is a selected constant to reduce the variability of
.
The moderated t-statistic improves the ranking performance if the purpose is to create a short sorted list.21,29 It has been proven in many studies that the moderated t-statistic performs better than the classical t-test and the fold change criteria.30,31
Moderated t-test is most frequently used method to identify differentially expressed genes between two conditions in the literature. Shi et al. obtained differentially expressed genes in the five datasets of coronary heart disease using moderated-t test.32 Dolah et al. used moderated t-test for genes in dolphin skin differentially expressed according to sex.33
Wilcoxon rank sum test:The Wilcoxon rank sum test is a nonparametric method used to compare gene expressions for two groups when
expression values are not normally distributed. In order to compare the samples, the ranks of the observations are used instead of the original observations.20
Wilcoxon rank sum statistic is given by;
(15)
Where
is the ranking between all the
and
values of
. Sorting starts from 1 for the smallest value and the highest value is
. If all the
values are smaller than
values, then
will be
. If all the
values are higher than
values, then
will be
. All other possibilities are between these two values.20
ANOVA F-test:ANOVA is used to investigate the significance of the effects of factors that may affect gene expression in microarray data analysis. The use of the ANOVA model for two-color microarray designs was first proposed by Kerr et al.7 and has been an important part of the literature. Jiang et al. tested barley genes for differential expression by one-way analysis of variance and controlled FDR according to the standard method of Benjamini and Hochberg.13
The ANOVA model for microarray data can be determined in two steps.34 The first stage is the normalization model;
(16)
Where
is the logarithm of signal intensity for the ith array, jth dye, gth gene and rth measurement. is the overall average expression level. A is the effect of the array at the measured intensity, D is the effect of the dye, and AD is the effect of interaction between dye and array. The first step is to form
term from the measured intensities. In the second step, gene-specific effects are modeled in terms of residuals of the normalization method.21 The gene-specific model is expressed as follows;
(17)
G is the average intensity for a particular gene,
is the effect of an array on this gene,
is the effect of dye on this gene, and
is the residual value. The variety-by-gene (VG) term is the main interest in the analysis, and reflects the variability in expression levels between samples for a particular gene.
is expressed as a "catch-all" term for the effects related to samples. In the simplest case,
is an indicator of the condition represented by the sample for j dye and i array. In more complex experiments, the design structure at the biological sample level is reflected in the VG terms.34
When there are duplicated spots in an array, the model should include additional terms for labeling and spot effects. The Gene-specific model can be modified by removing the terms DG and AG for single-color data.34
Hypothesis testing involves comparing two models. The null hypothesis suggests that there are no differential expressions (the VG values are equal to zero) and the alternative hypothesis suggests that there are differential expressions between conditions (the VG values are not equal to zero).29 F statistics are calculated with residual sum of squares;
(18)
(19)
(20)
and
are the residual sum of squares with
and
degrees of freedom, for zero and alternative models, respectively.
is the pooled variance between all the genes.34
Moderated F-test:Moderated t statistics lead to F statistics that can be used for test hypotheses about any set of comparisons. The appropriate quadratic forms of the moderated t statistics follow F distributions.35
Where the average of the contrast estimators is
, suppose that all comparisons for a given gene will be tested. While
is the diagonal matrix, C is the contrast matrix, and
is the positive definite matrix, the correlation matrix of
will be
. Let r be the column order of C,
and
.
is obtained by as follows;
(21)
If the columns of
are chosen to be eigenvectors covering the spacing of
, the diagonal matrix will be
.
is obtained by as follows;
(22)
Moderated F-statistic tests whether any of the comparisons for a given gene is zero. The result of the test is whether or not the gene is differently expressed in any comparison. In complex experiments involving too many comparisons, it is primarily preferable to select genes based on moderated F-statistics.35
False discovery rate:The great majority of discussions on error rate in the literature relate to experimental and per comparison error rates. In recent years, the false discovery rate (FDR) proposed by Benjaminini and Hochberg35 has become very widespread. FDR is defined as the expected value of the ratio of false rejections to total rejections by the authors. Benjamini and Hochberg point out that controlling the FDR is more reasonable than controlling the experimental or per comparison error rates.36
Let
be the means to be compared, we are interested in testing
pair of hypotheses.
is the number of correctly rejected hypotheses from
rejection sets.
is the number of pairs incorrectly rejected.36
Benjamini and Hochberg point out that the false rejected null hypothesis can be expressed by the Q random variable.31
(23)
When
, Q is defined as zero. If there is no rejection, the error rate is zero. Benjamini and Hochberg defined the FDR as the average of .36
Consequently;
(24)
(25)
(26)