International Journal of Biomedical Science

Received- January 7, 2011; Accepted- February 15, 2011

International Journal of Biomedical Science 7(3), 172-180, Sep 15, 2011

ORIGINAL ARTICLE

Comparison of Two Methods for Detecting Alternative Splice Variants Using GeneChip^{^®} Exon Arrays

Wenhong Fan¹, Derek L. Stirewalt², Jerald P. Radich², Lueping Zhao¹

¹ Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109, USA;

² Clinical Research Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109, USA

Corresponding Author: Lue Ping Zhao, 1100 Fairview Ave. N, Seattle, WA 98109, USA. Tel/Fax: (206) 667-6927/2437; E-mail: lzhao@fhcrc.org.

	ABSTRACT
	INTRODUCTION
	METHODS
	RESULTS
	DISCUSSION
	ACKNOWLEDGEMENTS
	REFERENCES

	ABSTRACT

The Affymetrix GeneChip Exon Array can be used to detect alternative splice variants. Microarray Detection of Alternative Splicing (MIDAS) and Partek^® Genomics Suite (Partek^® GS) are among the most popular analytical methods used to analyze exon array data. While both methods utilize statistical significance for testing, MIDAS and Partek^® GS could produce somewhat different results due to different underlying assumptions. Comparing MIDAS and Partek^® GS is quite difficult due to their substantially different mathematical formulations and assumptions regarding alternative splice variants. For meaningful comparison, we have used the previously published generalized probe model (GPM) which encompasses both MIDAS and Partek^® GS under different assumptions. We analyzed a colon cancer exon array data set using MIDAS, Partek^® GS and GPM. MIDAS and Partek^{^®} GS produced quite different sets of genes that are considered to have alternative splice variants. Further, we found that GPM produced results similar to MIDAS as well as to Partek^® GS under their respective assumptions. Within the GPM, we show how discoveries relating to alternative variants can be quite different due to different assumptions. MIDAS focuses on relative changes in expression values across different exons within genes and tends to be robust but less efficient. Partek^® GS, however, uses absolute expression values of individual exons within genes and tends to be more efficient but more sensitive to the presence of outliers. From our observations, we conclude that MIDAS and Partek^® GS produce complementary results, and discoveries from both analyses should be considered.

KEY WORDS: alternative splicing; exon; gene expression analysis

INTRODUCTION

	INTRODUCTION

The Affymetrix GeneChip^® Exon Array (Affymetrix, Santa Clara, CA) is designed to evaluate expression of individual exons within genes and has been used to detect alternative splice variants (1-3). (MIDAS) (4) and Partek^® Genomics Suite (5) are among the commonly used methods for analyzing exon array data. When applying both methods to analysis of the same data set (6), we have noted that they produce different results (unpublished data), which could create controversies in practice. Here we intend to compare these two software methods from the perspective of the generalized probe model (GPM) that was published by our group (7). As a general model, GPM can be simplified to be equivalent to either MIDAS or Partek^® GS, depending on which assumptions are made.

MIDAS software assumes that relative changes in exon signals within genes are predictive of alternative splice variants. Under this assumption, MIDAS divides an individual exon signal by its gene expression value. In other words, it creates a gene normalized exon signal by computing ratios between individual exons and whole gene signals. Hereafter, we refer such ratios as “ratio signals.” In MIDAS, the ratio signal for each exon is used to compare exons to one another between comparison groups using analysis of variance (ANOVA). Implicitly such an ANOVA further assumes that the error distribution on the relative scale is normal on the logarithmic scale. Using variations detected by ANOVA, MIDAS predicts which exon is likely to have alternative splicing. Figure 1 illustrates how each exon signal is divided by its gene expression value to obtain the ratio signal. X and Y are the ratio signals for exon 3 in samples A and B, respectively. If X significantly differs from Y, a splice variant exists in exon 3 between samples A and B.

In contrast, Partek^® GS directly uses the exon signals to detect differences in the expression of exons in the ANOVA. It then declares the presence of relevant alternative splice variants when one or more exons appear to have different expression mean values between the two comparison groups. Figure 2 illustrates the basic strategy underlying Partek^® GS. Essentially it detects significant differences in the exon signals across all exons in the gene between the comparison groups. Implicitly assumed in Partek^® GS is the normally distributed error for the raw intensity values on the logarithmic scale.

While both software methods are designed for analyzing exon array data, there are some fundamental differences. Firstly, definitions of alternative splice variants are different at the conceptual level. MIDAS conceptualizes the presence of alternative splice variants via the relative changes of exon-specific expression values within a gene, while Partek^® GS assumes that alternative splice variants are absolute changes of exon-specific expression values within a gene. Secondly, MIDAS focuses on the presence of alternative splice variants at the individual exon level so that it generates a p-value for each exon in the statistical testing. Partek^® GS simply detects the presence of alternative splice variants within a gene, and it therefore produces a p-value for each gene. Thirdly, the underlying statistical assumptions on stochastic processes are rather different between these two methods; MIDAS assumes an error distribution for relative changes, and Partek assumes an error distribution for raw intensity values. The software methods are therefore substantially different, presenting challenges to scientists who desire to examine the biological relevance of alternative splice variants. Resolving their differences is the primary motivation for us to consider using GPM to analyze exon array data.

While GPM was recently proposed for analyzing probe-level data from GeneChip^® gene expression array data (7), it is rather general and encompasses statistical models used by MIDAS and Partek^® GS under their respective assumptions. Fundamentally, GPM is a linear regression model and is directly applicable to exon array data by equating “probe” with “exon.” Because it uses the generalized estimating equation (GEE) technique (8), GPM is particularly suitable for modeling multiple correlated exon expression values within genes. Under the typical multivariate normal distribution assumption, the GPM method produces efficient estimates. Efficiency in the current context refers to statistical efficiency, where a higher efficiency method obtains the same conclusion with relatively fewer samples. GPM encompasses the statistical method in MIDAS because the ANOVA method used in MIDAS is a special case of GPM as long as relative exon-specific expression signals are used in the analysis. Similarly, GPM encompasses Partek^® GS. This is also because the multivariate ANOVA used in Partek^® GS is a special case of GPM as long as exon-specific expression values are used and a single gene-specific parameter is used to identify alternative splice variants.

In addition to the above reasoning, it is important to demonstrate differences and similarities between these two software methods on an actual data set. For this purpose, we used an empirical data set that was generated from a panel of colon normal and cancerous tissues produced by researchers at Affymetrix (6). We applied MIDAS and Partek^® GS, together with GPM, to analyze the colon cancer data set. To eliminate the influence of data pre-processing, we used PLIER in the Affymetrix Power Tools (APT) Software Package (9) to produce both exon signals and gene expression values. We then compared results from MIDAS and Partek^® GS with GPM, and results from GPM under different assumptions.

View larger version :
[in a new window]

Figure 1. Schematic diagram to illustrate alternative splice variant detection in MIDAS.

View larger version :
[in a new window]

Figure 2. Schematic diagram to illustrate alternative splice variant detection in Partek^® GS.

METHODS

	METHODS

Data sets

The colon cancer data set includes 20 paired normal/tumor samples from colon tissue (6). Affymetrix Human Exon 1.0 ST array contains over one million exons covering the entire human genome. Exons are categorized into three groups: core content, extended content and full content. For our comparison, the core content exons were used because they had the most supportive evidence in the annotation databases.

Running MIDAS

MIDAS statistics were calculated using the software provided by Affymetrix in the APT packages. MIDAS uses the two-sample t-test, or 1-way ANOVA, to detect alternative splicing exon by exon. Let Y denote the ratio of the exon signal over the gene expression. MIDAS 1-way ANOVA model can be written as:
log(Y) | T = μ + T + ε (a)

where µ quantifies the baseline mean value, T quantifies the difference between the treatment groups, i.e. tumor versus normal in the colon tumor dataset, and ε denotes a normally distributed error. This error is assumed to be independent after accounting for the difference between tumor and normal tissues. The p-values and F-statistics were output into separate files for each exon.

Running Partek^® GS

Partek GS models exon signal directly using mixed ANOVA. For the colon tumor dataset, its model can be written as:
log(Y) | T, E, P = μ + T + P + T * E + S(P * T) + ε (b)

where Y is the exon signal (not the ratio as in MIDAS), µ estimates the baseline mean value, T estimates the difference between the treatment groups (i.e. differential gene expression parameter), P estimates the patient random effect, E estimates the Exon effect, S(P*T)is the sample random effect which is nested in tissue type and the patient variable, and ε denotes a normally distributed error. Besides the independence across individuals, errors from multiple exons within the same transcript cluster were assumed to be independent. Such conditional independence allows one to derive a joint multivariate normal distribution so that relevant parameters and their standard errors can be estimated. Note that we added the log-stabilizing factor of 8 to the data prior to the logarithmic transformation, to keep raw data as consistent as possible with the MIDAS calculation. We invoked the ALT-SPLICE ANOVA option under the STAT menu per its manual. P-values and F-statistics on the exon and tissue interactions were extracted from Partek^® GS outputs.

Running GPM

GPM was developed for probe-level analysis on the GeneChip^® gene expression array data on which multiple probes are used to interrogate one gene. Purely from the modeling perspective, a typical GPM can be written as the linear model used in Partek^® GS. To acknowledge the dependence among multiple exons within a transcript cluster, GPM can be formally defined as

(c)

where n is used to denote the number of exons, μ’s are mean intensity values for every exon, and β’s are differences between tumor and normal tissues where differences greater than zero imply the presence of alternative splice variants. The residuals ε' = (ε₁ ε₂ … ε_n) are assumed to be independent across subjects, but are correlated among themselves. Due to using the estimating equation, we did not have to assume a multivariate distribution for these residuals. Once estimates were obtained, we computed their covariance matrix, and hence Wald-statistics to test if all β’s equaled zero.

There are two pertinent remarks regarding use of GPM. In connection with the linear models underlying Partek^® GS, one could start with the linear model (b) and compute mean structures. Those random effects are the primary sources contributing to the covariance matrix of multiple exons within the transcript cluster. Specification of their mean structure and covariance matrix allows one to adopt the same estimating equation technique without requiring any normality assumption. The second point is relating to the application of GPM to pair matched observations. The simplest way for GPM to incorporate such a design is to replace the outcome vector with the corresponding differences between exon intensity values (on a log scale), and then to model the association of differences with the tissue type. More generally, if multiple tissue samples are obtained from the same subject, one would expand the above model (c) to include replicates in a straightforward fashion. Again, one does not have to make any unverifiable distributional assumption, resulting in robust estimations and inferences.

RESULTS

	RESULTS

MIDAS versus Partek^® GS

As noted above, MIDAS and Partek^® GS are the two most commonly used software methods for detecting alternative splice variants. We ran each analysis using the corresponding default settings, using the Bonferroni corrected p-value of 0.05 as the significance level. In this exercise, MIDAS yielded no signals for any alternative splice variants. On the other hand, Partek^® GS identified 368 transcript clusters (Table OL1; also see detail information on test statistics and annotation in additional files 1 and 2 in supplemental material online at www.ijbs.org). A transcript cluster includes multiple exons within the same transcriptional locus. It is conceptually equivalent to a gene but may not have the exactly same boundaries. For example, the Akt3 transcript cluster has many probe sets that fall outside the region annotated by the RefSeq database. We use gene and transcript cluster interchangeably in this paper.

Since MIDAS produces statistics on the exon level and Partek^® GS on the gene level, it is not feasible to compare their statistics directly. Instead we focused our comparisons on top candidate gene lists selected from both analyses. By ranking p-values from MIDAS, we picked the top 462 most significant exons in MIDAS, because these probes are from a list of 363 unique transcript clusters (Table OL2; also see detail information on test statistics and annotation in additional files 3 and 4 in supplemental material online at www.ijbs.org). The choice of 363 transcript clusters is comparable to 368 transcript clusters identified by Partek^® GS. Comparing these two lists resulted in a total of 152 transcript clusters that overlap (Table OL3; also see detail information on test statistics and annotation in additional files 5 and 6 in supplemental material online at www.ijbs.org).

To gain insight into the differences and similarities between these two methods, we identified three examples in distinct scenarios: transcript clusters selected by MIDAS only, by Partek^® GS only, and by both. As shown in Figures 3, 4, and 5, we plotted the ratio signals from MIDAS (bottom panel) and the exon signals from Partek^® GS analysis (top panel) for each exon in three transcript clusters. In Figure 3, for transcript cluster 2330133, exon 14 has a p-value of 0.04 and had been selected by MIDAS into the top 462 significant exons, but the transcript cluster was not selected by Partek^® GS. Exon 14 was selected due to significant ratio signal difference between normal (red) and tumor (blue) tissues. From the top panel, one can see that the exon signals for exon 14 are also quite different between normal and tumor tissue. Transcript cluster 2330133 was not selected by Partek^® GS because this method tests the entire set of exons in a transcript cluster, not each individual exon. In Figure 4, transcript cluster 3406329 was selected by Partek^® GS, but none of the exons in this transcript cluster had p-values sufficiently small to be in the top 462 clusters selected by MIDAS. For this transcript cluster, it seems that expression from the first half of the exons (1-14) was lower than those from the second half (15-33) in both normal and tumor tissues. Expression values for the second half exons were lower in normal than in tumor tissues, but this was not the case for the first half exons. This transcript cluster was selected by Partek^® GS because the expression patterns changed from the first half exons to the second half in this transcript cluster between the normal and tumor tissues. In MIDAS, analyzing the ratio signals for each exon in transcript cluster 3406329 did not reveal any significantly different exons between normal and tumor tissues. In Figure 5, transcript cluster 2584134 was selected by both Partek^® GS and MIDAS. MIDAS reported that exon 13 had a p-value of 0.025. Partek^® GS also detected alternative splicing patterns in this transcript cluster.