|
[ Index | PDF ]
Practical Robust Fit of Enzyme Inhibition Data1 BioKin Ltd., 15 Main St, Suite 232, Watertown MA 02472 2 Celera Genomics, Department of Enzymology and High-Throughput Screening, 180 Kimball Way, South San Francisco, CA 94080
Corresponding AuthorDr. Petr Kuzmic
Published In:Methods in Enzymology, vol. 383, pp. 366-381 (2004).
Keywordsenzyme kinetics; mathematics; statistics; robust regression; outlier detection; Huber’s Minimax; inhibition; high-throughput screening
IntroductionThe analysis of enzyme inhibition data in the context of preclinical drug screening presents unique challenges to the data analyst. The good news is recent advances in laboratory robotics, miniaturization, and computing technology. However, the good news is also the bad news. Now that we can perform thousands of enzyme assays at a time, how do we sensibly manage and interpret the vast amount of generated information? New methods of biochemical data analysis are needed to match the improvements in research hardware. Another challenge is presented by increasing constraints on material resources, given the need to assay an ever larger number (thousands or hundreds of thousands) of individual compounds in any given project. Of course, only a small number of hits advance from high-throughput screening, using a single-point assay, into the dose-response screening to determine the inhibition constant. Even so, the sheer number of enzyme inhibitors that need to be screened often leads to sub-optimal experimental design. For example, a strategic decision may have been made that all inhibitor dose-response curves will contain only eight data points (running down the columns of a 96-well plate), and that the concentration-velocity data points will not be duplicated, so that each 96-well plate can accommodate up to twelve inhibitors. With only eight non-duplicated data points, and with as many as four adjustable nonlinear parameters (e.g., in the four-parameter logistic equation), the experimental data better be extremely accurate and the concentrations optimally chosen. Alas, too often neither is the case. This paper is concerned with one particular nuisance arising in secondary preclinical screening of enzyme inhibitors, namely, the presence of gross outliers. For our purposes, outliers are data points that are affected by gross errors caused by malfunctioning volumetric equipment, by a human error in data entry, or by countless other possible mishaps. It is shown that Huber’s Minimax approach to robust statistical estimation is particularly preferable over the conventional least-squares analysis.
TheoryIteratively reweighted least squares. Assume that the dependent variable y is related to the independent variable x through the functional relationship y = f(x,p), where p is the m-vector of adjustable model parameters to be estimated from the available data pairs {{x1,y1},{x2,y2},...,{xn,yn}}. The usual Ordinary Least Squares1, 2, 3 (OLS) estimation problem can be formulated as is shown in equation (1).
Many efficient computational methods exist to accomplish this minimization. Unfortunately, the OLS estimate of the model parameters is very sensitive to the presence of outliers,4 which has lead to the design of various alternatives. For example, according to equation (2), instead of minimizing the sum of squared deviations, one might minimize the sum of absolute deviations.
Computationally, the Least Absolute Deviation (LAD) fit is more difficult than OLS. One approach4 is to resort to derivative-free methods, such as the Nelder-Mead simplex algorithm.5 A more feasible approach, leading to the same results, is to use iteratively reweighted least squares. This is based on the fact that the LAD fit can be accomplished as a sequence of weighted LS fits.
In each step, the best-fit values A similar iteratively reweighted least-squares procedure forms the basis of the robust fit method discussed in this article. Huber’s method. The LAD fit has been used occasionally for data analysis in biochemical kinetics.6 It does solve the outlier problem, but it is probably not appropriate in most experimental situations arising in biochemistry. As had been pointed out,4 the LAD fit does not provide maximum likelihood parameter estimates (ML, or M-estimates) if the underlying statistical distribution of random errors is Gaussian, according to the probability density function (4).
The LAD fit does produce ML parameter estimates if (and only if) the underlying error distribution function is a double-sided exponential, but such distribution is seen only infrequently.6 On the other hand, the very presence of outliers in real-world experimental data proves the fact that strictly Gaussian error distribution is also unrealistic. What is the solution to this quandary? Huber7 proposed that random experimental errors arising in the physical sciences
could be described by using the contaminated Gaussian distribution (5), where
According to Huber,7 Starting from similar distributional assumptions, and from the central role of statistical influence functions,8, 9 a rigorous theory of robust estimation had been built.7, 10, 11, 12 Regardless of the particular form of the influence function, many computational algorithms for robust regression analysis rely on iteratively reweighted least squares,13, 14, 15, 16 as does Huber’s method used here. Huber’s influence function7 is constructed as follows. All “good” data points (to be defined below) are assigned the same weight in the iteratively reweighted series of LS estimations, exactly as they are in OLS. In contrast, deviant or “bad” points are progressively deemphasized, by being assigned progressively smaller weights according to equation (7). Here, a “good” data point is one for which the standardized residual Ri, defined in equation (8), is smaller in absolute value than a certain multiple of the estimated standard deviation of fit. The cut-off criterion c serves as an empirical tuning constant.
It should be noted that different authors (including manual writers for advanced statistical software packages, such as S-PLUS, SAS, and MATLAB) variously refer to Ri either as standardized residuals or as Studentized residuals. This confusion is clearly explained by Rawlins.17 Importantly, Huber established that with c = 1.345, the M-estimator so defined is 95% efficient. By “efficiency” we mean the ratio of variances from Huber’s M-estimate relative to normal regression models, assuming that the underlying error distribution is in fact normal. The standard deviation of fit,
The “hat” matrix and nonlinear leverages. The quantity hi appearing in the denominator of equation (8) is the leverage of the i-th data point in the nonlinear least-squares regression. It is the diagonal element of the n×n “hat” matrix H defined in equation (10), where J is the familiar n × m Jacobian matrix of first derivatives.18 Recall that m is the number of adjustable parameters in the fitting model.
In linear regression, the experimental values y and the least-squares fit
values Because we are only interested in the diagonal elements of the “hat” matrix, we can compute them directly by using equation (12), where ji is the i-th row vector in the Jacobian matrix J.
The m×m matrix inverse The kinetic model. The kinetics of tight-binding enzyme inhibition21, 22 is described here by equation (13), where V b is the baseline or background reaction rate, V 0 is the reaction rate observed in the absence of inhibitor (“negative control”), [E] and [I] are respectively the concentrations of the enzyme and the inhibitor, and Ki is the apparent inhibition constant.23
Numerical ExampleThe experimental data in the first three columns of Table 1 represent the micromolar concentration of an inhibitor (xi) and the corresponding initial velocities of an enzyme reaction (yi). Ordinary least squares fit. According to Huber’s method, the robust regression analysis always begins with the ordinary least-squares fit, summarized in columns four through seven in Table 1. In the OLS fit to equation (13), the enzyme concentration ([E] = 10 nM) and the background rate (V b = 0) were treated as fixed constants; the apparent inhibition constant Ki and the control rate V 0 were treated as adjustable model parameters. The best fit values are shown in the first row of Table 2.
Table 1Results of fit for a representative enzyme inhibitor.
Table 2The results of fit using various analysis methods; nw<1 is the number of data points
for which the final weight (equation (7)) was lower than unity;
It is important to note the difference between ordinary residuals ri = yi - This difference between ri and Ri is caused by the nonlinear leverages hi for each
data point. Note that the leverage for the fourth data point (0.57) is more than
twice the leverage for the fifth data point (0.26), which means that in the
iteratively reweighted least squares the fourth data point will be initially
given larger weight (1/ The leverages hi for data points seven through nine are practically zero, which means that these data points contribute practically no useful information. This indicates a sub-optimal experimental design. Huber7 points out that large values of hi should “serve as warning signals that the ith observation may have a decisive, yet hardly checkable, influence. Values hi < 0.2 appear safe, values between 0.2 and 0.5 are risky, and if we can control the design at all, we had better avoid values above 0.5.” From this discussion it is clear the fourth data point is what Huber calls a leverage point (i.e., a data point associated with a dangerously high value of hi), whereas data points number seven through nine are useless. This is yet another unpleasant consequence of one-size-fits-all experimental designs traditionally seen in inhibitor screening. Most often, the same dilution ratio and the same maximum concentration is used for all inhibitor on the same 96-well plate, but if the inhibitors differ significantly in their inhibition constants, this uniform design generates a large number of data points with very low information value. Robust fit. In the second stage of this analysis, the leverages hi computed during the preliminary OLS fit were used in the iteratively reweighted OLS regression, employing the default value of Huber’s tuning constant (c = 1.345). After several repeated OLS fits, the adjustable parameters converged to the values listed in the third row of Table 2. The results of the fit are shown graphically in Figure 1.
Note in Table 1 that the Huber reweighted regression ended with assigning unit weights (that is, giving the Ordinary Least Squares treatment) to all data points except data point number four, which is assigned a very small weight (wi = 0.12). This result strongly suggests that the fourth data point is an outlier. Its standardized residual is almost equal to three (Ri = 2.98), which is yet another strong indication that the data point is affected by gross error. Some authors8 recommend that data points with Ri > 2 should simply be deleted. Others recommend a two-stage robust regression, starting with Huber’s influence function (7) followed by Tukey’s Biweight scheme (14), where the 95% asymptotic efficiency of the standard normal distribution is achieved with c = 4.6851.
Note that Tukey’s outliers are given zero weights, and thus are effectively excluded from analysis. We have experimented with Tukey’s Biweight and found that, too often, it deleted too many data points from our small data sets. Instead, in the context of inhibitor screening, we formulated a heuristic policy for data exclusion as follows. If and only if the Huber method produces a single data point with the final weight wi < 1, this single data point is deleted (by being assigned wi = 0), and the analysis is repeated one last time as ordinary least squares. For our example inhibitor, where this condition does apply, the results are shown graphically as the thin dashed curve in Figure 1. The solid and dashed curves in Figure 1 do appear pleasingly similar, suggesting that Huber’s fit with c = 1.345 and OLS fit with the fourth data point deleted are consistent with each other. The best-fit values of adjustable parameters, shown for the OLS fit with deletion in the last row in Table 2, are also comparable for the two methods, although they are not identical. The difference is caused by the outlier point being assigned a nonzero weight, w4 = 0.12, in Huber’s method. However, the difference between Ki = 131 nM (Huber) and Ki = 146 nM (OLS with deletion) is only about 10%, whereas the OLS method with full data set produced an inhibition constant (Ki = 43 nM) that is off by more than a factor of three. Thus, the application of Huber’s method alone produced two desirable effects. First, it reduced the systematic error in Ki due to a single outliers, from 330% to 10%. Second, it clearly diagnosed the presence of this outlier so it could be deleted. Variations in Huber’s tuning constant. Although the particular value c = 1.345 for Huber’s tuning constant is rooted in statistical theory (it has been chosen because it leads to an M-estimate that is 95% efficient), it is important to examine in practice how variations in c might affect the outcome of robust regression analyses in our particular experimental setting. Based on theoretical considerations (see equation (7)), we can predict that as c becomes very large all data points will be assigned unit weights and the Huber regression turns into OLS. On the other hand, as c becomes very small, we expect the Huber algorithm to resemble the LAD fit, because the weighting factors in the influence function (7) simply become reciprocal absolute residuals.
Figure 2 and the fourth row in Table 2 show the results of Huber regression with c = 1.0. Although this is only marginally lower than the recommended value c = 1.345, the results of fit are strikingly different. As is illustrated by the bottom panel in Figure 2, the robust fit is dominated by six of the nine data points, which are assigned unit weights. Data points number one, four and five are de-emphasized with weights w1 = 0.11, w4 = 0.03, and w5 = 0.10. In other words, the Huber fit discovered too many outliers. Further decreasing c to 0.1 created another problem. With c = 0.1 or lower, no points were assigned full weights in reweighted least squares. The best-fit values of model parameters remained approximately the same between c = 0.1 and c = 0.001, but the variances of the model parameters decreased drastically. This is expected from theory, because the Huber M-estimate loses asymptotic efficiency as c moves away from its 95% efficient value of 1.345. Thus in any software system where the Huber tuning constant c is adjustable by the user (e.g., SAS, S-PLUS, MATLAB, or in the software built by us for inhibitor screening), one must proceed with caution. Lowering c might not only pick up too many “outliers” if the data set is small, it will also unrealistically shrink parameter variances.
As is expected from theoretical considerations, when we increased the tuning constant c above its 95% efficient value (c = 1.345), the algorithm simply turned into OLS. This is seen from Figure 3 and the second row in Table 2. In some respect, these results are disconcerting. At least for this particular inhibitor, decreasing c only slightly (in fact, from 1.345 to 1.3, see Figure 4 below) has lead to the false identification of too many “outliers”. In contrast, increasing c eventually missed the single outlier altogether. An important question then is how wide of a range can c have for the Huber method to remain useful for analyzing data sets as small as ours are (nine data points, two to four adjustable parameters). To answer this question, we have varied c systematically between 0.1 and 10.0,
stepping by
It is encouraging that the range of c values, within which the Huber algorithm successfully detected only a single data point with the final weight wi < 1, is relatively wide (c = 1.345 through c = 5.6). Unfortunately, as the weight of this data point increases, the outlier progressively distorts the best-fit value of the inhibition constant. Also note that the tuning constant’s default value (c = 1.345) precariously sits at the lower end of this interval. Thus, perhaps a miniscule change in the data could cause the algorithm to tip over that edge, and suggest falsely that the data set contains three “outliers” instead of one. These are serious challenges for the designer and administrator of a software system designed for automatic, robust, high-throughput analysis of enzyme inhibition data.
Implementation NotesIn the course of our ongoing work on new methods for automated data analysis in preclinical screening,24, 25, 26 we have tested Huber’s robust regression method on tens of thousands of enzyme inhibitors. The efficiency of the method in handling occasional outliers was very good. This is in contrast with the OLS fit, where data points with very large deviations have unduly large influence, and with the LAD fit,6, 4 where data points with very small deviations dominate the fitted curve. The latter statement is consistent with the fact that, in LAD regression
implemented as iteratively reweighted least squares, the weights are 1/|yi - The practical success of Huber’s method applied even to relatively small data sets, such as those arising in preclinical screening, is due to the fact that the method behaves as OLS does if the data are “good”, but at the same time it gives the LAD treatment to suspected outliers, while maintaining 95% asymptotic efficiency. The following are a few of many possible implementation issues, which were encountered in translating the theory of Huber’s regression into a practically useful software system. Replicated measurements. Huber’s robust regression method is suitable for the analysis of either replicated or nonreplicated data. However, when the number of replicates is small, it is best not to automatically average these replicates (e.g., duplicates) and then analyze the averaged data. Consider the hypothetical example where the replicated initial velocities are [100,99], [81,79], [62,30], [40,38], [19,21], and so on. These are ten data points, only one of which is clearly an outlier. By averaging we obtain five data points [99.5], [80], [46], [39], [20], among which, the outlier would be more difficult to detect if the fitting model is nonlinear. In this hypothetical example, a better alternative to robust regression might be weighted least squares (WLS) fit, where the weighting factors are reciprocal standard deviations from each replicate. Our experience shows that WLS with a small number of replicates can be treacherous for the following reason. If the inhibitor dose-response curve contains only a small number of (replicated) data points, it is possible that one particular duplicate might be fortuitously accompanied by a very small standard error, much smaller than the standard errors from other averages. In such a case, the particular data point would be assigned a disproportionately large weight, and consequently it might unduly influence the regression. In a production software system, the user or the administrator should have a choice to decide whether to use robust regression or WLS, but preferably not both at the same time. Outliers vs. deviations from the fitting model. Practical experience with Huber’s regression shows that in many cases the method will assign weights smaller than one to more than one data point, and sometimes even to all data points in the given dose-response curve. Obviously not all data points can be “outliers” if our distributional assumption (equation (5)) is correct. Rather, too many “outliers” (data points with wi < 1) simply suggest that the fitted model is incorrect. In such cases, a sensible software system would disregard the robust fit and revert to OLS with a suitable warning. Alternately, if only one of the weights is very much smaller than the others, it might make sense to delete the corresponding data point (by setting its wi = 0) and run one final OLS fit. A further refinement of this policy would be to take into account the total number of data points with wi < 1, or the sum of weights for the entire data set (see Figure 4). One might be willing to accept the results of Huber’s robust regression analysis only if the majority of data points end up with wi = 1, or otherwise conclude that the model is inadequate, issue a warning, and end with OLS as the last resort. However the software system is constructed, it is important to avoid situations illustrated in Figure 2, where very few data points completely dominate the regression while the remaining data points are effectively excluded through weighting. Again, in a production software system, the user or the administrator should be able to control these policies.
ConclusionsWe have discussed several mathematical quantities that should be of interest to the
biochemical data analyst, but, to our knowledge, are hardly ever mentioned in the
mainstream biochemical literature. First, standardized residuals defined by equation
(8) are significantly more informative than ordinary residuals (y - Secondly, nonlinear leverages, which are diagonal elements of the “hat” matrix H (equation (10)), are very useful for quickly assessing the presence of unduly influential data points (whether outliers or not) and the optimality of experiment design. If, after a least squares fit, we find that a particular data point is associated with hi > 0.7, it means that this is a leverage point. Very small, random changes in this single data value might have a very large effect on model parameters, which is undesirable. On the other hand, if we find that too many data points are associated with zero leverages (hi = 0), it means these data points were wasted, because they contribute no useful information at all about the model parameters. In such case, one should seriously consider improving the experimental design (in the case of inhibitor screening, the layout of concentrations) for the next round of experiments. Both standardized residuals and leverages play a role in the Huber’s method of robust regression analysis, implemented as iteratively reweighted least squares. We found that it is a good alternative to ordinary least squares when the goal is to exclude a single gross outlier from a relatively small data set. This approach can increase productivity in preclinical screening laboratories, faced with determining inhibition constants for thousands of enzyme inhibitors in a single project. Kinetic data analysis does present unique challenges in an extensively automated and robotized enzymology laboratory, where success means a smooth flow of the massive data stream connecting microtiter plate readers with structure-activity databases. At the present time, no technology can completely replace a well qualified enzymologist supervising the process. However, our practical experience shows that a software system judiciously implementing Huber’s variant of robust regression does help in the specific task of objectively identifying grossly outlying data points.
References
1. M. L. Johnson and S. G. Frasier, Methods Enzymol. 117, 301-342 (1985). 2. M. L. Johnson and L. M. Faunt, Methods Enzymol. 210, 1-37 (1992). 3. M. L. Johnson, Anal. Biochem. 206, 215-25 (1992). 4. M. L. Johnson, Methods Enzymol. 321, 417-424 (2000). 5. J. A. Nelder and R. Mead, Computer J. 7, 308-315 (1965). 6. I. B. C. Matheson, Computers & Chem. 14, 49-57 (1990). 7. P. J. Huber, “Robust Statistics”, John Wiley & Sons, New York, 1981. 8. D. A. Belsley, E. Kuh, and R. E. Welsch, “Regression Diagnostics: Identifying Influential Data and Sources of Collinearity”, John Wiley & Sons, New York, 1980. 9. R. D. Cook and S. Weisberg, “Residuals and Influence in Regression”, Chapman and Hall, New York, 1982. 10. W. J. J. Rey, “Introduction to Robust and Quasi-Robust Statistical Methods”, Springer-Verlag, New York, 1983. 11. F. R. Hampel, E. M. Roncheti, P. J. Rousseeuw, and W. A. Stahel, “Robust Statistics”, John Wiley & Sons, New York, 1986. 12. P. J. Rousseeuw and A. M. Leroy, “Robust Regression and Outlier Detection”, WileyInterscience, New York, 1987. 13. P. W. Holland and R. E. Welsch, Commun. Statist. - Theory Meth. A6, 813-827 (1977). 14. D. Coleman, P. Holland, N. Kaden, V. Klema, and S. C. Peters, ACM Trans. Math. Software. 6, 327-336 (1980). 15. J. O. Street, R. J. Carroll, and D. Ruppert, Amer. Statistician. 42, 152-154 (1988). 16. R. Heiberger and R. A. Becker, J. Comput. Graph. Statist. 1, 181-196 (1992). 17. J. O. Rawlings, “Applied Regression Analysis - A Research Tool”, Wadsworth, Belmont, 1988. 18. M. L. Johnson, Methods Enzymol. 321, 425-446 (2000). 19. D. C. Lay, “Linear Algebra and its Applications”, Addison-Wesley, Reading, MA, 1994. 20. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and Brian P. Flannery, “Numerical Recipes in C”, Cambridge University Press, Cambridge, 1992. 21. J. F. Morrison, Biochim. Biophys. Acta. 185, 269-286 (1969). 22. J. W. Williams and J. F. Morrison, Meth. Enzymol. 63, 437-467 (1979). 23. S. Cha, Biochem. Pharmacol. 24, 2177-2185 (1975). 24. P. Kuzmic, S. Sideris, L. M. Cregar, K. C. Elrod, K. D. Rice, and J. W. Janc, Anal. Biochem. 281, 62-67 (2000). 25. P. Kuzmic, K. C. Elrod, L. M. Cregar, S. Sideris, R. Rai, and J. W. Janc, Anal. Biochem. 286, 45-50 (2000). 26. P. Kuzmic, C. Hill, M. P. Kirtley, and J. W. Janc, Anal. Biochem. in press (2003).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||