The backtesting framework developed by the Committee is based on that adopted by many of the banks that use internal market risk measurement models. These backtesting programs typically consist of a periodic comparison of the bank's daily value-at-risk measures with the subsequent daily profit or loss ("trading outcome"). The value-at-risk measures are intended to be larger than all but a certain fraction of the trading outcomes, where that fraction is determined by the confidence level of the value-at-risk measure. Comparing the risk measures with the trading outcomes simply means that the bank counts the number of times that the risk measures were larger than the trading outcome. The fraction actually covered can then be compared with the intended level of coverage to gauge the performance of the bank's risk model. In some cases, this last step is relatively informal, although there are a number of statistical tests that may also be applied.
The supervisory framework for backtesting in this document involves all of the steps identified in the previous paragraph, and attempts to set out as consistent an interpretation of each step as is feasible without imposing unnecessary burdens. Under the value-at-risk framework, the risk measure is an estimate of the amount that could be lost on a set of positions due to general market movements over a given holding period, measured using a specified confidence level.
The backtests to be applied compare whether the observed percentage of outcomes covered by the risk measure is consistent with a 99% level of confidence. That is, they attempt to determine if a bank's 99th percentile risk measures truly cover 99% of the firm's trading outcomes. While it can be argued that the extreme-value nature of the 99th percentile makes it more difficult to estimate reliably than other, lower percentiles, the Committee has concluded that it is important to align the test with the confidence level specified in the Amendment to the Capital Accord.
An additional consideration in specifying the appropriate risk measures and trading outcomes for backtesting arises because the value-at-risk approach to risk measurement is generally based on the sensitivity of a static portfolio to instantaneous price shocks. That is, end-of-day trading positions are input into the risk measurement model, which assesses the possible change in the value of this static portfolio due to price and rate movements over the assumed holding period.
While this is straightforward in theory, in practice it complicates the issue of backtesting. For instance, it is often argued that value-at-risk measures cannot be compared against actual trading outcomes, since the actual outcomes will inevitably be "contaminated" by changes in portfolio composition during the holding period. According to this view, the inclusion of fee income together with trading gains and losses resulting from changes in the composition of the portfolio should not be included in the definition of the trading outcome because they do not relate to the risk inherent in the static portfolio that was assumed in constructing the value-at-risk measure.
This argument is persuasive with regard to the use of value-at-risk measures based on price shocks calibrated to longer holding periods. That is, comparing the ten-day, 99th percentile risk measures from the internal models capital requirement with actual ten-day trading outcomes would probably not be a meaningful exercise. In particular, in any given ten day period, significant changes in portfolio composition relative to the initial positions are common at major trading institutions. For this reason, the backtesting framework described here involves the use of risk measures calibrated to a one-day holding period. Other than the restrictions mentioned in this paper, the test would be based on how banks model risk internally.
Given the use of one-day risk measures, it is appropriate to employ one-day trading outcomes as the benchmark to use in the backtesting program. The same concerns about "contamination" of the trading outcomes discussed above continue to be relevant, however, even for one-day trading outcomes. That is, there is a concern that the overall one-day trading outcome is not a suitable point of comparison, because it reflects the effects of intra-day trading, possibly including fee income that is booked in connection with the sale of new products.
On the one hand, intra-day trading will tend to increase the volatility of trading outcomes, and may result in cases where the overall trading outcome exceeds the risk measure. This event clearly does not imply a problem with the methods used to calculate the risk measure; rather, it is simply outside the scope of what the value-at-risk method is intended to capture. On the other hand, including fee income may similarly distort the backtest, but in the other direction, since fee income often has annuity-like characteristics. Since this fee income is not typically included in the calculation of the risk measure, problems with the risk measurement model could be masked by including fee income in the definition of the trading outcome used for backtesting purposes.
Some have argued that the actual trading outcomes experienced by the bank are the most important and relevant figures for risk management purposes, and that the risk measures should be benchmarked against this reality, even if the assumptions behind their calculations are limited in this regard. Others have also argued that the issue of fee income can be addressed sufficiently, albeit crudely, by simply removing the mean of the trading outcomes from their time series before performing the backtests. A more sophisticated approach would involve a detailed attribution of income by source, including fees, spreads, market movements, and intra-day trading results.
To the extent that the backtesting program is viewed purely as a statistical test of the integrity of the calculation of the value-at-risk measure, it is clearly most appropriate to employ a definition of daily trading outcome that allows for an "uncontaminated" test. To meet this standard, banks should develop the capability to perform backtests based on the hypothetical changes in portfolio value that would occur were end-of-day positions to remain unchanged.
Backtesting using actual daily profits and losses is also a useful exercise since it can uncover cases where the risk measures are not accurately capturing trading volatility in spite of being calculated with integrity.
For these reasons, the Committee urges banks to develop the capability to perform backtests using both hypothetical and actual trading outcomes. Although national supervisors may differ in the emphasis that they wish to place on these different approaches to backtesting, it is clear that each approach has value. In combination, the two approaches are likely to provide a strong understanding of the relation between calculated risk measures and trading outcomes.
The next step in specifying the backtesting program concerns the nature of the backtest itself, and the frequency with which it is to be performed. The framework adopted by the Committee, which is also the most straightforward procedure for comparing the risk measures with the trading outcomes, is simply to calculate the number of times that the trading outcomes are not covered by the risk measures ("exceptions"). For example, over 200 trading days, a 99% daily risk measure should cover, on average, 198 of the 200 trading outcomes, leaving two exceptions.
With regard to the frequency of the backtest, the desire to base the backtest on as many observations as possible must be balanced against the desire to perform the test on a regular basis. The backtesting framework to be applied entails a formal testing and accounting of exceptions on a quarterly basis using the most recent twelve months of data.
The implementation of the backtesting program should formally begin on the date that the internal models capital requirement becomes effective, that is, by year-end 1997 at the latest. This implies that the first formal accounting of exceptions under the backtesting program would occur by year-end 1998. This of course does not preclude national supervisors from requesting backtesting results prior to that date, and in particular does not preclude their usage, at national discretion, as part of the internal model approval process.
Using the most recent twelve months of data yields approximately 250 daily observations for the purposes of backtesting. The national supervisor will use the number of exceptions (out of 250) generated by the bank's model as the basis for a supervisory response. In many cases, there will be no response. In other cases, the supervisor may initiate a dialogue with the bank to determine if there is a problem with a bank's model. In the most serious cases, the supervisor may impose an increase in a bank's capital requirement or disallow use of the model.
The appeal of using the number of exceptions as the primary reference point in the backtesting process is the simplicity and straightforwardness of this approach. From a statistical point of view, using the number of exceptions as the basis for appraising a bank's model requires relatively few strong assumptions. In particular, the primary assumption is that each day's test (exception/no exception) is independent of the outcome of any of the others.
The Committee of course recognises that tests of this type are limited in their power to distinguish an accurate model from an inaccurate model. To a statistician, this means that it is not possible to calibrate the test so that it correctly signals all the problematic models without giving false signals of trouble at many others. This limitation has been a prominent consideration in the design of the framework presented here, and should also be prominent among the considerations of national supervisors in interpreting the results of a bank's backtesting program. However, the Committee does not view this limitation as a decisive objection to the use of backtesting. Rather, conditioning supervisory standards on a clear framework, though limited and imperfect, is seen as preferable to a purely judgmental standard or one with no incentive features whatsoever.