Journal Issue

Consensus Forecasts and Inefficient Information Aggregation

Christopher Crowe
Published Date:
July 2010
  • ShareShare
Show Summary Details

I. Introduction

Consensus forecasts–mean or median forecasts from a panel of individual forecasters—are always inefficient, as long as forecasters make individually rational forecasts based on different information sets.2 However, the empirical relevance of this fact has yet to be tested. This paper provides such a test using the well-known and widely used forecaster survey dataset produced by Consensus Economics. It demonstrates that consensus forecasts are indeed inefficient, with out of sample efficiency gains (reduction in root mean square forecast error) from a simple adjustment technique of around 5 percent. The paper goes on to discuss some potential ramifications, both for users of this kind of survey data and for information aggregation more generally.

To understand this paper’s empirical strategy, one must first appreciate the intuition behind the theoretical result. Consider a simple stylized model of the forecasting environment, in which a group of agents form independent forecasts based on a private signal and a common prior. Each individual forecaster, if producing an honest minimum variance forecast, will derive his or her forecast using Bayesian updating, that is, from a weighted average of the prior and the forecaster’s private signal, where the weights reflect their relative variance. The consensus forecast will then weight the prior and the average private signal using the same relative weights (on average) as the private forecasters. But the mean private signal has a lower variance than each of the individual signals (this is the argument for employing the consensus forecast in the first place). In other words, the consensus forecast over-weights the prior, and as a result is inefficient. Notably, it is (negatively) correlated with its own forecast error. This result is a general one and does not rely on strategic forecasting behavior or heterogeneous signal precision.

One can test the empirical relevance of the result by analyzing the relationship between the forecast errors associated with the consensus forecast (the difference between the consensus forecast and the actual realization) and the revisions to the consensus forecast (the difference between the current consensus forecast and the previous one). To provide some initial evidence, Table 1 presents simple bivariate correlations between forecast updates and forecast errors, for the full sample as well as for different forecast horizons, using the Consensus Economics cross-country dataset of real GDP growth forecasts.3 The first column provides correlation coefficients for the individual forecasts, while the second provides coefficients for the consensus forecasts. Column 2 provides strong evidence for negative correlation between consensus forecast updates and forecast errors, particularly at forecast horizons of under 13 months. Estimated negative correlation coefficients are as high as 40 percent in absolute value. Comparison with Column 1 suggests that this negative correlation for the consensus forecasts is not driven by individual forecaster irrationality.4

Table 1.Correlation Coefficients
00.03 ***-0.13 ***
10.10 ***-0.19 ***
20.05 ***-0.21 ***
3-0.01-0.28 ***
40.06 ***-0.24 ***
50.02-0.36 ***
60.01-0.28 ***
70.03 ***-0.40 ***
80.03 ***-0.21 ***
90.06 ***-0.32 ***
100.06 ***-0.25 ***
110.14 ***-0.07
120.02-0.35 ***
130.14 ***-0.18 ***
140.15 ***-0.09 *
150.21 ***0.03
160.20 ***-0.08 *
170.21 ***-0.02
180.23 ***0.03
190.19 ***-0.12 **
200.24 ***0.03
210.22 ***-0.06
220.24 ***-0.05
Pooled0.13 ***-0.15 ***
Correlation between forecast errors and forecast updatesSignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).
Correlation between forecast errors and forecast updatesSignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).

This initial evidence of consensus forecast inefficiency is subjected to a more rigorous econometric investigation later in the paper. To this end, the paper carefully outlines the assumed data structure and several important econometric factors that have hindered testing in this area. It builds on insights in the existing literature on efficiency/rationality testing using forecast data (Zarnowitz, 1985; Keane and Runkle, 1990; Batchelor and Dua, 1991; Davies and Lahiri, 1995; Bonham and Cohen, 2001) and provides some new insights. In particular, it shows that the assumed forecast structure in Davies and Lahiri (1995), while approximately correct for analyzing forecaster behavior in the context of repeated forecasts at different time horizons and multiple forecasters per horizon, is inconsistent with rational forecaster behavior in a fully specified model of the information structure. In a fully specified model, rational forecaster behavior creates additional econometric challenges, which this paper’s empirical strategy addresses.

The results of this empirical exercise provide unambiguous support for the intuition given in Table 1. There is very strong evidence that consensus forecasts are indeed inefficient, and that this inefficiency is quantitatively important, particularly at forecast horizons of around 12 months or less. These results are robust to different definitions of the consensus forecast, different vintages of the actual data and different forecast datasets. Notably, very similar results are obtained using forecasts of quarterly nominal U.S. GDP from the Survey of Professional Forecasters.

These empirical results have a number of interesting implications. Consensus forecasts—the mean or median forecast from a panel of professional forecasters—are a widely used tool for researchers and policymakers. For instance, central banks frequently present consensus private sector forecasts alongside their own in-house forecasts as a means of increasing the credibility of their own forecasts and policymaking.5 Consensus forecasts are also employed in the academic literature as proxies for expectations more broadly (e.g. Johnson, 2002) or as a benchmark against which to test individual public or private sector forecasters (Barakchian and Crowe, 2009; Romer and Romer, 2000). However, this paper argues that more caution should be exercised when using consensus forecasts. In particular, the consensus forecast does not typically represent the best (most efficient) estimate of the true state of the world, conditional on the aggregate information set. Its use as a benchmark may therefore flatter the performance of alternative forecasts.

This paper also has implications for the debate over the efficacy of releasing more transparent public information sparked in part by the contribution of Morris and Shin (2002) and, in the policy sphere, by moves toward greater transparency by central banks, notably in the context of inflation targeting (IT) regimes. One can show that—even without Morris and Shin’s Beauty Contest component to the forecasting process—more transparent public information can increase the forecast error for the consensus forecast, even as the forecast errors of individual forecasts are reduced.6 This lesson that better public information can lead to worse collective judgments is a general one with many applications: the paper discusses an application to the U.S. financial crisis.

This paper’s results can also be related to discussions of groupthink or collective delusion (e.g., Benabou, 2008). The model outlined here differs from models of social learning that generate herd behavior (e.g. Banerjee, 1992) in that the inefficiency of the consensus forecast does not stem from forecasts being made public sequentially, but from the communication of posterior beliefs rather than raw signals. Nevertheless, it shares the core feature of these models identified by Benabou (2008): the key problem is a failure to aggregate private signals and its cure resides in more communication. A subtle insight from this model is that the form of communication is central to efficient aggregation. Agents must be induced to communicate not their best guess, but rather their idiosyncratic information. Moreover, communication may not be sufficient: the empirical evidence suggests that agents fail to optimally use the aggregate information that does become available.

Finally, this paper’s results point to an additional kind of aggregation problem—moving between microeconomic and macroeconomic models. For instance, it may be invalid to use macroeconomic or average data to derive insights into individual behavior: as demonstrated in this paper, the empirical finding that consensus forecasts are inefficient does not imply that individual forecasts are not rational. More significantly, the results suggest that insights based on individual behavior do not necessarily carry over to a macroeconomic (multi-agent) environment. Even when agents’ behavior imposes no externality, insights derived from individual rationality (e.g. that expectations are rational) may not hold in aggregate. The economy does not behave as if it were populated by a single rational representative agent.7 Since asset prices also reflect the aggregation of idiosyncratic private information, the results in this paper could also shed light on asset price phenomena (for instance, the success of momentum trading strategies detailed in Jegadeesh and Titman, 1993 and 2001, and Hong and Stein, 1999), although a treatment of this subject is beyond the scope of the current paper.

Section II provides a formal discussion of the paper’s key argument. It goes on to demonstrate that an efficient forecast can be recovered by reducing the overweighting on the prior using a linear combination of the consensus forecast and the prior. The size of the required adjustment is increasing in the relative variance of the individual forecasters’ idiosyncratic private information.8 This section also illustrates how the paper’s predictions can be formulated as empirical hypotheses in the context of the data, and discusses some important econometric concerns that additionally dictate the empirical strategy.

Section III illustrates the argument using a large dataset of individual monthly forecasts of annual economic growth for a cross-section of countries between 1989 and 2008. It describes the data, then presents evidence of the existence and size of the inefficiency associated with employing consensus forecasts for different countries and time horizons. There is strong evidence of inefficiency for consensus forecasts, much less so for individual forecasts, as predicted by the model.9 The results are robust to employing a median rather than a mean forecast as the measure of the consensus forecast, to using real time (early) GDP estimates rather than revised data to measure the actual GDP out-turn, and to using an alternative dataset (nominal GDP data for the US from the Survey of Professional Forecasters). Using the adjustment method outlined above and employing the most recent previous consensus forecast as a proxy for the prior, one can obtain adjusted forecasts with mean squared errors on average more than 5 percent lower than the raw consensus forecasts, for forecasting horizons of a year or less, even out of sample. This effect may appear small; however, the reduction in forecast errors associated with the adjustment is typically greater than the reduction associated with using the next month’s consensus forecasts (i.e., with one month’s additional data available to forecasters). Moreover, since consensus forecasts are generally thought to be at, or near, the frontier in terms of forecast accuracy (see, e.g., Ang, and others, 2007, for an application to inflation forecasts), the fact that a simple adjustment technique can reduce forecast errors at all is significant. However, there is no evidence that forecasters make the necessary adjustment to the previous month’s consensus forecast in updating their prior.

Section IV provides a fuller discussion of how the results relate to some of the existing literature discussed above. Section V then concludes with a discussion of implications for optimal forecasting methods and the use of consensus forecasts by central banks and other institutions.

II. Model and Empirical Strategy

A. Basic Model and Results

The following model captures the basic features of this discussion. A variable y is distributed iid with mean μ and variance σ2v:

A continuum of agents each receives a private signal of y, given by xi:

where u is an aggregate error, iid mean-zero with variance σ2u and εi are iid mean-zero idiosyncratic errors with variance σ2ε. The information structure, including all variances, is common knowledge.

Agents form forecasts following the usual Bayesian updating:

where Fi minimizes the root mean squared forecast error and is therefore given by:

The consensus forecast is inefficient, even as individual forecasts are all efficient.

Proof: The individual forecasts are uncorrelated with their forecast errors:

Define the consensus forecast as the mean forecast:

Then the consensus forecast is correlated with its forecast errors:

This result applies equally to the median if εi is symmetric (or, more generally, if its median is zero), and generalizes to the case with heterogeneous σ2ε or a finite number of agents (Kim, Lim and Shaw, 2001 present a proof for the latter case). The easiest way to understand this result is to compare the individually-rational relative weight F on the individual signal with the optimal weight F* that would arise from minimizing the root mean squared forecast error associated with the aggregate signal :

difference is that this paper derives forecasters’ optimal forecasts based wholly on the underlying information structure, whereas Davies and Lahiri have an explicit information structure for common shocks but assume idiosyncratic forecasting errors rather than deriving forecast dispersal based on idiosyncratic signals. In fact, one can show that the forecast structure assumed by Davies and Lahiri is inconsistent with optimal forecasts. Once one derives dispersed signals endogenously, then optimal forecasting behavior with respect to the common shocks changes, and this then has implications for the behavior of consensus forecasts. With respect to more cosmetic changes, this paper ignores forecaster-specific bias (since this paper is not concerned with testing for bias) and also allows for a final shock to the variable outside the forecasting window (to capture forecast revisions).

The forecast target, output y, is assumed to be a function of 25 iid monthly shocks, one per month during the two year maximum forecast horizon and a final shock after the forecast horizon is complete due to data revision. That is:

where h indexes the forecast horizon (the horizon 0 shock is due to data revision, and since the maximum forecasting horizon is two years).

At the start of each month, agents i observe two signals of y. The first is a common prior μh, based on the shocks observed through the previous month:11

Agents also receive an idiosyncratic, mean-zero signal of the current month’s shock, which provides a second signal of y:

With this information structure, agents’ optimal forecast of y is given by:12

with consensus forecasts denoted:

Here, the key characteristic of consensus forecasts—underreaction to aggregate new information—is clearly observable. By contrast, the optimal forecast based on the aggregate signal fully absorbs the new shock :

The unique panel structure of forecast data poses additional econometric challenges, which are also addressed here. Finally, this section outlines a measure of the inefficiency embedded in consensus forecasts.

In the context of the information structure outlined above, the assertion in that consensus forecasts are inefficient still holds:

significant coefficients on lagged forecast updates and this pattern of alternating signs would lead one to reject the hypothesis that priors are rational, in favor of the naive case.

Under both rational and naive priors, the error terms echτ have a complex structure over forecast horizons h and adjacent forecast periods τ. Because includes only subsequent shocks that are orthogonal to information through horizon h, the errors are contemporaneously uncorrelated with the regressors. However, neither (24) nor the restricted version (25) can be estimated over a sample that pools across forecast horizons, since echτ includes subsequent shocks that are correlated with the regressors when subsequent forecasts are included (the Appendix provides a further discussion of this issue, which is common to consensus and individual forecasts).

Parameter estimates from separate regressions (per h) should yield consistent estimates. However, to improve the efficiency of our baseline estimates this paper adopts the following iterative procedure. Assuming that the naive priors case is the appropriate one, (25) is estimated for. In this case,

One can then subtract the estimated residuals from the left hand side of (25) estimated for :

The residuals can then be used as a proxy for in equation estimated for :

and so on. Since the iterated use of residuals could generate more noise, results from this procedure are presented alongside results from simple regressions of (25).

Finally, an estimate of the efficient forecast can be obtained by transforming the consensus forecast using the estimated regression coefficient from (25):15

To provide a metric for the efficiency gain that can be obtained via this transformation, one can compare the root mean square error (RMSE) of the raw forecasts with the RMSE of the adjusted forecasts:

B. Efficiency Tests

This subsection presents the results of the efficiency tests outlined in section II.C. First, it estimates equation (24), including the full set of lagged forecast updates as in the rational priors case, to assess evidence of the aggregation problem. These results suggest that the rational priors case can be rejected in favor of the simpler naive priors case, and the remainder of the results assume this latter case is the relevant one. The baseline results under naive priors are then presented, both with and without the adjustment to the dependent variable using residuals from regressions for subsequent forecast horizons outlined in section II.C.

Table 2 presents a summary of results from estimating regression equations including the full set of lagged forecast updates, as in (24). This table presents the weighted mean beta coefficient from the set of relevant per-forecast horizon regressions (weighted by the inverse of the variance of the coefficient estimates).18 Significance levels are based on a z-test using the standard error of the weighted mean. The last two columns give the proportion of relevant per-forecast horizon regressions in which the beta coefficient is found to be significant at the 1 percent level and to have a positive and negative sign, respectively. The predicted sign under the rational priors case is highlighted in bold.

Table 2.Naïve vs. Rational Priors
Update at lag:Beta coefficient:Proportion of significant coefficients
weighted meanPositiveNegative
1-0.686 ***0.000.30
2-0.355 **0.000.14
160.278 *0.000.00
This table summarizes the results of 23 regressions by forecast horizon.Each regression includes the full list of available forecast updates, plus a constant term.The mean coefficient is weighted by the inverse of the variance.Significance Level denoted by *** (1 percent); ** (5 percent); * (10 percent); Based on a z-test of the null that the weighted mean is equal to zero.“Positive” and “Negative” give the proportion of individual coefficients that are statistically significant at the 1 percent level, and >0 and <0 respectively. Standard errors clustered by country.The predicted sign under rational priors is highlighted in bold.
This table summarizes the results of 23 regressions by forecast horizon.Each regression includes the full list of available forecast updates, plus a constant term.The mean coefficient is weighted by the inverse of the variance.Significance Level denoted by *** (1 percent); ** (5 percent); * (10 percent); Based on a z-test of the null that the weighted mean is equal to zero.“Positive” and “Negative” give the proportion of individual coefficients that are statistically significant at the 1 percent level, and >0 and <0 respectively. Standard errors clustered by country.The predicted sign under rational priors is highlighted in bold.

Looking at the mean beta coefficient estimates, only two additional lagged forecast updates (in addition to the most recent update) are statistically significant, and of these one has the wrong sign. A similar pattern emerges from the raw per-forecast horizon results. Hence, the evidence is not supportive of the rational priors case. In terms of forecaster behavior, this result suggests that forecasters do not adjust the previous month’s consensus forecast in updating their forecasts, but rather use it naively. In fact, the significant negative average coefficient on the first lagged update (as opposed to the positive coefficient predicted under rational priors) suggests that even the raw consensus forecast is not fully absorbed into the prior in the first month. Given the evidence against rational priors, the remainder of the results therefore focus on the simpler naive priors case, and include only the most recent forecast update.

Tables 3 and 4 present the key results. They show regression results for univariate specifications with the consensus forecast errors as the dependent variable and the most recent update to the consensus forecast as the independent variable for each of the 23 forecast horizons, based on the methodology outlined in Section II C. Table 3 presents results for simple regressions with the raw forecast error as the dependent variable, while Table 4 presents results in which the forecast error is adjusted using residuals from prior forecast periods.19

Table 3.Baseline Efficiency Tests














































Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable is mean (consensus) forecast error
Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable is mean (consensus) forecast error
Table 4.Efficiency Tests: lterative Error Adjustment














































Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable transformed by iterative subtraction of residuals, as described in text.
Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable transformed by iterative subtraction of residuals, as described in text.
Table 5.In-Sample Efficiency Gains, by Country
CountryEfficiency Gain (Percent)
Adjusted ForecastNext Month’s Consensus
China P.R.: Hong Kong4.93.6
China P.R.: Mainland2.92.9
Czech Republic6.7-0.8
Korea Republic of11.04.9
New Zealand0.51.5
Russian Federation5.05.2
Slovak Republic3.16.6
Taiwan Province of China6.13.0
United Kingdom0.22.5
United States0.02.7
Author’s calcuations, as in textForecasts made for years to 2006.Forecast horizon = 1 not included, to facilitate comparison between columns 2 and 3.
Author’s calcuations, as in textForecasts made for years to 2006.Forecast horizon = 1 not included, to facilitate comparison between columns 2 and 3.

There is very significant evidence of underweighting of new information in the consensus forecasts, as predicted. The results are strongest at shorter forecast horizons: the largest beta coefficient estimate in absolute terms is for the 8-month horizon, using both the raw and adjusted forecast errors. Coefficient estimates are somewhat larger using the raw errors as the dependent variable as in Table 3, but the resulting adjustment accounts for a much larger share of the overall forecast error, reflected in significantly higher R2s, for the adjusted forecast error specifications in Table 4.20 Coefficient estimates are also more uniform and significant over a larger range of forecast horizons in this latter case.

These results shed some light on the relative variance of idiosyncratic and shared information available to forecasters at different forecasting horizons. Using the results from Table 4 as representing the best estimates, the estimated coefficient on the consensus forecast update is generally in the range of -0.5 to -1.5, suggesting that the variance of forecasters’ idiosyncratic signals σ2ε, at the shorter forecast horizons where the effect is significant, is of a similar magnitude to the variance of the common signals, σ2v. This implies a value of F of around 0.5: the consensus forecast typically absorbs only half of the aggregate new information available each period.

C. Efficiency Gains from adjusted Consensus Forecasts

The next step is to calculate the efficiency gain associated with an appropriate adjustment of the consensus forecasts, as in (31). In all cases, the naive priors case is assumed to be the relevant one. A complicating factor is that the relative variance of private and public signals is likely to differ systematically across countries as well as across time periods, and a good correction technique should allow for this. Allowing bj to differ across each country and time horizon would risk over-parameterizing the model, reducing the precision of parameter estimates and worsening the model’s out of sample performance. However, analysis of the bh coefficients in Table 3 suggests a systematic pattern across three broad time horizons, which points to a relatively parsimonious parameterization, in which the slope coefficient bj is assumed to differ across each country c and within the three broad time horizons. The following specification is therefore run:

Table 6.In-Sample Efficiency Gains, by Forecast Horizon
Forecast HorizonEfficiency Gain (Percent)
Adjusted ForecastNext Month’s Consensus
Author’s calcuations, as in textForecasts made for years to 2006.

: Drop forecast horizon = 1 to allow comparison between second and third columns.

Author’s calcuations, as in textForecasts made for years to 2006.

: Drop forecast horizon = 1 to allow comparison between second and third columns.

Out of sample efficiency gains are somewhat lower overall, but not markedly so (Tables 7 and 8).22 Again, out of sample efficiency gains are somewhat better than the gains associated with having an additional month’s worth of data to inform the individual forecasts. Some countries with very good in-sample performance (e.g. Indonesia), register a very poor out of sample performance, suggesting that some caution should be exercised in extrapolating in-sample patterns based on extreme events (such as the Asian crisis). Overall, however, one could have improved near-term forecast accuracy for forecasts made in 2007 and 2008 by more than 5 percent, with respect to the raw consensus forecast, based on the adjustment technique outlined here and data available through end-2006.

Table 7.Out of Sample Efficiency Gains, by Country
CountryEfficiency Gain (Percent)
Adjusted ForecastNext Month’s Consensus
China P.R.: Hong Kong4.62.8
China P.R.: Mainland4.74.8
Czech Republic-25.7-1.1
Korea Republic of5.61.6
New Zealand2.05.1
Russian Federation3.20.9
Slovak Republic-3.41.4
Taiwan Province of China4.32.2
United Kingdom2.04.7
United States0.86.0
Author’s calcuations, as in textForecasts made from Jan 2007 onwards, for 2007/08.Forecast horizon = 1 not included, to facilitate comparison between columns 2 and 3.
Author’s calcuations, as in textForecasts made from Jan 2007 onwards, for 2007/08.Forecast horizon = 1 not included, to facilitate comparison between columns 2 and 3.
Table 8.Out of Sample Efficiency Gains, by Forecast Horizon
Forecast HorizonEfficiency Gain (Percent)
Adjusted ForecastNext Month’s Consensus
Author’s calcuations, as in textForecasts made from Jan 2007 onwards, for 2007/08.

: Drop forecast horizon = 1 to allow comparison between second and third columns.

Author’s calcuations, as in textForecasts made from Jan 2007 onwards, for 2007/08.

: Drop forecast horizon = 1 to allow comparison between second and third columns.

D. Robustness Checks

This subsection briefly outlines the results of three robustness checks. The first simply replicates the results in Table 3 using the median forecast (rather than the mean) as the appropriate definition of the consensus forecast (Table 9). Results are almost identical to the baseline.

Table 9.Efficiency Tests: Median Forecasts














































Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable is median, rather than mean, forecast error.
Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Dependent variable is median, rather than mean, forecast error.

The second robustness exercise deals with the issue of data revisions. A number of authors have suggested that forecasters aim to predict early releases of the data rather than subsequent revised estimates: this issue is of particular relevance to forecasts of GDP growth, where the data go through several rounds of (often substantial) data revisions up to a year after the end of the period in question (Keane and Runkle, 1990; Loungani, 2001). Following Loungani (2001), one can obtain preliminary or “real time” estimates of GDP growth using GDP estimates taken from the relevant version of the IMF’s WEO data archive. Specifically, the data presented in the April version of the dataset typically represents a first estimate of GDP growth during the previous year. Table 10 replicates the results in Table 3 using these ‘real time’ measures of actual GDP growth, which are available for 2002-08. Results are again almost identical to the baseline: in fact, the coefficient on the forecast update is now negative and significant across a wider set of individual forecast horizons, despite the reduction in sample size.

Table 10.Efficiency Tests: Real Time Growth Data














































Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Forecast Error calculated with respect to real time data (April WEO of following year).
Standard Errors clustered by countrySignificance Level denoted by *** (1 percent); ** (5 percent); * (10 percent).Forecast Error calculated with respect to real time data (April WEO of following year).

As an additional robustness check, a similar analysis was undertaken using estimates of quarterly, seasonally adjusted US nominal GDP (in levels) taken from the Survey of Professional Forecasters (SPF).23 The SPF is a survey of professional economic forecasters maintained by the Federal Reserve Bank of Philadelphia. Forecasts are made quarterly, and forecasts are available for up to four quarters ahead, in addition to a nowcast of the current quarter and a backcast of the previous quarter. Since the furthest out horizon is dropped (because there is no base for calculating the forecast update), and the backcast is somewhat redundant (because initial GDP estimates are already available for that quarter when the backcast is made), there are four relevant forecast horizons available, from the current quarter to three quarters ahead.

Table 11 replicates the results of Table 3 for the SPF dataset. The point estimates for -b have the correct (negative) sign at all four horizons, and are statistically significant for the 2 and 3 quarter ahead forecasts. Efficiency gains from adjusted forecasts are presented in the final row of the table. Gains are in a broadly similar range to those for the Consensus Economics dataset, albeit slightly lower. The largest gain, at the 3-quarter ahead horizon, is 5 percent. Interestingly, the relative magnitudes of gains at the different forecast horizons is similar to those using the CE data, with gains largest at forecast horizons of 2—3 quarters, and smaller gains at the shorter horizons. The estimates of b suggest that the under-weighting of new information is somewhat less pronounced for the SPF dataset than for the Consensus Economics data. Other things being equal, this suggests that forecasters in this case have less informative idiosyncratic signals, compared to the shared public information, which seems plausible given the significant quantity of public information available on the U.S. economy (the relatively small efficiency gains available for the U.S. using the CE data presented in Tables 5 and 7 points to a similar conclusion).

Table 11.SPF Nominal GDP Forecasts

IV. Discussion

A. Morris and Shin (2002): A Reassessment

Morris and Shin (2002) use a global games framework to model Keynes’s (1936) beauty contest. Keynes argues that agents’ attempts to second guess the consensus forecast can damage the information-revelation role of market prices. Morris and Shin extend this logic to show that more transparent (lower variance) public signals could lead to higher forecast errors on the part of market participants. In their model, public information is over-weighted relative to agents’ private signals because it is more helpful for second-guessing other agents and hence for aligning an agent’s own forecast more closely with the consensus. More accurate public signals can exacerbate this over weighting problem, potentially increasing the volatility of agents’ individual forecasts.

Morris and Shin’s arguments have been criticized, notably for the ad-hoc nature of the beauty contest element and because the parameter values necessary for their argument on the negative effect of public information provision to hold are unrealistic (Svensson, 2006). In particular, the public information must be at least eight times noisier than the private information, which seems unlikely (in the context of making central bank economic forecasts public) in light of evidence on the apparent superiority of central bank forecasts over their private sector counterparts (Barakchian and Crowe, 2009; Romer and Romer, 2000).

This apparent superiority may be exaggerated by comparing central bank forecasts with inefficient consensus forecasts rather than an efficient aggregator of the private sector’s information set. However, the logic of our model suggests that, even without the beauty contest element, the provision of more accurate public information could harm the accuracy of the consensus forecast, even if it has an unambiguously positive effect on the accuracy of individual forecasts. Given the attention paid to the consensus forecast and its potential role for uninformed agents in the economy in forming their expectations (particularly if agents do not make the necessary adjustment to the consensus forecast in practice), this could argue against greater transparency.

To formalize this argument, I first illustrate the effect of changing the accuracy of public information in the context of the simple signal extraction model, before analyzing the effect in Morris and Shin’s model that includes the beauty contest element. In both cases more accurate public information is modeled as a reduction in the variance of the public prior, σ2v. In the context of the signal extraction model outlined in Section II.A, the mean squared consensus forecast error is given by:

B. Groupthink, Bank Behavior and the Credit Crunch

The insights from this paper can also be related to recent work on groupthink or collective delusion by Benabou (2008). Benabou’s model is based on a modification to a standard utility function to incorporate anticipatory feelings (i.e. deriving utility from expected as well as actual consumption). With this modification, Benabou is able to generate externalities from individual behavior that give rise to equilibria with self-fulfilling collective delusion. Agents rationally choose to ignore pertinent information, giving rise to collective judgments that differ systematically from reality.

The phenomenon of groupthink involves a number of features that distinguish it from the model in this paper, notably the introduction of anticipatory feelings that lead to rational self-denial (and ultimately mutually-reinforcing denial within groups) and the absence of a major role for idiosyncratic private information. Nevertheless, there are some important complementarities between the two approaches. The collective failure to rationally process the available aggregate information—leading to the discarding of useful information—is the most obvious common feature to both models. Another commonality is the role of common group beliefs. In the Groupthink model agents choose to ignore information at odds with the group prior, leading to a less efficient aggregation of private signals. In the simple signal extraction model in this paper the common prior is also over weighted collectively, although this behavior is individually rational.26

Moreover, as discussed in relation to Morris and Shin’s paper, a more accurate (lower variance) prior can lead to a less accurate collective forecast in our model. This result can provide insights into some of the groupthink phenomena discussed in Appendix A to Benabou’s paper. An example of relevance to the recent U.S. subprime crisis is that of groups of bank employees who appeared to systematically underestimate the risks associated with complex structured assets. For instance, increased use of sophisticated quantitative models for pricing assets may have genuinely provided a more accurate assessment of expected risks and returns that led to a lower-variance prior among managers and a (positive) direct effect on the accuracy of their consensus forecast. However, this more accurate signal then caused the managers, receiving diffuse but ultimately correlated signals of potential downside risk (e.g. from their knowledge of the underlying assets’ quality and experience of running traditional loan books), to rationally put greater weight on the group prior, collectively underweighting their noisy but (in aggregate) informative private signals, and therefore reducing consensus forecast accuracy.

The condition for this underweighting effect to dominate the direct effect—that the group prior be no less noisy than the managers’ own idiosyncratic signals—seems plausible, given that the models turned out to perform rather poorly out of sample.27 In other words, the claims prior to the crisis that these new techniques were reducing risk, and the emerging consensus post-crisis that their use reduced the ability to foresee and forestall emerging risks, may both be accurate assessments. The first relates to the group prior; the second to the group posterior.

V. Conclusions

This paper has provided an empirical assessment and discussed the implications of a result first noted by Kim, Lim and Shaw (2001), that consensus forecasts tend to over weight the prior at the expense of new information. Testing this result with respect to forecasts of economic growth taken from one of the most widely-used cross-country forecast datasets, one finds robust evidence of this underreaction to new information. Moreover, this underreaction is clearly due to the forecast aggregation process, as individual forecasters appear to respond more or less appropriately to new information. Applying a suitable adjustment to the consensus forecasts using additional information contained in the change to the consensus forecast from the previous month’s, one is able to obtain more accurate forecasts, particularly at horizons of a year or less (where the efficiency gain is more than 5 percent on average, even out of sample). The gain in efficiency is, on average, larger than that associated with having access to an additional month’s data to inform the forecast. There is no evidence that forecasters make the necessary adjustment in practice, suggesting that this inefficiency in the consensus forecast is not well understood even by the sophisticated market participants included in the survey.

This paper has implications for consumers of consensus forecasts. Those using the forecasts need to be aware of the nature and source of their inefficiency as aggregators of the information available to market participants. Forecast aggregation services (such as the one whose data are employed in this paper) may need to adjust their methodology to provide more efficient consensus forecasts (although, for many countries, the efficiency gain is likely to be relatively small). One method is to employ an ex post adjustment similar to the one used in this paper. An alternative method would be to attempt to elicit individual forecasts that place a higher weight on each forecaster’s private signal compared to their best forecast. One means of achieving this in practice would be to provide incentives for strategic forecasting such that forecasters attempt to differentiate their forecasts from those of others (Ottaviani and Sørensen, 2006). Alternatively, forecasters could be invited to submit two forecasts: their best guess (as currently) and a noisier forecast based only on their newest information.

Although the paper’s main insights relate to the aggregation of individual forecasts, the paper also has implications for those engaged in forecasting themselves. Forecasting technologies that decentralize information-gathering but centralize the production of a final forecast (as are typically employed in a central bank) may help to alleviate this problem of overweighting the prior. For instance, when combining insights from a number of quantitative forecasting models, it is better to apply judgment (i.e. shade towards the prior) only at the level of the aggregate forecast, rather than at the level of each individual model, particularly if in the latter case the weights placed on each model’s ‘raw’ results relative to the prior are endogenously determined by their relative noisiness, either explicitly or implicitly.


Testing for Individual Forecast Rationality

Within the context of this paper, which is concerned primarily with establishing whether consensus forecasts are inefficient aggregators of individual information, individual forecast rationality is pertinent only to the extent that it sheds light on the performance of consensus forecasts. In particular, if one can establish that individual forecasts are approximately rational, then any inefficiency identified in consensus forecasts can be attributed to aggregation, and not to underlying forecaster behavior.

Individual forecasts are rational as long as the signals are weighted as in (14):

Taking this relationship to the data is hindered by the fact that we do not observe the prior μchτ. However, under the null of individual forecaster rationality the previous month’s consensus forecast can be used instead, since rationality implies a zero coefficient in a regression of forecast errors on either (i.e., assuming μh, is available) or (substituting for μh). The first case is intuitive, and follows from (40). In the second case:

for each forecast target will be the same (this corresponds to the case of, which is also the only case in which the consensus forecast is efficient, and indeed there is nothing to be gained from employing pooled individual as opposed to consensus forecasts in this instance).

In fact the same argument applies to pooling forecasts across time horizons, both for individual forecasts and for the consensus forecast. Again, the problem is that information sets differ across time horizons, so that rationality arguments can no longer be invoked to justify the assumption of orthogonality between forecasts and forecast errors. Hence, for a given realization of ycτ, forecast errors are once again positively correlated with forecasts when observations from different forecast horizons are pooled. To illustrate this, note that, assuming rationality, the residuals from (42) are given by:

where eichτ are therefore uncorrelated with for a given forecaster i. However, the presence of the last term in (43) means that eichτ are correlated with, for Hence, pooling across time periods introduces correlation between regressors and the error term.28

Given these constraints, this paper follows Davies and Lahiri (2001) and exploits a further characteristic of rational individual forecasts—that forecast revisions are uncorrelated with information known at the time of the earlier forecast—to test for rationality. Specifically, I test for correlation between forecast revisions at adjacent forecast horizons, since the previous forecast revision is known at the time of the current revision and the two should therefore be uncorrelated. It turns out that this Martingale test of efficiency is not subject to some of the empirical concerns raised with respect to more conventional tests based on forecast errors. It also allows one to ignore issues of data revision, as it does not make reference to the actual data y.

In the context of the assumed information structure, the forecast revision at horizon h is given by:

These results suggest some modest under weighting of new information on the part of individual forecasters. The (weighted) mean estimated coefficient is positive, as is the median, while 11 percent of individual coefficients are positive and statistically significant (while less than 2 percent are negative and significant). A positive estimate of β implies that, so that the relative weight put on new information is below the optimal.

However, this under weighting is limited compared to that identified for the consensus forecast. It turns out that, although the Martingale efficiency test undertaken for the individual forecasters is somewhat different to the simpler test undertaken for the consensus forecasts, the estimated β and b coefficients are directly comparable:

Analysis of SPF data

The Survey of Professional Forecasters (SPF) is carried out by the Federal Reserve Bank of Philadelphia, who took over responsibility for the survey from the ASA/NBER in 1990. A panel of professional forecasters is surveyed every quarter, just after the release of initial GDP estimates for the previous quarter by the Bureau of Economic Analysis in the first month of the quarter. There is significant persistence in the make-up of the forecaster panel, at least at the institutional level, although forecasters also drop in and out of the sample. For the post-1992 sample used in the analysis, the number of forecasters each quarter varies between 30 and 52.

The survey contains forecasts of 31 economic variables. This paper uses data on one of these, nominal GDP (seasonally adjusted, annual rate, in levels). This series goes back to 1968. However, prior to 1992 the forecasts were of nominal GNP not GDP. In addition, the documentation on the data prior to the Philadelphia Fed assuming responsibility for the survey in 1990 is limited (e.g. it can be harder to consistently identify forecasters across time, and the relative timing of forecasts and data releases is not so clear). Hence, this study uses only the GDP data, covering forecasts made in 1992Q1 through 2009Q4. Since this data is not expected to contain any outliers, unlike the Consensus Economics (CE) data, it was not trimmed.

Actual GDP data for comparison is taken from the Federal Reserve Bank of St Louis’s ALFRED database, which contains GDP estimates for different vintages. An important question is which GDP release forecasters are forecasting: the first estimate, available in the first month after the quarter in question has ended; the first revised estimate, available the next month; or subsequent revised estimates. A comparison of forecast errors suggested that forecasters aimed to match the first estimate, so this estimate was used as the actual for comparison.

Since nominal GDP has grown significantly over time, using raw data in levels would create a significant problem of heteroskedasticity. Effectively, recent data would receive a much higher relative weight. Hence, the data must be scaled appropriately to ensure that the magnitude of the observations remain stationary. For the consensus forecast regressions, both the forecast error and forecast update are divided by the previous month’s consensus forecast (the proxy for the prior). For the individual forecast regressions, the most recent and the lagged forecast update are divided by the two period lagged forecast. Otherwise, the methodology is identical to that underlying the baseline results presented in Table 3 and the results for individual forecasts presented in Table A1.

Table A1.Individual Regression Results Summary

Results for individual SPF forecasters are presented in Table A2. This Table summarizes results for regressions for the 57 individual forecasters in the sample for whom there were at least 40 forecast observations across the relevant forecast horizons. As with the CE data, there is some modest evidence that forecasters under-weight new information, leading to positive correlation across forecast revisions, although the extent of under-weighting is significantly less than for consensus forecasts. Undertaking the same back of the envelope calculations as before, is around 0.5 for the forecast horizons at which it is significantly different from zero (Table 11). This implies that. Using the expression for and the estimate, then the individually optimal weight on new private information is around 0.8. The estimated actual weight, 0.67, is thus too low, both from the perspective of optimal individual forecasts (where the weight should be 0.8) and from that of optimal consensus forecasts (where the weight should be 1). As with the CE dataset, the extent of under-weighting is considerably greater for consensus than for individual forecasts.

Table A2.Individual SPF Regression Results Summary
Proportion significant
Total Observations5,937
Summary of 57 Martingale regressions by forecasterSignificance defined at the 1 percent levelIndividual regressions have standard errors clustered by yearMeans are weighted by the inverse of the variance.
Summary of 57 Martingale regressions by forecasterSignificance defined at the 1 percent levelIndividual regressions have standard errors clustered by yearMeans are weighted by the inverse of the variance.

Comparing results in Table A1 and Table A2, the median and mean point to greater evidence of inefficiency in individual forecasts for the SPF dataset than for the CE dataset, although there is a higher proportion of significant individual coefficients in the case of the latter. No forecaster in the SPF dataset significantly over-weights new information, and there is no significant evidence of bias, positive or negative, for any forecaster in the SPF dataset.


Subject to the usual caveats, the author is extremely grateful to Kajal Lahiri, David Romer, Allan Timmermann, and Ken Wallis, seminar participants at the IMF Research Department and George Washington University and conference participants at the European Central Bank’s 6th Workshop on Forecasting Techniques for useful comments on an earlier draft, and to Prakash Loungani, Alin Mirestean, Marco Ottaviani, Mauro Roca, Hyun Shin, Natalia Tamirisa and Ken West for helpful discussions.

Kim, Lim and Shaw (2001) appears to be the only published reference on the theoretical finding, although an early draft of Ottaviani and Sørensen (2006) also makes reference to it (see also Wallis, forthcoming). Other authors have discussed in more general terms the fact that pooling forecasts is inferior to pooling information sets (see, e.g., Granger, 1989 and Timmermann, 2006), which is the underlying cause of the consensus forecasts’ inefficiency. A related issue—that consensus forecasts should not be used for testing individual forecaster rationality—has received more attention (Figlewski and Wachtel, 1983; Keane and Runkle, 1990; Bonham and Cohen, 2001).

See section 3.1 for a full description of the data and sample.

There is evidence for some positive correlation for the individual forecasters, which would suggest over weighting of new information. However, this appears due to the positive bias created by pooling across forecasters with common forecast targets noted by Zarnowitz (1985) and others. The appendix outlines a method for assessing rationality among individual forecasters that is not subject to this bias, which finds some very limited evidence of under weighting of new information.

Examples include the ECB’s Survey of Professional Forecasters (; and the Survey of Professional Forecasters carried out by the Federal Reserve Bank of Philadelphia (, which is employed in a robustness check in this paper. The Bank of England also carries out a quarterly survey of external forecasters (see, for instance, the August 2009 Inflation Report, p.50).

Amato and Shin (2006) and Roca (2009) arrive at similar results in a New Keynesian DSGE setting. By contrast, this paper analyzes the impact of public information on consensus forecasts in the context of Morris and Shin’s (2002) global games model, and its special case (without the beauty contest element)—the simple signal extraction model with public and private signals.

The fact that this aggregation problem shows up with respect to expectations is particularly worrisome given the central role ascribed to rational expecations in most modern macroeconomic models.

This section replicates the key results in Kim, Lim and Shaw (2001) for the case with a continuum rather than a finite number of individual forecasters.

The difference in performance between consensus and individual forecast rationality tests has been noted before. For instance, Batchelor and Dua (1991) argue that rationality tests on the consensus forecasts provide a poor guide to the extent of rationality among individual forecasters. Similarly, Harvey, and others, (2001) note the strong autocorrelation of revisions to consensus forecasts, which, they argue, imply that the consensus forecasts cannot be optimal. However, this phenomenon was far less apparent when the records of particular individual forecasters were examined.

To match the apparent data generating process (DGP) more closely one could assume heterogeneous priors. However, the model outlined here approximates the DGP reasonably well while maintaining traditional assumptions about forecaster rationality.

Davies and Lahiri (1995) implicitly set (that is, forecasters fully absorb the common shock on average) but forecasts are also dispersed around the mean. This is inconsistent with optimal forecasting behavior: the common shock νh will only be fully absorbed if it is observed without any noise, otherwise it is optimal to place some weight on the prior (). If forecasters have access to a signal without noise, then they will ignore any noisy signal and the noise term εih should therefore not enter into their forecast. The forecast structure that Davies and Lahiri propose is consistent only with a measurement error explanation for

Since the theory predicts that consensus forecasts are unbiased (as long as the underlying individual forecasts are unbiased), the adjustment does not take into account the estimated bias coefficient ah.

The number of regressions from which the underlying beta coefficients are drawn varies across updates at different lags. E.g., the update at lag 1 is included in all 23 underlying regressions, while the update at lag 23 is included only in the regression at. Note that these results exclude observations for countries and years with bimonthly forecasts, since this specification requires the full set of 23 forecast horizons.

The results in Table 4 exclude countries and years with bimonthly forecasts, since the adjustment methodology requires the full set of 23 forecast horizons.

This is what one would expect, since a large fraction of the uncertainty at horizon h that shows up in the residual is resolved in subsequent forecast periods, and subtracting subsequent residuals from the forecast error therefore tends to reduce the unexplained (residual) portion of the error, leading to a higher.

In fact, gains are significantly higher for some countries and time periods.

Nominal as opposed to real GDP was used to avoid the additional complications associated with matching forecasts with appropriate vintages of actual GDP data, given changes in base years. Actuals are the first GDP estimate, available in the first month of the subsequent quarter. There is some ambiguity as to whether forecasters are aiming for the first GDP estimate or the second, revised, estimate available in the second month of the subsequent quarter. However, forecast errors are smaller using the first estimate as actual at all forecast horizons, suggesting that this is the appropriate benchmark. Because nominal magnitudes increase significantly over the sample period, forecast updates and forecast errors are all divided by the relevant prior (previous consensus forecast) to minimize heteroskedasticity. This is similar to taking a log transformation, and indeed results based on taking logs of individual forecasts prior to aggregating to form the consensus forecasts are almost identical. The scaling method used in the paper is preferred to taking logs as the latter could artificially introduce forecast bias (due to the Jensen inequality). The SPF data before 1992 refer to GNP rather than GDP: to avoid additional complications raised by changing output definitions, data prior to 1992 is dropped. Additional details of the SPF dataset and the analysis, including results for individual forecasts, where (as for the Consensus Economics data) there is evidence for some minor under-weighting of new information, are in the Appendix.

Benabou’s mechanism is also individually rational conditional on the inclusion of anticipatory feelings in the utility function.

Explanations for the crisis have tended to emphasize that the models performed worse than expected (i.e. their signal to noise ratio turned out to be lower than believed). However, while this may indeed have been the case, it is not necessary as an explanation for the crisis.

Other Resources Citing This Publication