### MHW observation

MHWs were identified based on v.2.1 of NOAA’s Optimum Interpolation Sea Surface Temperature (OISST v.2.1)^{41,42}. OISST v.2.1 was released in April 2020, and is identical to v.2.0 for data up until 2015 but includes significant quality improvements starting in 2016 (https://www.ncdc.noaa.gov/oisst/optimum-interpolation-sea-surface-temperature-oisst-v21). We obtained SST data at daily frequency and 0.25° horizontal resolution from NOAA’s Physical Sciences Laboratory (https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.highres.html).

The bulk of our analysis was performed using monthly SST data for both observations and forecasts, but see the section ‘Sensitivity to defining MHWs from daily versus monthly SST’ for a discussion of the implications and practicality of using daily instead of monthly output. Daily 0.25° OISST data were averaged to monthly temporal resolution and 1° spatial resolution for consistency with the forecasts being evaluated (see the next section). MHWs were identified based on methods proposed in a previous study^{43} and adapted for monthly data as described in ref. ^{19}. First, SST anomalies at each grid cell were computed by subtracting the 1991–2020 monthly climatology. MHW thresholds specific to each month of the year were then calculated as the 90th percentile of observed SST anomalies in a 3-month window (for example, for January MHWs, the 90th percentile of all December to February SST anomalies). SST anomalies were then converted to binary time series (MHW or no MHW) depending on whether they were above or below their respective thresholds.

### Global climate forecasts

Underlying the MHW forecasts described in this study are seasonal SST forecasts obtained from six global climate models contributing to the North American Multimodel Ensemble^{13,14}. For each of the six models, an ensemble of forecasts is initialized each month, with the number of ensemble members and the forecast lead time varying between models (Extended Data Table 1). In addition to real time forecasts, a multidecadal set of reforecasts has been performed for each model. Reforecasts, also sometimes referred to as retrospective forecasts or hindcasts, are forecasts simulated for past periods using only information available at the time of forecast initialization (that is, ignoring information that has subsequently become available). The long historical suite of (re)forecasts is necessary to rigorously evaluate the skill and biases of the forecast systems. Here we obtained monthly averaged SST forecast output for 1991–2020, which is a period that is available from all six models, from the IRI/LDEO climate data library (https://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/). Output from all models is served on a common grid with 1° resolution in longitude and latitude.

### MHW forecasts

To develop MHW forecasts based on the SST forecasts described above, a series of steps were performed for each model. First, the reforecast and forecast periods were concatenated to produce a single set of forecasts for analysis. For models that have more ensemble members in the real time forecasts than in the retrospective forecasts (Extended Data Table 1), we kept the same number of ensemble members as the retrospective forecasts to maintain consistency throughout the analysis period. Next, the model mean forecasts were calculated by averaging together the individual ensemble members of each model. The model mean forecasts were used to calculate model-specific monthly forecast climatologies for each initialization month and lead time, as is customary in climate forecast skill evaluation^{44,45}, and forecast anomalies were calculated for each individual ensemble member by subtracting the model mean climatology. Next, seasonally varying MHW thresholds for each model, lead time and initialization month were calculated with the same methodology described above for SST observations. Forecasts with SST anomalies at or above their respective thresholds were classified as MHWs, resulting in an ensemble of forecasts for binary outcomes (MHW or no MHW). The above steps were repeated for each of the six models, resulting in a multimodel ensemble of 73 members that was used to generate probabilistic monthly MHW forecasts. As forecasts are initialized at the beginning of the month, and we report monthly averages, lead times range from 0.5 months (for example, forecasts of January MHWs, initialized at the beginning of January) to 11.5 months (for example, forecasts of December MHWs, initialized at the beginning of January).

### Sensitivity to defining MHWs from daily versus monthly SST

In general, the time scale of predictable events increases with forecast lead time, such that one might look at daily output from weather-scale forecasts (for example, 1–2 weeks lead time) whereas monthly output is more appropriate for seasonal forecasts (up to a year). However, although the most impactful MHWs are overwhelmingly longer-lived events (>1 month)^{46}, there is also interest in more ephemeral warm extremes (lasting days to weeks) that may be missed in monthly averaged SST. To illustrate the influence of using daily rather than monthly SST forecasts for MHW prediction, we compare forecasts of MHWs identified based on daily and monthly output from CCSM4 for the locations highlighted in Fig. 2. For these locations, we obtained daily output of forecast SST for the entire 1991–2020 period from the CCSM4 model. We then repeated the analysis of observed and forecast MHWs using daily data with the definition described previously^{43}, which requires MHW thresholds to be exceeded for at least five days. Skill metrics for MHW forecasts generated from daily SST output were calculated using the same methods as those applied to monthly SST forecasts (see ‘MHW forecast evaluation’ below).

Relative to MHW forecasts defined from monthly data, forecasts at daily resolution show shorter mean MHW durations and often slightly lower skill, but no change in the reported patterns in MHW forecast skill (Extended Data Figs. 6 and 7). The consistency between the monthly and daily forecast skill is not surprising given that MHWs defined with daily data are still strongly driven by low frequency variability. However, it is important to note that even though seasonal forecasts can predict the enhanced or reduced likelihood of MHWs on daily time scales, this does not mean that one can predict the details of a specific short (for example, five-day) warming event months in advance. Rather, the skill in MHW forecasts provided at daily resolution is still reflective of predictable longer-lived SST anomalies, and forecast skill tends to be lower for shorter-lived events (Extended Data Fig. 6).

We also note that daily output is often not publicly available for seasonal forecasts (for example, NMME output is provided as monthly averages). Fortunately, we were able to get daily output from the CCSM4 model to conduct the comparison shown here, but at least in the near term a global MHW forecast system will necessarily be based on monthly output. The same may not be true for subseasonal forecasts (for example, 45 days or less), for which daily MHW forecasts would be more appropriate and daily model output would more likely be available.

### Accounting for warming trends

Owing to long-term warming trends in the world’s oceans, the rate of MHW occurrence increases over time if fixed thresholds are used to identify them. This effect is prominent even over the relatively short 30-year period examined here, with MHW occurrence increasing two- to threefold if the warming trend is not accounted for (Extended Data Fig. 5). There has been debate in the literature about whether (or when) it is appropriate to retain or remove warming trends in MHW research^{3,19,47}. Here we present results in the main text for MHWs calculated from detrended SST anomalies, but all analyses have been conducted using both methods. For the detrended analysis, we removed linear trends over the 1991–2020 period from the observed SST anomalies and the lead-time-dependent forecast SST anomalies at each grid cell. In the context of MHW forecasts, warming trends may be removed or included depending on the user and the application, but it is important to understand the implications of how trends are handled. Some forecast skill metrics are sensitive to the rate of events, so if trends are retained (and MHW frequency increases over time), those skill metrics will also show trends that are unrelated to the actual capabilities of the model^{48,49} (Extended Data Fig. 5). In the following section we expand on this point in the context of specific forecast skill metrics.

### MHW forecast evaluation

Our MHW forecast assessment follows common methods for evaluating climate and weather forecast skill, particularly for extreme events, which present challenges because of their relatively rare occurrence. For forecast verification, we first classify each ensemble member at each time step according to its position in the 2 × 2 contingency table: true positives (MHW is forecast and occurs), true negatives (no MHW is forecast and MHW does not occur), false positives (MHW is forecast but does not occur) and false negatives (no MHW is forecast but MHW occurs). From the contingency table we calculate two skill metrics, the forecast accuracy and the SEDI, described below. We also calculate the Brier skill score, which is derived from the MHW forecast probability (that is, the average of the binary forecasts from all ensemble members for a given forecast). Below, each of these metrics is described further. All three skill metrics show similar spatial patterns (Extended Data Fig. 8).

Of the many skill metrics proposed for forecasts of extreme events, SEDI^{49} has several desirable qualities^{50}, including (1) it is non-degenerate, meaning that it does not trend towards a meaningless limit (for example, zero or infinity) as event rarity increases, (2) it is base-rate independent, meaning that it is not influenced by changes in the frequency of events, and (3) it is equitable, meaning its expected value is the same (zero) for random forecasts, regardless of what method is used to generate the random forecasts^{51}. SEDI is calculated as

$${rm{SEDI}}=frac{{rm{log }}F-{rm{log }}H-{rm{loglog}}left(1-Fright)+{rm{log }}(1-H)}{{rm{log }}F+{rm{log }}H+{rm{loglog}}left(1-Fright)+{rm{log }}(1-H)},$$

where *H* is the hit rate (ratio of true positives to total observed events) and *F* is the false alarm rate (ratio of false positives to total observed non-events). The maximum SEDI score is one and scores above (below) zero indicate forecasts better (worse) than random chance.

For completeness, we also calculate two additional forecast skill metrics: the Brier skill score (BSS) and forecast accuracy. The Brier score is an estimate of the mean square error of the probabilistic forecast:

$${rm{BrS}}=,frac{1}{N}mathop{sum }limits_{i=1}^{N}{left({f}_{i}-{o}_{i}right)}^{2}$$

where *N* is the total number of forecasts being evaluated, *f*_{i} is the forecast probability computed from all ensemble members (that is, the fraction of forecasts predicting a MHW) for forecast *i* and *o*_{i} is the observed probability, which is either zero (no MHW) or one (MHW). The Brier skill score normalizes the Brier score relative to the skill of a reference forecast (BrS_{ref}):

$${rm{BSS}}=1{rm{mbox{–}}}{rm{BrS}}/{{rm{BrS}}}_{text{ref}}.$$

Here the reference forecast is simply the climatological rate of MHW occurrence (that is, always predicting a 10% chance of a MHW occurring). The BSS ranges from one (perfect skill) to negative infinity (no skill); as for SEDI, scores above (below) zero indicate forecasts better (worse) than random chance.

Forecast accuracy is included as a common and easily understandable skill metric; it is simply the fraction of forecasts that are correct:

$${rm{forecast; accuracy}}=({rm{true; positives}}+{rm{true; negatives}})/N.$$

For events that occur on average 10% of the time, the forecast accuracy for random forecasts is 0.82. Thus, MHW forecast accuracy above (below) 0.82 indicates skill better (worse) than random chance.

Significance of forecast skill metrics is quantified using a Monte Carlo simulation with block bootstrapping. Specifically, for a given grid cell we (1) calculate the MHW decorrelation time scale, 𝜏 (that is, the lag at which autocorrelation drops below 1/e), and then (2) randomly sample (with replacement) blocks of length 𝜏 from the observed MHW time series and concatenate them to create a forecast of length 360 months (the same as the model forecast verification period). This process is repeated to create 1,000 random forecasts, and forecast skill is calculated for each one. The 95% confidence intervals are then calculated from the skill values of the random forecasts, with significance defined as forecast skill exceeding the 97.5th percentile of the random forecast skill distribution.

When calculating time series of forecast skill (Fig. 3b and Extended Data Fig. 5), skill metrics are calculated over all grid cells at each time, rather than over all times at each grid cell. For example, the forecast accuracy for a given month in Extended Data Fig. 5b is the fraction of the ice-free global ocean for which the MHW state that month was correctly predicted. Temporal patterns in skill are similar between different metrics (Extended Data Fig. 5), with the exception that there is a base rate dependence in the forecast accuracy and in the individual components of the contingency table (true/false positives/negatives). That dependency is apparent during the strongest El Niño events (when SEDI and BSS increase but forecast accuracy declines), and also in the influence of long-term warming (Extended Data Fig. 5). If SST data are not detrended and consequently the rate of MHWs increases, then forecast accuracy declines, true and false positives increase, and true and false negatives decrease. These trends simply reflect changes in the frequency of events, whereas the performance of the forecast system (for example, as measured by SEDI) does not show a long-term trend (Extended Data Fig. 5). Thus, whether long-term temperature trends are removed or retained during MHW identification and forecasting, one must understand the implications for skill assessment.

## More News

## Why is India dropping evolution and the periodic table from school science?

## Towards quantum machine learning

## I use a robot to restore Pompeii’s shattered frescoes