Testing for Trends on a Regional Scale: Beyond Local Significance

Radan Huth; Martin Dubrovský

doi:10.1175/JCLI-D-19-0960.1

This site uses cookies, tags, and tracking settings to store information that help give you the very best browsing experience. Dismiss this warning

Journal of Climate

Abstract
1. Introduction
2. Statistical tests
3. Synthetic data
4. Implementation of tests
5. Null distributions and critical values
6. Evaluation of tests: Type-II errors
7. Application to real data
8. Conclusions
Acknowledgments
REFERENCES

ProCite

RefWorks

Reference Manager

BibTeX

Zotero

EndNote

View raw image

Fig. 1.

Null distributions for test statistics, 20 × 20 grid; from left to right: sign test, counting test, M-K test, Walker test, FDR test, and regression test. A local t test is used for the counting, Walker, and FDR tests. The third row (highlighted by a rectangle) is for zero temporal and spatial autocorrelations; the upper two rows illustrate the effect of nonzero temporal autocorrelations (0.3 and 0.6); the lower three rows illustrate the effect of nonzero spatial autocorrelations (0.6, 0.9, and 0.97). Note that the range of horizontal and vertical axes varies among plots. Values of the Walker test statistic are multiplied by 1000.

View raw image

Fig. 2.

Effect of prewhitening on the test statistics. Null distributions are shown for the (top) counting, (middle) Walker, and (bottom) FDR tests for the (left) local M-K test, (center) local prewhitened Mann–Kendall test, and (right) local Mann–Kendall test with effective sample size, all for temporal autocorrelations of 0.0, 0.3, and 0.6 (for the first, second, and third rows in each section, respectively).

View raw image

Fig. 3.

Effect of prewhitening on temporal autocorrelations: histograms of (left) temporal autocorrelations for time series with autocorrelations set to 0.0, (center) autocorrelations set to 0.3 without prewhitening, and (right) autocorrelations set to 0.3 after prewhitening. Histograms consist of values of all 10 000 realizations and 400 grid points (i.e., 4 000 000 values altogether).

View raw image

Fig. 4.

Trend detectability, i.e., the magnitude of standardized trend (per 10 time units) for which type-II error is equal to 5%, plotted against the length of the series (in time units). Tests are coded by color (sign test in red, count test in pink, extended M-K test in black, Walker test in light blue, FDR test in blue, and regression test in green), local tests are coded by line (t test as thick solid, M-K test as thin solid, prewhitened M-K test as long-dashed, and M-K test with effective sample size as short-dashed), combinations of temporal and spatial autocorrelation are coded by symbol (0.0/0.0—open circles; 0.3/0.0—diamonds; 0.0/0.9—filled circles; 0.6/0.97—filled squares), and grid size is coded by the size of symbols (20 × 20 points—large symbols and thick solid lines; 10 × 10 points—small symbols and thin dashed lines). Values of trend detectability outside the range from 0.005 to 1.0 could not be estimated because of the design of the experiment and hence are not shown. Information is divided into four graphs for better visibility, focusing on (a)–(c) the dependence of trend detectability on the test for various autocorrelation values and (d) the effect of grid size.

View raw image

Fig. 5.

HadCRUT4 data: (left) histogram of temporal autocorrelations and (right) histogram of spatial autocorrelations between neighboring grid points.

	All Time	Past Year	Past 30 Days
Abstract Views	436	67	0
Full Text Views	218	135	27
PDF Downloads	200	88	22

GENERAL CONDITIONS

Author:

A. J. HENRY

No. III

BAROMETRIC PRESSURE

VERIFICATIONS

MISCELLANEOUS PHENOMENA

Editorial Type:: Article

Article Type:: Research Article

Testing for Trends on a Regional Scale: Beyond Local Significance

Radan Huth

Radan Huth^aDepartment of Physical Geography and Geoecology, Faculty of Science, Charles University, Prague, Czechia
^bInstitute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czechia

Search for other papers by Radan Huth in
Current site
Google Scholar
PubMed

and

Martin Dubrovský

Martin Dubrovský^bInstitute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czechia
^cGlobal Change Research Institute, Czech Academy of Sciences, Brno, Czechia

Search for other papers by Martin Dubrovský in
Current site
Google Scholar
PubMed

Online Publication:: 19 May 2021

Print Publication:: 01 Jul 2021

DOI:: https://doi.org/10.1175/JCLI-D-19-0960.1

Page(s):: 5349–5365

Received:: 02 Jan 2020

Accepted:: 26 Mar 2021

Published Online:: 19 May 2021

Displayed acceptance dates for articles published prior to 2023 are approximate to within a week. If needed, exact acceptance dates can be obtained by emailing amsjol@ametsoc.org.

Download PDF

Free access

Abstract

Studies detecting trends in climate elements typically concentrate on their local significance, ignoring the question of whether the significant local trends may or may not have occurred as a result of chance. This paper fills this gap by examining several approaches to detecting statistical significance of trends defined on a grid (i.e., on a regional scale). To this end, we introduce a novel simple procedure of significance testing that is based on counting signs of local trends (sign test), and we compare it with five other approaches to testing collective significance of trends: counting, extended Mann–Kendall, Walker, false detection rate (FDR), and regression tests. Synthetic data are used to construct null distributions of trend statistics, to determine critical values of the tests, and to assess the performance of tests in terms of type-II error. For lower values of spatial and temporal autocorrelations, the sign test and extended Mann–Kendall test perform slightly better than the counting test; these three tests outperform the Walker, FDR, and regression tests by a wide margin. For high autocorrelations, which is a more realistic case, all tests become similar in their performance, with the exception of the regression test, which performs somewhat worse. Some tests cannot be used under specific conditions because of their construction: the Walker and FDR tests for high temporal autocorrelations, and the sign test under high spatial autocorrelations.

Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-19-0960.s1.

Corresponding author: Radan Huth, huth@ufa.cas.cz

Abstract

Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-19-0960.s1.

Corresponding author: Radan Huth, huth@ufa.cas.cz

Keywords: Statistical techniques; Time series; Trends

1. Introduction

Detection of ongoing and recent climate change is one of the most important tasks in present-day climatology. Climate change detection studies are conducted on various spatial scales of aggregation, from global means (e.g., Jones et al. 2012; Hawkins and Jones 2013; Jiménez-Muñoz et al. 2013; Karl et al. 2015; Rahmstorf et al. 2017) through continental and regional means (e.g., Giorgi 2002; Jones and Moberg 2003; Scherrer et al. 2005; de Barros Soares et al. 2017; Hong et al. 2017) to means for individual countries or their parts (e.g., Brázdil et al. 2009; El Kenawy et al. 2012; Vose et al. 2012; Gonzalez-Hidalgo et al. 2016; Tao et al. 2017) and local time series. A typical study detecting climate change in local scale is performed at a set of stations (or, alternatively, on a regular grid) over a region, a country, or a group of countries, up to a continent. Linear trends (i.e., slopes of linear regression lines of the variable against time) are calculated and their statistical significance is assessed. Statistical significance usually means a rejection of the null hypothesis that there is no correlation between the climate variable and time. Only sites (stations or grid points) with significant trends are often discussed then, those with insignificant trends being neglected. Examples of recent studies conducted in this “typical” way include, for example, Unkašević and Tošić (2013), Dumitrescu et al. (2015), Fathian et al. (2015), Caloiero (2017), Burger et al. (2018), and Caloiero et al. (2019). This common approach to trend detection and climate change analysis suffers from several weaknesses, which provide a motivation and a starting point for this study.

First, as stressed by Nicholls (2001), who criticized a misuse and overuse of significance testing in climate change studies, a considerable amount of information may be lost when insignificant trends are neglected. The lack of significance may not indicate the absence of a trend, which is a frequent interpretation of such a situation; it may rather reflect the fact that the data series is simply not long enough for the trend, which is embedded in shorter-term climate fluctuations, to achieve significance. Arguments against the overuse of significance of testing of trends similar to Nicholls (2001) were also raised by Griffiths et al. (2005) and Moberg and Jones (2005). Quite a little seems to have changed in the practice of local climate change detection studies since the time of publication of these papers, however. The dichotomous view, considering trends either significant or absent, leads to difficulties in interpretation, which are not admitted by most authors. For example, how to interpret the situation that just one of two close stations has a significant trend? Is climate change proceeding at one station but absent at the other? Or, more intriguingly, what if the trend is significant in maximum temperature and insignificant in minimum temperature at one station, while the opposite is the case at a neighbor station? This is a realistic situation that appears, for example, on a map of annual temperature trends for Bangladesh in Rahman and Lateh (2016), without being noticed and commented on by the authors.

Second, the multiplicity of individual (local) tests is ignored. The question on the collective significance of local tests (also referred to as “global” or “field” significance), that is, whether the observed number of significant trends (i.e., of rejected null hypotheses) at individual sites may or may not have occurred due to mere chance, has very scarcely been posed in climate change detection studies. Very little has changed since the first studies where the collective significance of local trends was assessed (Iskenderian and Rosen 2000; Yue and Hashino 2003); the ignorance of this issue is still ubiquitous. As Wilks (2016) points out, the reflection and apprehension of the necessity to also account for the collective significance has been weak or even negligible among atmospheric scientists, although the importance of evaluation of the collective significance has been stressed repeatedly in meteorological and climatological literature (Livezey and Chen 1983; Katz and Brown 1991; Ventura et al. 2004; Wilks 2006a, 2016; DelSole and Yang 2011). Some other scientific disciplines, such as hydrology, seem to have paid more attention to the issue of collective significance (e.g., Douglas et al. 2000; Khaliq et al. 2009; Ledvinka and Lamacova 2015). The absence of testing for collective significance typically leads to statistical significance being overstated and falsely detected.

The third concern is policy related. The society is usually interested in a climate change over a certain region, not at every single observing site (or grid point) individually. For example, one of the objectives of the U.S. Historical Climatological Network “was to detect temporal changes in regional rather than local climate” (Easterling et al. 1996). However, ordinary trend detection studies concentrate on individual sites, not providing regional-scale information on significance.

All the three above concerns call for an assessment of trends and their significance on a regional level. Let us have a region, which has been chosen a priori, with a network of sites where trends have been determined. It is reasonable to assume that the prevalence of one sign of trends at individual sites is indicative of a high confidence we may have that the trend over the region is not zero (unless very high spatial autocorrelation is present in the data), regardless of how large or small the trend magnitude at individual sites is. In an extreme case when the trends are of the same sign at all (or nearly all) sites (temperature trends over several recent decades over continental- or subcontinental-scale domains may serve as a good example; e.g., Lu et al. 2005; Vose et al. 2012; Capparelli et al. 2013; Spinoni et al. 2015a; Vincent et al. 2015; Jury 2017), it is natural to expect that the widespread long-term change that we see is not due to mere chance, although the majority of the individual trends may formally be locally insignificant. It is surprising that no attempts in significance testing based on this presumption have been made. Just a few studies made a short step forward by counting significant and insignificant individual local trends of either sign (e.g., Brázdil et al. 2009; Capparelli et al. 2013; Piccarreta et al. 2015; Tabari et al. 2015; Scherrer et al. 2016; Beranová and Kyselý 2018; Pokorná et al. 2018; Mullick et al. 2019), but without evaluating whether these counts occur merely because of chance.

All the above facts motivated us for conducting this study. It has two main objectives. First, we design a novel method of evaluating significance of trends over a region, which is based on counting signs of local trends; it is referred to as the “sign test” in the rest of the paper. Second, we compare performance of several tests suitable for evaluating regional-scale significance of trends, including the one designed by us, on the basis of synthetic datasets.

This study deals with the detection of trends of one sign over the analyzed domain. Although it may seem to be a limitation, it is in fact a fairly common task since the expected manifestations of ongoing climate change as well as potential policy-relevant adverse effects are frequently related to one specific sign of a trend only. Simply speaking, it is more relevant to the society to know whether temperature is rising rather than whether it is changing; whether hot extremes occur more often rather than whether they change their frequency; whether wind becomes stronger rather than changing its speed; whether precipitation intensity is growing rather than changing; and whether drought periods are becoming longer rather than changing their duration (e.g., Spinoni et al. 2015b; Soulis et al. 2016; Brázdil et al. 2019; Finkel and Katz 2018). The fact that the detection of trends of one sign is a relevant task is supported by the fact that another test, extended Mann–Kendall test (which is included in our study), was developed for exactly the same purpose (e.g., Khaliq et al. 2009). On the other hand, the detection and assessment of significance of trends of both signs at the same time is also a common and policy-relevant task; for example, this relates to trends in pressure, frequency of synoptic events, or temperature variability (e.g., Gillett et al. 2005; Alexander and Perkins 2013; Kučerová et al. 2017).

2. Statistical tests

First of all, six statistical tests that allow for assessing regional-scale significance of trends defined at individual sites are described in this section. While the sign test and extended Mann–Kendall test are directly designed to test for trends of one sign in the domain, the other four tests—the counting test, Walker test, test based on the false detection rate (referred to as the FDR test here), and test of regression patterns—can be used for testing the significance of trends of both signs together. Three of the tests use local tests at individual sites as an input (counting test, Walker test, and FDR test), while the other three handle the data at all sites simultaneously (extended Mann–Kendall test, test of regression patterns, and the sign test, proposed by us). We start with the description of the local tests that we use in the first group of tests of collective significance, and then we list all the tests of collective significance. In the tests utilizing local tests, we suppose trends are evaluated for their significance locally at level α_L at N sites.

a. Local tests

The classical t test, nonparametric Mann–Kendall test, Mann–Kendall test with prewhitening, and Mann–Kendall test with the equivalent sample size are employed as local tests for the counting, Walker, and FDR tests. All of the tests evaluate the relationship between the tested time series and the order of values in time (e.g., the year if the series is composed of annual values). The t test evaluates the linearity of the relationship and is based on the correlation coefficient r between the tested time series and the order of the values, k = 1, …, K: the quantity

t = \frac{r}{\sqrt{1 - r^{2}}} \sqrt{K - 2}

(1)

is t-distributed with K − 2 degrees of freedom.

The Mann–Kendall test (M-K test) is a nonparametric test for assessing significance of a trend, assuming its monotonicity, not linearity. The local M-K test statistic is the sum of signs of differences for all pairs of values: it is defined at site i for time series x_ik, k = 1, …, K, as

T_{i} = \sum_{k = 1}^{K - 1} \sum_{l = k + 1}^{K} sgn (x_{i k} - x_{i l}) .

(2)

The test statistic is normally distributed with zero mean and variance

Var (T_{i}) = [K (K - 1) (2 K + 5) - \sum_{m = 1}^{M} q_{m} m (m - 1) (2 m + 5) / 18],

(3)

where the sum over m accounts for the effect of ties: q_m is the number of ties of extent m and M is the number of tied groups.

The presence of temporal autocorrelation affects results of significance testing for the above local tests. Typically, it inflates test statistics so that the null hypothesis is rejected (and trend is detected) more often than it should correctly be. There are various approaches to cope with autocorrelation. Some of them handle it a posteriori, such as by replacing the real sample size by its effective size in the calculation of critical values (e.g., Wilks 2006b; Decremer et al. 2014; Tabari et al. 2015); some of them handle it prior to testing by modifications of the time series entering the test.

We take both approaches, using procedures that have been widely used in climatological literature. Temporal autocorrelation is removed by the prewhitening procedure that retains trend magnitude (Yue and Wang 2002a; Yue et al. 2002): the linear trend is removed from the time series first, then lag-1 autocorrelation r₁ is calculated in detrended series

x_{i}^{'}

and the new prewhitened time series y_i, from which the lag-1 autocorrelation is removed, is formed by

y_{k} = x_{k}^{'} - r_{1} x_{k - 1}^{'} .

(4)

Last, the trend is added to the prewhitened series. This new variable then enters the standard testing procedure.

The test with equivalent sample size is based on replacing the actual sample size K by effective sample size K^*, which accounts for the inflation of variance of the text statistic due to temporal autocorrelation. The general formula (e.g., Yue and Wang 2004; Tabari et al. 2015) reduces for the first-order autoregressive process to

K^{*} = K \frac{1 - r_{1}}{1 + r_{1}},

(5)

where r₁ is the lag-1 autocorrelation coefficient of the detrended series.

b. Counting test

The counting test is a classical test for collective significance, which was introduced into atmospheric sciences by Livezey and Chen (1983). Its essence is that the collective significance is achieved if significance is detected for a sufficiently large number of local tests.

If the null hypothesis of no trend is true the chance of wrongly rejecting the local null hypothesis (i.e., detecting a significant trend) at any site is α_L. For independent local tests, the number of locally significant trends under the null hypothesis follows binomial distribution: the probability that their number M exceeds preset value m is

P (M \geq m) = \sum_{i = m}^{N} \frac{N!}{i! (N - i)!} α_{L}^{i} {(1 - α_{L})}^{N - i} .

(6)

The mean value of the distribution of M is Nα_L. However, the number of locally significant tests m_C necessary to achieve collective significance level α_C is larger than Nα_C. The collective significance level, which in formal terms is the probability that the given number of local null hypotheses are falsely rejected, can be determined from Eq. (6) by putting its right-hand side equal to α_C. Livezey and Chen (1983) demonstrate that for 50 local tests, collective significance α_C = 5% is achieved if at least 12% of local tests are significant at level α_L = 5%. Even for 1000 tests, at least 7% of local tests must be significant to achieve the collective significance of 5%.

The above numbers apply to the situation that all local tests are independent, which is rarely the case in atmospheric sciences. For dependent tests, that is, in the presence of spatial autocorrelation between the sites where trends are evaluated, the test becomes too permissive, that is, rejects local null hypotheses too frequently. Consequently, the number of significant local tests needed to achieve the global significance is higher than in the case of independence. Wilks (2006a) demonstrates that the sensitivity of the counting test to the mutual dependence of local tests is high.

c. Walker test

The counting test has the property of not taking into account the magnitude of significance achieved by local tests, which may be considered a disadvantage. The same number of highly significant local tests and of the tests that exceed the local significance level only marginally result in the same collective significance according to the counting test.

The Walker test approaches the issue of collective significance in a different way. It is built on the premise that the local significance can be achieved even if just one of the local tests is sufficiently highly significant. The Walker test detects collective significance at level α_C if the smallest p value of all local tests, p_min, satisfies the condition (Katz and Brown 1991; Wilks 2006a)

p_{\min} \leq 1 - {(1 - α_{C})}^{1 / N} .

(7)

d. False detection rate

The test based on false detection rate (FDR) can be viewed of as a generalization of Walker test. It was introduced into atmospheric sciences by Ventura et al. (2004) and later advocated by Wilks (2006a, 2016). The test is based on Benjamini and Hochberg (1995) procedure, which, as Wilks (2006a) demonstrates, also provides a field significance test. Whereas the Walker test looks at the most significant local test only, the FDR test takes into account the strength of significance of all local tests and denies the null hypothesis of no trend if any of the local tests yields a sufficiently small p value, the threshold depending on its order.

The FDR test detects collective significance if the following inequality holds for the p value of at least one local test:

p \leq \max_{i = 1, \dots, N} [p_{(i)}; p_{(i)} \leq α_{C} \frac{i}{N}],

(8)

where p_(i) denotes sorted p values of local tests; that is, p_(i) is the ith-smallest p value.

A considerable advantage of Walker and FDR tests is that they are much less sensitive to the mutual dependence of local tests than the counting test (Wilks 2006a). Unlike the counting test, the FDR test tends to be conservative if local tests are not independent. Under high spatial autocorrelations, Wilks (2016) recommends to double the statistical significance for determining critical value, that is, to achieve significance of 5%, critical value for the 10% significance should be used; the FDR test with the modified critical value was applied, for example, by Rohrer et al. (2019).

e. Extended Mann–Kendall test

Unlike the previous three tests, the following statistical tests are designed specifically for testing for trends. The local M-K test described above is extended to a collective test by simply summing the local M-K statistics over all sites (Yue and Wang 2002b; Khaliq et al. 2009; Ledvinka and Lamacova 2015). The collective M-K test statistic is then

T = \sum_{i = 1}^{N} T_{i} = \sum_{i = 1}^{N} \sum_{k = 1}^{K - 1} \sum_{l = k + 1}^{K} sgn (x_{i k} - x_{i l}),

(9)

which is also normally distributed. Its variance is approximated as

Var (T) \dot{=} \sum_{i = 1}^{N} Var (T_{i}) + 2 \sum_{i = 1}^{N - 1} \sum_{j = 1}^{N - i} \sqrt{Var (T_{i}) Var (T_{i + j})} \cdot cor (x_{i}, x_{i + j}),

(10)

where cor(x_i, x_i+j) is the correlation between time series at sites i and i + j. The second term on the right-hand side was introduced as a compensation for spatial autocorrelation. Relationship (10) is written as equality in papers introducing and describing the M-K test (Douglas et al. 2000; Yue and Wang 2002b; Ledvinka and Lamacova 2015), which is not correct, however. The degree to which Eq. (10) approximates variance of the M-K statistic and a short discussion are provided in the online supplemental material. Here we note that the second term on the right-hand side of Eq. (10) effectively accounts for spatial autocorrelation between sites, as we show later in section 5.

f. Test of regression patterns

DelSole and Yang (2011) propose a procedure for testing whether the set of linear regression coefficients (regression pattern) is collectively nonzero; the testing procedure is intended to account for the multiplicity and interdependence of the local tests. They demonstrate that the task results in F-distributed statistic with N and K − N − 1 degrees of freedom:

F = \frac{R^{2}}{1 - R^{2}} \frac{K - N - 1}{N},

(11)

where R² is a multiple correlation coefficient between all of the individual local time series and the order of values in time,

R^{2} = c^{T} R^{- 1} c

(12)

where c is a vector containing correlations between the time series and time index k = (1, 2, …, K),

c = (r_{i T}) = [cor (x_{i}, k)],

(13)

and

R

is a matrix containing correlations of time series between the sites,

R = (r_{i j}) = [cor (x_{i}, x_{j})] .

(14)

Here, i, j = 1, …, N index sites.

The statistic cannot be calculated if the number of sites N is larger than the length of the series K, which is often the case, however. The solution is to approximate N local time series by a smaller number of their principal components and to calculate the multiple correlation coefficient and the F statistic from the principal components instead of the original series. The term N in Eq. (11) then represents the number of retained principal components that represent sufficient portion of total variance. We refer to this test as the regression test.

g. Test based on counts of signs (sign test)

As announced in the introduction, we propose a new simple test for significance of trends at a network of sites. Its basic idea is that if there is no real trend in the data, the number of sites with positive trends will be approximately the same as the number of sites with negative trends. In the presence of a trend at all the sites, the numbers of negative and positive trends will differ. Therefore, the test statistic is simply the difference between the number of positive and negative trends, n₊ and n₋, respectively, relative to the grid size:

S = \frac{(n_{+} - n_{-})}{N} .

(15)

The numbers of trends of either sign follow binomial distribution with parameters p = 0.5 and N being the number of sites. The probability that the number of trends of the given sign (positive or negative), M, exceeds the preset value n is

P (M \geq n) = \sum_{i = n}^{N} \frac{N!}{i! (N - i)!} {0.5}^{N} .

(16)

The advantage of this test is its simplicity: no significance tests are needed at individual sites; just a sign of a trend must be calculated. Any method of determination of a trend, whether parametric or nonparametric, is applicable here.

3. Synthetic data

Null distributions of the test statistics outlined above are determined and the power of the tests is evaluated on a set of synthetic time series, which are generated for a regular grid consisting of 20 × 20 points. In generating these data, the local variance, temporal autocorrelation, spatial autocorrelation, and trend magnitude are prespecified.

The synthetic data are produced by the Spatial Generator for Trend Analysis stochastic weather generator (SPAGETTA; Dubrovský et al. 2020). Temporal and spatial autocorrelations are represented by a first-order multivariate autoregressive process, in which the modeled vector consists of values related to individual grid points (i.e., having 20 × 20 grid points, the dimension of the vector is 400). Spatial autocorrelations between grid points are assumed to isotropically decrease with their distance increasing:

cor [x_{i} (k), x_{j} (k)] = C_{0}^{d (i, j)} and

(17)

cor [x_{i} (k), x_{i} (k - 1)] = L_{0} C_{0}^{d (i, j)},

(18)

where C₀ < 1 and 0 < L₀ < 1 are user-defined parameters, k is the time index, and d(i, j) is distance of any pair of grid points i and j. If we set d = 1 for the neighboring (closest) grid points, the two above equations apparently imply that C₀ represents spatial autocorrelation between two closest grid points, p and q, at zero lag:

cor [x_{p} (k), x_{q} (k)] = C_{0}

(19)

and L₀ represents lag-1 temporal autocorrelation at any grid point:

cor [x_{i} (k), x_{i} (k - 1)] = L_{0} .

(20)

For the sake of simplicity, we refer to C₀ and L₀ as spatial autocorrelation and temporal autocorrelation, respectively, in further text. Using the algorithm just described, synthetic series of 200 time units (which may represent years in our setting) long are generated. They are produced for a wide range of values of temporal and spatial autocorrelations (listed in Table 1), which cover values occurring in real world for various variables on grids of reasonable horizontal resolutions. Since for trend detection, the signal-to-noise ratio, which is equivalent to the ratio of the trend magnitude to the standard deviation, is relevant, we only list the ratio in Table 1. Standard deviation is identical for all time series.

Table 1.

Values of statistical properties used for generating random data. The dimensionless trend is the magnitude of trend divided by standard deviation.

The stationary series, that is, the data with the above statistical properties without any trend imposed, are used to construct null distributions and determine critical values of the statistical tests. The time series with imposed linear trends of various magnitudes (see Table 1) are then used to evaluate the power of the tests.

For each combination of values of statistical properties, 10 000 random time series that are 200 time units long at 20 × 20 grid points are generated. Additional analyses are conducted on subsets of these data for a smaller grid (10 × 10 points) and for shorter series (30, 40, 50, 70, 100, and 150 time units).

4. Implementation of tests

For each random time series, the test statistics described in the following, pertaining to the individual statistical tests, are calculated for all values of the statistical properties (spatial and temporal autocorrelation and trend magnitude), listed in Table 1. The counting, Walker, and FDR tests are conducted with all the four local tests, namely the t test, M-K test, prewhitened M-K test, and M-K test with equivalent sample size at the 5% level. The possible ranges for all tests are listed in Table 2, which also indicates the tail where the region of rejection is located.

Table 2.

Possible ranges of values and location of the region of rejection for the tests.

For the counting test, the number of grid points where the local significance of a trend is detected, n_sig, is counted. Then the test statistic C is simply the percentage of grid points with locally significant tests:

C = (n_{sig} / N) \times 100.

(21)

The statistic of the Walker test W is the minimum p value of all N local t tests:

W = p_{(1)} = \min_{i = 1, \dots, N} p_{i} .

(22)

The statistic of the FDR test follows from its description in Eq. (8): we seek for the p value for which the inequality in Eq. (8) is fulfilled most strongly; that is, we seek for such i that the ith smallest p value divided by i is minimum. The minimum value of this ratio is then multiplied by the number of grid points for convenience; the statistic of the FDR test, f, is therefore

f = \min_{i = 1, \dots, N} [p_{(i)} / i] N .

(23)

For the extended M-K test, we use test statistic T calculated from Eq. (9) standardized to unit variance by dividing it by its standard deviation from Eq. (10), that is,

Z = [T - sgn (T)] / \sqrt{Var (T)} .

(24)

Since the number of grid points (400) is larger than the length of the time series (30 to 200), multiple correlation coefficient in Eq. (11) would be ill defined. To avoid it, the F statistic for the regression test is calculated from principal components, N in Eq. (11) referring to the number of retained components (DelSole and Yang 2011). Principal component analysis is conducted on the data matrix consisting of gridpoint values in columns and time realizations in rows (S-mode in the standard nomenclature; Richman 1986) using correlation as similarity measure. Numerous criteria exist on how to separate signal from noise, that is, how many components to retain; these criteria typically yield widely different numbers (e.g., Serrano et al. 1999). Here we choose the simple rule to keep that many components that explain at least 80% of total variance. If the number of components necessary for explaining 80% of variance is larger than half of the series length, we keep it equal to the latter. The criterion to select the number of principal components to retain differs from that used originally by DelSole and Yang (2011); hence results might differ from what one would obtain if the original criterion were used.

Last, the statistic of the sign test in Eq. (15) can be simplified if all trends are assumed to be either positive or negative, that is, no trends have exactly zero magnitude (this assumption is fully realistic; no trend with exactly zero magnitude in fact appeared in our very large sample). For the ease of display and for consistency with the counting test, we express this statistic as a percentage:

S = (2 \frac{n_{+}}{N} - 1) \times 100.

(25)

Since we focus on trends with one sign over the analyzed domain, we opt for one-sided hypotheses for all tests. That is, the null hypothesis of no trend is tested against the alternative of a positive trend.

5. Null distributions and critical values

Figure 1 displays empirical distributions (histograms) of test statistics for the six tests for an illustrative selection of temporal and spatial autocorrelation values. All tests appear to be sensitive to the presence of autocorrelations, although to a different extent and in a different way.

The distributions of the sign test are symmetric around zero because the probability of positive and negative trends is the same under the null hypothesis. The sign test, thanks to its construction, is insensitive to temporal autocorrelation as the magnitude of autocorrelation does not affect the number of positive and negative trends. Growing spatial autocorrelations make the distribution of the sign test widen because the probability of a simultaneous detection of trends of one sign at a large number of sites grows with increasing spatial autocorrelation. For high spatial autocorrelations, the distribution becomes U-shaped, with most values concentrated at the lower and upper bound (−100 and +100). That is, there is a strong tendency for local trends to be either all negative or all positive under high spatial autocorrelations, even though no real trend is present.

Histograms for the extended M-K test confirm that its modification succeeds in eliminating the sensitivity to spatial autocorrelation: the histogram of the test statistic does not change with its increasing values. Nonzero temporal autocorrelations result in a larger variance of the empirical distribution, without considerable changes in its shape.

The sensitivity of the regression test to spatial autocorrelations is relatively weak: The distribution of its statistic moves its peak toward zero and extends its variance with increasing spatial autocorrelations. Nonzero temporal autocorrelation results in the null distribution moving to the right, while keeping the ratio of its standard deviation to the mean approximately constant.

For the remaining three tests, we first discuss their behavior for the t test used as a local test. Nonzero temporal autocorrelations result in the statistic of the counting test being shifted to higher values; that is, the number of trends detected by local tests increases with increasing temporal autocorrelation. High spatial autocorrelations make the distribution of the test statistic widen and shift to zero. For very high spatial autocorrelations, there is a large probability that no local test will detect a trend if the null hypothesis is true (for spatial autocorrelation of 0.97, this probability is as large as 70%), but at the same time there is a nonzero, although small, probability that trend is detected at any number of sites, including the possibility that it is detected everywhere.

Unlike the other four tests, the Walker and FDR tests have their region of rejection in the left tail. The null distribution of the Walker test tends to stick to zero: the probabilities are largest for near-zero values and decrease toward the right tail. In other words, the smallest p value of all N local tests is usually very small even if the null hypothesis is true. Relatively large p values are nevertheless also plausible, despite occurring with much smaller probability. The autocorrelation values control the degree to which the Walker test statistic sticks to zero. For large spatial autocorrelations, the distribution becomes wider; for autocorrelation of 0.97, all values of the test statistic up to 1.0 are possible (i.e., the minimum p value may take on any value), although the probability of near-zero values remains the largest. The opposite effect appears for large temporal autocorrelations: the probability concentrates in the vicinity of zero.

A similar behavior can be observed for the FDR test. Its distribution is approximately uniform in the absence of autocorrelations. The probability concentrates at the upper bound (1.0) for high spatial autocorrelations while at the lower bound (0.0) for high temporal autocorrelations.

Distributions of test statistics for counting, Walker, and FDR tests are insensitive to whether t tests or Mann–Kendall tests are used as local tests (not shown): the distributions are almost identical. The two approaches of accounting for temporal autocorrelations result in different modifications of distributions of test statistics.

The use of prewhitening makes a rather minor difference. For nonzero temporal autocorrelations, the distribution for the counting test shifts to somewhat higher values, while the distributions for the Walker and FDR tests stick more strongly to zero (cf. middle and left columns in Fig. 2). This makes the distributions for the prewhitened series even more different from the distributions for the series with zero temporal autocorrelation than the distribution of the original series are. The likely reason is illustrated in Fig. 3. The histogram of autocorrelations in time series generated with temporal autocorrelation equal to 0.3 is rather wide (middle panel in Fig. 3); the actual autocorrelations range from 0.04 to 0.50 because of the random component in data. The histogram for the series from which the temporal autocorrelation was removed by prewhitening (right panel of Fig. 3) is much narrower than the histogram for series generated under the assumption of zero temporal autocorrelation (left panel). This effect occurs because the whitening procedure mistakes a part of stochastic variability for autocorrelation and removes it, although it should have been retained.

The use of effective sample size in local tests alters the distributions of test statistics substantially, making them close to distributions for series with zero autocorrelation. For the counting test, the distribution is shifted left close to the origin, although it still has somewhat higher mean and variance than that for zero autocorrelation. For the Walker test, the distribution after accounting for temporal autocorrelation has somewhat higher concentration of probability near the origin, while an extra peak appears near the upper bound for the FDR test.

Whether and how temporal autocorrelation is accounted for has no effect on how the distributions depend on spatial autocorrelations: for nonzero spatial autocorrelations, the distributions for prewhitened local tests and tests with effective sample size are almost identical to the standard t test and local Mann–Kendall test (not shown).

The empirical distributions constitute null distributions, from which critical values of the tests are to be deduced. The critical value for the significance level of 5% is the 95th percentile of the empirical distributions for the sign, counting, extended M-K, and regression tests, and the 5th percentile for Walker and FDR tests. A strong concentration of probability at the bound near which the critical value lies constitutes a potential problem because the test may become inapplicable. This concerns Walker and FDR tests under high temporal autocorrelations, the sign test under high spatial autocorrelations, and the counting test when both temporal and spatial autocorrelations are high. The use of local tests with effective sample size alleviates the limitations of the counting, Walker, and FDR tests under high temporal autocorrelations: these tests become applicable under high temporal autocorrelations if an effective sample size is used. If the largest (when the critical value is in the upper tail) or smallest (when the critical value is in the lower tail) possible value of the test statistic occurs in more than 5% of cases, the critical value is equal to the highest (or lowest for Walker and FDR tests) possible value, there are no values of the test statistic that could be larger (or smaller) than the critical value, and the test becomes unusable. This issue is a result of discreteness of the statistic for the sign and counting tests. For Walker and FDR tests, whose statistics are continuous, it results from a rounding in the calculation, which is technically unavoidable.

Critical values for selected tests are provided in Table 3 for various values of temporal and spatial autocorrelation and their combinations, for the grid size of 20 × 20 points, and time series 200 units long. Critical values reflect the behavior of empirical distributions in Fig. 1. They depend on both spatial and temporal autocorrelations except the sign test, which is insensitive to temporal autocorrelations, and extended M-K test, which is insensitive to spatial autocorrelations. The critical value of the extended M-K test is the 95th percentile of the standardized Gaussian distribution. Note that the critical values for the FDR test under zero temporal autocorrelations remain around 0.05 for spatial autocorrelations of up to 0.6, whereas they grow to about 0.12 for autocorrelations between 0.90 and 0.97. This is in a rough correspondence with the finding of Wilks (2016) that under high spatial autocorrelations, the FDR test performs at the double of the nominal significance level. The cases when the critical value is equal to the maximum (or minimum for the Walker and FDR tests) possible value, that is, when the test is unusable as explained above, are denoted by italics in Table 3. It is worth noting that Fig. 1 and Table 3 suggest that under high spatial autocorrelations (0.97 and more), even the presence of a positive trend at all 400 grid points does not ensure statistical significance of the sign test.

Table 3.

Critical values for individual tests, for time series that are 200 units long, and for a 20 × 20 grid for various values of temporal (in rows) and spatial (in columns) autocorrelation. Cases in which the critical value is equal to the largest (or smallest for Walker and FDR tests) possible value, because of which the test becomes unusable, are denoted in italics.

Whether the critical values depend on the length of time series is examined by calculating Spearman correlations of the critical values with the series length; they are displayed in Table 4 for selected values of spatial and temporal autocorrelations. (Recall that we examine seven different series lengths—namely, 30, 40, 50, 70, 100, 150, and 200 time units—from which the correlations are calculated.) One would expect that, in general terms, the shorter the time series the wider the null distribution because of a larger uncertainty. First, this is not true for all tests; for example, the variance of the null distribution for the sign test under zero spatial autocorrelations, and hence the critical value, is independent of the series length since the probability of detecting a positive or negative trend at a single site when no trend is imposed is equal to 0.5 regardless of the length of the series. Second, the way that the width of the null distribution projects onto the critical value depends on the shape of the distribution and is different for unbounded distributions (such as those for extended M-K test; see Fig. 1), distributions bounded from the side where the critical value lies (e.g., Walker test), distributions bounded from the other side (e.g., the FDR test for positive spatial autocorrelations), and U-shaped distributions (such as sign test for high spatial autocorrelations). This means that there is no simple general rule describing what the dependence of the critical value on the series length is; a similar argument applies to the dependence on the grid size, discussed below. Spearman correlations are insignificant for the sign test (this holds for other values of autocorrelations not shown in Table 4 as well), indicating that the variability of critical values relative to the series length is a demonstration of sampling variability. The dependence of critical values on the series length is more systematic for the other tests; it is particularly strong for the regression test and for tests with a local test with prewhitening.

Table 4.

Dependence of critical values on the length of the series. Displayed are Spearman correlations (boldface type indicates values significantly different from zero at the 5% level) between the series length and critical values for individual tests for selected autocorrelations and a 20 × 20 grid. The critical value for the highest autocorrelations cannot be determined for some tests (for an explanation see the text); therefore, the critical value for the highest feasible autocorrelation is displayed instead, as indicated by asterisks: one asterisk for temporal/spatial autocorrelation of 0.6/0.95, and two for 0.3/0.97.

The effect of the grid size on the critical values can be deduced from Table 5, comparing critical values for the 20 × 20 and 10 × 10 grids. Extended Mann–Kendall and FDR tests appear to be insensitive to the grid size and the regression test is little sensitive to it, especially for small autocorrelations. Critical values of all the other tests seem to be systematically affected by the grid size.

Table 5.

Critical values of tests for 20 × 20 and 10 × 10 grids for selected autocorrelations (temporal/spatial autocorrelation from left to right: 0.0/0.0, 0.0/0.9, and 0.6/0.97) and for the series length of 200.

If the test is not sensitive to the series length and/or the grid size it may be sensible to determine the critical values from the longest possible time series and/or for the largest grid because they are determined with the smallest uncertainty. Nevertheless, we prefer to determine the critical values for every series length and grid size separately for all tests in order to handle all the tests in the same way.

6. Evaluation of tests: Type-II errors

The power of a test can be evaluated against a specific alternative hypothesis as a type-II error, that is, the probability that the null hypothesis is not rejected if the alternative hypothesis is true. Various magnitudes of trend (see Table 1) are used as an alternative hypothesis. Type-II errors for the 20 × 20 grid and the time series 200 units long are displayed in Table 6 for a selection of tests; a full table for all tests, including all combinations with local tests, is provided as Table S1 in the online supplemental material. In the absence of autocorrelations, the sign, counting, and extended M-K tests are able to detect even the weakest trend considered in our study, 0.005 per decade, with certainty (type-II error being 0.2% and less). Dimensionless trends of 0.075 per decade and larger are detected with type-II errors below 5% by all tests even for the strongest autocorrelations, with the exception of the regression test and the largest autocorrelations considered.

Table 6.

Type-II errors (%) for various values of spatial and temporal autocorrelation and various trend magnitudes as specific alternative hypotheses for selected tests. Crosses denote cases in which the test is unusable; see the text for more explanation.

We set the type-II error of 5% as a limit of detectability of a trend, analogously to the significance level (type-I error), which is also set to 5%. The trend magnitude corresponding to this limit can be determined by interpolation between values in Table 6. For example, the value of trend corresponding to the 5% detectability for the sign test for zero temporal autocorrelation and spatial autocorrelation of 0.97 is between 0.03 (for which type-II error is 8.9%) and 0.04 (type-II error is 1.3%). Using linear interpolation, the detectable trend, that is, the trend corresponding to type-II error of 5%, is approximated to be 0.0351 for this case. The values of trends detectable at the 5% level, estimated by linear interpolation, are displayed in Fig. 4 for a range of lengths of the series, two grid sizes, and four combinations of temporal and spatial autocorrelation.

The ability of all tests to detect a trend decreases as the time series shortens. For the high autocorrelations (0.6/0.97), even an extremely steep trend of one standard deviation per decade cannot be detected for the 30-yr series by any test. It is also clearly visible that trends are easier to detect on the larger grid.

For a given trend magnitude, the ability to detect a trend worsens (i.e., type-II error grows) with spatial and temporal autocorrelation growing. The sign, counting, and extended M-K tests are superior to regression, Walker, and FDR tests for all autocorrelations considered, the regression test tending to perform better than Walker and FDR tests. Extended M-K test is comparable to the sign test in its performance for most configurations of autocorrelations, grid size, and series length; it tends to be slightly more powerful for higher spatial autocorrelations. The extended M-K and sign tests perform consistently better than the counting test, although by a relatively narrow margin. The performance of the FDR test is consistently but only slightly better than that of Walker test, which is an expected finding given that the former is a generalization and improvement of the latter. The differences between tests tend to decrease with increasing autocorrelations. Note that for temporal autocorrelation of 0.6 and spatial autocorrelation of 0.97, performance of Walker and FDR tests is shown for the local tests with effective sample size because it is the only variant available (all other variants of local tests are unusable because of the construction of test as discussed above).

The effect of the choice of the local test on the detectability of trends is small. It is illustrated for the counting test for zero spatial autocorrelation. The effect is similar but smaller for Walker and FDR tests and for higher spatial autocorrelations (not shown for the sake of clarity of the graphs). The performance of the local t test appears to be slightly better than the Mann–Kendall test. Prewhitening results in a marginal improvement of performance, while the use of effective sample size leads to a slight deterioration. This is illustrated for the counting test in Figs. 4a and 4b, but applies to Walker and FDR tests as well. For higher autocorrelations, the trend detectability is very similar for all the local tests.

In summary, Table 6 and Fig. 4 suggest that extended M-K and sign tests perform almost equally well, whereas the performance of the other tests is consistently worse. However, the applicability of the sign test is limited because it is unusable under very high spatial autocorrelations.

7. Application to real data

It is important to note that the tests were compared under assumptions that may not be met in reality. We take annual mean temperatures from HadCRUT4 surface temperature data (ensemble median from version HadCRUT.4.5.0.0; Morice et al. 2012) for the period 1961–2010 on a 20 × 20 grid with a 5° spacing, covering southeastern Asia, most of Australia, and adjacent parts of Indian and Pacific Oceans, as an example. The grid extends from 32.5°S to 62.5°N and from 82.5° to 177.5°E. The reason for choosing this domain is that it is the only domain on Earth comprising 20 × 20 grid points where data in this dataset are almost complete for the given 50-yr period. As Fig. 5 indicates, the range of values of autocorrelations is considerable: temporal autocorrelations range from slightly negative values to almost 0.8, while spatial autocorrelations have much more skewed distribution, with values from below 0.1 to over 0.98. The assumption of spatially homogeneous autocorrelations is clearly not met.

The assumption of AR(1) process in both spatial and temporal domain is not realistic either: temperature is known to exhibit long-range dependence (e.g., Lennartz and Bunde 2009; Mann 2011; Gil-Alana et al. 2019) whereas the assumption of isotropy of spatial autocorrelations is broken by the existence of teleconnections, that is, highly correlated but mutually distant regions, of which Southern Oscillation is the most relevant one in southeastern Asia and Australia. The spatial autocorrelations are also likely to differ between the north–south and west–east directions, in particular if a regular latitude–longitude grid is used, in which the distance of grid points in the west–east direction decreases toward the pole. The study of whether the rating of methods is still valid under the long-range persistence and presence of teleconnections, and in general if other assumptions are not met, may be the topic of further work.

The idealized assumptions of the constancy of autocorrelations and variance and of the lack of long-range dependence in both time and space are necessary for the simulation study and comparison of tests to be feasible and tractable: analogous assumptions form the grounds of similar studies (e.g., Yue et al. 2002; Wilks 2006a, 2016). In practical applications it is advisable that the critical values be determined from empirical null distributions obtained by resampling the real data rather than taken from synthetic data created under idealized assumptions. The resampling must respect the properties of data relevant for the estimation of trends and their significance (e.g., McKinnon and Deser 2018), that is, temporal and spatial autocorrelations must be preserved in resampled data in particular.

Despite the simplifying assumptions, results of this study provide reliable guidelines for the selection of an appropriate test of collective significance of trends.

8. Conclusions

This paper focuses on collective (global, field) significance of tests of local trends. Our objective was to introduce a novel simple procedure of significance testing for the presence of trend on a grid, which is based on counting signs of local trends (sign test), and compare it with other approaches to testing collective significance of trends (counting, extended Mann–Kendall, Walker, FDR, and regression tests).

First, the null distributions of trend statistics are constructed using synthetic data. The critical values of tests are then determined as corresponding quantiles of the null distributions. The performance of tests is quantified in terms of trend magnitudes, for which type-II error (i.e., the probability of failing to reject the null hypothesis when the alternative hypothesis is true) attains 5%: the lower the trend magnitude, which is detectable with the given type-II error, the better performance of the test. For lower values of spatial and temporal autocorrelations, the sign test and extended M-K test outperform the counting test by a narrow margin; these three tests are considerably more efficient than Walker, FDR, and regression tests for the entire range of autocorrelations. The spread of performance of individual tests becomes narrower for high autocorrelations. Some tests become unusable under high spatial or temporal autocorrelations because their critical values are identical to the highest or lowest possible value of the null distribution. Taken together, the extended (multisite) version of M-K test appears to be the best choice. The sign test proposed by us, which has an advantage of computational simplicity, is an almost equally appropriate choice whenever it is usable, that is, for low to moderate autocorrelations when its critical values remain below the maximum possible value of the test statistic. Another advantage of the sign test is the independence of the critical values on the length of the series.

An example of annual mean temperatures in HadCRUT4 data is used to demonstrate that real data are unlikely to comply with the assumptions on which the synthetic data are constructed and critical values determined. Therefore, real-world applications of the testing procedures will require data resampling (Monte Carlo) procedures to create null distributions and to determine critical values. Our results highlight the necessity of testing for collective significance in trend detection studies and should serve as a guidance for the selection of the appropriate collective significance test.

Acknowledgments

This study was supported by the Czech Science Foundation, project 16-04676S. Tomáš Krauskopf kindly assisted with processing the real-world data. The authors thank Prof. Richard Vogel of Tufts University for insightful discussions about Eq. (10) and the anonymous reviewers whose comments led to substantially improved clarity of the paper.

REFERENCES

Alexander, L., and S. Perkins, 2013: Debate heating up over changes in climate variability. Environ. Res. Lett., 8, 041001, https://doi.org/10.1088/1748-9326/8/4/041001.
- Search Google Scholar
- Export Citation
Benjamini, Y., and Y. Hochberg, 1995: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc., 57B, 289–300, https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
- Search Google Scholar
- Export Citation
Beranová, R., and J. Kyselý, 2018: Trends of precipitation characteristics in the Czech Republic over 1961–2012, their spatial patterns and links to temperature and the North Atlantic Oscillation. Int. J. Climatol., 38, e596–e606, https://doi.org/10.1002/joc.5392.
- Search Google Scholar
- Export Citation
Brázdil, R., K. Chromá, P. Dobrovolný, and R. Tolasz, 2009: Climate fluctuations in the Czech Republic during the period 1961–2005. Int. J. Climatol., 29, 223–242, https://doi.org/10.1002/joc.1718.
- Search Google Scholar
- Export Citation
Brázdil, R., J. Mikšovský, P. Štěpánek, P. Zahradníček, L. Řezníčková, and P. Dobrovolný, 2019: Forcings and projections of past and future wind speed over the Czech Republic. Climate Res., 77 (1), 1–21, https://doi.org/10.3354/cr01540.
- Search Google Scholar
- Export Citation
Burger, F., B. Brock, and A. Montecinos, 2018: Seasonal and elevational contrasts in temperature trends in central Chile between 1979 and 2015. Global Planet. Change, 162, 136–147, https://doi.org/10.1016/j.gloplacha.2018.01.005.
- Search Google Scholar
- Export Citation
Caloiero, T., 2017: Trend of monthly temperature and daily extreme temperature during 1951–2012 in New Zealand. Theor. Appl. Climatol., 129, 111–127, https://doi.org/10.1007/s00704-016-1764-3.
- Search Google Scholar
- Export Citation
Caloiero, T., R. Coscarelli, R. Gaudio, and G. P. Leonardo, 2019: Precipitation trend and concentration in the Sardinia region. Theor. Appl. Climatol., 137, 297–307, https://doi.org/10.1007/s00704-018-2595-1.
- Search Google Scholar
- Export Citation
Capparelli, V., C. Franzke, A. Vecchio, M. P. Freeman, N. W. Watkins, and V. Carbone, 2013: A spatiotemporal analysis of U.S. station temperature trends over the last century. J. Geophys. Res. Atmos., 118, 7427–7434, https://doi.org/10.1002/jgrd.50551.
- Search Google Scholar
- Export Citation
de Barros Soares, D., H. Lee, P. C. Loikith, A. Barkhordarian, and C. R. Mechoso, 2017: Can significant trends be detected in surface air temperature and precipitation over South America in recent decades? Int. J. Climatol., 37, 1483–1493, https://doi.org/10.1002/joc.4792.
- Search Google Scholar
- Export Citation
Decremer, D., C. E. Chung, A. M. L. Ekman, and J. Brandefelt, 2014: Which significance test performs the best in climate simulations? Tellus, 66A, 23139, https://doi.org/10.3402/tellusa.v66.23139.
- Search Google Scholar
- Export Citation
DelSole, T., and X. S. Yang, 2011: Field significance of regression patterns. J. Climate, 24, 5094–5107, https://doi.org/10.1175/2011JCLI4105.1.
- Search Google Scholar
- Export Citation
Douglas, E. M., R. M. Vogel, and C. N. Kroll, 2000: Trends in floods and low flows in the United States: Impact of spatial correlation. J. Hydrol., 240, 90–105, https://doi.org/10.1016/S0022-1694(00)00336-X.
- Search Google Scholar
- Export Citation
Dubrovský, M., R. Huth, H. Dabhi, and M. W. Rotach, 2020: Parametric gridded weather generator for use in present and future climates: Focus on spatial temperature characteristics. Theor. Appl. Climatol, 139, 1031–1044, https://doi.org/10.1007/s00704-019-03027-z.
- Search Google Scholar
- Export Citation
Dumitrescu, A., R. Bojariu, M.-V. Birsan, L. Marin, and A. Manea, 2015: Recent climatic changes in Romania from observational data (1961–2013). Theor. Appl. Climatol., 122, 111–119, https://doi.org/10.1007/s00704-014-1290-0.
- Search Google Scholar
- Export Citation
Easterling, D. R., T. R. Karl, E. H. Mason, P. Y. Hughes, D. P. Bowman, R. C. Daniels, and T. Boden, 1996: United States Historical Climatology Network (U.S. HCN). ORNL Rep. 30, NDP-019/R1, 214 pp.
El Kenawy, A., J. I. López-Moreno, and S. M. Vicente-Serrano, 2012: Trend and variability of surface air temperature in northeastern Spain (1920–2006): Linkage to atmospheric circulation. Atmos. Res., 106, 159–180, https://doi.org/10.1016/j.atmosres.2011.12.006.
- Search Google Scholar
- Export Citation
Fathian, F., S. Morid, and E. Kahya, 2015: Identification of trends in hydrological and climatic variables in Urmia Lake basin, Iran. Theor. Appl. Climatol., 119, 443–464, https://doi.org/10.1007/s00704-014-1120-4.
- Search Google Scholar
- Export Citation
Finkel, J. M., and J. I. Katz, 2018: Changing world extreme temperature statistics. Int. J. Climatol., 38, 2613–2617, https://doi.org/10.1002/joc.5342.
- Search Google Scholar
- Export Citation
Gil-Alana, L. A., M. Monge, and M. F. Romero Rojo, 2019: Sea surface temperatures: Seasonal persistence and trends. J. Atmos. Oceanic Technol., 36, 2257–2266, https://doi.org/10.1175/JTECH-D-19-0090.1.
- Search Google Scholar
- Export Citation
Gillett, N. P., R. J. Allan, and T. J. Ansell, 2005: Detection of external influence on sea level pressure with a multi-model ensemble. Geophys. Res. Lett., 32, L19714, https://doi.org/10.1029/2005GL023640.
- Search Google Scholar
- Export Citation
Giorgi, F., 2002: Variability and trends of sub-continental scale surface climate in the twentieth century. Part I: Observations. Climate Dyn., 18, 675–691, https://doi.org/10.1007/s00382-001-0204-x.
- Search Google Scholar
- Export Citation
Gonzalez-Hidalgo, J. C., D. Peña-Angulo, M. Brunetti, and N. Cortesi, 2016: Recent trend in temperature evolution in Spanish mainland (1961–2010): From warming to hiatus. Int. J. Climatol., 36, 2405–2416, https://doi.org/10.1002/joc.4519.
- Search Google Scholar
- Export Citation
Griffiths, G. M., and Coauthors, 2005: Change in mean temperature as a predictor of extreme temperature change in the Asia-Pacific region. Int. J. Climatol., 25, 1301–1330, https://doi.org/10.1002/joc.1194.
- Search Google Scholar
- Export Citation
Hawkins, E., and P. D. Jones, 2013: On increasing global temperatures: 75 years after Callendar. Quart. J. Roy. Meteor. Soc., 139, 1961–1963, https://doi.org/10.1002/qj.2178.
- Search Google Scholar
- Export Citation
Hong, X. W., R. Y. Lu, and S. L. Li, 2017: Amplified summer warming in Europe–West Asia and Northeast Asia after the mid-1990s. Environ. Res. Lett., 12, 094007, https://doi.org/10.1088/1748-9326/aa7909.
- Search Google Scholar
- Export Citation
Iskenderian, H., and R. D. Rosen, 2000: Low-frequency signals in midtropospheric submonthly temperature variance. J. Climate, 13, 2323–2333, https://doi.org/10.1175/1520-0442(2000)013<2323:LFSIMS>2.0.CO;2.
- Search Google Scholar
- Export Citation
Jiménez-Muñoz, J. C., J. A. Sobrino, and C. Mattar, 2013: Has the Northern Hemisphere been warming or cooling during the boreal winter of the last few decades? Global Planet. Change, 106, 31–38, https://doi.org/10.1016/j.gloplacha.2013.02.010.
- Search Google Scholar
- Export Citation
Jones, P. D., and A. Moberg, 2003: Hemispheric and large-scale surface air temperature variations: An extensive revision and an update to 2001. J. Climate, 16, 206–223, https://doi.org/10.1175/1520-0442(2003)016<0206:HALSSA>2.0.CO;2.
- Search Google Scholar
- Export Citation
Jones, P. D., D. H. Lister, T. J. Osborn, C. Harpham, M. Salmon, and C. P. Morice, 2012: Hemispheric and large-scale land surface air temperature variations: An extensive revision and an update to 2010. J. Geophys. Res., 117, D05127, https://doi.org/10.1029/2011JD017139.
- Search Google Scholar
- Export Citation
Jury, M., 2017: Spatial gradients in climatic trends across the southeastern Antilles 1980–2014. Int. J. Climatol., 37, 5181–5191, https://doi.org/10.1002/joc.5156.
- Search Google Scholar
- Export Citation
Karl, T. R., and Coauthors, 2015: Possible artifacts of data biases in the recent global surface warming hiatus. Science, 348, 1469–1473, https://doi.org/10.1126/science.aaa5632.
- Search Google Scholar
- Export Citation
Katz, R. W., and B. G. Brown, 1991: The problem of multiplicity in research on teleconnections. Int. J. Climatol., 11, 505–513, https://doi.org/10.1002/joc.3370110504.
- Search Google Scholar
- Export Citation
Khaliq, M. N., T. B. M. J. Ouarda, P. Gachon, L. Sushama, and A. St-Hilaire, 2009: Identification of hydrological trends in the presence of serial and cross correlations: A review of selected methods and their application to annual flow regimes of Canadian rivers. J. Hydrol., 368, 117–130, https://doi.org/10.1016/j.jhydrol.2009.01.035.
- Search Google Scholar
- Export Citation
Kučerová, M., C. Beck, A. Philipp, and R. Huth, 2017: Trends in frequency and persistence of atmospheric circulation types in COST733 classifications over Europe derived from a multitude of classifications. Int. J. Climatol., 37, 2502–2521, https://doi.org/10.1002/joc.4861.
- Search Google Scholar
- Export Citation
Ledvinka, O., and A. Lamacova, 2015: Detection of field significant long-term monotonic trends in spring yields. Stochastic Environ. Res. Risk Assess., 29, 1463–1484, https://doi.org/10.1007/s00477-014-0969-1.
- Search Google Scholar
- Export Citation
Lennartz, S., and A. Bunde, 2009: Trend evaluation in records with long-term memory: Application to global warming. Geophys. Res. Lett., 36, L16706, https://doi.org/10.1029/2009GL039516.
- Search Google Scholar
- Export Citation
Livezey, R. E., and W. Y. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques. Mon. Wea. Rev., 111, 46–59, https://doi.org/10.1175/1520-0493(1983)111<0046:SFSAID>2.0.CO;2.
- Search Google Scholar
- Export Citation
Lu, Q. Q., R. Lund, and L. Seymour, 2005: An update of U.S. temperature trends. J. Climate, 18, 4906–4914, https://doi.org/10.1175/JCLI3557.1.
- Search Google Scholar
- Export Citation
Mann, M. E., 2011: On long range dependence in global surface temperature series. Climatic Change, 107, 267–276, https://doi.org/10.1007/s10584-010-9998-z.
- Search Google Scholar
- Export Citation
McKinnon, K. A., and C. Deser, 2018: Internal variability and regional climate trends in an observational large ensemble. J. Climate, 31, 6783–6802, https://doi.org/10.1175/JCLI-D-17-0901.1.
- Search Google Scholar
- Export Citation
Moberg, A., and P. D. Jones, 2005: Trends in indices for extremes in daily temperature and precipitation in central and western Europe, 1901–99. Int. J. Climatol., 25, 1149–1171, https://doi.org/10.1002/joc.1163.
- Search Google Scholar
- Export Citation
Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones, 2012: Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset. J. Geophys. Res., 117, D08101, https://doi.org/10.1029/2011JD017187.
- Search Google Scholar
- Export Citation
Mullick, M. R. A., R. M. Nur, M. J. Alam, and K. M. A. Islam, 2019: Observed trends in temperature and rainfall in Bangladesh using pre-whitening approach. Global Planet. Change, 172, 104–113, https://doi.org/10.1016/j.gloplacha.2018.10.001.
- Search Google Scholar
- Export Citation
Nicholls, N., 2001: The insignificance of significance testing. Bull. Amer. Meteor. Soc., 82, 981–986, https://doi.org/10.1175/1520-0477(2001)082<0981:CAATIO>2.3.CO;2.
- Search Google Scholar
- Export Citation
Piccarreta, M., M. Lazzari, and A. Pasini, 2015: Trends in daily temperature extremes over the Basilicata region (southern Italy) from 1951 to 2010 in a Mediterranean climatic context. Int. J. Climatol., 35, 1964–1975, https://doi.org/10.1002/joc.4101.
- Search Google Scholar
- Export Citation
Pokorná, L., M. Kučerová, and R. Huth, 2018: Annual cycle of temperature trends in Europe, 1961–2000. Global Planet. Change, 170, 146–162, https://doi.org/10.1016/j.gloplacha.2018.08.015.
- Search Google Scholar
- Export Citation
Rahman, M. R., and H. Lateh, 2016: Spatio-temporal analysis of warming in Bangladesh using recent observed temperature data and GIS. Climate Dyn., 46, 2943–2960, https://doi.org/10.1007/s00382-015-2742-7.
- Search Google Scholar
- Export Citation
Rahmstorf, S., G. Foster, and N. Cahill, 2017: Global temperature evolution: Recent trends and some pitfalls. Environ. Res. Lett., 12, 054004, https://doi.org/10.1088/1748-9326/aa6825.
- Search Google Scholar
- Export Citation
Richman, M. B., 1986: Rotation of principal components. J. Climatol., 6, 293–335, https://doi.org/10.1002/joc.3370060305.
- Search Google Scholar
- Export Citation
Rohrer, M., S. Bronnimann, O. Martius, C. C. Raible, and M. Wild, 2019: Decadal variations of blocking and storm tracks in centennial reanalyses. Tellus, 71A, 1586236, https://doi.org/10.1080/16000870.2019.1586236.
- Search Google Scholar
- Export Citation
Scherrer, S. C., C. Appenzeller, M. A. Liniger, and C. Schär, 2005: European temperature distribution changes in observations and climate change scenarios. Geophys. Res. Lett., 32, L19705, https://doi.org/10.1029/2005GL024108.
- Search Google Scholar
- Export Citation
Scherrer, S. C., E. M. Fischer, R. Posselt, M. A. Liniger, M. Croci-Maspoli, and R. Knutti, 2016: Emerging trends in heavy precipitation and hot temperature extremes in Switzerland. J. Geophys. Res. Atmos., 121, 2626–2637, https://doi.org/10.1002/2015JD024634.
- Search Google Scholar
- Export Citation
Serrano, A., J. A. García, V. L. Mateos, M. L. Cancillo, and J. Garrido, 1999: Monthly modes of variation of precipitation over the Iberian Peninsula. J. Climate, 12, 2894–2919, https://doi.org/10.1175/1520-0442(1999)012<2894:MMOVOP>2.0.CO;2.
- Search Google Scholar
- Export Citation
Soulis, E. D., A. Sarhadi, M. Tinel, and M. Suthar, 2016: Extreme precipitation time trends in Ontario, 1960–2010. Hydrol. Processes, 30, 4090–4100, https://doi.org/10.1002/hyp.10969.
- Search Google Scholar
- Export Citation
Spinoni, J., and Coauthors, 2015a: Climate of the Carpathian region in the period 1961–2010: Climatologies and trends of 10 variables. Int. J. Climatol., 35, 1322–1341, https://doi.org/10.1002/joc.4059.
- Search Google Scholar
- Export Citation
Spinoni, J., G. Naumann, J. Vogt, and P. Barbosa, 2015b: European drought climatologies and trends based on a multi-indicator approach. Global Planet. Change, 127, 50–57, https://doi.org/10.1016/j.gloplacha.2015.01.012.
- Search Google Scholar
- Export Citation
Tabari, H., M. T. Taye, and P. Willems, 2015: Statistical assessment of precipitation trends in the upper Blue Nile River basin. Stochastic Environ. Res. Risk Assess., 29, 1751–1761, https://doi.org/10.1007/s00477-015-1046-0.
- Search Google Scholar
- Export Citation
Tao, H., T. Fischer, B. D. Su, W. Y. Mao, T. Jiang, and K. Fraedrich, 2017: Observed changes in maximum and minimum temperatures in Xinjiang autonomous region, China. Int. J. Climatol., 37, 5120–5128, https://doi.org/10.1002/joc.5149.
- Search Google Scholar
- Export Citation
Unkašević, M., and I. Tošić, 2013: Trends in temperature indices over Serbia: Relationships to large-scale circulation patterns. Int. J. Climatol., 33, 3152–3161, https://doi.org/10.1002/joc.3652.
- Search Google Scholar
- Export Citation
Ventura, V., C. J. Paciorek, and J. S. Risbey, 2004: Controlling the proportion of falsely rejected hypotheses when conducting multiple tests with climatological data. J. Climate, 17, 4343–4356, https://doi.org/10.1175/3199.1.
- Search Google Scholar
- Export Citation
Vincent, L. A., X. Zhang, R. D. Brown, Y. Feng, E. Mekis, E. J. Milewska, H. Wan, and X. L. Wang, 2015: Observed trends in Canada’s climate and influence of low-frequency variability modes. J. Climate, 28, 4545–4560, https://doi.org/10.1175/JCLI-D-14-00697.1.
- Search Google Scholar
- Export Citation
Vose, R. S., S. Applequist, M. J. Menne, C. N. Williams Jr., and P. Thorne, 2012: An intercomparison of temperature trends in the U.S. Historical Climatology Network and recent atmospheric reanalyses. Geophys. Res. Lett., 39, L10703, https://doi.org/10.1029/2012GL051387.
- Search Google Scholar
- Export Citation
Wilks, D. S., 2006a: On “field significance” and the false discovery rate. J. Appl. Meteor. Climatol., 45, 1181–1189, https://doi.org/10.1175/JAM2404.1.
- Search Google Scholar
- Export Citation
Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp.
Wilks, D. S., 2016: “The stippling shows statistically significant grid points”: How research results are routinely overstated and overinterpreted, and what to do about it. Bull. Amer. Meteor. Soc., 97, 2263–2273, https://doi.org/10.1175/BAMS-D-15-00267.1.
- Search Google Scholar
- Export Citation
Yue, S., and C. Y. Wang, 2002a: Applicability of prewhitening to eliminate the influence of serial correlation on the Mann–Kendall test. Water Resour. Res., 38, 1068, https://doi.org/10.1029/2001WR000861.
- Search Google Scholar
- Export Citation
Yue, S., and C. Y. Wang, 2002b: Regional streamflow trend detection with consideration of both temporal and spatial correlation. Int. J. Climatol., 22, 933–946, https://doi.org/10.1002/joc.781.
- Search Google Scholar
- Export Citation
Yue, S., and M. Hashino, 2003: Temperature trends in Japan: 1900–1996. Theor. Appl. Climatol., 75, 15–27, https://doi.org/10.1007/s00704-002-0717-1.
- Search Google Scholar
- Export Citation
Yue, S., and C. Y. Wang, 2004: The Mann–Kendall test modified by effective sample size to detect trend in serially correlated hydrological series. Water Resour. Manage., 18, 201–218, https://doi.org/10.1023/B:WARM.0000043140.61082.60.
- Search Google Scholar
- Export Citation
Yue, S., P. Pilon, B. Phinney, and G. Cavadias, 2002: The influence of autocorrelation on the ability to detect trend in hydrological series. Hydrol. Processes, 16, 1807–1829, https://doi.org/10.1002/hyp.1095.
- Search Google Scholar
- Export Citation

Supplementary Materials

Share Link

Copy this link, or click below to email it to a friend

Email this content

or copy the link directly:

https://journals.ametsoc.org/view/journals/clim/34/13/JCLI-D-19-0960.1.xml

Link copied successfully

Testing for Trends on a Regional Scale: Beyond Local Significance

Abstract

Abstract

1. Introduction