Volume 43, Issue 6 p. 2678-2698
RESEARCH ARTICLE
Full Access

On using self-organizing maps and discretized Sammon maps to study links between atmospheric circulation and weather extremes

Jan Stryhal

Corresponding Author

Jan Stryhal

Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czech Republic

Correspondence

Jan Stryhal, Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czech Republic.

Email: [email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Methodology, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
Eva Plavcová

Eva Plavcová

Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czech Republic

Contribution: Conceptualization, Funding acquisition, Project administration, Supervision, Visualization, Writing - review & editing

Search for more papers by this author
First published: 06 January 2023

Abstract

Classifications by self-organizing maps (SOMs) have been used to study links between extreme weather events and atmospheric circulation. However, SOMs may often provide a too generalized image of circulation, which is too coarse to properly resolve meteorological extremes. Several approaches have been used to circumvent this limitation, including an increase in the number of nodes and/or relaxation of the stress on topological ordering of nodes, through adjusting the neighbourhood radius. Here, we (1) test whether weighting of training fields based on extremity of their circulation could lead to SOMs that would better represent circulation extremes without impairing the topological ordering of nodes, and (2) we investigate whether Sammon maps (SM), which have so far been limited to validation of SOMs in climatology, could provide a viable alternative to SOMs in studying weather extremes. We show that removing a certain percentage of circulation fields closest to the climatological mean has a greater effect on the representation of circulation extremes than more than doubling the SOM array size. Additionally, this step allows for decreasing the neighbourhood radius, the parameter responsible for topological ordering, which would not otherwise be possible without impairing the topology of the map. The SOMs are compared with two newly developed methods that discretize SMs based on arbitrarily chosen and distribution-based bins. We demonstrate the usefulness of both methods in climate model validation and show that both methods are able to capture circulation variability in its extremes in more detail than the SOMs. The distribution-based method does not suffer from sample-size issues and outperforms all other classifications in extracting ERA5 patterns leading to extreme temperature minima over central Europe. Our results suggest that classifications based on SM could be a viable alternative to SOMs when SOMs lack the ability to discriminate circulation outliers leading to temperature extremes.

1 INTRODUCTION

The study of weather extremes often requires linking weather elements to synoptic- and/or large-scale atmospheric circulation, especially in extra-tropical regions, where the links tend to be particularly strong (Horton et al., 2015). Many methods have been developed within the fields of synoptic and dynamic climatology to provide a simplified image of the variable atmospheric circulation and to study its links to weather elements. Not all of them are, nonetheless, useful in studying weather extremes, because the output of many of the methods is too general to be representative of extreme circulation fields that typically lead to weather extremes (Gibson et al., 2017). One particular approach inclining toward such limitation is that of automated classifications of atmospheric circulation (hereafter referred to as “circulation classifications”), which have become very popular since the simultaneous boom in atmospheric modelling and computation technology. This approach relies on computer-assisted processing of circulation data that minimizes—but not eliminates completely—the tedious and subjective decision making on the part of the researcher, unavoidable for the traditionally used manual catalogues of weather types developed mainly for weather prediction (Huth et al., 2008).

Many circulation classification methods have been developed so far; their overview and inter-comparison was provided within the COST 733 Action (Huth et al., 2008; Philipp et al., 2016). The output of a classification is a catalogue of representative circulation patterns, or circulation types (CTs) in short, and a calendar linking each input field (e.g., daily mean sea level pressure map) to one—the most similar—of these CTs. The resulting classification can be then used to link circulation to a meteorological extreme simply by evaluating how each individual CT conditions the extreme (e.g., its conditional probability of occurrence), representing a “bottom-up” approach (Wolski et al., 2018). However, since the automated process is typically unsupervised, a researcher has only few options—via specification of parameters of classification—to influence the definition of CTs that could be directly linked to extreme events. One way to circumvent this drawback is to base the classification process on indices that are known to be relevant to the variability of an element, such as the strength and direction of airflow or vorticity area used in the case of the Jenkinson-Collison method (Jones et al., 2013), which was originally designed to capture circulation variability over the British Isles. Naturally, these methods—also called hybrid methods because some expert input is required to define the representative CTs, but they otherwise rely on computer assistance—cannot be blindly used in other climate zones than those for which they were developed.

To circumvent the limitations of the bottom-up approach, a “top-down” approach (Wolski et al., 2018) is often used instead, in which one first finds all episodes of an extreme and subsequently analyses circulation fields occurring during those episodes (e.g., Loikith and Broccoli, 2015; Aragon et al., 2020). This top-down approach is particularly useful in scrutinizing one or a few events of one particular extreme, but it has its own limitations. For example, the absence of an image of overall circulation variability does not allow for studying synoptic paths leading to an extreme since they include “non-extreme” days. It is also problematic to study more types of extremes in parallel (including compound extremes) and/or multiple datasets in which circulation leading to the extreme may considerably differ from one another.

However, rather than relying on classifications, the specific needs of research on extremes can also be addressed by an alternative paradigm of feature extraction, namely exploratory data projection. Exploratory data projection consists in simplification and visualization of high-dimensional data, and aims to better understand the data structure (Lerner et al., 1998). Projection has been used in synoptic climatology in many forms, principal component analysis (PCA) is probable the most important one. PCA provides a linear representation of multi-variable spatial–temporal circulation datasets by only a few derived orthogonal variables, or modes of variability, that maximize the explained variance and typically have a rather strong link to those weather elements that are conditioned by circulation. The projection of data onto the phase space spanned by one (in the form of a teleconnection index) or more of these modes of variability can be used as a method to study linear data structures. PCA can be also used as a data pre-processing tool (e.g., Philipp et al., 2007), toward regionalization (e.g., Gibson et al., 2020) and—assuming correct configuration of input data—even the means of classification itself (e.g., Huth, 2000; Compagnucci and Richman, 2008).

The idea of classifying data by discretizing projections of high-dimensional circulation data onto low-dimensional phase space is, nevertheless, not limited to PCA. Another branch of classifications, the so-called hybrid methods (Philipp et al., 2016), is based on the same principle. The above-mentioned Jenkinson-Collison method or the classification of Großwettertypes (Beck et al., 2007) project data on predefined indices or idealized patterns and then classify the data according to certain predefined threshold criteria. Since the indices are quite simple and refer to well-known latent features of the complex circulation fields, the output can be often visualized and easily understood even without plotting the spatial patterns. On the other hand, similarly to PCA, not all defined indices can be typically viewed at the same time, which may pose some difficulties. With or without the subsequent classification, the Jenkinson-Collison circulation indices have been extensively used to visualize complex relationships, for example in meta-analyses of findings obtained by a number of classifications (Stryhal and Huth, 2019) and synoptic-scale classifications of multiple spatial domains (Otero et al., 2018) or an evaluation of synoptic forcing of compound extreme events associated with increased mortality (Plavcová and Urban, 2020), among many others.

Various non-linear alternatives to linear projection have been developed in order to account for nonlinear structures of complex data and improve their embedding in low-dimensional spaces. However, applications of most of these methods have so far been scarce in climatology— contrariwise to both linear projection and non-linear classification methods. One notable exception is the method of self-organizing maps (SOMs; Kohonen, 2001; Sheridan and Lee, 2011). The goal of SOMs is to map the input data onto a (typically two-dimensional) projection space formed by a regular array of nodes, which approximates the topological structure of the data space. SOMs are therefore not a pure projection method as they also discretize the output by mapping individual data points onto the nodes, providing a classification (Ripley, 1996).

In a way, SOMs combine the paradigms of exploratory data projection and classification, potentially benefitting from both. However, requirements for a good representation of data topology and good separation into clusters, or high-quality informative projection and high-quality classification, cannot be fulfilled simultaneously. Using SOMs in applications without a need for ordering of clusters may therefore lead to suboptimal results compared with methods that optimize data separation, such as cluster analysis (e.g., Webb and Copsey, 2002; Jiang et al., 2015; Gibson et al., 2017). This is, however, rarely the case, since the topological ordering of CTs in SOMs is their greatest benefit making interpretation more manageable, especially when larger SOMs are trained to provide a better image of extreme circulation.

The limitations of SOMs in the research on extremes have been tackled in several ways. The arguably most common way consists in the above-mentioned top-down approach in which only circulation fields co-occurring with an extreme are used to train a SOM. This removes most of circulation variability and only relatively small SOMs suffice to capture the variability of an extreme. As examples, Aragon et al. (2020) used a 4 × 3 SOM to study heavy-precipitation days, and Loikith and Broccoli (2015) captured extremes of temperature in only a 3 × 3 SOM. On the other hand, when SOMs are trained in complete datasets, larger SOMs may hypothetically be necessary to represent the extremes. Ohba et al. (2015) used a 12 × 12 SOM to identify and subsequently cluster nodes that link circulation to extreme rainfall. Cassano et al. (2015) suggest that larger SOMs better represent extremes and that the ideal size depends on the regional domain; Gibson et al. (2017), furthermore, point to the dependence on the type of extreme considered. Last but not least, the role of some specific SOM parameters, mainly the neighbourhood radius (more details in Section 2.4), has been considered by a few studies (Jiang et al., 2015; Gibson et al., 2017). The results clearly show that decreasing the value of the radius in later iterations of SOM training (or even effectively using the nodes as seeds for cluster analysis that follows the training of SOM) is preferable if one's goal is to link the nodes to extreme events. This, however, decreases the topological ordering of nodes (Gibson et al., 2017), especially in SOMs with larger arrays (Jiang et al., 2015).

The method of non-linear mapping, more often called Sammon mapping (SM) after Sammon (1969), is one of the most popular non-linear projection and dimensionality reduction approaches. SM is considered one of so-called metric multi-dimensional scaling (MDS) methods, which comprise algorithms that use a distance metric between objects in high-dimensional metric space and construct its low-dimensional Euclidean representation. The high-dimensional objects are embedded in a low-dimensional latent Cartesian space (typically two dimensions are used to aid visualization) in a way that minimizes a certain stress function. The function quantifies differences in the high- and low-dimensional distances between pairs of objects; in other words, it points to distortion of data topology due to the scaling (Lee and Verleysen, 2007).

Unlike PCA, which preserves variance, SM aims to preserve (typically Euclidean) distances among input circulation fields. Another distinction is that the Sammon representation is non-linear; that is, the latent dimensions, or axes, of the Sammon space cannot be represented by a linear combination of the original variables (Lerner et al., 1998). Moreover, despite its name, the method does not provide a mapping function that would allow for projecting objects on existing maps (Yin, 2003). Therefore, in order to embed data in a Sammon map, the data must be included in the training sample.

To our knowledge, the use of Sammon mapping in atmospheric sciences has so far been limited to visualizing SOM topology (e.g., Sheridan and Lee, 2011). Other MDS methods have also been used rather scarcely, although compared with SM their application has been somewhat more diverse, including dimensionality reduction prior to cluster analysis (Kremer et al., 2020), comparison of methods (Bürger et al., 2013) and datasets (Gibson et al., 2019), and visualizing spatial and temporal differences in temperature and precipitation climatologies (Evtimov and Ivanov, 2007; Rahman et al., 2018).

In the present study, we focus on (1) several issues regarding the applicability of SOMs, and topology-preserving classifications in general, for studying extremes of circulation and their links to meteorological extremes. We investigate whether weighting of input fields based on extremity of their circulation leads to SOMs that better resolve extreme circulation fields, especially if combined with an optimization of the neighbourhood radius parameter, without losing the topological ordering of nodes even if larger array sizes (up to 120 CTs) are chosen. SM is used to demonstrate effects of various methodological choices on SOM arrays. (2) We present two methods that avoid using SOMs to achieve a topology-preserving classification but comprise discretization of Sammon maps instead, and we compare their performance to SOMs. The methodology is schematically depicted in Figure 1b.

Details are in the caption following the image
(a) The area of interest with ERA5 orography. Mean SLP fields were evaluated within the broken-lined box; mean TMIN was calculated for the solid-lined box. (b) Schematic description of the methodology. The order of calculations is indicated by numbers

The paper is organized as follows: Datasets and methodology are described in Section 2. In Section 3, seven SOMs and two SM-based classifications are applied to outputs of the ERA5 reanalysis and a randomly selected regional climate model (RCM) output for central Europe. In Section 4 and 5, respectively, we provide discussion and conclusions.

2 DATA AND METHODS

2.1 Data

The primary dataset, in which SM and SOMs are trained, is the ERA5 atmospheric reanalysis (Hersbach et al., 2020). We use two variables – “mean sea level pressure” (SLP) from the analysis project and “minimum temperature (TMIN) at 2 m since previous post-processing” from the forecasts. Both variables have temporal and spatial resolution of 1 h and 0.25°, respectively, and can be accessed via https://cds.climate.copernicus.eu/. While SLP fields are instantaneous, TMIN is calculated over the previous hour as the minimum of five values (the native resolution of the model is 12 min).

The primary variables are further processed. First, daily mean SLP fields are calculated for the domain of central Europe shown in Figure 1a (broken-line box), which corresponds to the “D07” domain defined in Philipp et al. (2016). Second, calendar daily areal TMIN are calculated for a smaller region in the centre of the larger domain (spanning 48°–52° N and 10°–19° E, Figure 1a). This region excludes high-altitude mountain ranges and therefore has a relatively homogenous climate. Only data from the extended winter season (November to March) are analysed.

Furthermore, daily output of the CCLM RCM driven by the historical run of the HadGEM global climate model, from the EURO-CORDEX (COordinated Regional climate Downscaling EXperiment) project (Jacob et al., 2014) was retrieved and processed identically to the reanalysis data.

Since the reanalysis (at present available only since 1979) and the RCM historical run (1950–2005) differ in their time series, only the overlapping January 1979 to December 2005 series was selected for comparison. However, to train a SM and SOMs, the whole available ERA5 time series (1979–2017) was utilized.

2.2 Sammon mapping

SM (Sammon, 1969) is a method that provides a non-linear representation of high-dimensional objects, in our case the daily mean SLP fields, in a low-dimensional Cartesian space, or Sammon map. We chose two dimensions here to aid visual interpretation. The method finds such low-dimensional representation for which the following error, or stress, function is minimized:
E = 1 i < j d ij * i < j d ij * d ij 2 d ij * , (1)
where d ij * ( d ij ) denote the Euclidean distances between ith and jth SLP fields in the original (Sammon) space. The core of the function are the squared errors d ij * d ij , which quantify the distortion of data topology due to the mapping. Furthermore, the algorithm uses d ij * 1 as a weight factor, which warrants better representation of distances between similar input fields, or, the local structure. The factor is specific to SM and distinguishes SM from other metric MDS methods (Lee and Verleysen, 2007). Last, the sum of input distances is used as a normalizing factor.

The function “sammon” of the R package MASS (Venables and Ripley, 2002) was employed to obtain the mapping. Unlike the decomposition of correlation/covariance matrices by PCA, the transformation of the distance matrix via SM does not have an analytical solution and, therefore, a numerical optimization method must be used that iteratively finds the global optimum among the possible low-dimensional representations. This particular algorithm uses the diagonal Newton method. The mapping also requires an initial configuration to be specified. One option is a random seed. However, that would require repeating the calculation several times to avoid sub-optimal maps. Here, the classical MDS (principal coordinates analysis) was chosen. It provides a good initial configuration and, compared with a random seed, also ensures that the SM can be exactly replicated at any time. For more information on optimization and initialization, a reader is referred to Lee and Verleysen (2007, p. 258) and the MASS package documentation (Ripley, 2022).

For our chosen two-dimensional representation, the function returns a list of pairs of coordinates that localize all fields in the trained SM. Figure 2 shows the SM of the 5,899 ERA5 fields over 1979–2017 extended-winter seasons.

Details are in the caption following the image
Sammon map of ERA5 central European daily SLP fields (1979–2017). Individual fields (grey dots) are embedded in the two-dimensional Sammon space, formed by two orthogonal axes. The units are not defined, but the Euclidean distance between any two points approximates the Euclidean distance between the respective high-dimensional fields (in hPa). The box highlights the area of SM shown in Figure 6

2.3 Embedding of model data in the Sammon map

As mentioned above, no mapping function is provided by SM by which new data could be projected on an existing map (Yin, 2003). At the same time, the iterative training process is computationally too demanding to allow for very large training datasets. In our case, pooling the reanalysis and model data together prior to training would be manageable on a personal computer. However, this approach would not be feasible for larger model ensembles and, therefore, we avoid it. One also cannot train a SM independently in individual datasets, because the resulting maps would not be comparable. Here, following approach is chosen to embed model fields in the ERA5 map:
  1. The CCLM dataset consisting of 4,050 fields is split chronologically into subsets limited to 100 fields, which results in 40 sets of 100 fields and one of 50 fields.
  2. 41 pooled datasets are formed, each of which combines one CCLM subset with all (5,899) ERA5 fields.
  3. A SM is calculated independently for each of the new datasets, coordinates of CCLM fields are retrieved, and a complete CCLM time series reconstructed.

Experimentally, we found that inclusion of 100 model fields alters the coordinates of ERA5 fields on average by <1 unit in each direction relative to the map trained only in ERA5, which is negligible given the size of the Sammon space (see Figure 2) and its intended use. Inclusion of additional datasets would increase the overall time complexity linearly, as opposed to the quadratic increase due to inflation of the training sample.

2.4 Discretization of a Sammon map

A discretization of the Sammon map is achieved in two specific ways, which are hereafter referred to as “equal-area” (EASM) and “equal-frequency” (EFSM).

EASM approach utilizes solely the dimensions and units of a Sammon map. It requires that (1) an approximate number of bins be selected a priori, (2) boundaries of bins be established in terms of SM units, and (3) data is classified with the bins. We chose an orthogonal grid with a step of 100 hPa in each Sammon direction (Figure 3) and open outer boundaries, which yields 98 bins that fully cover the ERA5 SM. The open boundaries ensure that classification of extreme circulation fields mapped outside of the ERA5 projection space is possible, if they occur in CCLM. In Figure 4a, centroids are plotted for all bins with non-zero frequency; the centroid patterns are calculated by averaging all SLP fields with their projections within the given bin.

Details are in the caption following the image
Equal-area Sammon map discretization (EASM) of ERA5 SLP fields (1979–2017, see Figure 2): the lines show boundaries of 98 bins of size of 100 × 100 hPa, the numbers indicate codes of selected bins, shade of symbols shows absolute frequency of bins in more detail (four categories at 50 × 50 hPa per bin)
Details are in the caption following the image
Discretization of ERA5 Sammon map (1979–2017): centroid patterns over 48°–52°N and 10°–19°E. (a) Equal-area Sammon map (EASM) discretization method. For parameters and ordering of bins see Figure 3. (b) Equal-frequency Sammon map (EFSM) discretization method. Sammon projections of cluster centroids are indicated by dots. Not all spatial patterns are shown

EFSM method is based on a two-step procedure that (1) splits the Sammon space into several sectors and (2) each sector into bins based on the distribution of data points within the sectors. The process is illustrated in Figure 5a–c for quadrant I of the SM (see Figure 5d) and consists of the following steps: First, we find the slope of the two lines that pass through the centre of the SM and split the quadrant into three equally populated sectors. Note that the slopes are not simply multiples of 1/3 because the variance of data along the two Sammon axes differs from one another. Second, data points in each sector are classified into bins according to deciles of Euclidean distance between data points and the centre of the SM. This approach would be adequate for Sammon maps of approximately spherical shape. In our case, however, the considerable eccentricity of the ERA5 SM results in some bins being highly elongated and, consequently, containing rather heterogeneous fields (see the outer bins of sector 3 in Figure 5b). Therefore, we split the sector 3 (as well as the other three sectors neighbouring the vertical axis in other quadrants) into halves and utilize quintiles instead of deciles to obtain equal-size bins (Figure 5c). Sammon projections of centroids of resulting 120 bins are shown in Figure 5d, their spatial patterns in Figure 4b.

Details are in the caption following the image
Equal-frequency Sammon map (EFSM) discretization of ERA5 SLP fields (1979–2017). (a) Separation of quadrant I into sectors 1–3 of equal frequency. (b) Classification of data points in sectors into bins based on deciles of their Euclidean distance from the centre of the SM. Black dots show centroids of bins. (c) As in (b) except for splitting sector 3 into halves and using quintiles of Euclidean distance of data points to classify the resulting two sectors. (d) Position of 120 bins. For explanation of axes see Figure 2. Shading is used only to distinguish sectors/bins from each other

2.5 Training and assessment of SOMs

The “Kohonen” R package (Wehrens and Buydens, 2007; Wehrens and Kruisselbrink, 2018) was used to train SOMs. Training of a SOM constitutes an iterative and unsupervised process. At the start, n starting types, or vectors, in our case SLP fields, are randomly selected from a dataset to initialize the SOM array; n also equals the resultant number of circulation types. Subsequently, each of the daily patterns is step by step and in many iterations projected on the array and the position of types that are close (in terms of Euclidean distance) to the pattern is adjusted. The adjustment depends on three parameters that have to be specified in advance. First, the neighbourhood radius (further referred to as radius for short) defines the size of the affected area around the projected data point that is adjusted. Second, the neighbourhood function defines the mode of the adjustment, and third, the learning rate specifies how much the types within the affected area are attracted to the projected data point. The Gaussian neighbourhood function was chosen here. The value of the remaining two parameters is typically chosen to decrease in each iteration step, which allows for capturing the overall topological structure of the dataset first and then fine tuning the resulting classification during the later iteration steps.

Several parameters of SOMs were identified as decisive when training SOMs applicable in studying weather extremes, based on both literature review and our initial sensitivity study. Namely, it is the size (number of CTs) and shape (number of rows and columns) of a SOM and the neighbourhood radius, especially its terminal value. In Table 1, we list seven SOMs that are analysed in our study, including their parameters. Their choice is discussed in detail in Section 4. Training of each SOM was split into two passes. Pass 1 is initialized randomly from training fields and comprises 1,000 iterations over which rough ordering of nodes is obtained. Pass 2 is initialized from the output of pass 1 and consists of 3,000 iterations resulting in the final solution. Spatial patterns and Sammon projections of all four SOMs with 5 × 6 array (SOMs 1–4) are shown in Figure 6. For the three larger SOMs 5–7, only their Sammon projections are shown in Figure 7. Note that coordinates of SOM nodes were obtained identically to coordinates of model fields, that is, by pooling ERA5 fields and SOM patterns together prior to SM training (see Section 2.3 for details). The calculation was carried out independently for each SOM.

TABLE 1. SOMs trained for the study
SOM code SOM size Dataa Learning rate (pass 1)b Learning rate (pass 2)b Neighbourhood radius (pass 1)b Neighbourhood radius (pass 2)b Conditional quantization errorc
SOM1 5 × 6 All 0.05–0.02 0.02–0.01 5.00–1.70 1.50–1.00 100.0
SOM2 5 × 6 −33% 0.05–0.02 0.02–0.01 5.00–1.70 1.50–1.00 90.5
SOM3 5 × 6 −50% 0.05–0.02 0.02–0.01 5.00–1.70 1.50–1.00 77.5
SOM4 5 × 6 −33% 0.05–0.02 0.02–0.01 5.00–1.70 1.50–0.90 69.8
SOM5 7 × 10 All 0.10–0.02 0.02–0.01 6.00–2.50 2.50–1.30 86.4
SOM6 7 × 10 −33% 0.10–0.02 0.02–0.01 6.00–2.50 2.00–1.20 67.3
SOM7 10 × 12 −33% 0.10–0.02 0.02–0.01 6.00–2.50 2.00–1.15 63.7
  • a Either all ERA5 1979–2017 fields were used to train a SOM, or a certain percentage of fields whose Sammon projection is closest to the centre of the ERA5 SM were excluded.
  • b Learning rate and neighbourhood radius decrease linearly over 1000 (3000) iterations during pass 1 (2) between their preset initial and terminal values.
  • c The error is shown in percent of the error in SOM1. See Section 3.1 for definition.
Details are in the caption following the image
SOMs (right) and their projections onto ERA5 1979–2017 SM (left). In the SM, projections of SOM patterns are indicated by red dots, orthogonally neighbouring SOM patterns are connected by red abscissae the length of which approximates the distance between the patterns in the original high-dimensional space. Data used (not used) for training of SOMs are shown in dark (light) grey. Axes captions in left-hand panels as in Figure 2. Ordering of SOM CTs is indicated by numbers in panel (a)
Details are in the caption following the image
SMs of SOMs 5–7 obtained by projecting the SOMs onto the SM of 1979–2017 ERA5 fields. SMs of respective 5 × 6 SOMs trained in the same data as the depicted SOMs (see also Table 1) are shown in the background

Not all ERA5 fields are used to train each SOM. Some SOMs were trained in one of two reduced datasets, which were obtained as follows: (1) SM coordinates were standardized. (2) Euclidean distance from the SM centre was calculated for each field based on its standardized coordinates. (3) Reduced datasets were created by removing all fields closer to the SM centre than the 33rd and the 50th percentiles of Euclidean distances of the ERA5 dataset, respectively. The selection of training datasets is indicated in Table 1, and in Figure 6a–c the three datasets can be compared. Note that complete ERA5 time series was subsequently classified with the SOMs, that is, the purpose of the reduction was solely in affecting the definition of SOM nodes.

Two kinds of quality metrics are typically used to assess SOMs—quantization error and topological error (Kohonen, 2001). Although their exact specification varies among studies, their purpose does not: While the quantization error shows the degree of data separation into clusters and can be therefore understood as a measure of quality of a classification, the topological error shows the degree to which the organization of types within a SOM array captures the structure of classified data. Since both metrics depend on predefined parameters, they are typically used to select a single optimum SOM of multiple ones trained. Here, quantization error of a SOM is calculated as the Euclidean distance between each circulation field and its closest type, averaged over all classified fields. The topological error represents the percentage of circulation fields whose two closest types do not lie next to each other in the SOM array.

First, we used the metrics to find optimum values of parameters for given data and size of SOM array (Table 1). Second, each of the seven SOMs was selected from 10 randomly initialized variants. As an additional quality measure, we projected the SOMs onto the SM calculated in ERA5 fields. This step also helps visualize the link between SM bins and SOM nodes.

2.6 Classification of data

Classification of circulation fields with SOM CTs is straightforward. The method itself ensures that each training data point is classified with the closest node (in terms of Euclidean distance), and the class of each field is thus obtained directly in the output. Additional data can be projected onto SOMs by classifying it with CTs utilizing the same metric, that is, by finding the node that is closest to the projected field in the original (high-dimensional) data space.

In the case of SMs, classification is carried out in the two-dimensional Sammon space. Both methods enable one to classify each data point (training or projected one) directly according to its SM coordinates relative to the boundaries between bins. However, we decided to include one optimization step for the EFSM approach. The specification of bins that includes an arbitrary definition of directional sectors, described above, does not ensure that each data point is classified with the closest bin centroid. Therefore, the training ERA5 fields were re-classified such that each field was classified with the closest bin centroid in the Sammon space. Consequently, the final classification slightly differs from the intermediate one indicated in Figure 5c. Note that while not strictly speaking necessary, this simple step decreases the average distance between classified fields and their respective bin centroids by about 4%.

3 APPLICATIONS

Here, we compare the performance of classifications obtained by SOMs and discretization of SMs in capturing the frequencies of occurrence of CTs (Section 3.1) and daily TMIN (Section 3.2) in ERA5 and CCLM model.

3.1 Frequency of circulation types in ERA5 and CCLM bias

In Figure 8, differences in the representation of ERA5 circulation variability, including its extremes, by the nine classifications are shown. The two versions of SM discretization differ considerably. The EASM approach (Figure 8i) results in several highly populated clusters (so-called snowballing) in the centre of the Sammon space, which are surrounded by rarely occurring CTs containing extreme circulation fields. The EFSM approach (Figure 8h) results in clusters that are closer to the centre of the Sammon space than in EASM, but the representation of extremes is still better than in any of the SOMs (Figure 8a–g). SOM1 (Figure 8a), trained in all fields, classifies the most extreme cyclonic and anticyclonic fields with two corner nodes, 25 and 6 (see Figure 6a), respectively. This snowballing effect is weaker when a certain percentage of fields that are in the centre of the data Sammon map are omitted in training (SOMs 2 and 3). The resulting SOMs have larger SMs with their outer nodes considerably closer to the edges of the data space populated by extreme circulation fields. An additional way how SOMs with larger SMs can be trained (compare SOMs 2 and 4) consists in decreasing the terminal value of the neighbourhood radius. This, however, potentially comes at the cost of damaging the topology. A damaged topology is visible in SOM5 whose two corners fold over on the array. Lastly, SOMs with larger SMs can also be obtained by increasing the SOM array size, as documented by SOMs 5–7 (Figure 8e–g). However, the effect of array size is relatively weak: Even though SOM7 and the EFSM discretization both result in 120 clusters, the SM of the SOM is markedly smaller.

Details are in the caption following the image
Relative frequency of CTs in ERA5 (1979–2005) represented by various methods. (a–g) SOMs 1–7, in turn (see Table 1), (h) Equal-frequency Sammon map (EFSM) discretization, (i) Equal-area Sammon map (EASM) discretization. The scales of symbols (see legend) and both axes are identical in all panels for ease of comparing methods. Axes captions as in Figure 2

The size of SMs of SOMs is a useful parameter for assessing the discriminatory power of the classifications for extreme circulation fields. However, to quantify and compare the effects of different methodological choices, we further calculate a conditional quantization error (Table 1). The error is defined the same way as the overall quantization error (see Section 2.5), except it considers only the 10% of days with the most extreme circulation (those classified with the outer bins of EFSM discretization as indicated in Figure 5b). The results show that the optimum combination of evaluated parameters is that used to train SOM4 (the nodes representing extreme fields are 30% closer compared with SOM1, which has the same number of CTs). A comparable result is achieved by SOM6, that is, the positive effect of increasing the number of CTs in this case does not outweigh the increase in the radius necessary to maintain good topology. Overall, SOM7 has the lowest error. However, the improvement relative to SOM4 seems fairly small considering that it requires quadrupling the number of nodes.

All nine classifications provide a relatively similar pattern of CCLM biases in CT frequency (Figure 9). The model clearly overestimates circulation captured by most outer CTs. However, the size of the bias considerably depends on the method and only the SM-based methods (and to some extent the largest SOMs) clearly show that the bias tends to increase proportionally to the extremity of circulation.

Details are in the caption following the image
As in Figure 8, except for differences between CCLM and ERA5. The difference d in the frequency f of a circulation type CT is calculated as = f CCLM CT f ERA 5 CT ÷ f ERA 5 CT × 100 %

The EASM discretization (Figure 9i; see Figure 4a for spatial patterns) detects several regions in which simulated fields occur exclusively—mainly extreme cyclonic (middle and upper left region of the SM) and anticyclonic (middle and bottom right region) flow—or considerably more often than in reanalysis—mainly westerly flow (bottom left and centre regions). The method distinguishes between patterns with an anticyclone north or north-east of the domain (upper right), which are underestimated by the model, and patterns with an anticyclone over central Europe or the Alps, which are overestimated.

The EFSM discretization (Figure 9h) groups extreme fields into larger clusters with their centroids closer to the centre of the map. Consequently, it provides more readable information on CCLM circulation variability as a whole. The method shows underestimation of most patterns close to the centre of the map (circulation with only weak SLP gradients and average SLP) and overestimation of extreme circulation in general, with the exception of south-easterly flow (centre of the upper right quadrant of the map).

The SOMs, in particular those with snowballing in corner CTs (SOMs 1 and 2), fail to show the considerable overestimation of extreme cyclonic and anticyclonic fields, as these fields occupy regions of the data space far from any of the SOM nodes. The model also overestimates the intensity of south-easterly and easterly flow around the Mediterranean trough (top centre), which the SOMs fail to show or underplay due to their limited extension in the direction of the second Sammon projection.

3.2 Daily minimum temperature in ERA5 and CCLM bias

Daily TMIN conditioned by individual CTs varies between approximately 4°C and −12°C in ERA5 (Figure 10). The extremes are linked to westerly and cyclonic flow (maxima) and south-easterly and easterly flow (minima). The values are clearly organized in the Sammon space, which is apparent in Figure 10h, i, and SOMs capture this organization well. Similarly to CT frequency, the range of obtained values differs depending on the method, and especially the smaller SOMs fail to distinguish the patterns leading to TMIN extremes.

Details are in the caption following the image
As in Figure 8, except for daily TMIN averaged over all days within each CT

CCLM seems to have a cold bias in cyclonic and westerly flow and in CTs with a pronounced anticyclone over north-eastern Europe (Figure 11). The remaining CTs are overwhelmingly warmer in the model, although the result is more noisy compared with CT frequency. The larger SOMs and the two SM-based classifications provide a valuable distinction regarding the TMIN bias in CTs that lead to TMIN minima over the domain (high SLP and/or advection from the eastern quadrant)—the model is generally colder in CTs with a pronounced anticyclone over north-eastern Europe, and warmer in most other CTs that are conducive to lowest TMIN, suggesting that different factors contribute to TMIN bias.

Details are in the caption following the image
As in Figure 10, except for differences between CCLM and ERA5. The difference d in the daily mean TMIN t of a circulation type CT is calculated as d = t CCLM CT t ERA 5 CT

In order to quantify and compare the ability of the methods to capture extreme events, we focus in detail on extreme daily TMIN, which are defined as days with TMIN below the fifth percentile of the TMIN distribution in ERA5 1979–2005 extended winters. The distribution of these events (shown in Figure S1) closely follows that of the mean daily TMIN.

A successful application of a classification to an analysis of an extreme depends on the ability of the classification to define CTs conducive to the extreme, and to separate the days with an extreme from a time series as best as possible. We compare the nine classifications based on their ability to separate the 5% of days with the most extreme TMIN in ERA5. Figure 12 illustrates the separability by comparing cumulative distributions of extreme days (vertical axis) and all days (horizontal axis). In an ideal situation, given the chosen fifth percentile of TMIN, 100% of extreme days would map on 5% CT relative frequency, which would form a steep line in the graph. The closer a curve lies to this limit, the better the separation of extremes by the method is.

Details are in the caption following the image
Separation of days with extreme TMIN (TMIN bellow the fifth percentile of TMIN distribution in ERA5 extended winters) in different classifications and time periods. Prior to calculating the cumulative frequencies, CTs were sorted according to the probability of occurrence of extreme TMIN in descending order. The 5% threshold is defined independently for each period. The two black thin lines show theoretical limits of the distribution in which all extreme days map on 5% (best separation; steep line) and 100% (worst separation) cumulative relative frequency, respectively

Performance of the methods is relatively comparable when considering the whole time series (4,084 days; Figure 12a); somewhat better (worse) results are achieved by EFSM and SOM 7 (EASM). The rather small differences are likely due to the fact that TMIN extremes are not restricted to only a few CTs but can occur in a majority of CTs (e.g., in 60% of CTs in the case of EFSM; see Figure S1). When analysing shorter periods (Figure 12b–d), the extremes are considerably more “localized”, that is, constrained to a smaller subset of CTs. For example, <30% of EFSM CTs include an extreme day in 1988–1996 (not shown). Consequently, there are larger differences in the representation of the extremes and best separation is achieved by EFSM and SOM 7.

4 DISCUSSION

4.1 SOMs

The use of circulation classifications, including those by SOMs, to study extreme weather has considerable limitations stemming from the fact that the output of classifications is typically too generalized (e.g., Gibson et al., 2017). By design, nodes of a SOM array tend to reside in dense regions of data space providing less discriminatory power for rare and extreme circulation fields. Our study shows that representation of extreme circulation patterns by SOMs can be to some extent improved by (1) increasing the size of SOM array, (2) removing a certain fraction of circulation fields that are similar to the mean circulation pattern, and (3) decreasing the terminal value of the neighbourhood radius. In particular, combining these three steps can help train a SOM that has an optimal representation of sparsely populated outer regions of the data space, while having a good topology—arguably the main benefit of the method compared with cluster analysis. Nevertheless, our results suggest that reasonably good representation can be achieved even without increasing the size of SOM array.

In synoptic climatology, SOMs with relatively small arrays have typically been used. Jiang et al. (2015) suggests that SOMs of 10–20 nodes are sufficient for classification, and larger SOMs should only be used with the goal of exploratory data projection. Cassano et al. (2015), who focus specifically on extremes, compare SOMs of up to 35 nodes and advice against increasing the size of the SOM array blindly as it alone may not be sufficient to identify patterns relevant to the analysis. Cassano et al. (2016) and Ohba et al. (2015) use a SOM of 35 and even 144 nodes, respectively; however, in both cases SOM nodes were further meta-classified into only a few clusters of nodes that were further interpreted.

Here, we utilized SMs to visualize the effect of increasing SOM array size. Sammon projection has been used in evaluating the topological structure of SOMs (e.g., Hewitson and Crane, 2002). In Figure 6, we show that SMs can also be used to understand the position of SOM nodes within the data space (or more precisely, within its Sammon projection) and to estimate the effect of increasing SOM array size on the representation of circulation extremes. In Figure 7, differences in the size of SMs of SOMs of different sizes are shown. When contrasting the SMs of 70-node SOMs relative to that of 30-node SOMs, it is evident that the effect of such an increase of SOM size is only small relative to the size of the data space. In other words, a larger SOM array provides a more detailed image of the densely populated region in the centre of the data space, but provides only little additional information on extremes.

One way to improve on the limited representation of extreme fields, regardless of the SOM array size, relies in removing some of the circulation fields available for SOM training. A similar approach has been used before (e.g., Aragon et al., 2020) to select for SOM training only fields that co-occurred with a particular meteorological extreme. However, when a SOM is trained only in the data subspace conducive to a particular extreme, it provides no information on overall circulation variability. Such an approach can be useful but has its limitations. Here, instead of removing fields that do not lead to an extreme in the training sample, we used a different approach. Utilizing Sammon projection of data, we assess the extremity of circulation simply by calculating Euclidean distance of a point from the centre of the two-dimensional SM. Subsequently, a certain percentage of fields with smallest extremity of circulation (33% or 50%) are removed from the training sample prior to SOM training. This results in SOM nodes being “pushed away” farther from the heavily populated data space centre toward the data space edges, which in turn allows for a better discrimination of circulation extremes. The approach assumes that the meteorological extreme of interest be predominantly linked to extreme circulation fields.

Unlike the above mentioned approach used by Aragon et al. (2020), our reduction of the training sample still allows for classification of all fields, including those not used during training. During training, most SOM nodes—but not all—become organized around the centre region free of training data, which is indicated in Figure 6b–d by light grey colour. Reusch et al. (2005) showed that when calculating a SOM in a dataset consisting of a few distinct clusters, some nodes with zero frequency of occurrence are still placed between the clusters. We observed a similar behaviour—all SOMs calculated in a reduced ERA5 dataset still contain several nodes close to the centre of the Sammon space, on which the fields not used for training can be projected in the next step, resulting in a complete classification.

Additionally, the approach of training SOMs in a reduced dataset seems to have other advantages. The effect of data reduction on representation of circulation extremes seems to be greater to that of more than doubling the number of nodes. More importantly, the reduced datasets allow for decreasing the terminal radius to a value that would otherwise lead to considerable deterioration of SOM topology, which is apparent for SOM5 in Figure 7a in the form of a slight folding of the array in two of its corners. Previous research has shown that opting for a lower terminal radius leads to a lower quantization error, or a better classification (Jiang et al., 2015; Gibson et al., 2017), but the topology constraint may prevent one from making such a choice. In our case, using reduced datasets together with lower terminal radius led to a representation of extreme circulation comparable with that achieved when quadrupling the number of nodes, but using only a fraction of computational resources and being far easier to interpret.

4.2 Discretization of Sammon maps

The results regarding SOMs seem promising; nevertheless, there may arguably be a competing alternative comprising a discretization of a SM of data. It seems quite arduous and computationally expensive to find a combination of SOM parameters that would optimize the representation of a particular extreme in a particular dataset, without resorting to very low values of the neighbourhood radius (effectively carrying out k-means clustering). Since circulation fields tend to form a continuum rather than a few distinct clusters (e.g., Philipp et al., 2016), discretizing a SM by defining either arbitrarily set boundaries between data bins or utilizing information on data variability across the SM should also hypothetically lead to a topology-preserving classification useful in studying extremes. Here, two specific ways of discretization were attempted and compared with SOMs.

The EASM discretization is similar to SOMs that we defined in reduced datasets in the sense that there are several large clusters in the centre of the map, snowballing fields close to climatological mean, which are not of much interest. By setting the size of bins to 100 × 100 hPa, and calculating centroid maps for all bins, one obtains a rectangular map of circulation patterns similar to a SOM. This map, however, better represents the extremes of the data space, and if a larger map is defined than would be necessary for the training dataset (such as in Figure 4a), it becomes particularly useful in comparing the extent of circulation variability in different datasets. On the other hand, quantifying, plotting, and statistically evaluating properties of the bins becomes complicated since there is a wide range of frequencies across the bins (several orders of magnitude).

The EFSM discretization utilizes the information of the data structure obtained via SOMs and EASM, mainly the facts that (1) CTs with “mirrored” patterns (e.g., cyclonic vs. anticyclonic, westerly vs. easterly flow) are organized in opposite parts of the projection space and (2) in any direction from the centre of the SM toward its edge, similar patterns occur in terms of pattern-to-pattern correlation, while the extremity of circulation (strength of flow and/or SLP) increases. The EFSM therefore defines the bins according to how data is distributed in 16 directions from the SM centre, utilizing selected percentiles of each distribution as boundaries. The resulting classification has as many CTs as the largest SOM7, but the classes are more evenly spread across the data space (compare for instance Figure 8g, h), especially along the second Sammon projection.

By definition, frequencies of EFSM bins are very similar to one another. The result does not compromise the classification of fields close to the climatological mean in order to achieve better representation of circulation extremes, such as in the case of EASM and SOMs trained in reduced datasets. Regardless of its simplicity, the method seems to outperform all SOMs trained here in separating circulation fields leading to extreme TMIN, despite all the optimization steps carried out here in SOM training. Even though the equal-frequency bins are reminiscent of an output of cluster analysis, the quality of the classification will unavoidably be lower, since the bins are grounded in the two-dimensional Sammon space topology.

Lastly, whether a scaling method—be it SOMs, SM discretization, or any other method—provides an adequate image of circulation variability depends also on the structure of data. In our case, the structure of the circulation dataset is rather simple: it contains only one variable (SLP), 85% of whose variability can be explained by leading two principal components. This suggests that a two-dimensional representation of the data should be quite accurate and the topology of the output relatively easy to interpret. Nonetheless, PCA provides only a linear model and is consequently unable to cope with nonlinear dependences hidden in the data (Lee and Verleysen, 2007). More complex data (e.g., larger spatial domain, more input variables, or summer fields) could bring additional methodological challenges that would have to be addressed.

5 CONCLUSIONS

Previous studies showed that SOMs can be applied to studying weather extremes and their links to atmospheric circulation but also that such applications are inherently difficult and limited, both due to the nature of classifications of atmospheric circulation in general and SOMs in particular. We propose the following recommendations to abate the limitations of SOMs:

(1) Removing circulation fields closest to the climatological mean from the training sample allows for better representation of circulation extremes. This is mainly due to the fact that a lower value of the neighbourhood radius can be chosen for later iterations of SOM training, which would otherwise compromise the topology of the map. If necessary, the fields removed from the training sample can be projected on the SOM after the training concludes. (2) SM can be used not only to validate the topology of a SOM, but also to visualize the topology of the full dataset and the link between the two. This seems particularly useful during SOM training as it provides a visual alternative (or at least addition) to the typically used quantization and topological errors.

Regardless of the proposed optimization steps, a trained SOM may still perform poorly in discriminating circulation extremes and consequently be unable to capture the links to a meteorological extreme. For such cases, we propose another utilization of SM, consisting in discretization of the map. Two simple methods are defined, one based on arbitrarily set boundaries between data bins, the other utilizing information on data distribution across the SM. While the former method can quickly provide information on differences between datasets (for example in the extremity of circulation extremes, frequency of CTs, and how CTs condition a particular meteorological extreme), it tends to suffer from sampling issues. The latter approach seems to be more useful for quantification purposes and, in our case, it matched or even outperformed the optimized SOMs in discretising types of circulation leading to extremes in minimum temperature in ERA5.

Our results suggest that the tedious and lengthy search for optimal configuration of SOM parameters necessary to make the SOM applicable in a study on weather extremes, while retaining a good topological ordering of nodes useful for interpretation and visualization of results, could be circumvented by discretizing a SM of data instead.

AUTHOR CONTRIBUTIONS

Jan Stryhal: Conceptualization; data curation; formal analysis; methodology; visualization; writing – original draft; writing – review and editing. Eva Plavcová: Conceptualization; funding acquisition; project administration; supervision; visualization; writing – review and editing.

ACKNOWLEDGEMENTS

We thank all involved in the development of the ERA5 reanalysis and the CCLM and HadGEM models. We acknowledge the Copernicus Climate Change Service for providing access to ERA5 and the Earth System Grid Federation for providing access to model outputs. We thank Ondřej Lhotka for obtaining model data.

    FUNDING INFORMATION

    The study was supported by the Czech Science Foundation, project 19-24425Y.