Volume 128, Issue 20 e2023JD039183
Research Article
Full Access

Representation of Modes of Atmospheric Circulation Variability by Self-Organizing Maps: A Study Using Synthetic Data

J. Stryhal

Corresponding Author

J. Stryhal

Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czechia

Correspondence to:

J. Stryhal,

[email protected]

Search for more papers by this author
R. Beranová

R. Beranová

Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czechia

Search for more papers by this author
R. Huth

R. Huth

Institute of Atmospheric Physics, Czech Academy of Sciences, Prague, Czechia

Faculty of Science, Charles University, Prague, Czechia

Search for more papers by this author
First published: 17 October 2023

Abstract

Self-organizing maps (SOMs) represent a popular tool for classifying atmospheric circulation patterns. One of their traditional applications has been to link typical synoptic-scale patterns to large-scale teleconnections, or modes of low-frequency circulation variability. However, recently there have been attempts to interpret an array of SOM nodes directly as a continuum of teleconnections, grounded in SOMs' ability to combine two otherwise distinct approaches to data analysis, that is, exploratory projection (or, dimensionality reduction) and classification. This conceptual shift calls for methodological studies that would improve our understanding of how orthogonal modes of variability, typically used to describe teleconnections, relate to SOM outputs. Here, we define three idealized modes of variability and use their various combinations to generate synthetic data sets. Many variants of SOMs are generated for SOMs of various shapes and sizes. The results show that projection of modes on a SOM array is sensitive not only to data structure, but also to various SOM parameters. The leading mode of variability projects rather strongly on SOMs if its explained variance is markedly higher than that of the second-order mode; the remaining modes project considerably more weakly, and all modes tend to blend when their explained variance is similar, which leads to underrepresentation of some phases of modes and/or combinations of modes among the SOM patterns. Furthermore, we show that some features of SOM topology that were previously considered a proof of data nonlinearity appear even if the underlying modes of variability are strictly linear.

Key Points

  • Synthetic fields are generated as linear combinations of three idealized functions of spatial co-variability, or circulation modes

  • Self-organizing maps have limited ability to represent the second and third modes, and modes tend to blend when their variance is similar

  • Principal component analysis may detect data structures conducive to high sensitivity of self-organizing maps to the choice of parameters

Plain Language Summary

One of the tasks of a climatologist is to group together days with similar weather over a region, and to find out how such types of regional weather are influenced by distant locations. These distant influences, also known as teleconnections, are intricate and studying how the methods we use to study them deal with these intricacies is one of the keys that helps us understand the climate. One way to start this task is to apply simple and well-understood data sets to methods and examine how they interpret the data. In the paper, we studied the popular method of self-organizing maps in such a way. We found that the method usually identifies the strongest teleconnection, but the images of weaker teleconnections tend to be unclear or even blend together. We use our results to revisit findings of some previous studies and to discuss how self-organizing maps could be improved in the future to give us more accurate information on the atmosphere.

1 Introduction

Self-organizing maps (SOMs; Kohonen, 2001; Hewitson & Crane, 2002; Sheridan & Lee, 2011) have become increasingly popular in synoptic climatology. Among their numerous applications, the one toward classifying circulation patterns—typically mean sea level pressure (SLP) or geopotential height fields—has been most common. Like other classification methods—see for example, Huth et al. (2008) and Philipp et al. (2016) for detailed reviews of the topic of circulation typing—SOMs enable approximating the broad variability in circulation fields by a small number of representative nodes, or circulation types. However, SOMs offer the additional benefit of organizing these types into an array, where the Euclidean distance between individual types approximates the distance between them in the original multidimensional space. This means that SOMs not only classify data but also represent their topological structure, making them popular for visualization and interpretation of complex results.

An alternative utilization of SOMs, first proposed by Johnson et al. (2008) and subsequently applied in several related studies (Chang & Johnson, 2015; Johnson, 2013; Rousi et al., 20152020; Yuan et al., 2015), involves a different interpretation of the SOM output. Rather than interpreting individual types in a SOM array as discrete, representative geophysical patterns, the entire array is viewed as a continuum of teleconnections. Teleconnections have traditionally been understood as patterns that express statistical relationships between climatic variables at distant locations (Ångström, 1935; Hannachi et al., 2017; Wallace & Gutzler, 1981). As such, they have been investigated via local-based indices, one-point correlation or regression maps (e.g., Corti et al., 1997; Franzke & Feldstein, 2005; Wallace & Gutzler, 1981), and principal component analysis (PCA; e.g., Barnston & Livezey, 1987; Beranová & Huth, 2008; Comas-Bru & Hernández, 2018; Corti et al., 1997; Davini & Cagnazzo, 2014; Feldstein, 2000; Horel, 1981; Huth & Beranová, 2021; Jung et al., 2003). The latter method has arguably been the most popular.

PCA decomposes data transformed into a similarity (typically correlation or covariance) matrix whose values represent relations between time series of a typically gridded input meteorological variable. The decomposition results in a set of new variables that are linear combinations of input variables, are mutually orthogonal, maximize the explained variance, and have traditionally been referred to as modes of variability (Barnston & Livezey, 1987), hereafter “modes” for short. Since it was shown by Horel (1981) that the leading modes obtained by rotated PCA well reproduce the teleconnection patterns found earlier by Wallace and Gutzler (1981), the terms teleconnections and modes have been used more or less interchangeably. Although some instances can be found in literature in which the term mode of variability refers to variability patterns in general, including circulation types and/or transitions between them (e.g., Cannon, 2020; Lennard & Hegerl, 2015; Reusch et al., 2005; Risbey et al., 2015), here we use the term to address latent spatial variables or functions of variability, such as PC loading vectors or point-correlation maps.

One particular motivation behind the application of SOMs (and less often also cluster analysis) to analyses of teleconnections instead of PCA and other linear methods is to study their non-linear features, such as differences in the spatial structure teleconnections may have in their opposite phases, which cannot be captured by the abovementioned traditional methods due to their imposed linearity and/or orthogonality (Risbey et al., 2015). A typical application of SOMs is to study the extent to which occurrences of typical atmospheric patterns are conditioned by intensity and phase of leading modes (e.g., Gao et al., 2019; Johnson & Feldstein, 2010; Reusch, 2010; Reusch et al., 2007; Verdon-Kidd & Kiem, 2009). Because SOMs preserve topology of data, the opposite phases of leading modes tend to strongly project on specific regions of a SOM array, for example, on its two opposite corners. In the continuum perspective of teleconnections, such regions are then interpreted to form positive and negative continua of the mode's “members” (e.g., Chang & Johnson, 2015; Johnson et al., 2008), or “flavors” (Johnson, 2013; Rousi et al., 2020; Verdon-Kidd, 2018), which are in turn applied to describe the mode's temporal changes (in phase and intensity) and/or changes in its spatial patterns by means of analyzing shifts in the frequency of occurrence among the pertinent SOM types. Note that compared to the traditional “discrete modal perspective” (Chang & Johnson, 2015), in the continuum perspective a teleconnection does not have one fixed spatial pattern, although its generalized positive and negative patterns can be retrieved by weighted averaging of pertinent SOM types (e.g., Chang & Johnson, 2015). In the continuum perspective, the definition of a teleconnection is shifted from a statistical link expressing co-variability among distant locations to less constrained “low-frequency anomaly patterns” (Yuan et al., 2015). These patterns show particular temporal realizations of such a statistical link rather than the link itself. In reality, nevertheless, typical daily anomalies (i.e., high-frequency circulation types) have been interpreted as teleconnections in this context more often than low-pass filtered data (e.g., Chang & Johnson, 2015; Johnson et al., 2008; Rousi et al., 2015; Yuan et al., 2015).

A typical way to delimit individual “canonical” teleconnections (leading modes defined by linear methods) within a SOM teleconnection continuum consists in projecting modes (principal components) on SOM arrays and comparing similarities between the two representations of teleconnections in the spatial (e.g., Rousi et al., 2020; Yuan et al., 2015) or temporal (e.g., Johnson et al., 2008) domain, which links the two perspectives of teleconnections together. Theoretical studies on how modes project on SOM arrays seem, therefore, particularly relevant to the continuum perspective. It is, however, an area of research that has so far received only very little attention.

Jiang et al. (2013) and Lee (2017) have shown that influences of various teleconnection indices on the frequency, intensity and airmass characteristics of typical geopotential height patterns are complex and can counteract one another; consequently, attributing changes in the frequency of these patterns to one particular teleconnection seems far from trivial. Most of previous studies indicate that the leading variability mode tends to project strongly on types around two opposite corners of a SOM array, and the very corners are often representing positive and negative polarities of the mode. According to Lennard and Hegerl (2015), the two diagonals of a SOM array are analogous to first two principal components and thus to the leading two teleconnections. This somewhat contradicts the suggestion of Reusch (2010) that if the leading mode projects on one diagonal of a SOM array, the second-order mode may or may not project on the other diagonal. Furthermore, Reusch (2010) suggests that differences in the length of two the diagonals on which the leading modes project—measured in the phase space that approximates Euclidean distances among data (see Section 2, Sammon mapping, for more details)—may be attributed to data nonlinearities. However, little attention has been paid to the third-order mode, possibly due to its scattered signal in two-dimensional SOMs. In other words, lower-order modes, regardless of their strength and significance, have so far been overlooked when interpreting SOMs.

Here, we present a study on how orthogonal modes project on SOMs and how their signal depends on various SOM parameters, such as SOM size and shape, as well as the structure of data (number and strength of modes). Various synthetic data sets are generated from three predefined modes and used to train SOMs. The data and methods are described in Section 2. In Section 3, we analyze the links of SOM arrays to modes of variability and to data structure. In Section 4, we discuss the results, and Section 5 summarizes main findings.

2 Data and Methods

2.1 Synthetic Data

In the study, multiple synthetic data sets with predefined—and thus known—structure are used to train SOMs. The data sets are composed of circulation fields, which are linear combinations of three modes of circulation variability that describe variations in zonal (Z), meridional (M), and circular (C) flow. The three modes are defined as mutually orthogonal and of unit length to ensure that the contribution of individual modes to the total variability can be fully controlled by generated indices (amplitudes). To avoid calculation with very small numbers and to improve the readability of the resulting generated fields, all three patterns are scaled up by the same factor, the size of which is not relevant, but explains the larger isoline intervals in the spatial patterns of modes (Figure 1). The patterns can be interpreted as anomalies of an unspecified circulation variable that represent the modes in their positive phase and intensity of +1.0.

Details are in the caption following the image

Spatial patterns of predefined circulation modes in their positive phase: (a) zonal, (b) meridional, (c) circular. The interval of isolines is two units; positive (negative) values are in red (blue), the zero isoline is in green.

Each data set comprises 1,000 circulation fields; each field consisting of 20 × 20 grid points is a linear combination of the three patterns, whereby each mode is multiplied by a scalar index. The index therefore represents the intensity (amplitude) of a particular mode at a particular time instant. For each mode, the indices are drawn from normal distributions, the mean of which is zero and variance is a predefined fraction (explained variance hereafter) of the total variance σ2 = 1. A total of 12 combinations of explained variances (EV), and thus data sets, are predefined (Table 1) such that urn:x-wiley:2169897X:media:jgrd59011:jgrd59011-math-0001. For example, in the data set “Z70-M20-C10,” 70% of the total variability is in the zonal flow, while the contribution of the meridional and circular flow is 20% and 10%, respectively. We also define several simple “bivariate” data sets in which the third mode is “turned off” (e.g., Z70-M30). Data sets in which other than the zonal mode was the strongest (e.g., Z30-M70) were not included in the study as they did not lead to any new findings. Note that the whole process can be understood as reverse PCA whereby a circulation data set is reconstructed from PCs by a linear superposition of orthonormal eigenvectors of a correlation matrix (i.e., PC loadings, or spatial patterns of modes) multiplied by time series of amplitudes (or, PC scores).

Table 1. Combinations of Explained Variances of Modes Tested in the Study
Mode Explained variance (%)
Z 90 70 70 60 60 50 50 50 50 40 40 33
M 10 30 20 30 20 50 40 30 25 40 30 33
C 0 0 10 10 20 0 10 20 25 20 30 33

2.2 Training and Assessment of Self-Organizing Maps

Here, we present the methodology used for training, selecting and evaluating SOMs. A schematic diagram illustrating the methodology is shown in Figure 2. We discuss the choices of parameters and their effects on the results in Section 4.2. For a more detailed description of SOMs, we refer the reader to Kohonen (2001) and Sheridan and Lee (2011). Note that we use the term type instead of node in the paper to avoid any confusion with the term mode. We use the same term also when referring to SOM patterns to avoid any confusion with the patterns of modes.

Details are in the caption following the image

Schematic depiction of the methodology.

The training of a SOM is an unsupervised and iterative process that does not require intervention from the researcher. However, several parameters affect the SOM's properties, some of which require subjective decisions to be made beforehand. To initialize the SOM array, n starting vectors (in our case, generated circulation fields) are randomly selected from a data set; n is also the resultant number of types. Subsequently, the fields are projected onto the array in multiple iterations, and the position of vectors that are close (in terms of Euclidean distance) to the field is adjusted. The adjustment depends on three parameters that have to be specified in advance. First, the neighborhood radius (referred to as radius for short) determines the size of the affected area around the projected data point that is adjusted. Second, the neighborhood function defines the adjustment method, and third, the learning rate specifies how much the vectors within the affected area are attracted to the projected field. The Gaussian neighborhood function was chosen over the commonly used bubble function, as it produced SOMs with somewhat better topology. The remaining two parameters decrease in value in each iteration step, allowing the overall topological structure of the data set to be captured first before fine-tuning the resulting classification over later iteration steps. Two more parameters relate to the SOM architecture: the algorithm type (sequential vs. batch) and form of array (rectangular vs. hexagonal). Kohonen (2001) suggests that batch algorithm and hexagonal array be used, but most applications of SOMs in climatology appear to rely on the other two options. To make our study comparable, we respect the common practice in the field with our selection.

The “Kohonen” R package (Wehrens & Buydens, 2007; Wehrens & Kruisselbrink, 2018) was used to train SOMs. For each of the 12 data sets, SOMs of 13 different shapes (number of rows × number of columns) of arrays were calculated: 2 × 3, 2 × 4, 3 × 3, 3 × 4, 3 × 5, 4 × 4, 4 × 5, 4 × 6, 5 × 5, 5 × 6, 5 × 7, 6 × 6, and 6 × 7, the resulting classifications ranging from six (2 × 3 SOMs) to forty-two (6 × 7) types. The number of iterations was set to 1,000 as a higher value did not lead to better SOMs, likely due to the relatively simple structure of data. The initial value of the learning rate was set to between 0.04 (smallest SOMs) and 0.12 (largest SOMs) and decrease linearly to 0.01 over the 1,000 iterations. The radius parameter was initialized as the 66th percentile of type-to-type Euclidean distances, calculated from coordinates (i.e., row and column) of types within the SOM array, and was set to linearly decrease to a specific terminal radius. To account for the sensitivity of SOMs to the terminal radius, described by Jiang et al. (2015) and Gibson et al. (2017) and also apparent in our preliminary analyses, 16 values of this parameter (0.0–1.5 at a step of 0.1) were tested for SOMs up to 4 × 4 types and 10 values (0.9–1.8 at a step of 0.1) for larger SOMs, which are computationally more expensive. Note that radius less than 1.0 has no physical meaning and is specific to the “Kohonen” R package; there may be differences in other available software (Gibson et al., 2017). It allows for specifying the fraction of iterations with no neighbor modification. For instance, radius set to linearly decrease from 3 to 0 means that the last third of iterations do not modify neighbors. Lastly, to account for the sensitivity of SOMs to random initialization, 20 variants of SOMs were trained for each data set—SOM shape—terminal radius combination, resulting in a total of 37,440 SOMs.

Two kinds of quality metrics are typically used to assess SOMs—quantization error and topological error (Kohonen, 2001). While the quantization error shows the degree of data separation into classes and can be therefore understood as a measure of quality of a classification, the topological error expresses the degree to which the organization of types within the rectangular SOM array captures the structure of high-dimensional classified data. Here, quantization error is calculated as the Euclidean distance between each field and its closest type, averaged over all 1,000 classified fields. The topological error represents the percentage of fields whose two closest types do not lie next to each other in the SOM array. From the trained SOMs, one SOM was chosen for each data set and SOM shape by selecting the SOM with the smallest quantization error among the SOMs in whose projections onto the Sammon space folding of arrays did not occur. In case multiple SOMs met these conditions, the one with the smallest topological error was selected. A total of 156 SOMs were retained for the study.

2.3 Sammon Mapping

Sammon mapping is a non-linear dimension reduction method that maps multivariate data points (here, 400-dimensional fields) to lower- (here, two-) dimensional space such that the data structure is approximately preserved; that is, inter-point distances (typically, Euclidean) in the original and the projection Sammon space correspond (Sammon, 1969). This is achieved by iteratively looking for a solution that minimizes the error function
urn:x-wiley:2169897X:media:jgrd59011:jgrd59011-math-0002(1)
where urn:x-wiley:2169897X:media:jgrd59011:jgrd59011-math-0003 denotes the Euclidean distance between ith and jth fields in the original space and dij denotes their Euclidean distance in the projection space.

Sammon maps were trained using the function “sammon” of the R package MASS (Venables & Ripley, 2002). To find a solution close to the global optimum, the function utilizes classical multidimensional scaling (principal coordinates analysis) to obtain the initial configuration of the Sammon map. A reader is referred to Lee and Verleysen (2007, p. 258) for more information on initialization and optimization of Sammon mapping and Stryhal and Plavcová (2023) for an up-to-date review of its applications in climatology.

The purpose of Sammon mapping in the study is twofold. First, it is used as an additional quality measure to assess the topological structure of trained SOMs. In Figure 3, Sammon map of one data set is shown together with one SOM and its projection onto the Sammon map. Note that in the Sammon map in Figure 3a, relative to the orthogonal array of types in Figure 3b, the array is slightly deformed such that the distance between individual SOM types approximates the Euclidean distance between the spatial patterns of the types in the original space. A SOM whose Sammon projection is significantly deformed (e.g., when folding of arrays occurs and type-to-type links cross each other) does not capture data topology.

Details are in the caption following the image

Sammon map and a 4 × 5 SOM of Z60-M30-C10 data set: (a) Sammon map of 1,000 circulation fields (gray dots) and a SOM projected onto the Sammon map (red circles and lines); (b) Spatial patterns (centroids) of SOM types in panel (a) (units as in Figure 1), with coordinates denoting row and column of types.

Second, we utilize Sammon mapping as a means of visualizing relationships between the generated data sets, representation of these data sets by SOMs and the underlying modes. The following approach is used to project SOMs and modes on Sammon maps: The projection of a SOM pattern p onto a Sammon map is obtained by finding the point in the Sammon space that minimizes the error F:
urn:x-wiley:2169897X:media:jgrd59011:jgrd59011-math-0004(2)
where urn:x-wiley:2169897X:media:jgrd59011:jgrd59011-math-0005 (dip) denotes the Euclidean distance between the ith circulation field and type p in the original (projection) space. Note that F is almost identical to E, except that it does not integrate over all pairs of data points. Equation 2 is also used to project the modes onto Sammon maps. However, since the modes are not defined as anomaly patterns but rather as functions from which anomaly patterns are generated, we multiply each of the patterns in Figure 1 by several scalar indices (−3.0 to +3.0 at the step of +0.5). As explained in Section 2.1, we can understand the modes in terms of PCA, that is, as orthogonal vectors that form a 3-dimensional feature space. By multiplying the patterns of modes, we obtain an arbitrary coordinate system that represents this space and that can be projected onto Sammon maps.

3 Results

3.1 Links Between SOM Parameters, SOM Quality, and Data Structure

In Figure 4, quantization and topological errors and their dependence on terminal radius, SOM shape and data set structure are shown for selected cases. Not surprisingly, SOMs with larger arrays have a lower quantization error, as they better explain data variations. The error is also smaller in data sets with two modes and highest in data sets having three modes of similar variance, or very little structure. Notably, a significant drop in the error occurs when the terminal radius decreases below 1.0, but there is no further reduction beyond this threshold. For future studies, experimenting with the terminal radius close to the threshold of 1.0 should, therefore, suffice. However, this may not be the case if large SOMs were combined with data having little structure. Here, for example, terminal radius of 1.4–1.6 was the minimum that led to SOMs with reasonable topological structure for arrays of 30–42 types combined with data sets of little structure (not shown here).

Details are in the caption following the image

The dependence of quantization and topological errors on the terminal radius (horizontal axes), SOM shape (columns) and data set structure (rows). Twenty variants of SOMs were trained for each radius; variants selected for the study are highlighted by red crosses. The red box highlights specific case in which three clusters appear in the topological error distribution, while blue stars indicate three local optima that are shown in more detail in Figure 5. The units of vertical axes are (a) Euclidean distance of an unspecified variable and (b) fraction of 1.00 (refer to Section 2 for the definition of the errors).

The relationship between SOM parameters and the topological error is somewhat less clear, especially for terminal radius values above 1.0. The relationship can be indifferent, directly proportional, and indirectly proportional, depending on the particular data set—SOM shape combination. However, for almost all 156 combinations tested, a sharp increase in error can be observed when the terminal radius drops below 1.0, except for a few cases with simple data sets. The relatively high spread of the error, particularly when the terminal radius is set under 1.0, underscores the sensitivity of SOMs to initialization and, thus, the importance of calculating multiple variants of SOMs to avoid the risk of achieving only suboptimal local optima of classification.

Although SOMs trained with identical parameters may have nearly identical quantization error, they typically have a wide range of topological errors depending on initialization. For instance, we provide a detailed analysis of a specific case (Z50-M40-C10 data set, 3 × 4 array, and radius of 0.7) for which three distinct topological structures appear. The case is indicated by the red box in Figure 4b and the blue stars highlight the three local optima in the topological error, which are shown in more detail in Figure 5. Note that the SOM in Figure 5a has a markedly better topology than the other two.

Details are in the caption following the image

Sammon maps of three variants of 3 × 4 SOMs representing three local optima of topological structure indicated in Figure 4b. TE and QE indicate the topological and quantization errors associated with the respective SOMs. A few outlying data points were omitted from the figure. See Figure 3 for legend.

3.2 SOM Topology and the Modes of Variability

Here, we present how SOM types are organized within an array according to their spatial similarity to modes. Sammon mapping is utilized to visualize the topological structure of both data and SOM arrays, and Pearson's pattern correlation is used to measure the spatial similarity between patterns of types and modes. Additionally, several spatial patterns representative of various intensities and phases of the modes are projected onto Sammon maps to obtain a generalized non-linear projection of the underlying linear modes.

3.2.1 Two Modes of Variability

The generated circulation fields are clearly organized in the Sammon map according to their spatial similarity to patterns of modes. In all three simple data sets (Z90-M10, Z70-M30, Z50-M50), projections of the two modes are orthogonal to each other. In Figure 6, this is apparent both indirectly from correlations between data and modes (small colored dots) and directly from projections of idealized patterns representing the modes onto the Sammon maps (large colored dots). Furthermore, projections of these idealized patterns are linear. In the two data sets of unequal strength of modes, the modes project along the two axes of the Sammon space (Figures 6a and 6b). In the special case of their equal strength (Figure 6c), the modes remain orthogonal after projection, but do not project on the horizontal and vertical axes of the Sammon map. This is likely due to the fact that the horizontal axis is by default oriented in the direction of maximum explained variance, which is along the leading mode. This direction is unstable for data of spherical shape, that is, when two leading modes explain the same share of variance. However, this does not affect our analysis, as relationships between projected objects do not change if the Sammon map is linearly transformed (e.g., rotated).

Details are in the caption following the image

Sammon maps of 4 × 4, 4 × 5, and 4 × 6 SOMs of bivariate data sets. The Sammon maps of data sets are shown in the background and colored according to Pearson's correlation between patterns of circulation fields and the positive phase of respective modes (Figure 1). Each row shows results for one data set: (a) Z90-M10, (b) Z70-M30, and (c) Z50-M50. In each row, the two leftmost panels show the identical 4 × 4 SOM array (in black dots and lines), while each panel shows correlation of data with one mode (Z for zonal, M for meridional). In the right-hand half of the figure, the two columns show projections of 4 × 5 and 4 × 6 SOMs; correlations with the zonal mode only are shown. Larger blue/red dots denote projections of negative/positive phases of a respective mode, each dot representing a certain multiple of the pattern in Figure 1, the dots closest to the Sammon map center [0,0] being ±0.5 times the pattern, the second closest being ±1.0 times the pattern, etc.

In Figure 6, three SOMs (4 × 4, 4 × 5, and 4 × 6) are projected onto each of the three Sammon maps (data sets) to illustrate the link between their topology and projection of modes. The diagonals of the square 4 × 4 SOMs project on the axes of the Sammon space in the case of Z90-M10 and Z70-M30 data sets, that is, opposite corners of SOM arrays depict opposite phases of modes in their rather extreme state. In the case of Z50-M50 (Figure 6c), however, the modes strongly project on the (theoretical) center row and column of the array, with the corner types representing fields with about equal contribution of both modes.

In general, with an increasing elongation of SOM arrays, its center row rather than a diagonal tends to be aligned with the leading mode, but a wide range of results was obtained in different data sets. The leading mode may strongly project on two opposite corners (as in 4 × 5 and 4 × 6 SOMs of Z90-M10 data), on only one corner (4 × 5 SOM of Z70-M30 data) or none at all (4 × 6 SOM of Z70-M30 data). The weaker mode does not project most strongly on the remaining two corners but rather on the types next to and second next to the corners in 4 × 5 and 4 × 6 SOMs, respectively. The corner types themselves represent more complex fields with non-negligible contribution of both modes. The results obtained for SOMs of other shapes are similar to those presented above: Modes project well on the diagonals (center row and column in the case of Z50-M50) of square SOM arrays regardless of their size, while only the leading mode occasionally projects on a diagonal in oblong SOMs.

3.2.2 Three Modes of Variability of Different Strength

Four data sets composed of three modes of different strength were analyzed in the study (Z70-M20-C10, Z60-M30-C10, Z50-M40-C10, and Z50-M30-C20). Relative to the bivariate data sets, shifting 10%–20% of total variability to the third-order circular mode does not markedly affect the projection of the first two modes onto the Sammon map: the zonal (meridional) modes project on the horizontal (vertical) axes of the projection plane (Figure 7). When projecting the circular mode on the Sammon map, its stronger realizations (i.e., ±1 × C and beyond) appear along the vertical axis (i.e., orthogonally to the leading mode, i.e., more or less parallel to the second-order mode). However, fields with such strong contribution of circular flow are generated only very rarely in these four data sets, the maximum index (amplitude) of the mode being—in absolute values—1.23 (C10 data set) and 1.49 (C20 data set), while the 95th percentile is approximately 0.6. On the contrary, the ±0.5 × C patterns project in all four cases very close to the center of the Sammon map. In general, fields highly correlated with the circular mode are scattered across the whole Sammon map, but their highest density is close to the center. In the center, fields with only weak circulation (“turned-off” zonal and meridional modes) occur, and therefore, even a small contribution of the circular mode leads to relatively high pattern correlation between its pattern and data fields.

Details are in the caption following the image

As in Figure 6, except for (a) Z60-M30-C10, (b) Z50-M40-C10, and (c) Z50-M30-C20, and inclusion of the third-order (C) mode. In each row, the three panels on the left show the Sammon map of one data set and projections of the 4 × 6 SOM array (in black dots and lines) and circulation modes (larger blue and red dots) onto it, while the two figures on the right show shape of 4 × 4 and 4 × 5 SOMs.

The orientation of SOM arrays relative to modes and the shape of the arrays also resemble those obtained for bivariate data sets: the two leading modes project onto diagonals in all square SOMs except those of the Z50-M40-C10 data set. On the other hand, the two leading modes project along the central row and column in most SOMs that are strongly elongated (i.e., those with two more columns than rows), the corner types then comprising linear combinations of the two leading modes. The orientation of SOMs with one more column than rows varies between the typical orientation of square and strongly elongated SOMs, depending on data set and SOM parameters (number of types, terminal radius). The corner types of these slightly elongated SOMs can be either linear combinations of two or three modes or strong projections of one particular mode. In the latter case, patterns in opposite corners typically are not mirror images of each other (see Figures 8a and 8b); such a situation is further referred to as “non-mirrored patterns.”

Details are in the caption following the image

Spatial patterns of selected SOMs: (a–c) 4 × 6 SOMs shown in Figure 7, (d)–(f) 4 × 6 SOMs shown in Figure 9. For legend, see Figure 3. The SOMs' Sammon maps in Figures 7 and 9 are oriented such that type [1,1] is in the upper left.

Consistent with the projection of the ±0.5 × C patterns close to the center of Sammon maps, also types resembling the pattern of this mode occur close to the center of SOM arrays. Note, however, that the proximity to the center is in terms of the Sammon map rather than SOM array coordinates. This is apparent in types [1,3] in Figure 8b, and [4,3] and [4,4] in Figure 8c, which—despite their location at the edge of SOM arrays—lie relatively close to the center of their respective Sammon maps, due to a rather deformed shape of the SOM array in the former case (Figure 7b) and rather small array (consequence of a higher terminal radius necessary to produce a SOM with good topological structure) in the latter (Figure 7c). However, the occurrence of types resembling the circular mode in the center does not mean that other types do not show signs of circular flow; rather the opposite holds, which is apparent from the spatial patterns of types (Figure 8).

3.2.3 Three Modes of Variability of Equal Strength

One example of equal strength of modes was presented above, regarding the bivariate Z50-M50 data set. Additional five data sets were defined in which the first- and second-order modes (Z40-M40-C20), the second- and third-order modes (Z60-M20-C20, Z50-M25-C25, and Z40-M30-C30), or all three modes (Z33-M33-C33) have equal strength.

Both the Sammon map and projection of modes in the case of Z40-M40-C20 data (not shown) resemble those of Z50-M40-C10 (Figure 7b), except the leading two modes are not aligned with the Sammon map axes, similarly to Z50-M50 (Figure 6c). The three cases with equal strength of the meridional and circular modes are presented in Figure 9, including Sammon maps of 4 × 4, 4 × 5, and 4 × 6 SOMs—the spatial patterns of the latter are shown in Figures 8d–8f. In all three cases, the leading mode is approximately aligned with the horizontal axis of the Sammon map. However, despite the equal strength of the two remaining modes, the three cases differ in how these two modes project onto the Sammon map of data as well as onto the SOMs.

Details are in the caption following the image

As in Figure 7, except for: (a) Z60-M20-C20, (b) Z50-M25-C25, and (c) Z40-M30-C30.

In the case of Z60-M20-C20 (Figure 9a), both modes project onto the vertical axis of the Sammon map, such that their positive (negative) values occur mostly above (below) the horizontal axis. A similar distribution is also apparent in patterns of SOM types (Figure 8d): excluding the clearly zonal types, the upper two rows tend to capture southerly (M+) and cyclonic (C+) flow, and their combinations, while northerly (M) and anticyclonic (C) flow prevails in the bottom two rows. Consequently, the SOM selectively and clearly underrepresents concurrence of phases M+C and MC+ (in general, types with circular features in the right-hand side of their patterns). Additionally, the two types that project onto the vertical axis of the Sammon map ([1,3] and [4,4]) have non-mirrored patterns, the former resembling M+, while the latter being a linear combination of M and C. This illustrates the fact that the position of types within an array and their distances from the center of its Sammon map are functions of Euclidean distance. As a consequence, the distances do not necessarily convey the information on the organization of anomalies within the patterns, at least not to the same degree as in the case of pattern correlation.

In the case of the Z50-M25-C25 data set (Figure 9b), the three modes project onto the Sammon map similarly to Z50-M30-C20 (Figure 7c), with the zonal and meridional modes aligned with the two axes and the circular mode scattered across the map with increased density of high correlations in the center of the map. The SOM is in this case deformed such that its upper left corner (see type [1,1] in Figure 8e) and the center of the bottom row (type [4,4]) lie close to this area of increased data density and represent both phases of the circular mode. On the other hand, strong projections of the meridional mode occur in types that lie farthest from the horizontal axis in the Sammon map (types [1,3] and [4,6]). Thus, while two corners of the SOM that lie close to the horizontal axis show opposite phases of the leading mode, the remaining two corners show one phase of one of the remaining two modes.

The Sammon map of the Z40-M30-C30 data set (Figure 9c) reveals that in this case it is the circular mode that projects orthogonally to the leading zonal mode. The projection of the meridional mode on the vertical axis is somewhat less clear than in Z60-M20-C20, but still apparent; note, however, that it does not have a clear representation among the types (Figure 8f). Nonetheless, the northerly (southerly) component of flow does project on most types in the upper (bottom) half of the SOM. Given the alignment of the zonal and circular modes with the central row and column of SOMs (in the case of this particular data set), it matters whether the number of rows and columns of a SOM is odd or even; the odd number causing the center row/column to clearly represent the spectrum of realizations of one particular mode, the remaining types being linear combinations of (typically) all three modes (not shown). This particular alignment occurred in Z50-M40-C10, Z50-M30-C20, and Z40-M30-C30 and also in most of small strongly elongated (2 × 4, 3 × 5) SOMs.

4 Discussion

4.1 On Data

Synthetic data sets are crucial in studying the ability of methods to capture atmospheric circulation and its structure. However, the definition of such data sets as well as the structures emphasized in them can vary considerably. These differences arise mainly from variations in how the studied feature, in this case, the mode of variability or teleconnection, is conceptualized.

In two previous studies that directly compared the skill of PCA and SOMs in extracting known patterns, or modes, from synthetic data sets (Liu et al., 2006; Reusch et al., 2005), most of the data sets were generated by repeating a few predefined patterns. This resulted in data consisting of a few distinct clusters, which is suitable for studying the ability of methods to classify spatial fields based on similarities in their patterns. Since SOMs represent a topologically ordered nonlinear classification, their application to such a task seems straightforward. However, applying PCA to this task is a misconception. Compagnucci and Richman (2008) showed that for PCA to produce a classification, the similarity matrix to be decomposed must express relations (e.g., pattern correlations) between individual fields (so-called T-mode PCA) and not spatial correlations, which is the case for the commonly used S-mode PCA. The reader is referred to chapter 6 of Richman (1986) for more information on modes of decomposition of data by PCA. To our knowledge, nevertheless, a comparison of SOMs with T-mode PCA has not been done yet.

In this paper, we defined our data sets such that they mimic real circulation, which consists of a continuum of circulation fields rather than a few well separated clusters (Philipp et al., 2016). In contrast to previous studies, the modes are not defined as anomaly fields recurring more or less unaltered in synthetic data sets, but rather represent various statistical links across the spatial domain, linear combinations of which create individual circulation fields. In other words, the generation of data can be understood as a reversed PCA. Such data do not limit the applicability of SOMs in any way, and because they are generated from orthogonal modes, they also make it possible to study how these modes project on SOM arrays, including lower-order modes, which has not been attempted before.

4.2 On Methodology

Beside the data sets, two other specific issues seem particularly important for our study due to its aim at methodological aspects. First, any conclusions are constrained by the quality of trained SOMs, while the assessment of quality is to a large extent goal-specific and a result of choice of multiple parameters (e.g., Gibson et al., 2017). Previous studies provide only limited guidelines regarding these choices, and in many cases the parameters were selected randomly or not mentioned at all (Sheridan & Lee, 2011). The second issue regards the application of Sammon mapping beyond a mere tool for validation, toward providing a visual link between data, modes and SOMs.

The training was initialized using randomly selected data points. We did not choose the alternative “linear” initialization from leading two principal components to assure that the relative orientation of SOMs and predefined modes in the Sammon space was not biased by this step. Since SOM topology appeared to be sensitive to initial conditions, 20 variants of SOM were calculated for each combination of the remaining parameters to avoid locally optimal solutions. Experimentally we found that training additional variants did not lead to markedly better SOMs, and we conclude that the variants retained for the study are very close to their respective global optima. Rousi et al. (2015) arrived at the same conclusion after training 10 randomly initialized variants of SOMs of Euro-Atlantic circulation.

Furthermore, the commonly used Euclidean distance was chosen as the metric of similarity, as other metrics provided by the package (e.g., Manhattan, sum-of-squares) did not lead to higher quality SOMs. Last, we kept the number of training iterations fairly low (1,000), and the training was done in a single phase. Increasing the number of iterations and/or splitting the training into two phases of “rough” and “fine” ordering, as suggested by Kohonen (2013), did not lead to qualitatively different SOMs, probably due to the combination of small arrays and small data sets.

The R package provides two neighborhood functions: bubble and Gaussian. Given the limited options provided by the algorithm, we did not test other functions such as the Epanechikov function, which was found superior by Liu et al. (2006), although Jiang et al. (2015) found no notable difference between Epanechikov and Gaussian functions for SOMs optimized for classification. Both available functions led to SOMs with good topology; however, each of them required a specific range of neighborhood radii. In general, the Gaussian function led to SOMs less sensitive to the radius, notably to its lower terminal value. According to Wehrens and Kruisselbrink (2018), the Gaussian neighborhood always adjusts all types to some extent, even if the terminal radius is close or equal to the threshold of 1.0. For the bubble neighborhood, iterations calculated at the threshold account only for the winning type, which turns the training into k-means clustering (Kohonen, 2001). In our case, including such iterations typically led to a marked change in quantization (decrease) and topological (decrease) errors and to dissipation of SOM topology, regardless of the neighborhood function. Gibson et al. (2017) suggest training two sets of SOMs that differ in the inclusion of “k-means” iterations, arguing that the two sets may complement each other, one providing good topology, the other good clustering. Since SOM topology itself was the subject of our study, SOMs were only allowed to include a “k-means phase” if no folding of SOMs was detected in the Sammon projection.

We experimentally found that the effects of the learning rate and the initial radius on SOM quality were relatively small, if chosen reasonably. In a few cases, which were discarded, an insufficient initial radius led to poor initial ordering of types in larger arrays. Additionally, too large learning rate in later iterations broke the topology of SOMs. These findings are in line with previous studies (e.g., Gibson et al., 2017; Sheridan & Lee, 2011).

The last point regards the use of Sammon mapping. Strictly speaking, Sammon mapping adds a step into the analysis that could introduce new uncertainty. It evaluates nonlinear relations among all data points to define a two-dimensional projection plane, but stresses an accurate representation of local topology. It represents only one of many possible frameworks, in which the results could be presented. However, the highly idealized rectangular array of SOMs limits the study of its topological structure to such an extent that the benefits of Sammon mapping seem to considerably outweigh its limitations. Furthermore, utilization of a data projection technique allows one to see individual data points instead of SOM type frequencies only, which makes it possible to show how the linear modes of variability project on data in a nonlinear continuum framework. An interested reader is referred to Stryhal and Plavcová (2023) for more information on Sammon mapping.

4.3 On Using SOMs to Study Teleconnections

Given the high number of combinations retained for the analysis (13 SOMs for each of the 12 data sets), only a relatively small subset of results could be chosen for presentation that, nevertheless, well represents the range of results. While the number of SOM types (size of array) does not seem to play that important a role in SOM topology (contrary to its strong link to explained variation), the shape of the array (ratio of the number of columns and rows) is an important factor. Multiple local optima in SOM topology are often found for oblong SOMs of data that have equal or nearly equal strength of two modes (zonal and meridional, or meridional and circular). This suggests that exploratory data projection by PCA prior to SOM training may aid selection of SOM parameters, such as initialization and shape of array. One such PCA application can be found in Lee (2017), who chose a square SOM array after finding that the leading two PCs accounted for nearly equal variance in their data.

A characteristic type of SOM's behavior, or topology pattern, consisting in a projection of two leading modes of variability on the diagonals of a SOM array, which was suggested in earlier studies (Lennard & Hegerl, 2015; Reusch et al., 2007), seems quite rare. An alignment of SOM diagonals with two leading modes—which results in mirrored patterns of positive and negative phases of modes in opposite corners—was detected only in square SOMs (in about 50% of cases). On the other hand, in about 60% of strongly elongated SOMs (2 × 4, 3 × 5, 4 × 6, and 5 × 7) another topology pattern was observed, in which the two leading modes are approximately aligned with the center row and column of the array, the diagonals then representing their combinations. Consequently, SOMs with an odd number of rows/columns show the spatial structure of modes somewhat better, especially those with small arrays. However, most of the remaining cases, and also most large SOMs and SOMs of data sets in which two (or all three) modes have approximately the same strength, cannot be described with either of the two idealized topology patterns, the projection and identification of modes becoming less straightforward.

The projection of the third-order mode on SOM arrays has so far received only little attention. Here, when the circular mode is weaker than the leading two modes, it projects strongly only on a few types close to the center of the array; but some curvature of isolines is apparent in most types. This is in line with how the mode projects on the Sammon map of data—fields correlated with the pattern of the mode may appear anywhere in the phase space, but in the center of the Sammon map such cases are more common. Fields very strongly correlated with the circular mode occur only when the amplitude of the two leading modes is close to zero (the modes are close to their neutral phase). Without the contribution of the leading modes, fields with only relatively weak anomalies are generated, which project close to the center of the Sammon space. A similar SOM structure can be seen in the results of Reusch et al. (2007) who analyzed North-Atlantic monthly mean SLP patterns (see Figures 1 and 6 of their study). While the leading two principal components, which explained 39% and 13% of variability, projected relatively strongly on all the four corners of the SOM, the third-order principal component (11%) did not have a clear representation within the 5 × 6 SOM array. Despite similarities between the pattern and several types in the center of the array (e.g., type 16, Figure 1), the two phases of this pattern were not recognized as noteworthy regimes of North-Atlantic variability, owing to their overall weak projection on a SOM, at variance with the similar strength of the second- and third-order modes.

The identification of the second and third modes in SOM arrays becomes particularly challenging if their strength is identical. Several different topology patterns were found here, each pointing to a particular limitation of SOMs. First, one of the modes may project relatively weakly onto one or a few types (typically in the center of the SOM) as if its variance was smaller than that of the other mode—the mode thus being underrepresented. Second, parts of the SOM array not occupied by patterns similar to the leading mode (e.g., two opposite corners) may represent one (any) phase of one mode each, the remaining phases being underrepresented. Third, one part of a SOM array (e.g., upper half) may predominantly show one particular combination of phases (positive or negative) of the two modes, the other (e.g., lower) half showing a different combination of phases, but not necessarily opposite to the upper half for both modes, the remaining two combinations of phases being underrepresented. Each of these three topology patterns may have specific utility, except that the pattern itself is largely outside of researchers' control due to the unsupervised nature of SOMs. In real data, the underrepresentation of specific combinations of phases of modes may be, but equally likely may not be, a consequence of these combinations occurring less frequently in the training data set—here it was a result of only minor differences due to sampling. This illustrates the fact that one has to pay a considerable price for the topology preservation constraint of SOMs, relative to simple cluster analysis.

Last, it has to be stressed that in both the Euro-Atlantic and the Pacific-American regions at least four modes of variability are necessary to describe variations in winter circulation, summer circulation being even more complex (Barnston & Livezey, 1987; Casado & Pastor, 2012; Cortesi et al., 2019; Hynčica & Huth, 2020). Therefore, selective and uncontrollable underrepresentation of important modes of variability and their interactions will be an even more important issue in SOMs of real data. We would also like to emphasize that the primary goal and strength of SOMs do not lie in explicitly identifying leading modes of circulation variability, but rather their various combinations represented by recurrent circulation fields. Whether SOMs provide high-quality classifications that outperform other methods, such as k-means, is a different aspect that was not evaluated in this study. However, this does not diminish its importance.

4.4 On Using SOMs to Study the Nonlinearity of Teleconnections

We clearly show that the topology of SOMs strongly depends on the structure of data. Even if the leading two modes of variability project along elements of a SOM array (such as its diagonals), marked irregularities can be identified in the spatial patterns of types if the data comprise more than two modes of variability. One way these irregularities demonstrate themselves in the spatial patterns of SOM types is a presence of non-mirrored patterns in opposite types within a SOM array, for example, in opposite corners. Reusch et al. (2007) pointed out that such non-mirrored patterns are a feature that distinguishes SOMs from linear analyses (such as PCA). However, one needs to remember that such apparent nonlinearities do not necessarily point to nonlinearity of the underlying modes of variability—non-mirrored patterns in (any) two opposite positions within a SOM array clearly prevail even if data are linear. In the case of our synthetic data sets, these differences may be due to sampling, or simply be artifacts of projecting multidimensional data onto a two-dimensional phase space. We showed that especially lower order modes can project on SOM arrays in various ways, which, nevertheless, all lead to some kind of selective underrepresentation of co-occurrences of some phases of modes. Our data can be understood bivariate or trivariate—as regardless of the number of grid points, or input variables, all variability can be linked to only two or three modes. The fact that nonlinearity appears so clearly in SOMs of such simplistic data sets suggests that it will occur in any complex (multimodal, noisy) real-world data regardless of how well such data may be approximated by linear models. The tendency of nonlinear methods to detect nonlinearity where there is none has already been documented (Christiansen, 2005), and one should, therefore, not forget the strong predisposition of results by the character of the used research method.

5 Summary and Outlook

The main findings of the study are as follows:
  • We constructed synthetic data sets as various linear combinations of three idealized modes of variability, each mode expressing a specific statistical link across the spatial domain. The data sets were used to evaluate (a) how the structure of data affects the ability of SOMs to capture individual modes and their combinations, and (b) the extent to which SOMs can depict modes one would obtain by traditional methods such as PCA.

  • We propose a novel utilization of Sammon mapping, a non-linear projection method previously limited to evaluating SOM topology in climatology. In this study, Sammon mapping is used to compare topologies of data sets, visualize linear modes of variability, study how SOMs represent data, and better understand conceptual differences between distinct approaches used to represent high-dimensional data sets.

  • It has been known that a careful choice of SOM parameters can help reduce quantization and topological errors, in other words, lead to better SOMs. However, we found that a reduction in the errors is conditional on the structure of data, or relations between individual modes of variability. We suggest that PCA become routinely utilized prior to SOM training to identify data structures conducive to a high sensitivity of SOM topology to some parameters, such as initialization and shape of SOM array. Some examples of such data structures discovered in this study are small or large differences between the strength of leading modes of variability. Nevertheless, we note that the specifics of how to apply PCA in this context will require further analyses.

  • The two leading modes of variability only rarely project on the diagonals of SOM arrays, a pattern suggested in some previous studies. Consequently, patterns or types in opposite corners of a SOM array do not show mirrored patterns representing positive and negative phases of one mode. Instead, various combinations of modes are observed. Additionally, the lack of presence of such mirrored patterns in opposite corners cannot be considered proof of data non-linearity since all data sets analyzed in this study are strictly linear.

  • When the third-order mode of variability is weak, it projects strongly only on the center of a SOM array. As its strength approaches that of the second-order mode, one of three following SOM responses occurs: (a) One of the two modes is under-represented, that is, it projects on the SOM as if it was considerably weaker than the other mode. (b) One phase of each of the two modes is underrepresented, despite both phases of each mode being generated equally strong. (c) Certain combinations of phases of the two modes appear in certain parts of the SOM array, that is, the two modes blend together, while other combinations remain under-represented.

  • These results suggest that using SOMs to study teleconnections in the context of recurrent physical (e.g., flow) patterns, rather than statistical links, has its own limitations. While interpreting the output of SOMs is more intuitive compared with the phase space spanned by orthogonal spatial-temporal functions of co-variability, the compression of high-dimensional data to a typically two-dimensional array of nodes results in an output that is not able to properly resolve leading modes of variability, even if the modes are limited in number and highly idealized.

Generating synthetic data sets from real modes of variability—and potentially from a larger number of modes, adding noise to the generated data, and/or defining more complex nonlinear modes of data variability may represent other useful steps in understanding how SOM arrays relate to teleconnections. Additionally, only spatial relations were analyzed here. The response of the frequency of occurrence of individual SOM types to changes in the underlying modes may also be of interest, especially regarding the ability of SOMs to capture inter-decadal changes in teleconnections and distinguishing them from sampling variability. The temporal aspects will be analyzed in another study.

Last, we want to emphasize that our evaluation does not assess the strengths and weaknesses of SOMs in identifying recurrent patterns (or, in other words, providing meaningful and useful classifications), nor do we suggest that SOMs should not be used as a powerful exploratory tool under less stringent constraints than those involved in studying teleconnections. However, we believe that this area has also been considerably understudied from a methodological and theoretical perspective. Therefore, we recommend conducting more studies that would compare SOMs with other classification approaches, such as nonhierarchical cluster analysis.

Acknowledgments

This research was supported by the Czech Science Foundation, project 17-07043S. We acknowledge the work of all authors of the Kohonen and MASS R packages, without which this study would not have been possible.

    Data Availability Statement

    The “Synthetic data set for studying SOMs, version 1” (Stryhal, 2020) containing spatial patterns of the predefined modes of variability and all generated data sets used in the study can be accessed at Figshare under the CC 4.0 license.