ShowCorrelation

Visual Correlation

© 2002 Inductive Solutions, Inc. All rights reserved.

Help
and Documentation

What ShowCorrelation Does

ShowCorrelation creates a visual color-coded representation of a given correlation matrix. It also determines whether your correlation matrix is valid ("positive definite") and creates a Principal Components Report. It is server-based: you submit (via HTML form) the correlation as input. The color-coded representation is sorted by three different methods that let you easily see patterns and data groupings.

Data patterns and groupings are much harder to see from a table of numbers -- the usual representation of correlation -- than with the ShowCorrelation color-coded representation.

Here is an example of a 4 variable economic time series:

The correlation of these 4 variables is shown in this table:
x1 x2 x3 x4
x1 1.000
x2 0.992 1.000
x3 0.621 0.604 1.000
x4 0.465 0.446 -0.177 1.000
So, for example, the correlation between x1 and x3 is 0.621. Note that the correlation values always range from +1 (perfect positive correlation) to -1 (perfect negative correlation). The correlation matrix is symmetric: correlation between x1 and x3 is the same as the correlation between x3 and x1. Consequently, we usually only show the lower diagonal.

The ShowCorrelation representations are as follows (legend of the ShowCorrelation color codes):

Not sorted.
x1 x2 x3 x4
x1        
x2        
x3        
x4        

Sorted by Contribution of Major Principal Component
x1 x2 x3 x4
x1        
x2        
x3        
x4        

Sorted by Angular order of First Two Principal Components
x2 x1 x4 x3
x2        
x1        
x4        
x3        

The first representation (Not Sorted ) lists the variables in the original order.
This is the legend of the ShowCorrelation color codes.

The next two representations sort the variables by the contribution of the Principal Components.

Note that the {x1,x2} grouping (={GNP Implicit Price Deflator, Gross National Product}) and to a lesser extent, {x3, x1, x2} can be easily seen.

The other output is the Principal Components Report:

Variance (Eigenvalue) and Principal Components Report

Principal
Component
Variance
(Eigenvalue)
Per Cent
Variance
Accumulated
Variance
Accumulated %
Variance
x1 2.6375 65.938 % 2.6375 65.938 %
x2 1.1710 29.275 % 3.8085 95.213 %
x3 0.1848 4.621 % 3.9934 99.834 %
x4 0.0066 0.166 % 4.0000 100.000 %

This report shows how the variances of the Principal Components are distributed. In this example, 95% of the variance (the "eigenvalues" of the Correlation Matrix) of the original 4 variables are contained in the first 2 Principal Components. This provides additional insights for the Visual Correlation displays.

The second part of the Principal Components Report lists all Principal Components (the "eigenvectors" of the Correlation Matrix):

Principal Components (Row Vectors)

x1 x2 x3 x4
0.6092 -0.0287 -0.2999 -0.7336
0.6044 -0.0250 -0.4219 0.6754
0.4224 0.6192 0.6595 0.0569
0.2919 -0.7843 0.5451 0.0503

The Principal Components transformation is given by the columns of this Matrix M. To transform a vector
x = (x1, x2, x3 x4)
in the original data set to a Principal Components representation y , (matrix) multiply it by M:
y = x M
Note: You can use fewer columns of M in this matrix multiplication; the number of columns correspond to the number of principal components you require. So, suppose M* consists of the 4-row x 2-column submatrix consisting of the columns (x1 x2 ):
x1 x2
0.6092 -0.0287
0.6044 -0.0250
0.4224 0.6192
0.2919 -0.7843
If we perform the matrix multiplication transformation
y = x M*
(where x is in the original 4-dimensional space) and y is the 2-dimensional transformed space (corresponding to the first 2 Principal Components), then the new 2-dimensional vectors y incorporate 95% of the information of the original 4-dimensional vectors x .

Here is another example. The correlation matrix is (with 0 in the upper diagonals):
r1 r2 r3 r4 r5 r6 r7
1.000 0.000 0.000 0.000 0.000 0.000 0.000
-0.141 1.000 0.000 0.000 0.000 0.000 0.000
-0.100 0.456 1.000 0.000 0.000 0.000 0.000
0.976 -0.329 0.688 1.000 0.000 0.000 0.000
-0.787 -0.272 0.485 -0.507 1.000 0.000 0.000
-0.235 0.273 0.061 -0.229 0.255 1.000 0.000
-0.631 0.278 0.269 -0.745 0.643 0.746 1.000

The ShowCorrelation output is:

Visual Correlation

Not sorted.
r1 r2 r3 r4 r5 r6 r7
r1              
r2              
r3              
r4              
r5              
r6              
r7              

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Visual Correlation Report anyway

This Correlation does not have the theoretical property of being positive definite. Some reasons for this are discussed below.

The output continues...

Visual Correlation

Sort by Contribution of Major Principal Component
r7 r5 r6 r2 r3 r4 r1
r7              
r5              
r6              
r2              
r3              
r4              
r1              
These colors reveal two variable groups: {r7, r5,r6} and {r1, r4} .

Visual Correlation

Sort by Angular order of First Two Principal Components
r1 r4 r3 r2 r6 r5 r7
r1              
r4              
r3              
r2              
r6              
r5              
r7              

On visual inspection, these colors reveal two variable groups: {r1, r4} and {r7, r5}.

The Principal Components Report explicitly shows the reason that this Correlation Matrix is not valid: Principal Component vectors #6 and #7 have negative variances (eigenvalues). The Report continues...

Variance (Eigenvalue) and Principal Components Report

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Principal Components Report anyway.
Principal
Component
Variance
(Eigenvalue)
Per Cent
Variance
Accumulated
Variance
Accumulated %
Variance
x1 3.4336 49.052 % 3.4336 49.052 %
x2 1.7292 24.703 % 5.1629 73.755 %
x3 1.3158 18.797 % 6.4786 92.552 %
x4 0.8958 12.797 % 7.3744 105.348 %
x5 0.1372 1.959 % 7.5115 107.308 %
x6 ** -0.0329 -0.470 % 7.4786 106.838 %
x7 ** -0.4786 -6.838 % 7.0000 100.000 %

** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These are not **

Principal Components (Row Vectors)

x1 x2 x3 x4 x5 x6 ** x7 **
-0.4916 0.1255 -0.1741 -0.3354 -0.5034 -0.5469 -0.2175
0.1479 0.3289 -0.6711 0.4698 0.0487 -0.3048 0.3220
0.0612 0.7715 0.2110 0.2091 -0.0089 0.1490 -0.5390
-0.4676 0.4578 0.1834 -0.2212 0.1263 0.2282 0.6490
0.4156 0.1260 0.5701 -0.0082 0.0158 -0.6429 0.2696
0.3057 0.1901 -0.3250 -0.7115 0.4843 -0.0861 -0.1284
0.4978 0.1393 -0.1027 -0.2603 -0.7024 0.3361 0.2192


** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These Principal Components have non-positive eigenvalues **

Copying Tab-Delimited Tables to ShowCorrelation

There are two inputs to ShowCorrelation. The first is an input box that asks for the number of variables; the second is a text area box.

In the text area box you paste a table: the first row of the table contains variable labels, the remainder of the table contains the Correlation Matrix.

The table must be in tab-delimited format: the columns are separated by tabs and the rows are separated by carriage returns. When you copy a table from an Excel spreadsheet to a text area box in the ShowTheBest dialog, it will automatically be in tab-delimited text.

Here is an example of the ShowCorrelation input page:


How many variables? Note: Maximum Number of Variables: 60
Copy a Correlation Table in tab-delimited text and paste it here:

In this sample problem, there are 7 variables, labeled r1, r2,...,r7. Their correlation matrix was already pasted (and is shown) in the text area box. The Grab Table button (not activated here) grabs the submitted table and verifies that it is in the correct form.

Here is a text file containing the table in tab-delimited text.

Here is a Microsoft Excel file containing the table in tab-delimited text.

Note: You only need to fill in the lower triangle of the Correlation Matrix. The upper triangle can be set to zero or any other values. ShowCorrelation needs only the lower triangle of the submitted text-delimited table.

If you want to try a sample run, click here .

How to Find or Compute a Correlation Matrix

You can find many correlation tables in published research on the web. This Excel workbook has several correlation tables (with web page reference links) showing applications in:

Use ShowCorrelation on the outlined tables to see evidence of data clustering.

To compute your own correlation matrix, use the Microsoft Excel Add-In (it comes with most versions of Excel). To install it, run Excel. On the Tools menubar, select

Tools=>Add-Ins...>Analysis ToolPak

Now, to run the Correlation Utility: suppose you have a table of variables in an Excel worksheet with the first row containing variable labels. On the Tools menubar, select

Tools=>Data Analysis...>Correlation

In the dialog, select the table as the input range for your data, check Labels in First Row, and select an output range on your worksheet for the correlation matrix. Excel will place the Correlation Matrix there. Then you can copy it to ShowCorrelation.

Why is My Correlation Matrix Invalid?
Reasons Why Your Correlation may be "Not Positive Definite"

ShowCorrelation assumes that the input correlation you submit is valid (we do not compute correlations).

One of the theoretical properties of a correlation and covariance matrix is that they be positive definite. Consequently, if a given correlation matrix is not positive definite, then it is not valid: theoretical conclusions about the underlying data set may not be correct.

A matrix is "positive definite" if all of its eigenvalues are positive. From a practical perspective, it means that the variances (eigenvalues) of the Principal Components (eigenvectors) must all be strictly greater than zero. ShowCorrelation displays these eigenvalues in its Principal Components Report.

Note that the correlation matrices we normally deal with in the "real world" are statistically estimated. The nature of this estimation is what presents problems.

Some common reasons for correlations not being positive definite are:

  1. Data Input Errors
    Data errors may be due to typographical mistakes, round-off, or copying the wrong rows or columns.
  2. Sampling Errors
    The fewer the samples in your data set, the more likely these errors may arise. There could be outliers in your samples that skew the computation. A variation of this is the problem of computing a sample correlation pair by pair versus statistically estimating the correlation as a whole (the so-called Polychoric Correlation problem). Most spreadsheet routines (ie, Excel) estimate the correlation matrix by a pair by pair computation.
  3. No variance in a variable
    If the values of a variable are all constant then there could be problems.
  4. Perfect linear relations between variables
    If the values of a set of variables have an exact linear relationship between them then there could be problems.

In these cases, ShowCorrelation issues a warning and continues with its computations.

Does it Matter? Fixing the Correlation Matrix

For exploratory data analysis and heuristic clustering applications, it may not matter if all eigenvalues are positive: in any case, ShowCorrelation uses the first two eigenvalues to determine rankings, and in most correlations that we have seen, these first two eigenvalues are positive.

Are their any fixes besides re-sampling or re-estimating the data set? Some researchers recommend "tuning" the correlation matrix, by slowly reducing the magnitude of the non-diagonal values so that the matrix becomes positive definite. One heuristic method that "preserves the correlation structure" is to multiply all off-diagonal matrix elements by a constant (0.9, 0.8, 0.7...) until the matrix becomes positive definite.