ShowCorrelationVisual Correlation |
and Documentation |
Data patterns and groupings are much harder to see from a table of numbers -- the usual representation of correlation -- than with the ShowCorrelation color-coded representation.
Here is an example of a 4 variable
economic time series:
What ShowCorrelation Does
ShowCorrelation creates a visual color-coded representation of a given correlation matrix. It also determines whether your correlation matrix is valid ("positive definite") and creates a Principal Components Report. It is server-based: you submit (via HTML form) the correlation as input. The color-coded representation is sorted by three different methods that let you easily see patterns and data groupings.
The correlation of these 4 variables is shown in this table:
x1 | x2 | x3 | x4 | |
x1 | 1.000 | |||
x2 | 0.992 | 1.000 | ||
x3 | 0.621 | 0.604 | 1.000 | |
x4 | 0.465 | 0.446 | -0.177 | 1.000 |
The ShowCorrelation representations are as follows (legend of the ShowCorrelation color codes):
Not sorted.
x1 | x2 | x3 | x4 | |
x1 | ||||
x2 | ||||
x3 | ||||
x4 |
Sorted by Contribution of Major Principal Component
x1 | x2 | x3 | x4 | |
x1 | ||||
x2 | ||||
x3 | ||||
x4 |
Sorted by Angular order of First Two Principal Components
x2 | x1 | x4 | x3 | |
x2 | ||||
x1 | ||||
x4 | ||||
x3 |
The first representation (Not Sorted ) lists the variables in the original order.
This is the legend of the
ShowCorrelation color codes.
The next two representations sort the variables by the contribution of the Principal Components.
Note that the {x1,x2} grouping (={GNP Implicit Price Deflator, Gross National Product}) and to a lesser extent, {x3, x1, x2} can be easily seen.
The other output is the Principal Components Report:
Variance (Eigenvalue) and Principal Components Report
Principal Component |
Variance (Eigenvalue) |
Per Cent Variance |
Accumulated Variance |
Accumulated % Variance |
x1 | 2.6375 | 65.938 % | 2.6375 | 65.938 % | x2 | 1.1710 | 29.275 % | 3.8085 | 95.213 % | x3 | 0.1848 | 4.621 % | 3.9934 | 99.834 % | x4 | 0.0066 | 0.166 % | 4.0000 | 100.000 % |
The second part of the Principal Components Report lists all
Principal Components (the "eigenvectors" of the Correlation Matrix):
Principal Components (Row Vectors)
x1 | x2 | x3 | x4 |
0.6092 | -0.0287 | -0.2999 | -0.7336 |
0.6044 | -0.0250 | -0.4219 | 0.6754 |
0.4224 | 0.6192 | 0.6595 | 0.0569 |
0.2919 | -0.7843 | 0.5451 | 0.0503 |
x1 | x2 |
0.6092 | -0.0287 |
0.6044 | -0.0250 |
0.4224 | 0.6192 |
0.2919 | -0.7843 |
Here is another example. The correlation matrix is (with 0 in the upper diagonals):
The ShowCorrelation output is:
r1
r2
r3
r4
r5
r6
r7
1.000
0.000
0.000
0.000
0.000
0.000
0.000
-0.141
1.000
0.000
0.000
0.000
0.000
0.000
-0.100
0.456
1.000
0.000
0.000
0.000
0.000
0.976
-0.329
0.688
1.000
0.000
0.000
0.000
-0.787
-0.272
0.485
-0.507
1.000
0.000
0.000
-0.235
0.273
0.061
-0.229
0.255
1.000
0.000
-0.631
0.278
0.269
-0.745
0.643
0.746
1.000
Visual Correlation
Not sorted.
r1
r2
r3
r4
r5
r6
r7
r1
r2
r3
r4
r5
r6
r7
Note: The correlation matrix you submitted is not valid
It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Visual Correlation Report anyway
This Correlation does not have the theoretical property of being positive definite. Some reasons for this are discussed below.
The output continues...
Visual Correlation
Sort by Contribution of Major Principal Component
These colors reveal two variable groups: {r7, r5,r6} and {r1, r4} .
r7
r5
r6
r2
r3
r4
r1
r7
r5
r6
r2
r3
r4
r1
Visual Correlation
Sort by Angular order of First Two Principal Components
r1
r4
r3
r2
r6
r5
r7
r1
r4
r3
r2
r6
r5
r7
On visual inspection, these colors reveal two variable groups: {r1, r4} and {r7, r5}.
The Principal Components Report explicitly shows the reason that this Correlation Matrix is not valid: Principal Component vectors #6 and #7 have negative variances (eigenvalues). The Report continues...
There are two inputs to ShowCorrelation. The first is an input box that asks for the number of variables; the second is a text area box.
In the text area box you paste a table: the first row of the table contains variable labels, the remainder of the table contains the Correlation Matrix.
The table must be in tab-delimited format: the columns are separated by tabs and the rows are separated by carriage returns. When you copy a table from an Excel spreadsheet to a text area box in the ShowTheBest dialog, it will automatically be in tab-delimited text.
Here is an example of the ShowCorrelation input page:
In this sample problem, there are 7 variables, labeled r1, r2,...,r7.
Their correlation matrix was already pasted (and is shown) in the text area box. The Grab Table button (not activated here) grabs the submitted table and verifies that it is in the correct form.
Here is a
text file
containing the table in tab-delimited text.
Here is a
Microsoft Excel file
containing the table in tab-delimited text.
Note: You only need to fill in the lower triangle of the Correlation Matrix.
The upper triangle can be set to zero or any other values. ShowCorrelation needs only the lower triangle of the submitted text-delimited table.
If you want to try a sample run,
click here .
You can find many correlation tables in published research on the web. This
Excel workbook has several correlation tables (with web page
reference links) showing applications in:
To compute your own correlation matrix, use the Microsoft Excel Add-In (it comes with most versions of Excel). To install it, run Excel.
On the Tools menubar, select
Now, to run the Correlation Utility: suppose you have a table of variables in an Excel worksheet with the first row containing variable labels.
On the Tools menubar, select
In the dialog, select the table as the input range for your data, check
Labels in First Row, and select an output range on your worksheet for the correlation matrix. Excel will place the Correlation Matrix there. Then you can copy it to ShowCorrelation.
One of the theoretical properties of a correlation and covariance
matrix is that they be positive definite. Consequently, if a given correlation matrix is not positive definite, then it is not valid: theoretical conclusions about the underlying data set may not be correct.
A matrix is "positive definite" if all of its eigenvalues are positive. From a practical perspective, it means that the variances
(eigenvalues) of the Principal Components (eigenvectors) must all be
strictly greater than zero. ShowCorrelation displays these
eigenvalues in its Principal Components Report.
Note that the correlation matrices we normally deal with in the "real world" are statistically estimated. The nature of this estimation is what presents problems.
Some common reasons for correlations not being positive definite are:
In these cases, ShowCorrelation issues a warning and continues with its computations.
For exploratory data analysis and heuristic clustering applications, it may not matter if all eigenvalues are positive: in any case, ShowCorrelation uses the first two eigenvalues to determine rankings, and in most correlations that we have seen, these first two eigenvalues are positive.
Are their any fixes besides re-sampling or re-estimating the data set? Some researchers recommend "tuning"
the correlation matrix, by slowly reducing the magnitude of the non-diagonal values so that the matrix becomes positive definite. One heuristic method that "preserves the correlation structure" is to multiply all off-diagonal matrix elements by a constant (0.9, 0.8, 0.7...) until the matrix becomes positive definite.
Variance (Eigenvalue) and Principal Components Report
Note: The correlation matrix you submitted is not valid
It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Principal Components Report anyway.
Principal
ComponentVariance
(Eigenvalue)Per Cent
VarianceAccumulated
VarianceAccumulated %
Variancex1
3.4336
49.052 %
3.4336
49.052 %
x2
1.7292
24.703 %
5.1629
73.755 %
x3
1.3158
18.797 %
6.4786
92.552 %
x4
0.8958
12.797 %
7.3744
105.348 %
x5
0.1372
1.959 %
7.5115
107.308 %
x6 **
-0.0329
-0.470 %
7.4786
106.838 %
x7 **
-0.4786
-6.838 %
7.0000
100.000 %
** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These are not **
Principal Components (Row Vectors)
x1
x2
x3
x4
x5
x6 **
x7 **
-0.4916
0.1255
-0.1741
-0.3354
-0.5034
-0.5469
-0.2175
0.1479
0.3289
-0.6711
0.4698
0.0487
-0.3048
0.3220
0.0612
0.7715
0.2110
0.2091
-0.0089
0.1490
-0.5390
-0.4676
0.4578
0.1834
-0.2212
0.1263
0.2282
0.6490
0.4156
0.1260
0.5701
-0.0082
0.0158
-0.6429
0.2696
0.3057
0.1901
-0.3250
-0.7115
0.4843
-0.0861
-0.1284
0.4978
0.1393
-0.1027
-0.2603
-0.7024
0.3361
0.2192
** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These Principal Components have non-positive eigenvalues ** Copying Tab-Delimited Tables to ShowCorrelation
How to Find or Compute a Correlation Matrix
Use ShowCorrelation on the outlined tables to see evidence of data clustering.
Why is My Correlation Matrix Invalid?
ShowCorrelation assumes that the input correlation you submit is valid (we do not compute correlations).
Reasons Why Your Correlation may be "Not Positive Definite"
Data errors may be due to typographical mistakes, round-off, or
copying the wrong rows or columns.
The fewer the samples in your data set, the more likely these errors may arise. There could be outliers in your samples that skew the computation. A variation of this is the problem of computing
a sample correlation pair by pair versus statistically estimating the correlation as a whole (the so-called Polychoric Correlation problem). Most spreadsheet routines (ie, Excel) estimate the
correlation matrix by a pair by pair computation.
If the values of a variable are all constant then there could be problems.
If the values of a set of variables have an exact linear relationship between them then there could be problems.
Does it Matter? Fixing the Correlation Matrix