Help ShowCorrelation - Correlation Visualization

ShowCorrelation

Visual Correlation

Help
and Documentation

What ShowCorrelation Does
- Inputs and Outputs
- The meaning of the colors
- The Principal Components Report
ShowCorrelation Input: Tab-Delimited Tables
- Samples from Excel
- Sample from Text
How to Find or Compute a Correlation Matrix
- Computing Correlation in Excel
- Why is My Correlation Matrix Invalid? Does it Matter?
- Reasons Why Your Correlation may be "Not Positive Definite"
Sample Run: Using ShowCorrelation

What ShowCorrelation Does

ShowCorrelation creates a visual color-coded representation of a given correlation matrix. It also determines whether your correlation matrix is valid ("positive definite") and creates a Principal Components Report. It is server-based: you submit (via HTML form) the correlation as input. The color-coded representation is sorted by three different methods that let you easily see patterns and data groupings.

Data patterns and groupings are much harder to see from a table of numbers -- the usual representation of correlation -- than with the ShowCorrelation color-coded representation.

Here is an example of a 4 variable economic time series:

x1=GNP Implicit Price Deflator
x2=Gross National Product
x3=Unemployment
x4=Size of Armed Forces

The correlation of these 4 variables is shown in this table:

	x1	x2	x3	x4
x1	1.000
x2	0.992	1.000
x3	0.621	0.604	1.000
x4	0.465	0.446	-0.177	1.000

So, for example, the correlation between x1 and x3 is 0.621. Note that the correlation values always range from +1 (perfect positive correlation) to -1 (perfect negative correlation). The correlation matrix is symmetric: correlation between x1 and x3 is the same as the correlation between x3 and x1. Consequently, we usually only show the lower diagonal.

The ShowCorrelation representations are as follows (legend of the ShowCorrelation color codes):

Not sorted.

x1 x2 x3 x4

x1

x2

x3

x4

Sorted by Contribution of Major Principal Component

x1 x2 x3 x4

x1

x2

x3

x4

Sorted by Angular order of First Two Principal Components

x2 x1 x4 x3

x2

x1

x4

x3

The first representation (Not Sorted ) lists the variables in the original order.
This is the legend of the ShowCorrelation color codes.

The next two representations sort the variables by the contribution of the Principal Components.

Note that the {x1,x2} grouping (={GNP Implicit Price Deflator, Gross National Product}) and to a lesser extent, {x3, x1, x2} can be easily seen.

The other output is the Principal Components Report:

Variance (Eigenvalue) and Principal Components Report

Principal Component	Variance (Eigenvalue)	Per Cent Variance	Accumulated Variance	Accumulated % Variance
x1	2.6375	65.938 %	2.6375	65.938 %
x2	1.1710	29.275 %	3.8085	95.213 %
x3	0.1848	4.621 %	3.9934	99.834 %
x4	0.0066	0.166 %	4.0000	100.000 %

This report shows how the variances of the Principal Components are distributed. In this example, 95% of the variance (the "eigenvalues" of the Correlation Matrix) of the original 4 variables are contained in the first 2 Principal Components. This provides additional insights for the Visual Correlation displays.

The second part of the Principal Components Report lists all Principal Components (the "eigenvectors" of the Correlation Matrix):

Principal Components (Row Vectors)

x1	x2	x3	x4
0.6092	-0.0287	-0.2999	-0.7336
0.6044	-0.0250	-0.4219	0.6754
0.4224	0.6192	0.6595	0.0569
0.2919	-0.7843	0.5451	0.0503

The Principal Components transformation is given by the columns of this Matrix M. To transform a vector
x = (x1, x2, x3 x4)
in the original data set to a Principal Components representation y , (matrix) multiply it by M:
y = x M
Note: You can use fewer columns of M in this matrix multiplication; the number of columns correspond to the number of principal components you require. So, suppose M* consists of the 4-row x 2-column submatrix consisting of the columns (x1 x2 ):

x1	x2
0.6092	-0.0287
0.6044	-0.0250
0.4224	0.6192
0.2919	-0.7843

If we perform the matrix multiplication transformation
y = x M*
(where x is in the original 4-dimensional space) and y is the 2-dimensional transformed space (corresponding to the first 2 Principal Components), then the new 2-dimensional vectors y incorporate 95% of the information of the original 4-dimensional vectors x .

Here is another example. The correlation matrix is (with 0 in the upper diagonals):

r1 r2 r3 r4 r5 r6 r7

1.000 0.000 0.000 0.000 0.000 0.000 0.000

-0.141 1.000 0.000 0.000 0.000 0.000 0.000

-0.100 0.456 1.000 0.000 0.000 0.000 0.000

0.976 -0.329 0.688 1.000 0.000 0.000 0.000

-0.787 -0.272 0.485 -0.507 1.000 0.000 0.000

-0.235 0.273 0.061 -0.229 0.255 1.000 0.000

-0.631 0.278 0.269 -0.745 0.643 0.746 1.000

The ShowCorrelation output is:

Visual Correlation

Not sorted.

	r1	r2	r3	r4	r5	r6	r7
r1
r2
r3
r4
r5
r6
r7

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Visual Correlation Report anyway

This Correlation does not have the theoretical property of being positive definite. Some reasons for this are discussed below.

The output continues...

Visual Correlation

Sort by Contribution of Major Principal Component

	r7	r5	r6	r2	r3	r4	r1
r7
r5
r6
r2
r3
r4
r1

These colors reveal two variable groups: {r7, r5,r6} and {r1, r4} .

Visual Correlation

Sort by Angular order of First Two Principal Components

	r1	r4	r3	r2	r6	r5	r7
r1
r4
r3
r2
r6
r5
r7

On visual inspection, these colors reveal two variable groups: {r1, r4} and {r7, r5}.

The Principal Components Report explicitly shows the reason that this Correlation Matrix is not valid: Principal Component vectors #6 and #7 have negative variances (eigenvalues). The Report continues...

Variance (Eigenvalue) and Principal Components Report

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Principal Components Report anyway.

Principal Component	Variance (Eigenvalue)	Per Cent Variance	Accumulated Variance	Accumulated % Variance
x1	3.4336	49.052 %	3.4336	49.052 %
x2	1.7292	24.703 %	5.1629	73.755 %
x3	1.3158	18.797 %	6.4786	92.552 %
x4	0.8958	12.797 %	7.3744	105.348 %
x5	0.1372	1.959 %	7.5115	107.308 %
x6 **	-0.0329	-0.470 %	7.4786	106.838 %
x7 **	-0.4786	-6.838 %	7.0000	100.000 %

** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These are not **

Principal Components (Row Vectors)

x1	x2	x3	x4	x5	x6 **	x7 **
-0.4916	0.1255	-0.1741	-0.3354	-0.5034	-0.5469	-0.2175
0.1479	0.3289	-0.6711	0.4698	0.0487	-0.3048	0.3220
0.0612	0.7715	0.2110	0.2091	-0.0089	0.1490	-0.5390
-0.4676	0.4578	0.1834	-0.2212	0.1263	0.2282	0.6490
0.4156	0.1260	0.5701	-0.0082	0.0158	-0.6429	0.2696
0.3057	0.1901	-0.3250	-0.7115	0.4843	-0.0861	-0.1284
0.4978	0.1393	-0.1027	-0.2603	-0.7024	0.3361	0.2192

** Warning. Your input correlation matrix is not valid (not positive definite).
All eigenvalues must be positive. ** These Principal Components have non-positive eigenvalues **

Copying Tab-Delimited Tables to ShowCorrelation

There are two inputs to ShowCorrelation. The first is an input box that asks for the number of variables; the second is a text area box.

In the text area box you paste a table: the first row of the table contains variable labels, the remainder of the table contains the Correlation Matrix.

The table must be in tab-delimited format: the columns are separated by tabs and the rows are separated by carriage returns. When you copy a table from an Excel spreadsheet to a text area box in the ShowTheBest dialog, it will automatically be in tab-delimited text.

Here is an example of the ShowCorrelation input page:

In this sample problem, there are 7 variables, labeled r1, r2,...,r7. Their correlation matrix was already pasted (and is shown) in the text area box. The Grab Table button (not activated here) grabs the submitted table and verifies that it is in the correct form.

Here is a text file containing the table in tab-delimited text.

Here is a Microsoft Excel file containing the table in tab-delimited text.

Note: You only need to fill in the lower triangle of the Correlation Matrix. The upper triangle can be set to zero or any other values. ShowCorrelation needs only the lower triangle of the submitted text-delimited table.

If you want to try a sample run, click here .

How to Find or Compute a Correlation Matrix

You can find many correlation tables in published research on the web. This Excel workbook has several correlation tables (with web page reference links) showing applications in:

economics
biology
finance
politics
health care

Use ShowCorrelation on the outlined tables to see evidence of data clustering.

To compute your own correlation matrix, use the Microsoft Excel Add-In (it comes with most versions of Excel). To install it, run Excel. On the Tools menubar, select

Tools=>Add-Ins...>Analysis ToolPak

Now, to run the Correlation Utility: suppose you have a table of variables in an Excel worksheet with the first row containing variable labels. On the Tools menubar, select

Tools=>Data Analysis...>Correlation

In the dialog, select the table as the input range for your data, check Labels in First Row, and select an output range on your worksheet for the correlation matrix. Excel will place the Correlation Matrix there. Then you can copy it to ShowCorrelation.

Why is My Correlation Matrix Invalid?
Reasons Why Your Correlation may be "Not Positive Definite"

ShowCorrelation assumes that the input correlation you submit is valid (we do not compute correlations).

One of the theoretical properties of a correlation and covariance matrix is that they be positive definite. Consequently, if a given correlation matrix is not positive definite, then it is not valid: theoretical conclusions about the underlying data set may not be correct.

A matrix is "positive definite" if all of its eigenvalues are positive. From a practical perspective, it means that the variances (eigenvalues) of the Principal Components (eigenvectors) must all be strictly greater than zero. ShowCorrelation displays these eigenvalues in its Principal Components Report.

Note that the correlation matrices we normally deal with in the "real world" are statistically estimated. The nature of this estimation is what presents problems.

Some common reasons for correlations not being positive definite are:

Data Input Errors
Data errors may be due to typographical mistakes, round-off, or copying the wrong rows or columns.
Sampling Errors
The fewer the samples in your data set, the more likely these errors may arise. There could be outliers in your samples that skew the computation. A variation of this is the problem of computing a sample correlation pair by pair versus statistically estimating the correlation as a whole (the so-called Polychoric Correlation problem). Most spreadsheet routines (ie, Excel) estimate the correlation matrix by a pair by pair computation.
No variance in a variable
If the values of a variable are all constant then there could be problems.
Perfect linear relations between variables
If the values of a set of variables have an exact linear relationship between them then there could be problems.

In these cases, ShowCorrelation issues a warning and continues with its computations.

Does it Matter? Fixing the Correlation Matrix

For exploratory data analysis and heuristic clustering applications, it may not matter if all eigenvalues are positive: in any case, ShowCorrelation uses the first two eigenvalues to determine rankings, and in most correlations that we have seen, these first two eigenvalues are positive.

Are their any fixes besides re-sampling or re-estimating the data set? Some researchers recommend "tuning" the correlation matrix, by slowly reducing the magnitude of the non-diagonal values so that the matrix becomes positive definite. One heuristic method that "preserves the correlation structure" is to multiply all off-diagonal matrix elements by a constant (0.9, 0.8, 0.7...) until the matrix becomes positive definite.

r1	r2	r3	r4	r5	r6	r7
1.000	0.000	0.000	0.000	0.000	0.000	0.000
-0.141	1.000	0.000	0.000	0.000	0.000	0.000
-0.100	0.456	1.000	0.000	0.000	0.000	0.000
0.976	-0.329	0.688	1.000	0.000	0.000	0.000
-0.787	-0.272	0.485	-0.507	1.000	0.000	0.000
-0.235	0.273	0.061	-0.229	0.255	1.000	0.000
-0.631	0.278	0.269	-0.745	0.643	0.746	1.000

ShowCorrelation

Visual Correlation

Help and Documentation

What ShowCorrelation Does

Variance (Eigenvalue) and Principal Components Report

Principal Components (Row Vectors)

Visual Correlation

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable. Continuing Visual Correlation Report anyway

Visual Correlation

Visual Correlation

Variance (Eigenvalue) and Principal Components Report

Note: The correlation matrix you submitted is not valid

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable. Continuing Principal Components Report anyway.

Principal Components (Row Vectors)

Copying Tab-Delimited Tables to ShowCorrelation

How to Find or Compute a Correlation Matrix

Why is My Correlation Matrix Invalid? Reasons Why Your Correlation may be "Not Positive Definite"

Does it Matter? Fixing the Correlation Matrix

Help
and Documentation

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Visual Correlation Report anyway

It is not positive definite (Eigenvalues are not all strictly >0). Warning: Results may be questionable.
Continuing Principal Components Report anyway.

Why is My Correlation Matrix Invalid?
Reasons Why Your Correlation may be "Not Positive Definite"