Overview of SiDCo

SiDCo (SIgned Distance COrrelation) calculates pairwise distance correlation coefficients between all columns of a .xlsx datasheet. The primary use of SiDCo is in metabolomics and lipidomics although this site provides seamless application of signed distance correlation and partial distance correlation for any dataset.


The main advantage of distance correlation is the ability to quantify linear and non-linear correlations simultaneously, while allowing for comparisons of matrices of different dimensions through the calculation of distance covariances. Due to this unique ability, distance correlation can be used to calculate one-to-all (linear and non-linear correlations between each feature and all the other features) or one-to-one correlations (pairwise correlations between individual features). Both options are available in SiDCo.


SiDCo capitalizes on the Gaussian Graphical Model (GGM) method to determine pair-wise associations while removing the confounding effects of other variables. The GGM method calculates the inverse of the distance covariance to remove orthogonal contributions without any matrix shrinkage. If distance covariance matrix is singular, this inverse is calculated using the Moore–Penrose inverse A⁺ of the matrix.


SiDCo is implemented in Python and R Shiny. Two analytical tabs allow users to choose between signed distance correlation or partial distance correlation:

Distance Correlation:

  • Calculates either the distance correlation and p-value between each feature and all the other features combined (n i to (∀j≠i) n j ) in a sense of one-to-all (correlation with the network) or pairwise distance correlations between individual features (n i to n (∀j≠i) ) in a sense of one-to-one comparisons.
  • For one-to-one pairwise comparisons, SiDCo additionally outputs the sign of the Pearson correlation to the distance correlation value. The sign (positive or negative) only indicates the overall linear trend of the Pearson correlation. The sign does not indicate significant linear correlation.

Outputs are:

  • For the one-to-all comparison, the output is an Excel file (.xlsx) that includes the distance correlation value for each feature to all the other features and the corresponding p-value for this calculation.
  • For the one-to-one calculation, the output is an Excel file (.xlsx) that includes the signed distance correlation values and corresponding p-values as well as the Pearson and Spearman correlation values with their corresponding p-values. Distance correlation values are set to zero if their absolute value is below the user-defined threshold value or their corresponding p-value is above the user-defined p-value.

Partial Distance Correlation:

  • Partial distance correlation is calculated using the Gaussian Graphical Model (GGM) with p-value determined from the cumulative normal distribution function of Fisher z-transformed correlations.

Outputs are:

  • An Excel file (.xlsx) with partial distance correlation values and their corresponding p-values.

In both cases, data are preprocessed by z-score normalization of features across all samples. Any missing values must be imputed by the user prior to analysis or SiDCo will not function.


Distances calculation running time is typically a function of N², where N is the sample size. For typical datasets in metabolomics and lipidomics (~500 x 500) both distance correlation and partial distance correlation calculations take less than 2 minutes. For extremely large datasets of more than one million elements, calculations can be time-consuming.



Preparing your data for SiDCo

SiDCo calculates distance correlation between features listed in columns using data across rows. The SiDCo input must be a single .xlsx file with features (for example metabolites or lipids) in columns and samples in rows. The file should contain column names in the top row and row names in the first column (column A). Additional information can be included and user can specify start row and column as well as stop row. All numeric data should be below and to the right of the specified start column and start row. If there is any non-numeric data to the right of the specified start column, the analysis will abort. Because distance correlation calculations cannot work with data that have missing values, users should impute missing data with a method that is the most appropriate for their dataset prior to using SiDCo.


Sample Data

The sample dataset is provided in the allowed input format (.xlsx) with features in columns and samples in rows. Note, column A includes group names. Row 1 includes feature names. To calculate distance correlations in separate groups set, the user should input for Group 1: Start Column: B; First Row: 2 (or -1 indicating first data row); Last Row: 31. For analysis of Group 2, the user should input: Start Column: B; First Row: 32; Last Row: 46 or -1 (stating last row).


Troubleshooting SiDCo

When troubleshooting, please review this list of common reasons for SiDCo failing to run. If you are still experiencing difficulties running our tool, please contact ldomic@uottawa.ca for further assistance. Please include your input dataset and a description of the problem you experienced. We will reproduce the problem and provide you with a solution.


1. My file loads but does not output any analysis.

SiDCo only accepts Excel (.xlsx) files as input. Additionally, ensure that your column and row information to the right and below your start cell are numeric values. Data can start from any row and column; however, all data must be numeric to the right and down from the user-defined start point. Make sure that there are no missing data in your input as they will prevent SiDCo calculations.


2. All obtained values are zero.

Revise your tolerance, i.e., threshold information and p-value. SiDCo sets to zero values that are below the correlation value threshold or above the specified p-value. If you prefer to see all values please enter 0 for distance correlation threshold and 1 for p-value threshold.


3. Why are there are no negative values for one-to-all distance correlations?

As Pearson correlation cannot be calculated for vectors of different lengths, it is not possible to determine a linear sign in the one-to-all distance calculation.


4. Why are there no Pearson and Spearman values for one-to-all correlation?

Pearson and Spearman correlations cannot be calculated for the one-to-all set and thus cannot be included in this output.


5. I get results from the distance correlation but not from the partial distance correlation tab.

Partial distance correlation analysis is based on the inversion of the distance covariance matrix. If this matrix is singular, inversion is not possible. This occurs when the input has fewer samples than features or if there are some features that can only be obtained as a linear combination of other features in the dataset. To address these issues, add more sample measurements or reduce the number of features in your partial distance correlation analysis.


Contact us

ldomic@uottawa.ca


Cite the use of SiDCo in a publication

Monti F, Stewart D, Surendra A, Alecu I, Nguyen-Tran T, Bennett SAL, Cuperlovic-Culf M: Signed Distance Correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis. Bioinformatics 2023, 39(5), doi: 10.1093/bioinformatics/btad210


Public Server

SiDCo: https://complimet.ca/SiDCo/


Software License

SiDCo is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License v3 (or later versions) as published by the Free Software Foundation. As per the GNU General Public License, SiDCo is distributed as a bioinformatic tool to assist users WITHOUT ANY WARRANTY and without any implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. All limitations of warranty are indicated in the GNU General Public License.