Overview of SiDCo


SiDCo (SIgned Distance COrrelation) calculates pairwise distance correlation coefficients between all columns of a .xlsx datasheet. The primary use of SiDCo is in metabolomics and lipidomics although this site provides seamless application of signed distance correlation and partial distance correlation for any dataset.

The main advantage of distance correlation is the ability to quantify linear and non-linear correlations simultaneously, while allowing for comparisons of matrices of different dimensions through the calculation of distance covariances. Due to this unique ability, distance correlation can be used to calculate one-to-all (linear and non-linear correlations between each feature and all the other features) or one-to-one correlations (pairwise correlations between individual features). Both options are available in SiDCo.


SiDCo capitalizes on the Gaussian Graphical Model (GGM) method to determine pair-wise associations while removing the confounding effects of other variables. The GGM method calculates the inverse of the distance covariance to remove orthogonal contributions without any matrix shrinkage. If distance covariance matrix is singular, this inverse is calculated using the (Moore–Penrose inverse - Wikipedia) .


SiDCo is implemented in Python with a RShiny front-end. Two analytical tabs allow users to choose between signed distance correlation or partial distance correlation:


Tab dCor:

  • Calculates either the distance correlation and p-value between each feature and all the other features combined (n i to (∀j≠i) n j ) in a sense of one-to-all (correlation with the network) or pairwise distance correlations between individual features (n i to n (∀j≠i) ) in a sense of one-to-one comparisons.
  • For one to one pairwise comparisons, SidCo additionally inputs to the distance correlation the sign of the Pearson correlation. The sign (positive or negative) only indicates the overall linear trend. The sign does not suggest significant linear correlation.

    Outputs are:
    - For the one-to-all comparison, the output is an excel file that includes the distance correlation value for each feature to all the other features and the corresponding p-value for this calculation.

    - For the one-to-one calculation, the output is an excel file that includes the signed distance correlation values and corresponding p-values as well as the Pearson and Spearman correlation values with their corresponding p-values. Distance correlation values are set to zero if their absolute value is below the user-defined threshold value or their corresponding p-value is above the user-defined p-value.


Tab pdCor:

  • Partial distance correlation is calculated using the Gaussian Graphical Model (GGM) with p-value determined from the cumulative normal distribution function of Fisher z-transformed correlations.

    Outputs are:
    - An Excel spreadsheet with partial distance correlation values and their corresponding p-values.


In both cases data is preprocessed by z-score normalization of features across all samples. Any missing values must be imputed by the user prior to analysis or SiDCo will not function.


Distances calculation running time is typically a function of N², where N is the sample size. For typical datasets in metabolomics and lipidomics (~500 x 500) both dCor and pdCor calculations take less than 2 minutes. For extremely large datasets of more then 1M elements, calculations can be time-consuming.



SiDCo workflow.

Preparing your data for SiDCo


SidCo calculates distance correlation between features listed in columns using data across rows. The SidCo input must be a single .xlsx file with features (for example metabolites or lipids) in columns and samples in rows. The file should contain column names in the top row and row names in the first column (column A). Additional information can be included and user can specify start row and column as well as stop row. All numeric data should be below and to the right of the specified start column and start row. If there is any non-numeric data to the right of the specified start column, the analysis will abort. Because distance correlation calculations cannot work with data that have missing values, users should impute missing data with a method that is the most appropriate for their dataset prior to using SiDCo.


Sample Data


The sample datasets are provided in both allowed input formats (.csv and .xlsx) with features (metabolites or lipids) in columns and samples in rows. Note, column A includes group names. Row 1 includes feature names. To calculate distance correlations in separate groups set, the user should input for Group 1: Start Column: B; First Row: 2 (or -1 indicating first data row); Last Row: 31. For analysis of Group 2, the user should input: Start Column: B; First Row: 32; Last Row: 46 or -1 (stating last row).


Sample Data


  1. exampleinput.xlsx

When troubleshooting, please review this list of common reasons for SiDCo failing to run. If you are still experiencing difficulties, please contact ldomic@uottawa.ca for further assistance. Please include your input dataset and a description of the problem that you experienced. We will reproduce the problem and provide you with a solution.

  1. My file loads but does not output any analysis

    SiDCo only accepts comma-delimited .xlsx files as input. Tab-delimited files (.txt) can be read but not analyzed and will not produce any results. Please convert your input data into .csv format before running SiDCo. Additionally, ensure that your column and row information to the right and below your start cell are numeric values. Data can start from any row and column; however, all data must be numeric to the right and down from the user-defined start point. Make sure that there are no missing data in your input as they will prevent SiDCo calculations.

  2. All obtained values are zero

    Revise your tolerance, i.e., threshold information and p-value. SiDCo sets to zero values that are below the correlation value threshold or above the specified p value. If you prefer to see all values please enter 0 for distance correlation threshold and 1 for p-value threshold.

  3. Why are there are no negative values for one-to-all distance correlations

    As Pearson correlation cannot be calculated for vectors of different lengths, it is not possible to determine a linear sign in the one-to -all distance calculation.

  4. Why are there no Person and Spearman values for one-to-all correlation

    Pearson and Spearman correlations can not be calculated for the one-to-all set and thus can not be included in this output.

  5. I get result from dCor but not from pdCor tab

    pdCor analysis is based on the inversion of the distance covariance matrix. If this matrix is singular, inversion is not possible. This occurs when the input has fewer samples than features or if there are some features that can only be obtained as a linear combination of other features in the dataset. To address these issues, add more sample measurements or reduce the number of features in your pdCor analysis.

Contact Us

ldomic@uottawa.ca


Cite your use of SiDCo in a publication

F. Monti, D. Stewart, A. Surendra, I. Alecu, T. Nguyen-Tran, S. A. L Bennett, M. Čuperlović-Culf, Signed Distance Correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad210, https://doi.org/10.1093/bioinformatics/btad210 ( Download )


Public Server

SiDCo: https://complimet.ca/SiDCo/


Software License

SiDCo is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License v3 (or later versions) as published by the Free Software Foundation. As per the GNU General Public License, SiDCo is distributed as a bioinformatic tool to assist users WITHOUT ANY WARRANTY and without any implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. All limitations of warranty are indicated in the GNU General Public License.

Calculating. This might take a minute...

Remember that the sign of the coefficient is coming from Pearson's correlation.

eg. a high coefficient with a negative sign does NOT mean a significant negative trend.

It only indicates a strong correlation, with some negative overall, linear trend also detected

Calculating. This might take a minute...