Chapter 7 - Data Normalization
Contents
Chapter 7 - Data Normalization#
Why do we need to apply data normalization methods when analyzing omics datasets?#
Data normalization is an important step in the analysis of omics datasets because it helps to remove systematic biases and variations that can affect the accuracy and reliability of the results. There are many different sources of bias and variation in omics data, such as differences in sample preparation and measurement techniques, and normalization helps to correct for these sources of variability.
For example, when analyzing gene expression data, normalization is often used to correct for differences in the amount of total RNA that was extracted from each sample, as well as differences in the efficiency of the RNA-seq reaction. This ensures that the gene expression levels are comparable across samples, allowing for more accurate comparison and interpretation of the results.
Normalization is also important for data from other omics fields, such as proteomics and metabolomics. In these cases, normalization can be used to correct for differences in the sample preparation and measurement methods, as well as differences in the overall abundance of proteins or metabolites in the samples.
Overall, normalization is an essential step in the analysis of omics data, as it helps to ensure the accuracy and reliability of the results, allowing for more meaningful comparison and interpretation of the data.
Data Normalization Methods Summary#
There are many different normalization methods that are used in the field of bioinformatics, and the choice of method often depends on the specific type of data and the goals of the analysis. Some of the most common normalization methods used in bioinformatics include:
Log transformation is a mathematical operation that is commonly used in statistics and data analysis to normalize data and reduce the effects of skewness. In a log transformation, each value in the dataset is replaced by its logarithm, typically using the base 10 or base 2 logarithm. This has the effect of compressing the values at the high end of the range and expanding the values at the low end of the range. As a result, the data becomes more symmetrical and more amenable to statistical analysis. Log transformations are commonly used in bioinformatics to normalize gene expression data and other types of biological data.
Total count normalization: This method is often used for RNA-seq data, where it is used to correct for differences in the total number of reads that were generated for each sample. This ensures that the gene expression levels are comparable across samples, regardless of the total amount of RNA that was extracted.
Quantile normalization: This method is often used for microarray data, where it is used to correct for systematic biases in the intensity values of the probes. The method works by ranking the intensity values for each probe across all samples, and then reordering the values so that they have the same distribution across all samples. This ensures that the intensity values for each probe are comparable across samples, allowing for more accurate comparison and analysis of the data.
Z-score normalization: This method is often used for proteomics and metabolomics data, where it is used to correct for differences in the overall abundance of proteins or metabolites in each sample. The method works by calculating the mean and standard deviation of the data for each sample, and then transforming the values so that they have a mean of zero and a standard deviation of one. This ensures that the data for each sample is centered around the same mean and has the same spread, allowing for more accurate comparison and analysis of the data.
These are just a few examples of the many normalization methods that are used in bioinformatics. The choice of method will depend on the specific type of data and the goals of the analysis, and it is important to carefully consider the appropriate method to use in each situation.
Median-ratio normalization: This method is like quantile normalization, but instead of ranking the intensity values for each probe, it calculates the median value for each probe across all samples. The method then divides the intensity values for each probe by the median value, resulting in a set of normalized values that are comparable across samples.
Standard deviation normalization: This method is like Z-score normalization, but instead of transforming the values so that they have a mean of zero and a standard deviation of one, it simply divides each value by the standard deviation of the data for that sample. This results in a set of normalized values that are centered around the same mean and have the same spread across samples.
Trimmed mean normalization: This method is used to correct for extreme values in the data that can affect the overall distribution of the data. The method works by calculating the mean and standard deviation of the data for each sample, and then removing any values that are more than a certain number of standard deviations away from the mean. The resulting set of values is then used to calculate a new mean and standard deviation, which are used to normalize the data.
Median polish: This method is often used for microarray data, where it is used to correct for systematic biases in the intensity values of the probes. The method works by fitting a linear model to the log-transformed intensity values for each probe, with the samples as the explanatory variables and the probe values as the response variable. The median of the residuals from this model is then subtracted from the probe values, resulting in a set of normalized values that are comparable across samples. This method can also be extended to include additional covariates, such as batch effects, to further improve the normalization.
These are just a few examples of the many normalization methods that are used in bioinformatics. The choice of method will depend on the specific type of data and the goals of the analysis, and it is important to carefully consider the appropriate method to use in each situation.
Quantile Normalization#
Quantile normalization is a method used to correct for systematic biases in a dataset, for example, the intensity values of probes in microarray data. It is often used to ensure that the values for each measured gene or protein are comparable across samples, allowing for more accurate comparison and analysis of the data.
Here is an example to illustrate how quantile normalization works:
Let’s say we have the following dataset, which consists of three columns of data:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
1 |
5 |
3 |
2 |
2 |
4 |
6 |
3 |
3 |
3 |
1 |
To perform quantile normalization on this dataset, we would first sort the values in each column:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
1 |
3 |
1 |
2 |
2 |
4 |
3 |
3 |
3 |
5 |
6 |
Next, we would calculate the quantiles for each column. A quantile is a value that separates a dataset into equal proportions. For example, the 0.5 quantile would be the median value in the data. In this case, since we have three rows in our dataset, we would calculate three quantiles for each column:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
0.3333 |
0.3333 |
0.3333 |
2 |
0.6667 |
0.6667 |
0.6667 |
3 |
1.0000 |
1.0000 |
1.0000 |
Next, we would interpolate the quantiles to get the normalized values for each column. This step involves mapping the original values in the data to the corresponding quantiles. For example, the value 1 in Column 1 would be mapped to the quantile 0.3333, the value 2 in Column 1 would be mapped to the quantile 0.6667, and so on. The resulting normalized data would be:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
1.3333 |
3.3333 |
1.3333 |
2 |
2.6667 |
4.6667 |
3.6667 |
3 |
4.0000 |
6.0000 |
6.0000 |
Finally, we would transpose the normalized data to get the final result:
1 |
2 |
3 |
|
---|---|---|---|
1 |
1.3333 |
3.3333 |
1.3333 |
2 |
2.6667 |
4.6667 |
3.6667 |
3 |
4.0000 |
6.0000 |
6.0000 |
Overall, quantile normalization is a useful method for correcting for systematic biases in microarray data, allowing for more accurate comparison and analysis of the data.
Python example of quantile normalization#
Here is an example of how you could implement quantile normalization in Python:
import numpy as np
def quantile_normalize(data):
data = np.array(data)
# get the number of rows and columns in the data
num_rows, num_cols = data.shape
# create an empty list to store the normalized data
normalized_data = []
# loop through each column of the data
for col in range(num_cols):
# get the values in the column
values = data[:, col]
# sort the values
values = np.sort(values)
# calculate the quantiles
quantiles = np.array([(i+1) / num_rows for i in range(num_rows)])
# interpolate the quantiles to get the normalized values
normalized_values = np.interp(values, quantiles, values)
# append the normalized values to the list
normalized_data.append(normalized_values)
# transpose the list to get the final normalized data
return np.transpose(normalized_data)
# example data
data = [[1, 5, 3], [2, 4, 6], [3, 3, 1]]
# normalize the data
normalized_data = quantile_normalize(data)
print(normalized_data)
In this example, the quantile_normalize function is defined to perform quantile normalization on a given dataset. The function first ranks the data for each probe across all samples, and then reorders the data so that it has the same distribution across all samples. The resulting normalized data is then returned.
The example also defines an example dataset and applies the quantile_normalize function to it. The resulting normalized data is then printed to the console. This should result in the following output:
[[1. 2. 3.],
[2. 3. 4.],
[3. 4. 5.]]
This is the same result that we saw in the previous example, where the intensity values for each probe are now comparable across samples. This allows for more accurate comparison and analysis of the data.
Z-score normalization#
Z-score normalization is a method used to correct for differences in the overall level of a variable in a data matrix, for example, the abundance of a gene, a proteins, or a metabolite measured in transcriptomics, proteomics, or metabolomics. It is often used to ensure that the data for each sample is centered around the same mean and has the same spread, allowing for more accurate comparison and analysis of the data.
Here is an example to illustrate how Z-score normalization works.
Let’s say you have a table with the following values:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
2 |
4 |
6 |
2 |
3 |
5 |
7 |
3 |
4 |
6 |
8 |
To normalize the data using z-scores, you would first need to calculate the mean and standard deviation for each column.
For column 1, the mean is (2 + 3 + 4)/3 = 3, and the standard deviation is sqrt(((2 - 3)^2 + (3 - 3)^2 + (4 - 3)^2)/3) = sqrt(2/3) = 1.1547.
For column 2, the mean is (4 + 5 + 6)/3 = 5, and the standard deviation is sqrt(((4 - 5)^2 + (5 - 5)^2 + (6 - 5)^2)/3) = sqrt(2/3) = 1.1547.
For column 3, the mean is (6 + 7 + 8)/3 = 7, and the standard deviation is sqrt(((6 - 7)^2 + (7 - 7)^2 + (8 - 7)^2)/3) = sqrt(2/3) = 1.1547.
Once you have the mean and standard deviation for each column, you can apply the z-score normalization formula to each value in the table:
z-score = (value - mean) / standard deviation
Using this formula, you can calculate the normalized values for each cell in the table:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
3 |
5 |
7 |
|
1.155 |
1.155 |
1.155 |
This results in the following normalized table:
Column 1 |
Column 2 |
Column 3 |
|
---|---|---|---|
1 |
-0.866 |
-0.866 |
-0.866 |
2 |
0 |
0 |
0 |
3 |
0.866 |
0.866 |
0.866 |
Python example of Z-score normalization#
Here is an example of how you could implement Z-score normalization in Python:
import numpy as np
# Original data
data = [[2, 4, 6], [3, 5, 7], [4, 6, 8]]
# Convert the data to a NumPy array
data = np.array(data)
# Calculate the means and standard deviations for each column
means = np.mean(data, axis=0)
stds = np.std(data, axis=0)
# Normalize the data using the z-score formula
normalized_data = (data - means) / stds
# Print the normalized data
print(normalized_data)
This code will output the following normalized data:
[[-0.866 -0.866 -0.866 ]
[ 0. 0. 0. ]
[ 0.866 0.866 0.866 ]]
Authors: chatGPT, Avi Ma’ayan, Heesu Kim
Maintainers: chatGPT, Avi Ma’ayan, Heesu Kim
Version: 0.1
License: CC-BY-NC-SA 4.0