When running differential expression analysis, we see this error:
my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)
my_comparison <- c("group", "AD", "control")
Significant_DEseq2_ASVs <- Differential_Abundance(my_ad_DESeq2, my_comparison, 0.05)
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means
To the best of my understanding, this error is caused by having a zero-inflated dataset (data with a disproportionate number of counts compared to what is expected) where there are 0s in every ASV.
DESeq2 has difficulty handling this kind of data; phyloseq is really borrowing its functions for differential expression to do differential abundance, and zero-inflated data is more common in microbiome datasets than RNA-seq datasets.
We can troubleshoot this in a number of different ways:
-
We can add 1 to every count in the matrix to remove 0’s. While this technically works, I’m not really sure on what this would mean statistically. On one hand, I notice a lot of our data is low count. It might not be a big issue to add 1 count to 1000 counts, but adding 1 to 10 counts seems like it could be significant. It also means that although we are avoiding 0s, the difference between samples at these ASVs is still minimal, and I’m also not sure what effect this might have on the results.
-
We can be more stringent when we filter ASVs and choose to analyze only the most common ones. While this approach is not uncommon, to the best of my knowledge there is no standardized cut off point accepted by the community.
-
We can use the alternative geometric mean (see here and all links within) which I understand ignores the 0s/NA values:
########## Function (for anyone who wishes to see)
Differential_Abundance_AltGeo <- function(w, x, y, z) {
# Accept the input and give it a human friendly name
my_counts <- w
my_DESeq2_object <- x
my_comparision_list <- y
my_pvalue <- z
#Do the DESeq2 analysis
gm_mean = function(x, na.rm = TRUE){
exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}
geoMeans = apply(counts(my_DESeq2_object), 1, gm_mean)
my_DESeq2_object = estimateSizeFactors(my_DESeq2_object, geoMeans = geoMeans)
my_DESeq2_object = estimateDispersions(my_DESeq2_object, fitType = "parametric")
my_DESeq2_object <- nbinomWaldTest(my_DESeq2_object)
#Change the contrast section to compare different variables and groups.
res = results(my_DESeq2_object, cooksCutoff = FALSE, contrast= my_comparision_list)
#Select only abundances with a p-value of z or below
alpha = my_pvalue
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(my_counts)[rownames(sigtab), ], "matrix"))
#Let's look at the OTUs that were significantly different between the two groups. The following makes a nice ggplot2 summary of the results
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
scale_fill_brewer(palette = palname, ...)
}
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Class order
x = tapply(sigtab$log2FoldChange, sigtab$Class, function(x) max(x))
x = sort(x, TRUE)
sigtab$Class = factor(as.character(sigtab$Class), levels=names(x))
# Order order
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x, TRUE)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
# Family order
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x, TRUE)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
# Species order
x = tapply(sigtab$log2FoldChange, sigtab$Species, function(x) max(x))
x = sort(x, TRUE)
sigtab$Species = factor(as.character(sigtab$Species), levels=names(x))
return(sigtab)
}
########## For Students
# STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design
my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)
# STEP 2: Select the groups to compare
my_comparison <- c("group", "AD", "control")
# STEP 3: Run the differential abundance analysis at the chosen p-value
Significant_DEseq2_ASVs <- Differential_Abundance_AltGeo(ad_counts,my_ad_DESeq2, my_comparison, 0.05)
# STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups
Significant_DEseq2_ASVs
# STEP 5: Plot the results with your chosen x axis and legend
ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Class)) + geom_point(size=3, position = "jitter") +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
ggtitle("my title here")
A new function, Differential_Abundance_AltGeo will be used in this case. Note that it requires your counts table as an input, which the other differential abundance functions do not. We have three different differential abundance functions now:
- Differential_Abundance(): For unproblematic datasets like MISO
- Differential_Abundance_Continuous(): For continuous variables such as metabolites and age
- Differential_Abundance_AltGeo(): For 0-inflated datasets including FeFiFo, ad, and oc
Do note that with ad dataset, the table the function prints is wider because ASVs are not named ASV_1, ASV_2, etc. but contain the sequence data itself as the name (which is actually something nice to have). We can use the arrow button in the top right corner of the Significant_DEseq2_ASVs table to browse through the information and get the taxa.
I will update the learnr modules on SciServer with this new function. And post again when finished.