Every gene contains at least one zero, cannot compute log geometric means

When running differential expression analysis, we see this error:

my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)
my_comparison <- c("group", "AD", "control")
Significant_DEseq2_ASVs <- Differential_Abundance(my_ad_DESeq2, my_comparison, 0.05)

estimating size factors

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means

To the best of my understanding, this error is caused by having a zero-inflated dataset (data with a disproportionate number of counts compared to what is expected) where there are 0s in every ASV.

DESeq2 has difficulty handling this kind of data; phyloseq is really borrowing its functions for differential expression to do differential abundance, and zero-inflated data is more common in microbiome datasets than RNA-seq datasets.

We can troubleshoot this in a number of different ways:

  1. We can add 1 to every count in the matrix to remove 0’s. While this technically works, I’m not really sure on what this would mean statistically. On one hand, I notice a lot of our data is low count. It might not be a big issue to add 1 count to 1000 counts, but adding 1 to 10 counts seems like it could be significant. It also means that although we are avoiding 0s, the difference between samples at these ASVs is still minimal, and I’m also not sure what effect this might have on the results.

  2. We can be more stringent when we filter ASVs and choose to analyze only the most common ones. While this approach is not uncommon, to the best of my knowledge there is no standardized cut off point accepted by the community.

  3. We can use the alternative geometric mean (see here and all links within) which I understand ignores the 0s/NA values:

########## Function (for anyone who wishes to see)
Differential_Abundance_AltGeo <- function(w, x, y, z) {
  
  # Accept the input and give it a human friendly name
  my_counts <- w
  my_DESeq2_object <- x
  my_comparision_list <- y
  my_pvalue <- z
  
  #Do the DESeq2 analysis
  
  gm_mean = function(x, na.rm = TRUE){
    exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
  }
  
  geoMeans = apply(counts(my_DESeq2_object), 1, gm_mean)
  my_DESeq2_object = estimateSizeFactors(my_DESeq2_object, geoMeans = geoMeans)
  my_DESeq2_object = estimateDispersions(my_DESeq2_object, fitType = "parametric")
  my_DESeq2_object <- nbinomWaldTest(my_DESeq2_object)
  
  #Change the contrast section to compare different variables and groups.
  res = results(my_DESeq2_object, cooksCutoff = FALSE, contrast= my_comparision_list)
  
  #Select only abundances with a p-value of z or below
  alpha = my_pvalue
  sigtab = res[which(res$padj < alpha), ]
  sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(my_counts)[rownames(sigtab), ], "matrix"))
  
  #Let's look at the OTUs that were significantly different between the two groups. The following makes a nice ggplot2 summary of the results
  theme_set(theme_bw())
  scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
  }
  # Phylum order
  x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
  
  # Class order
  x = tapply(sigtab$log2FoldChange, sigtab$Class, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Class = factor(as.character(sigtab$Class), levels=names(x))
  
  # Order order
  x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
  
  # Family order
  x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))
  
  # Genus order
  x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
  
  # Species order
  x = tapply(sigtab$log2FoldChange, sigtab$Species, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Species = factor(as.character(sigtab$Species), levels=names(x))
  
  return(sigtab)
}


########## For Students
# STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design 
my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)

# STEP 2: Select the groups to compare
my_comparison <- c("group", "AD", "control")

# STEP 3: Run the differential abundance analysis at the chosen p-value
Significant_DEseq2_ASVs <- Differential_Abundance_AltGeo(ad_counts,my_ad_DESeq2, my_comparison, 0.05)

# STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups
Significant_DEseq2_ASVs

# STEP 5: Plot the results with your chosen x axis and legend
ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Class)) + geom_point(size=3, position = "jitter") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
  ggtitle("my title here")

A new function, Differential_Abundance_AltGeo will be used in this case. Note that it requires your counts table as an input, which the other differential abundance functions do not. We have three different differential abundance functions now:

  • Differential_Abundance(): For unproblematic datasets like MISO
  • Differential_Abundance_Continuous(): For continuous variables such as metabolites and age
  • Differential_Abundance_AltGeo(): For 0-inflated datasets including FeFiFo, ad, and oc

Do note that with ad dataset, the table the function prints is wider because ASVs are not named ASV_1, ASV_2, etc. but contain the sequence data itself as the name (which is actually something nice to have). We can use the arrow button in the top right corner of the Significant_DEseq2_ASVs table to browse through the information and get the taxa.

I will update the learnr modules on SciServer with this new function. And post again when finished.

1 Like

This is up on SciServer!
Let me know if you have any difficulties with it.

@gugulethusk @gauriipaul

1 Like

Hi @syork, I was trying Diff Abundance for age (and dosage) as continous variables. This is the code:

STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design

my_oc_DESeq2 ← phyloseq_to_deseq2(oc_counts, design = ~age)

STEP 2: Select the groups to compare

my_comparison ← c(“age”, “18”, “24”)

STEP 3: Run the differential abundance analysis at the chosen p-value

Significant_DEseq2_ASVs ← Differential_Abundance_Continuous(oc_counts,my_oc_DESeq2, my_comparison, 0.05)

STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups

Significant_DEseq2_ASVs

STEP 5: Plot the results with your chosen x axis and legend

ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Species)) + geom_point(size=3, position = “jitter”) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
ggtitle(“low vs high dosage”)

and this is the error: variables in design formula cannot contain NA: age

Is there a specific way to do this? What am I doing wrong?

Hi Gauri,

Thank you for bringing this to my attention. This error revealed some underlying issues in all DA functions.

There are a number of complicating factors here.

  1. Dosage should not be considered as a continuous variable as although represented with numbers, we really have 3 distinct groupings. I can adjust this in the phyloseq object; an easy fix. But we still have to address the following issues.
  2. Both age and dosage have missing data (ex. not every participant has an age recorded), and we need our inputs into the DA function to have the same dimensions (ex. the list of samples must be the same).
  3. Unfortunately, not all samples have the same missing fields (ex. the people who are missing an age sometimes have dosage information), so we would have to manipulate the data multiple times if we wanted to capture the largest sample size for both.
  4. For future analyses, we should keep in mind that there are missing fields in other facts too, so if we wanted to look at those factors we would need to repeat the process for each.

What I am thinking right now given the pressing deadline for ABRCMS is I will create a new object that contains the 74 samples with dosage data; it seems that all these people also have age data. If you want to explore age deeper for the microPublication we can work more on this then.

It will take me some time to work on this so I will post again when the fixes have been made.

Thank you so much!
Just for ABRCMS abstract submission, I can try the analysis of the new 74-sample-object. We can get into more details after the deadline.

Hi Gauri,

We have two new objects, oc_dosage and oc_dosage_counts, which only contain samples with information on dosage.

Since dosage a categorical variable from a 0-inflated dataset, we will use Differential_Abundance_AltGeo and will specify what dosages we are comparing. Thankfully there are only three dosage levels. Here is an example comparing the 30 and 20 mcg.

To get access to the new objects please delete and remake the container on SciServer.


# STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design 
my_oc_DESeq2 <- phyloseq_to_deseq2(oc_dosage_counts, design = ~OCE2Dosage)

# STEP 2: Select the groups to compare
my_comparison <- c("OCE2Dosage", "30", "20")

# STEP 3: Run the differential abundance analysis at the chosen p-value
Significant_DEseq2_ASVs <- Differential_Abundance_AltGeo(oc_dosage_counts, my_oc_DESeq2, my_comparison, 0.05)

# STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups
Significant_DEseq2_ASVs

# STEP 5: Plot the results with your chosen x axis and legend
ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Class)) + geom_point(size=3, position = "jitter") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
  ggtitle("my title here")



1 Like