Every gene contains at least one zero, cannot compute log geometric means

syork · July 18, 2025, 5:27pm

When running differential expression analysis, we see this error:

my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)
my_comparison <- c("group", "AD", "control")
Significant_DEseq2_ASVs <- Differential_Abundance(my_ad_DESeq2, my_comparison, 0.05)

estimating size factors

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means

To the best of my understanding, this error is caused by having a zero-inflated dataset (data with a disproportionate number of counts compared to what is expected) where there are 0s in every ASV.

DESeq2 has difficulty handling this kind of data; phyloseq is really borrowing its functions for differential expression to do differential abundance, and zero-inflated data is more common in microbiome datasets than RNA-seq datasets.

We can troubleshoot this in a number of different ways:

We can add 1 to every count in the matrix to remove 0’s. While this technically works, I’m not really sure on what this would mean statistically. On one hand, I notice a lot of our data is low count. It might not be a big issue to add 1 count to 1000 counts, but adding 1 to 10 counts seems like it could be significant. It also means that although we are avoiding 0s, the difference between samples at these ASVs is still minimal, and I’m also not sure what effect this might have on the results.
We can be more stringent when we filter ASVs and choose to analyze only the most common ones. While this approach is not uncommon, to the best of my knowledge there is no standardized cut off point accepted by the community.
We can use the alternative geometric mean (see here and all links within) which I understand ignores the 0s/NA values:

########## Function (for anyone who wishes to see)
Differential_Abundance_AltGeo <- function(w, x, y, z) {
  
  # Accept the input and give it a human friendly name
  my_counts <- w
  my_DESeq2_object <- x
  my_comparision_list <- y
  my_pvalue <- z
  
  #Do the DESeq2 analysis
  
  gm_mean = function(x, na.rm = TRUE){
    exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
  }
  
  geoMeans = apply(counts(my_DESeq2_object), 1, gm_mean)
  my_DESeq2_object = estimateSizeFactors(my_DESeq2_object, geoMeans = geoMeans)
  my_DESeq2_object = estimateDispersions(my_DESeq2_object, fitType = "parametric")
  my_DESeq2_object <- nbinomWaldTest(my_DESeq2_object)
  
  #Change the contrast section to compare different variables and groups.
  res = results(my_DESeq2_object, cooksCutoff = FALSE, contrast= my_comparision_list)
  
  #Select only abundances with a p-value of z or below
  alpha = my_pvalue
  sigtab = res[which(res$padj < alpha), ]
  sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(my_counts)[rownames(sigtab), ], "matrix"))
  
  #Let's look at the OTUs that were significantly different between the two groups. The following makes a nice ggplot2 summary of the results
  theme_set(theme_bw())
  scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
  }
  # Phylum order
  x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
  
  # Class order
  x = tapply(sigtab$log2FoldChange, sigtab$Class, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Class = factor(as.character(sigtab$Class), levels=names(x))
  
  # Order order
  x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
  
  # Family order
  x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))
  
  # Genus order
  x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
  
  # Species order
  x = tapply(sigtab$log2FoldChange, sigtab$Species, function(x) max(x))
  x = sort(x, TRUE)
  sigtab$Species = factor(as.character(sigtab$Species), levels=names(x))
  
  return(sigtab)
}


########## For Students
# STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design 
my_ad_DESeq2 <- phyloseq_to_deseq2(ad_counts, design = ~group)

# STEP 2: Select the groups to compare
my_comparison <- c("group", "AD", "control")

# STEP 3: Run the differential abundance analysis at the chosen p-value
Significant_DEseq2_ASVs <- Differential_Abundance_AltGeo(ad_counts,my_ad_DESeq2, my_comparison, 0.05)

# STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups
Significant_DEseq2_ASVs

# STEP 5: Plot the results with your chosen x axis and legend
ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Class)) + geom_point(size=3, position = "jitter") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
  ggtitle("my title here")

A new function, Differential_Abundance_AltGeo will be used in this case. Note that it requires your counts table as an input, which the other differential abundance functions do not. We have three different differential abundance functions now:

Differential_Abundance(): For unproblematic datasets like MISO
Differential_Abundance_Continuous(): For continuous variables such as metabolites and age
Differential_Abundance_AltGeo(): For 0-inflated datasets including FeFiFo, ad, and oc

Do note that with ad dataset, the table the function prints is wider because ASVs are not named ASV_1, ASV_2, etc. but contain the sequence data itself as the name (which is actually something nice to have). We can use the arrow button in the top right corner of the Significant_DEseq2_ASVs table to browse through the information and get the taxa.

I will update the learnr modules on SciServer with this new function. And post again when finished.

syork · July 18, 2025, 6:26pm

This is up on SciServer!
Let me know if you have any difficulties with it.

@gugulethusk @gauriipaul

gauriipaul · July 20, 2025, 3:45am

Hi @syork, I was trying Diff Abundance for age (and dosage) as continous variables. This is the code:

STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design

my_oc_DESeq2 ← phyloseq_to_deseq2(oc_counts, design = ~age)

STEP 2: Select the groups to compare

my_comparison ← c(“age”, “18”, “24”)

STEP 3: Run the differential abundance analysis at the chosen p-value

Significant_DEseq2_ASVs ← Differential_Abundance_Continuous(oc_counts,my_oc_DESeq2, my_comparison, 0.05)

STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups

Significant_DEseq2_ASVs

STEP 5: Plot the results with your chosen x axis and legend

ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Species)) + geom_point(size=3, position = “jitter”) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
ggtitle(“low vs high dosage”)

and this is the error: variables in design formula cannot contain NA: age

Is there a specific way to do this? What am I doing wrong?

syork · July 20, 2025, 4:24pm

Hi Gauri,

Thank you for bringing this to my attention. This error revealed some underlying issues in all DA functions.

There are a number of complicating factors here.

Dosage should not be considered as a continuous variable as although represented with numbers, we really have 3 distinct groupings. I can adjust this in the phyloseq object; an easy fix. But we still have to address the following issues.
Both age and dosage have missing data (ex. not every participant has an age recorded), and we need our inputs into the DA function to have the same dimensions (ex. the list of samples must be the same).
Unfortunately, not all samples have the same missing fields (ex. the people who are missing an age sometimes have dosage information), so we would have to manipulate the data multiple times if we wanted to capture the largest sample size for both.
For future analyses, we should keep in mind that there are missing fields in other facts too, so if we wanted to look at those factors we would need to repeat the process for each.

What I am thinking right now given the pressing deadline for ABRCMS is I will create a new object that contains the 74 samples with dosage data; it seems that all these people also have age data. If you want to explore age deeper for the microPublication we can work more on this then.

It will take me some time to work on this so I will post again when the fixes have been made.

gauriipaul · July 20, 2025, 6:28pm

Thank you so much!
Just for ABRCMS abstract submission, I can try the analysis of the new 74-sample-object. We can get into more details after the deadline.

syork · July 20, 2025, 7:22pm

Hi Gauri,

We have two new objects, oc_dosage and oc_dosage_counts, which only contain samples with information on dosage.

Since dosage a categorical variable from a 0-inflated dataset, we will use Differential_Abundance_AltGeo and will specify what dosages we are comparing. Thankfully there are only three dosage levels. Here is an example comparing the 30 and 20 mcg.

To get access to the new objects please delete and remake the container on SciServer.


# STEP 1: Convert the phyloseq object to a DESeq2 object and tell R the experimental design 
my_oc_DESeq2 <- phyloseq_to_deseq2(oc_dosage_counts, design = ~OCE2Dosage)

# STEP 2: Select the groups to compare
my_comparison <- c("OCE2Dosage", "30", "20")

# STEP 3: Run the differential abundance analysis at the chosen p-value
Significant_DEseq2_ASVs <- Differential_Abundance_AltGeo(oc_dosage_counts, my_oc_DESeq2, my_comparison, 0.05)

# STEP 4: Retrieve the list of ASVs with a significant difference in abundance between the chosen groups
Significant_DEseq2_ASVs

# STEP 5: Plot the results with your chosen x axis and legend
ggplot(Significant_DEseq2_ASVs, aes(x = Phylum, y=log2FoldChange, color= Class)) + geom_point(size=3, position = "jitter") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
  ggtitle("my title here")

Topic		Replies	Views
Indicator analysis Code Cookbook R , 16S	0	25	July 19, 2025
Trouble plotting gene across all regions Help miniCURE	5	15	December 3, 2025
Trouble with tax plots Help R	1	24	July 17, 2025
readCounts questions for RNA-Seq Mini Cure Help RNA-seq	1	62	January 16, 2025
Unable to get any data Wnt1 or Wnt10b in R-Studio Help R	10	274	July 21, 2022