When I try to run my dataset (SRR-it is a very large dataset) into trim QC to trim it, it always stops running at the Fastp job. An error sign shows saying that there isn’t enough memory for the job to run. So i realized that i have to split the dataset. My ultimate goal is to get it into a krona pie chart to see the types of bacteria present in the data set. Chapter 5 Taxonomy Profiling | BioDIGS miniCURE
Here is my history: Galaxy
Hi. Your dataset might just be too large to handle with the given memory. For a workaround try splitting your dataset into smaller chunks:
-
Use Split Fasta tool on Galaxy to split data into smaller number of chunks (auto suggested is 10 chunks).
-
Once the Split fasta operation is complete, use Extract dataset tool to extract one of the chunks - select to extract the first dataset as the simplest option (all chunks in theory are ‘equivalent’ in a sense of being a random assortment of sequences from the dataset.)
-
Go ahead and try using fastp on this smaller dataset.
I think I figured it out! Here is my history
:Galaxy
The data I Imported: SRR29980925
If you could please double check the workflow, i would really appreciate it. Thank you for all your advice!
Good job!
I see you have successfully split your file and did the taxonomy workflow. Of note, since your original dataset was super large, the resulting 10% chunk ended up being also very large (18.7GB!!). This is most likely why it took a long time to do all the steps. Next time I recommend splitting your very large dataset into more than 10 chunks, say 30 or even 50 chunks. This will speed up your workflow and hopefully should produce similar results with taxonomy classification.
Another suggestion is to remove replicate galaxy entries in your history, since they occupy lots of your precious space there. I think you have replicate input SRR files there, and replicate split fasta entries.