Distribution-based clustering is available on conda:
conda install -c cduvallet -c conda-forge dbotu_q2
Note that you need the April 2018 QIIME 2 release (version 2018.4
) or later for this plugin to work.
If that doesn't work, you can clone or download the repo to your computer, activate your qiime environment, and then run:
python setup.py install
QIIME 2 plugin for distribution-based clustering.
To learn more about distribution-based clustering, check out the original publication or the python implementation, dbOTU3 (and its associated publication). This plugin is essentially a QIIME 2 wrapper around this new implementation.
Currently, the dbOTU plugin has only one function, the distribution-based OTU caller in call-otus
.
From within a QIIME 2 environment (i.e. after doing source activate qiime2-2018.4
), run:
qiime dbotu-q2 call-otus \
--i-table test_data/counts.qza \
--i-sequences test_data/seq.qza \
--o-representative-sequences dbotu_seqs.qza \
--o-dbotu-table dbotu_table.qza
There are optional parameters that you can change to improve the performance of clustering.
You can see these parameters by typing qiime dbotu-q2 call-otus --help
, and you can learn more about how to choose them by reading the original publication and the dbotu3 update.
Note that this plugin wraps the dbotu3 version of distribution-based clustering, which recommends using slightly different parameters than the original version.
Currently, the membership information can be printed using the --verbose
flag.
The first column of the membership file has the representative sequence for each OTU, and all subsequent columns have the sequences which are grouped into that OTU.
If you want to see the membership file (which shows which sequences are grouped into which "OTU"), use the --verbose
flag (and optionally pipe the output to a separate file):
qiime dbotu-q2 call-otus \
...
--verbose > membership_file.txt
To run distribution-based clustering, you need (1) some dereplicated sequences and (2) a table of counts indicating how many times each of those sequences is in each of your samples. Dereplicated sequences can be:
The important thing is that the input sequence file contains only non-duplicated sequences (i.e. it is not just all the raw reads present in your dataset).
The sequence IDs in the counts table should match the IDs in the input sequences file, and every sequence ID in your dereplicated sequences file must be present in the table.
If you're using QIIME 2 to process your data, these should have the data format FeatureData[Sequence]
.
If you want to use this plugin but you're not using QIIME 2 for any of your other steps, you'll need to first import your data (a feature table of counts and a fasta file of dereplicated sequences) into qiime. From within the qiime environment, you can do:
qiime tools import \
--input-path your_sequence_file.fasta \
--output-path your_sequence_file.qza \
--type 'FeatureData[Sequence]'
qiime tools import \
--input-path your_table.biom \
--output-path your_table.qza \
--type 'FeatureTable[Frequency]'
--verbose
flag