q2-dbotu

2018.4.2
QIIME 2 plugin for distribution-based clustering, which calls OTUs based on the similarity between their genetic sequences and their count distribution across samples.

install guide:

Distribution-based clustering is available on conda:

conda install -c cduvallet -c conda-forge dbotu_q2

Note that you need the April 2018 QIIME 2 release (version 2018.4) or later for this plugin to work.

If that doesn't work, you can clone or download the repo to your computer, activate your qiime environment, and then run:

python setup.py install

q2-dbotu

QIIME 2 plugin for distribution-based clustering.

To learn more about distribution-based clustering, check out the original publication or the python implementation, dbOTU3 (and its associated publication). This plugin is essentially a QIIME 2 wrapper around this new implementation.

Usage

Currently, the dbOTU plugin has only one function, the distribution-based OTU caller in call-otus. From within a QIIME 2 environment (i.e. after doing source activate qiime2-2018.4), run:

qiime dbotu-q2 call-otus \
	--i-table test_data/counts.qza \
	--i-sequences test_data/seq.qza \
	--o-representative-sequences dbotu_seqs.qza \
	--o-dbotu-table dbotu_table.qza

Genetic, abundance, and p-value criteria parameters

There are optional parameters that you can change to improve the performance of clustering. You can see these parameters by typing qiime dbotu-q2 call-otus --help, and you can learn more about how to choose them by reading the original publication and the dbotu3 update. Note that this plugin wraps the dbotu3 version of distribution-based clustering, which recommends using slightly different parameters than the original version.

Membership file

Currently, the membership information can be printed using the --verbose flag. The first column of the membership file has the representative sequence for each OTU, and all subsequent columns have the sequences which are grouped into that OTU. If you want to see the membership file (which shows which sequences are grouped into which "OTU"), use the --verbose flag (and optionally pipe the output to a separate file):

qiime dbotu-q2 call-otus \
    ...
    --verbose > membership_file.txt

Data

To run distribution-based clustering, you need (1) some dereplicated sequences and (2) a table of counts indicating how many times each of those sequences is in each of your samples. Dereplicated sequences can be:

  • exact sequence variants output by DADA2 or deblur's denoise commands,
  • dereplicated sequences, which you can get by clustering at 100% identity with vsearch,
  • some other set of representative sequences, perhaps output from clustering with vsearch at a different identity threshold (e.g. 97% OTUs).

The important thing is that the input sequence file contains only non-duplicated sequences (i.e. it is not just all the raw reads present in your dataset). The sequence IDs in the counts table should match the IDs in the input sequences file, and every sequence ID in your dereplicated sequences file must be present in the table. If you're using QIIME 2 to process your data, these should have the data format FeatureData[Sequence].

Using non-QIIME 2 artifact data

If you want to use this plugin but you're not using QIIME 2 for any of your other steps, you'll need to first import your data (a feature table of counts and a fasta file of dereplicated sequences) into qiime. From within the qiime environment, you can do:

qiime tools import \
  --input-path your_sequence_file.fasta \
  --output-path your_sequence_file.qza \
  --type 'FeatureData[Sequence]'

  qiime tools import \
    --input-path your_table.biom \
    --output-path your_table.qza \
    --type 'FeatureTable[Frequency]'

Versions

  • 2018.4.2 - output membership info with the --verbose flag
  • 2018.4.1 - fixed bug: transpose table before and after calling dbotu3, so that dbotu3 gets data in expected format (sequences in rows) despite input qiime2 format (sequences in columns)
  • 2018.4.0 - original "working" plugin, uploaded to conda