|
FISABIO Bioinformatics Service - Comparative Taxonomy Pipeline (V4)
Home | Introduction | Samples summary | Taxonomy tables | Barplots | Krona | Alpha diversity |
---|---|---|---|---|---|---|
Rarefaction | Beta diversity | Clustering | CCA | NMDS | Analysis of Variance | Paired tests |
Methods | References |
Covernote: This demo shows the functioning and outputs of Comparative Taxonomy pipeline (Downstream Analysis) from FISABIO Sequencing and Bioinformatics service. Data used belong to already published work and are used only for demonstrative purpose.
Project basename: DemoTaxonomyUpstream
Project advisor email: seqserbioinfo_fisabio@gva.es
Project advisor phone: +34 961 925929
Introduction
The aim of this document is to drive you through the reading of the analysis. The basic idea of the manuscript is to provide a user-friendly approach to 16S rDNA gene microbial analysis.
By clicking on hyperlinks it is possible to navigate among the table of contents, figures, tables and bibliographic references. All statistics have been obtained using The RStatistics software Rstatistics, making use of several Open Source libraries such as gdata, vegan, etc.
All figures can be found in the folder figures and subfolders. All tables are stored in the folder tables_and_biomes. Each file name self-explains the data. Figures are all in PDF (Portable Document Format) format and can be visualized by standard PDF viewer Acrobat or Evince. PDF allows to resize figures without loosing/altering the resolution. This format is easy to be exported or converted in other formats. Moreover you will find also low resolution images in PNG (Portable Network Graphic) format.
Data filtering
- Main data reported in contingency tables or KRONA diversity representation have been obtained considering the whole dataset including singletons or and under-represented taxa.
- All statistical representations (with exception of barplots) have been obtained reducing the dataset removing those taxa which were represented by less than 3 sequence.
- All statistics have been carried filtering out unidentified ud-Archaea and ud-Bacteria which collect reads with Phylum scores <= 0.8. Reads belonging to these category are generally not related to 16S rDNA gene and assigned to the higher taxonomic rank with very low score values.
Tables are formatted by tabs or commas. All tables can be easily imported in almost all spreadsheet softwares.
Some of the files are stored in gzip format, a common Unix compression algorithm and can be unzipped by the command:
# only unix systems # zipping by: gzip file # unzipping: gunzip file.gz
Another kind of file which include in just one element folders and subfolders is labelled by the extension tar.gz which means that all the elements are stored in just one file and could be decompressed using the command:
# only unix systems tar xvzf file.tar.gz
In Microsoft Windows systems, standard decompressing tools should be able to access these kinds of files.
We hope you find this report of help and for any suggestions or bug notification, please contact us by e-mail to seqserbioinfo_fisabio@gva.es.
By clicking on hyperlinks it is possible to navigate among the table of contents, figures, tables and bibliographic references.
For disambiguation of terminology we suggest the read of Marchesi et al. (2015).
For those analyses based on PCR reaction, we also suggest the reading of Gonzalez et al. (2012).
Main data summary
Hereafter some main project data and metadata used for the analysis.
Project base name: DemoV4
Project advisor: FISABIO Sequencing and Bioinformatics Service
Project advisor email: seqserbioinfo_fisabio@gva.es
Numeric factor(s) used:
Categorical factor(s) used: CMD.Type, CMD.Fraction, CMD.SampleID
Following table summarize the number of reads considered for each sample and its taxon distribution.
Sample | NReads | MeanLength | SDLength |
---|---|---|---|
AE-1 | 1108 | 490.00 | 95.10 |
AE-2 | 2782 | 481.18 | 112.18 |
AE-3 | 1343 | 480.93 | 112.75 |
AE-4 | 2952 | 474.37 | 117.16 |
AQ-1 | 1349 | 470.96 | 127.20 |
AQ-2 | 697 | 476.42 | 120.84 |
AQ-3 | 1172 | 479.72 | 107.16 |
AQ-4 | 1993 | 449.00 | 143.65 |
AW-1 | 3135 | 492.78 | 96.58 |
AW-2 | 2148 | 495.07 | 83.96 |
AW-3 | 1948 | 490.04 | 92.34 |
AW-4 | 3241 | 474.81 | 118.54 |
BR-1 | 2084 | 490.57 | 99.12 |
BR-2 | 3321 | 482.58 | 112.46 |
BR-3 | 57 | 342.93 | 196.08 |
BR-4 | 1837 | 410.74 | 173.01 |
BT-1 | 830 | 489.67 | 100.21 |
BT-2 | 3727 | 493.28 | 94.24 |
BT-3 | 60 | 353.77 | 196.29 |
BT-4 | 1636 | 454.02 | 140.68 |
Taxonomy reads distribution tables
This section provides taxonomy tables calculated at various taxonomic ranks: as Phylum, Class, Order, Family, Genus and Species.
NB: Species taxonomic rank is provided only if greengen database is chosen.
Only reads longer than 50nt have been considered.
Contingency and proportion tables are stored in tables and biomes main folder.
Taxon | AE.1 | AE.2 | AE.3 | AE.4 | AQ.1 | AQ.2 | AQ.3 | AQ.4 | AW.1 | AW.2 | AW.3 | AW.4 | BR.1 | BR.2 | BR.3 | BR.4 | BT.1 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Archaea;ud-Archaea | 3 | 18 | 1 | 5 | 4 | 0 | 1 | 12 | 6 | 7 | 1 | 12 | 3 | 7 | 0 | 17 | 1 | 1 |
Bacteria;Actinobacteria | 6 | 138 | 26 | 34 | 8 | 34 | 3 | 22 | 12 | 91 | 19 | 26 | 1 | 1 | 0 | 27 | 0 | ... |
Bacteria;Bacteroidetes | 55 | 229 | 16 | 54 | 110 | 37 | 10 | 17 | 262 | 81 | 21 | 45 | 841 | 182 | 0 | 132 | 367 | ... |
Bacteria;Cyanobacteria/Chloroplast | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
Bacteria;Deinococcus-Thermus | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
Bacteria;Firmicutes | 997 | 2058 | 1202 | 2710 | 1139 | 469 | 1104 | 1750 | 2751 | 1769 | 1834 | 2978 | 1122 | 2913 | 37 | 1372 | 434 | ... |
Bacteria;Fusobacteria | 0 | 18 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
Bacteria;Lentisphaerae | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... |
Bacteria;Proteobacteria | 19 | 161 | 4 | 5 | 10 | 109 | 1 | 4 | 11 | 115 | 1 | 10 | 63 | 20 | 0 | 2 | 7 | ... |
Bacteria;Synergistetes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | ... |
Bacteria;ud-Bacteria | 26 | 153 | 94 | 144 | 78 | 45 | 53 | 188 | 91 | 73 | 72 | 170 | 54 | 184 | 20 | 287 | 21 | ... |
Bacteria;Verrucomicrobia | 0 | 7 | 0 | 0 | 0td> | 1 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
NB: Family and Genus contingency tables are similarly provided but omitted in this demo.
Taxon | AE.1 | AE.2 | AE.3 | AE.4 | AQ.1 | AQ.2 | AQ.3 | AQ.4 | AW.1 | AW.2 | AW.3 | AW.4 | BR.1 | BR.2 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Archaea;ud-Archaea | 0.271 | 0.647 | 0.074 | 0.169 | 0.297 | 0.000 | 0.085 | 0.602 | 0.191 | 0.326 | 0.051 | 0.370 | 0.144 | 0.211 | ... |
Bacteria;Actinobacteria | 0.542 | 4.960 | 1.936 | 1.152 | 0.593 | 4.878 | 0.256 | 1.104 | 0.383 | 4.236 | 0.975 | 0.802 | 0.048 | 0.030 | ... |
Bacteria;Bacteroidetes | 4.964 | 8.231 | 1.191 | 1.829 | 8.154 | 5.308 | 0.853 | 0.853 | 8.357 | 3.771 | 1.078 | 1.388 | 40.355 | 5.480 | ... |
Bacteria;Cyanobacteria/Chloroplast | 0.181 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... |
Bacteria;Deinococcus-Thermus | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.032 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... |
Bacteria;Firmicutes | 89.982 | 73.976 | 89.501 | 91.802 | 84.433 | 67.288 | 94.198 | 87.807 | 87.751 | 82.356 | 94.148 | 91.885 | 53.839 | 87.715 | ... |
Bacteria;Fusobacteria | 0.000 | 0.647 | 0.000 | 0.000 | 0.000 | 0.287 | 0.000 | 0.000 | 0.032 | 0.279 | 0.000 | 0.000 | 0.000 | 0.000 | ... |
Bacteria;Lentisphaerae | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.030 | ... |
Bacteria;Proteobacteria | 1.715 | 5.787 | 0.298 | 0.169 | 0.741 | 15.638 | 0.085 | 0.201 | 0.351 | 5.354 | 0.051 | 0.309 | 3.023 | 0.602 | ... |
Bacteria;Synergistetes | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.391 | ... |
Bacteria;ud-Bacteria | 2.347 | 5.500 | 6.999 | 4.878 | 5.782 | 6.456 | 4.522 | 9.433 | 2.903 | 3.399 | 3.696 | 5.245 | 2.591 | 5.540 | ... |
Bacteria;Verrucomicrobia | 0.000 | 0.252 | 0.000 | 0.000 | 0.000 | 0.143 | 0.000 | 0.000 | 0.000 | 0.279 | 0.000 | 0.000 | 0.000 | 0.000 | ... |
NB: Family and Genus proportion tables are similarly provided but omitted in this demo.
Barplots
In this section accumulative barplots of reads distributions are shown. Columns are ordered by metadata and samples. In such a way it is possible to visualize different profiles according to the factor used for ordering columns.
Only taxa represented more than 1% are reported.
|
KronaViewer representation
The following link open a web page where all diversity is summarised using KRONA tool. Taxonomic annotation tables are parsed to this tool and an interactive viewer will show taxa distributions by samples. The tool allows browsing and zooming within each taxonomic rank. Images could be exported by clicking on Snapshot button for further uses. If you use these images, please cite Ondov, et al. (2011).
Alpha diversity
This section describes diversity indexes characterizing every sample at Phylum, Family and Genus taxonomy ranks levels. All tables can be found in the Diversity folder. Columns report for each sample:
- Diversity estimators
- Shannon Index
- Simpson Index
- invSimpson
- fisherAlpha
- Richness estimators
- Number of observations
- Chao1
- Chao1standard error
- ACE
- ACE standard error
Shannon | Simpson | invSimpson | fisherAlpha | OBS | CHAO1 | CHAO1.SE | ACE | ACE.SE | |
---|---|---|---|---|---|---|---|---|---|
AE-1 | 0.46 | 0.19 | 1.23 | 1.00 | 7 | 7 | 0.00 | 7.00 | 1.31 |
AE-2 | 0.98 | 0.44 | 1.78 | 1.01 | 8 | 8 | 0.00 | 8.00 | 0.94 |
AE-3 | 0.44 | 0.19 | 1.24 | 0.81 | 6 | 6 | 0.46 | 7.12 | 0.98 |
AE-4 | 0.37 | 0.15 | 1.18 | 0.72 | 6 | 6 | 0.00 | 6.00 | 1.15 |
AQ-1 | 0.60 | 0.28 | 1.38 | 0.81 | 6 | 6 | 0.00 | 6.00 | 1.22 |
AQ-2 | 1.06 | 0.51 | 2.06 | 1.08 | 7 | 7 | 0.23 | 8.00 | 0.97 |
AQ-3 | 0.26 | 0.11 | 1.12 | 0.83 | 6 | 7 | 2.22 | 9.18 | 1.72 |
AQ-4 | 0.47 | 0.22 | 1.28 | 0.76 | 6 | 6 | 0.00 | 6.00 | 0.91 |
AW-1 | 0.48 | 0.22 | 1.29 | 0.99 | 8 | 9 | 2.25 | 12.05 | 1.34 |
AW-2 | 0.74 | 0.31 | 1.46 | 1.05 | 8 | 8 | 0.00 | 8.00 | 1.37 |
AW-3 | 0.28 | 0.11 | 1.13 | 0.77 | 6 | 7 | 2.22 | NA | NA |
AW-4 | 0.37 | 0.15 | 1.18 | 0.71 | 6 | 6 | 0.00 | 6.00 | 0.91 |
BR-1 | 0.91 | 0.55 | 2.20 | 0.76 | 6 | 6 | 0.46 | 7.11 | 0.99 |
BR-2 | 0.50 | 0.22 | 1.29 | 0.98 | 8 | 9 | 2.25 | 12.07 | 1.45 |
BR-3 | 0.65 | 0.46 | 1.84 | 0.40 | 2 | 2 | 0.00 | NA | NA |
BR-4 | 0.81 | 0.41 | 1.70 | 0.77 | 6 | 6 | 0.00 | 6.00 | 0.91 |
BT-1 | 0.84 | 0.53 | 2.13 | 0.71 | 5 | 5 | 0.45 | 6.10 | 1.12 |
BT-2 | 0.48 | 0.22 | 1.29 | 0.83 | 7 | 7 | 0.23 | 8.11 | 1.40 |
BT-3 | 0.71 | 0.41 | 1.70 | 0.66 | 3 | 3 | 0.00 | 3.00 | 0.82 |
BT-4 | 0.66 | 0.31 | 1.46 | 0.79 | 6 | 6 | 0.00 | 6.00 | 1.15 |
NB: Family and Genus Alpha diversity tables are similarly provided but omitted in this demo.
Rarefaction analysis
Rarefaction analysis permits estimating the overall diversity covered by the obtained squences. Next picture shows the Rarefaction curves calculated for each sample.
Rarefaction calculated at Family level
General rarefaction graphs, The figures represent the number of sequences (x axis) versus the number of Families (y axis).
Global rarefaction curves
Phylum | Family | Genus |
---|---|---|
Rarefaction curves grouped by metadata.
group | Phylum | Family | Genus |
---|---|---|---|
CMD.Type-A | |||
CMD.Type-B |
etc…
Beta Diversity
Jaccard and Sorensen indexes.
Phylum
Jaccard and Sorensen distances boxplot for each group of samples. Below the figures distance matrices and paired statistics tables are also provided.
IndexesBoxplots | |
---|---|
Phylum-CMD.Type | |
Phylum-CMD.Fraction | |
Phylum-CMD.SampleID |
Following tables shows paired statistics tests. Only associations with p-values <=0.05 are reported.
metadata | level1 | level2 | index | mean1 | mean2 | wilcoxon.pr | |
---|---|---|---|---|---|---|---|
1 | CMD.Type | A | B | Sorensen | 0.8059905 | 0.5813109 | 0.0000084 |
14 | CMD.SampleID | 1 | 4 | Sorensen | 0.6588802 | 0.8000000 | 0.0002521 |
16 | CMD.SampleID | 2 | 4 | Sorensen | 0.6472727 | 0.8000000 | 0.0032936 |
17 | CMD.SampleID | 3 | 4 | Sorensen | 0.5120000 | 0.8000000 | 0.0096629 |
18 | CMD.Type | A | B | Jaccard | 0.7318783 | 0.4838542 | 0.0000084 |
31 | CMD.SampleID | 1 | 4 | Jaccard | 0.5683810 | 0.8000000 | 0.0002521 |
33 | CMD.SampleID | 2 | 4 | Jaccard | 0.5638095 | 0.8000000 | 0.0032936 |
34 | CMD.SampleID | 3 | 4 | Jaccard | 0.4400000 | 0.8000000 | 0.0096629 |
NB: Family and Genus Beta diversity data are similarly provided but omitted in this demo.
Clustering
In this section various cluster analyses are reported. A general clustering involving all samples initially reported. Clustering for each sample belonging to a group of metadata indicated by the user through the metadata file by CMD_Type, CMD_Fraction, CMD_SampleID categorical factors is carried out and figures are stored in the figures folder autodefined by the proper figure names.
NB: When running through metadata factors, if the number of samples is less than 3 clustering can not be calculated and will be skipped.
All clusterings have been obtained using distance matrices for families and genera. 1000 bootstraps have been applied to perform cluster statistics.
The dendrograms reports AU and BP scores (p-vlaues). AU p-value (printed in red color in default) is the abbreviation of "approximately unbiased" p-value, which is calculated by multiscale bootstrap resampling. BP value (printed in green color in default) is "bootstrap probability" value, which is less accurate than AU value as p-value. One can consider that clusters (edges) with high AU values (e.g. 95%) are strongly supported by data .
Red frames reports approximatively unbiased p-values greather than 95.
Phylum | Families | Genera |
---|---|---|
Clustering by factors
Following pictures show again clustering based on factors samples subgroups.
Goup | Phylum | Family | Genus |
---|---|---|---|
CMD.TypeA | |||
CMD.TypeB |
etc…
NB: Family and Genus Clustering plots are similarly provided but omitted in this demo.
CCA
Canonical Correspondence Analysis
This section provides Canonical Correspondence Analysis (CCA) explaining taxonomical data trough environmental factors. The functions are based on Legendre algorithm (Legendre, et al., 1999): in CCA Chi-square transformed data matrix is submitted to weighted linear regression on constraining variables, and the fitted values are submitted to correspondence analysis performed via singular value decomposition (svd).
Global CCA
Phylum | Family | Genus |
---|---|---|
CCA by factors
Following pictures show again CCA considering also factors-based ordination.
Factor | Phylum | Family | Genus |
---|---|---|---|
CMD.Type | |||
CMD.Fraction |
etc...
NMDS
Non Metric Multidimensional Scaling is applied as ordination method to reduce dimensions of distance matrices of samples or groups of samples. Distance matrices have been obtained using Bray-Curtis dissimilarity.
Factor | Phylum | Family | Genus |
---|---|---|---|
CMD.Type | |||
CMD.Fraction | |||
CMD.SampleID |
Analysis of variance
The analysis of variance is performed using the adonis function (Permutational Multivariate Analysis of Variance Using Distance Matrices) from the R vegdist package which performs partitioning sums of squares using semi-metric and metric distance matrices. We have applied ‘Bray-Curtis' distance. This analysis is analogous to MANOVA (multivariate analysis of variance). M.J. Anderson refers to the method as ‘permutational MANOVA' (formerly ‘nonparametric MANOVA') (Anderson, 2001). Further, as its inputs are linear predictors, and a response matrix of an arbitrary number of columns (2 to millions), it is a robust alternative to both parametric MANOVA and to ordination methods for describing how variation is attributed to different experimental treatments or uncontrolled covariates. It is also analogous to redundancy analysis (Legendre, et al., 1999).
Phylum
Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr..F. | |
---|---|---|---|---|---|---|
CMD.Type | 1 | 0.0514 | 0.0514 | 2.7380 | 0.1320 | 0.0889 |
Residuals | 18 | 0.3378 | 0.0188 | NA | 0.8680 | NA |
Total | 19 | 0.3892 | NA | NA | 1.0000 | NA |
CMD.Fraction | 4 | 0.0531 | 0.0133 | 0.5929 | 0.1365 | 0.8272 |
Residuals1 | 15 | 0.3361 | 0.0224 | NA | 0.8635 | NA |
Total1 | 19 | 0.3892 | NA | NA | 1.0000 | NA |
CMD.SampleID | 3 | 0.1780 | 0.0593 | 4.4941 | 0.4573 | 0.0020 |
Residuals2 | 16 | 0.2112 | 0.0132 | NA | 0.5427 | NA |
Total2 | 19 | 0.3892 | NA | NA | 1.0000 | NA |
NB: Family and Genus ADONIS tables are similarly provided but omitted in this demo.
Paired tests
Following tables show wilcoxon test between groups for each factors. Only associations with p-values <= 0.05 are reported.
Phylum
metadata | level1 | level2 | taxa | mean1 | mean2 | wilcoxon.pr | |
---|---|---|---|---|---|---|---|
1 | CMD.Type | A | B | Bacteria;Actinobacteria | 1.9281547 | 0.4856277 | 0.0149354 |
99 | CMD.SampleID | 1 | 3 | Bacteria;Bacteroidetes | 21.8628023 | 0.6593011 | 0.0119252 |
102 | CMD.SampleID | 1 | 3 | Bacteria;Firmicutes | 76.3832742 | 98.5750400 | 0.0119252 |
104 | CMD.SampleID | 1 | 3 | Bacteria;Proteobacteria | 1.3773228 | 0.0926583 | 0.0119252 |
105 | CMD.SampleID | 1 | 4 | Bacteria;Actinobacteria | 0.3263628 | 1.3743634 | 0.0121858 |
106 | CMD.SampleID | 1 | 4 | Bacteria;Bacteroidetes | 21.8628023 | 4.0612483 | 0.0367139 |
111 | CMD.SampleID | 1 | 4 | Bacteria;Proteobacteria | 1.3773228 | 0.2391984 | 0.0121858 |
113 | CMD.SampleID | 2 | 3 | Bacteria;Bacteroidetes | 6.5713870 | 0.6593011 | 0.0119252 |
114 | CMD.SampleID | 2 | 3 | Bacteria;Firmicutes | 83.9865679 | 98.5750400 | 0.0119252 |
117 | CMD.SampleID | 2 | 3 | Bacteria;Proteobacteria | 5.9110708 | 0.0926583 | 0.0119252 |
125 | CMD.SampleID | 2 | 4 | Bacteria;Proteobacteria | 5.9110708 | 0.2391984 | 0.0121858 |
129 | CMD.SampleID | 3 | 4 | Bacteria;Bacteroidetes | 0.6593011 | 4.0612483 | 0.0361451 |
NB: Family and Genus Paired tests tables are similarly provided but omitted in this demo.
Other methods
Library preparation
16S rDNA gene amplicons were amplified following the 16S rDNA gene Metagenomic Sequencing Library Preparation Illumina protocol. The gene-specific sequences used in this protocol target the 16S rDNA gene V3-V4 region. Illumina adapter overhang nucleotide sequences are added to the gene-specific sequences. The primers are selected from the Klindworth et al. publication (Klindworth, et al., 2013). The full length primer sequences, using standard IUPAC nucleotide nomenclature, to follow the protocol targeting this region are:
16S rDNA gene Amplicon PCR Forward Primer = 5'
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
16S rDNA gene Amplicon PCR Reverse Primer = 5'
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
We used microbial genomic DNA (5 ng/ul in 10 mM Tris pH 8.5) to initiate the protocol. After 16S rDNA gene amplification, the mutiplexing step was performed using Nextera XT Index Kit (FC-131-1096). We run 1 ul of the PCR product on a Bioanalyzer DNA 1000 chip to verify the size, the expected size on a Bioanalyzer trace is 550bp. After size verification the libraries were sequenced using a 2x300pb paired-end run (MiSeq Reagent kit v3 (MS-102-3001)) on a MiSeq Sequencer according to manufacturer's instructions (Illumina).
Quality assessment
Quality assessment was performed by the use of prinseq-lite program (Schmieder, et al., 2011) applying following parameters:
- min_length: 50
- trim_qual_right: 30
- trim_qual_type: mean
- trim_qual_window: 10
R1 and R2 from Illumina sequencing where joined using fastq-join from ea-tools suite (Aronesty, 2011)
Bioinformatic analysis
Data have been obtained using an ad-hoc pipeline written in R Rstatistics language (R Core Team, 2012), making use of several Open Source libraries such as gdata, vegan, etc. Data have been grouped and stratified according to the metadata file provided by the user.
All sequences have been maintained in one table. For each read, sequence length, taxonomic annotation, cluster of membership and user defined metadata, have been compiled.
Taxonomic annotation
Taxonomic affiliations have been assigned using the RDP_classifier from the Ribosomal Database Project (Cole, et al., 2009). All data have been tabulated. Reads with RDP score value below 0.8 have been assigned to the upper taxonomic rank leaving last rank as unidentified (e.g. Firmicutes, Bacillales, Bacillaceae, Unidentified Bacillaceae).
R packages
The pipeline core runs by R Rstatistics language (R Core Team, 2012). All calculation and statistics have been carried out within this environment. Here a list of the used packages and their references.
- lattice: graphics for R (Sarkar, 2008).
- knitr, knitcitations, markdown: report and reference environment (Xie, 2014; Boettiger, 2014; Allaire, et al., 2014).
- Bioconductor packages for genomics data: Biostrings.
References
Allaire J, et al. markdown: Markdown rendering for R. R package version 0.7.4. 2014. URL: http://CRAN.R-project.org/package=markdown.
M. J. Anderson. "A new method for non-parametric multivariate analysis of variance". In: Austral Ecology 26.1 (2001), pp. 32–46. ISSN: 14429985. DOI: 10.1046/j.1442-9993.2001.01070.x. URL: http://doi.wiley.com/10.1046/j.1442-9993.2001.01070.x.
E. Aronesty. ea-utils: Command-line tools for processing biological sequencing data. ea-utils: FASTQ processing utilities. 2011. URL: http://code.google.com/p/ea-utils.
C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.5. 2014. URL: http://CRAN.R-project.org/package=knitcitations.
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009 Jan;37(Database issue):D141-5. doi: 10.1093/nar/gkn879. Epub 2008 Nov 12. [PubMed Central:PMC2686447] [DOI:10.1093/nar/gkn879] [PubMed:19004872] , pp. D141–145.
Klindworth A, Pruesse E, Schweer T, Peplies J, Quast C, Horn M, Glöckner FO. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 2013 Jan 7;41(1):e1. doi: 10.1093/nar/gks808. Epub 2012 Aug 28 [PubMed Central:PMC3592464] [DOI:10.1093/nar/gks808] [PubMed:22933715] , p. e1.
Legendre, P, et al. (1999) "DISTANCE-BASED REDUNDANCY ANALYSIS : TESTING MULTISPECIES RESPONSES IN MULTIFACTORIAL ECOLOGICAL EXPERIMENTS". In: Ecological Monographs 69.1, pp. 1–24. ISSN: 00129615. DOI: 10.2307/2657192. URL: http://www.jstor.org/stable/2657192?origin=crossref.
Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011 Sep 30;12:385. doi: 10.1186/1471-2105-12-385. [PubMed Central:PMC4520061] [DOI:10.1186/s40168-015-0094-5] [PubMed:26229597] 12:385.
Marchesi JR, Ravel J. The vocabulary of microbiome research: a proposal.Microbiome. 2015 Jul 30;3:31. [PubMed Central:PMC4520061] [DOI:10.1186/s40168-015-0094-5] [PubMed:26229597] , p. 31.
R Core Team. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0. R Foundation for Statistical Computing. Vienna, Austria, 2012. URL: http://www.R-project.org/.
Sarkar D. Lattice: Multivariate Data Visualization with R. ISBN 978-0-387-75968-5. New York: Springer, 2008. URL: http://lmdvr.r-forge.r-project.org.
Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011 Mar 15;27(6):863-4. . [PubMed Central:PMC3051327] [DOI:10.1093/bioinformatics/btr026] [PubMed:21278185] , pp. 863–864.
Xie Y. Dynamic Documents with R and knitr. ISBN 978-1482203530. Chapman and Hall/CRC, 2014. URL: http://yihui.name/knitr/.