Metabarcoding Applications for biodiversity monitoring and invasive species detection

you okay so thank you very much for inviting me to speak here it's a great pleasure to be here so I'm Emily I'm working Amalia Christopher's lab at McGill University and I'm working for casein the canadian aquatic invasive species Network so I'll be talking to you about meta barcoding and its applications for describing biodiversity and also identifying aquatic invasive species and if I say ia is that's what I'm referring to so I'll start with a quick overview of the talk so I'll give an introduction to meta barcoding and something that's called OD clustering and then talk about how we views Moxley plant to communities to validate the meta barcoding method and how we're applying what we've learned there to the analysis of a bigger dataset so case in sample 16 ports around Canada with the aim to survey biodiversity and detect errors and then I'll wrap up with conclusions so I don't know how many people here are aquatic biologists but our aquatic environments are threatened by invasive species and early detection has been recognized of as great importance so it's been shown to be easier more cost-effective to control amazing species before they become well-established so we need to have methods of detection that are capable of identifying species that are potentially at very low abundance because invading species may first occur at low density and traditionally aquatic species would be identified using keys so for example you'd examine an individual under the microscope and use diagnostic morphological criteria but these keys are often based on adult criteria so it's hard to identify for example larval forms so the nought play of copepods are notoriously difficult to identify and it's also problematic for morphologically cryptic species in addition it's also been shown that the sampling efforts typically employed are insufficient to identify species and lessor at medium to high density so as I said this could be problematic for invading species that are at low abundance so there's an increasing interest in finding alternative methods that give us greater detection probability and that's where metabolic healing comes in so meta while coding is really away rapidly serve environment so you could go out for example and sample a water body you would take a sub sample of that and do DNA extraction so you're aiming to extract everything is there and then what you do is you amplify an area of the genome that you're interested in and this is your barcode region so this is DNA based identification similar to the traditional bar coding but the difference here is that you're working on this very complex community and the idea is that information contained in this barcode region informs you of what species you have in your sample so this PCR amplified community is then sent for sequencing so we used piracy queuing pyrosequencing which is just a type of high-throughput sequencing so it generates lots and lots of data and traditionally these data are then clustered into what's known as operational taxonomic units and the idea is that these units these Oh to use correspond to species and I will explain a little bit more about that shortly but at this point it could just stop and you could report number about to use or if you had a database available you could blast to use against that to actually taxonomic you identify them so for explain ot clustering I just want to point out why it's important that we validate this method and assess this assumption that oh to use reflect species well so this this shows data that was generated by a colleague for Hamilton Harbor so you have number of species plotted for two key groups Cod include Asura and in green you have the number of species detective with traditional taxonomy and then blue the number with meta bar coding so you can see that you get vastly more species with meta bar coding and this is not unusual many other researchers have reported estimates of biodiversity orders of magnitude higher so there's been lots of debate about this and trying to understand how we been massively underestimated diversity or is there something going on with the method so just to kind of give you a break to digest that so oh to you clustering to explain more about how it works and so you have this PCR amplified community so you have these barcode sequences and then Archie clustering works on sequence divergence so you would pick a sequence divergence threshold and then that defines your unit so often people pick 3 percent so if sequences three percent more than three percent different they get pulled out into these different clusters and what happens is is that OTE clustering usually starts with the most abundant sequences and in the end you take the consensus sequence and that's what you would go to blast against your database you end up with a consensus for that whole unit but this sequence divergent are shoulder often seen as quite arbitrarily defined and there's a few issues that could complicate it's used so often researchers to focus on very hyper variable regions so we are looking at for example the v4 region of 18s and this is the most hype variable region of 18s so it's been shown to have high levels of intra nomic intraspecific and into specific variation so potentially if you're looking at a community and very taxonomically diverse individuals the level of variation in that marker may be quite variable so picking a 3% divergence for everything may not be appropriate and you can envision a couple of different scenarios happening so and then perhaps naive assumption if you have two species you should have two Oh to use generated but if these two species are very closely related and there's less than 3% divergence in that market region then the sequence is produced by those species may actually join together and form a shared otu so if you just use Oh tree number you're under estimating numbers but also when you come to blast you still may not identify the other species on the other hand if you have a species that has very high levels of variation in this region you may actually find out through the three percent level you generate more than one Oh to you so you could have multiple Oh to use four species so in this case you would actually overestimate diversity so the other thing to consider as well as that you have what's known as single turns in your data so the port I showed where you had much higher diversity with a meta barcoding that those kind of discoveries have sparked a lot of debate about the rare biosphere with the idea that these this new diversity is mostly contributed by species that are at very low abundance and this is supported by the fact that often these species occurring our data only once so Singleton's are reads that occur once across your whole data set and it could be that if species is genuinely really rare in your sample you just happen to sample at once and you get one sequence generated on the other hand during PCR and sequencing you can have errors and artifacts occurring and given that they're random events they'll like you just happen once so they would also generate a singleton so there's debate about how to treat these these type of this type of data so now I'll go move on to how we've used the metal bar coding sorry they're not community method to try and address in these issues so I built Moxley plankton communities with set numbers of identify species so the idea here is that you know how many individuals and species you included in the community so when you come to do the analysis you should know what to expect to find turns a number of Oh to use and I included both closely related and more distant species with the idea to be able to see if we can distinguish closely related species based on our marker region and I also included both single individuals and populations and the idea here is that if we include multiple individuals per species if we potentially introduce intraspecific variation does this complicate ot clustering so the aim was to see whether we had a correspondence between OTU number and expected species and could we detect all species and just to explain a little bit more about the method um we built two communities the first one had one individual per species no 469 species in total so I made a call community where I mixed together the individuals DNA extracted the PCR but then I also used a method where I took single individuals of different species individually PCR amplified them so did individual DNA extraction PCR amplification with these tagged primers which means that when you analyze your data you can separate that individual out so you can really see what sequence was what sequences were generated by that individual and cluster or to cluster that data and then I use that same approach for a second community which instead of having single individuals it had multiple individuals per species so I hadn't heard 276 individuals the 14 species so I look again another cool community where everything was mixed together but then I had these populations that were separately DNA extracted and PCR amplified with the type primers and where possible I had populations of different size for the same species so again looking at whether if we introduced multiple individuals what effect does that have so we analyzed the data using new pass which is a relatively recently developed algorithm and the default threshold that it uses is 3% so we started by analyzing the data that way so just to summarize what we had so we had single individuals of 72 different species and populations of 13 species and we found that between 33 to 84 percent of those individual populations generated a single ot so you could argue in the majority of cases there was a good correspondence but in some cases we had to use more than 10% divergent threshold to get a single OTE and we also found that in general including Singleton's in the analysis of those sequences that occur once I resulted in more ot use per species so generally complicated the correlation and then when we had those multiple individuals per species the results were less clear so we would have cases where multiple into most of all o to use were generated for populations but there wasn't a clear pattern that when you had more individuals you got more o to use that we've seen more sporadic so not necessarily a clear pattern and we did also find that we couldn't distinguish closely related species so for example for the general Artemio Daphnia and gomorrah's we had protonated species and we weren't able to detect distinguish those in the data so just summarize that there's a few things going on so first of all this highlights how it's important to consider the amount of resolution in the marker the barcode region that you use so as I said for Daphnia we had a couple of poster native species and we found that they were joining together to form a single OTU and this was also a case for some balanus species so you can see that if you have less than 3% divergence so if you can take the reference date sequence from NCBI for example for your species and if you see that you have represent divergence then using a 3% diversion threshold you're not going to be able to distinguish them so you have this 102 forming on the other hand and this also shows the importance of having reference databases so with copic left lump or the Asian clam we had multiple achieve to use be generated for a single individual or for multiple individuals and if we just stop with O 2 number and didn't have the tagging method we might interpret these as different species but because we can bless them against a database we can see that they're all matching core bicular and as you can see as I said the pattern the number of ODU's generated by the pot based of different size there's not an obvious trend there so this could be more just like sampling if these are genuine alleles just random sampling across populations if there is interested nomic variation we also found that although Singleton's often resulted in more ot use per species sometimes it they did allow the identification of what could be rare species so Braccio nasser rotifer when we analyzed the data without Singleton's it didn't generate a note to you but when we did we had an O to you form that was formed of four singleton sequences so this individual was at very low abundance and when we had the thing we're trying to detect it so this potentially lends support to argument that Singleton's could be genuine sequences when you're analyzing natural data so all of this has really led us to think that maybe what we should be doing is just skipping o2u clustering so o to clustering is useful when you don't have taxonomic information available because you're still able to describe your community in some way so for example if you're analyzing a bacterial community that has not been described in any way at all then you have to use some kind of space achill you're just describing genetic diversity in the sample but in cases where for example if you're detecting quite a converted species and you have a target list of species if you make sure that you have reference sequences available for those then why not just blast your reads again so instead of using ot clustering so yeah so now I'm going to talk about how we've gone on to analyze a bigger natural poor data set so as I said casein sampled 16 ports around Canada through four ports in four different geographic regions to have Pacific Arctic Atlantic and the Great Lakes and the all those different ports generated over seven million sequences and if we just blast those without Oh to clustering against an 18s database we find that we can taxonomically classify to the family level 95% of those weeds and this is just oh it's actually a figure produced by a colleague who I'm working with but it's a story to show you how you are able to without you clustering you can get at least to the family level estimation of diversity and you can see there are samples mostly composed of crustaceans and then just to show as well how we get patterns when we look at the different geographic regions we have patterns that kind of make sense and that therefore ports within a region show similar patterns of the diversity so I know these figures look kind of complicated but the cut different colors represent different generous is just showing how the different regions you get different patterns in terms of diversity coming out and then moving on to the texting aquatic invasive species so the way we did that was we compiled a list of invaders from resources online resources in the literature and I checked to make sure that those species had 80s sequences available and databases and then we blasted those reads that were generated reports against the reference database and picked out the invaders so we found 23 is and when I say that there are none big lessly and a because they're detected that means that because there were some cases where a single read would blast against multiple species with equivalent blast score so based on the blast result you can't distinguish them and so the 23 is the cases where those reads bastard uniquely against that species so there are cases where we cannot confidently say for sure that we were detected now n is because it could be a closer rate of species so for example for Meyer and Iran area the softshell clam there's reeds are also blast against two of the species equally well we also tested ot clustering just to see how the results would differ and we found that when we go to clustered nine of those areas were no longer detected and for a number of those species I mapped back I looked at the reads and mapped them back to which ot they joined and I found that they were joining Oh to use forming Oh to use with closely related species and just to kind of explain more how this is a problem so for example if you have do you have Mayan Aria reads and then you have read The Closer rated species that could be more abundant so I said there are two clustering January starts with the most abundant reads so oh gee my own re I might join this oh gee but because it's not the most abundant when you make your consensus sequence the other another closely related species will be represented by their consensus so when you blast it you'll detect that and this mass the presence of my own area and on the other hand you could have scenarios like I said where even when you're blasting reads based on the best results you cannot distinguish them so this is the case where your barcode is not giving you enough resolution in that group and I know these these size reviews don't look particularly trees don't look particularly nice the colors are kind of funny but this is just to show that in some cases taking up a logistic approach can help you to be more certain about what species you have so in this case circuit pages as a species that's closely related to both the trophy's so that more than 90 percents similar but when you make this tree you can see that there were the trophy sequences do separate out so looking at this you seem before more confident that the reads you have do actually fit with circuit pages but then you can have other scenarios where for example clusters - is a European green crab based on this so if you include reference sequences for post rate of species a Japanese crab atlantic blue crab it's more of a mess it's much harder to see your reads are also clustering closer to those species so we found that particularly for crab species our marker was very difficult to distinguish different species so again if you were interested in the species you'd have to potentially think about different region so now I'll just conclude everything so we've seen these two different problems they're collapsing and over splitting issues so both of these could result in masada miscalculation diversity see that overall underestimation of diversity and this can be particularly problematic for detection of aquatic invasive species when you really want to be sure which species you're identifying but who don't want to report an amazing species somewhere when it's not really there but at the same time you don't want to miss it so although this all highlights the importance of growing these um well-developed molecular databases so it's really important to have reference sequences to be able to blast your oil to use all weeds again so there needs to be a really concerted effort to make sure that we have well-developed databases and also when you're designing these kind of experiments and this kind of survey is really thinking about the market the barcode region that you're using and how much resolution you need in different groups and potentially could the so the benefit of something like 18's for example is that it's quite easy to get this to amplify across many different groups and so if Co one has been used lots and barcoding but it's it's problematic to get it to amplify across very diverse so the poster groups so potentially could envision using this broad approach where you use a broad e amplifying primer and then if you get for example Costner's - coming up you could use a more targeted primer for that group to really confirm that you have better species and in our lab there is a massive student who is working on developing primers for Co one potentially using more of a cocktail perch where we have : primer specific for different groups so I'll finish there we're saying thank you to my supervisors and clabber ages and people that contributed samples for those mock communities and of course to my girl and answer I can crease in and again through my teammates Day thank you does anybody have any questions for Emily thanks Emily um are you interested in abundant straight or trying to work out the abundance from your barcode um ideally we would like to but I think it's very difficult to do that so you could use read depth or number of reads as some estimation of abundance but I think there's so many different things that could affect that so primer biases and it doesn't necessarily correspond to actual biomass but it would be really interesting to try and investigate that further the big one is copy number variation on your ribosomal gene because that can vary quite wildly between so a second question when you said that you had near identical you had identical blast hits what was that based on you score actually so we filtered our blast outputs by using a minimum overlap of like 97% sorry minimum percentage identity of 97 and then the length we also filtered by minimum length but then when it came to the actual distinguishing top hits we base that on school because one thing you could do is because you have a database of all of the sequences you could try and work out what are characteristic snips in there and use those to try and guide those more difficult decisions yeah that guess links to the kind of phylogenetic approach how then when you look more in detail at the sequences how you can then see differences so yeah that's that's a good point any other questions alright well thank Emily again you you