WDM 7 Processing AND OR NOT queries

now having seen what an inverted index is let's go back to our query Brutus and Caesar and not Calpurnia and let's try to see how we can use the inverted index that we just discussed to answer this query in fact let's start from an even simpler query just an and query of Brutus and Caesar so consider processing the query Brutus and Caesar how do we use the inverted index the first thing we need to do is locate Brutus in the dictionary and retrieve its postings list and as I said the dictionary can be implemented in a variety of ways you could use a sorted array could use b-trees hash tables and so on but whatever implementation you use you can take this term Brutus and search for it in the dictionary and you will either find it or you won't find it if you don't find it it's as good as saying that there is no postings list for this term or saying that the postings list is simply empty but if you do find Brutus in the dictionary then you will also be able to follow the pointer to its postings list and retrieve the postings list from the disk into main memory for Brutus likewise we will take the second term which is Caesar search for it in the dictionary follow the pointer to the postings list on disk if it is one disk and bring it into main memory of course if the entire index is already fitting into main memory then we don't need to do that we just need to retrieve the postings from wherever it's located now once we have both these postings lists available to us we need to find their intersection that is we need to find which dock IDs are common to both Brutus and Caesar those are the documents which contain both the terms Brutus and Caesar right recall that in the postings list you have a sorted sequence of dock IDs indicating which documents contain this term so if you want to find out which documents contain both these terms you have to search for documents that are common to both these postings lists so how do you take the intersection of these two sorted lists in fact now you can probably understand why the postings list as it is sorted in increasing order of dock ID that's because it's very easy to intersect these two lists for queries like this so how do we perform this intersection operation sometimes it's called the merge operation because if you think about it it's very similar to how you would merge these two sorted lists in the merge step of merge sort I'm assuming here that you have encountered the merge sort algorithm before in either of data structures or an algorithms class and you would recall that the merge sort is a divide and conquer algorithm which divides an array which is to be sorted into two halves or two parts it divides the overall problem into two subproblems it recursively conquers both the subproblems and then you need to come combine the solutions to the two subproblems so once you have both halves of the array sorted then you can merge the two sorted the two smaller sorted arrays into the overall sorted sequence and that merge step is something that I'm assuming you're familiar with here if not you can go back and look at the merge sort algorithm we also have videos of the merge sort algorithm online on our website you can check them out if you are not familiar with the algorithm you can also try to see if you can understand this pseudocode directly this is the pseudocode for taking the intersection of the two postings lists so what we have initially is an empty answer list so the answer list is going to contain the documents that are or the dock IDs that are common to both the lists and initially this answer list is the empty list now you also have two pointers p1 and p2 pointing to the heads of both the postings lists and what we are going to do in the course of this merge step is we are going to keep advancing these pointers p1 and p2 we are going to walk through both these lists in a particular order so that when both the pointers have reached the ends of the lists the answer list is going to contain the intersection of the elements in both the lists that is it's going to contain the dock id's of the documents which contain both these terms so let's say that p1 and p2 are pointing to some location within the list initially they are pointing both of them are pointing to the first element in each step what we will do is we'll check if the dock ID that p1 is pointing to is equal to the dock ID that p2 is pointing to if if the answer to this is yes that means that this particular dock ID whatever that value is of the dock ID that both are pointing to that particular document contains both the terms and so it really belongs to the answer list it should be appended to the answer list so we will append that particular dock ID which we can call a stock idea of p1 or we can call it as da value of p2 because both are the same will append it to the answer list and having done that we will increment both p1 and p2 to point to the next element in their subsequent in in their respective lists if p1 and p2 are not pointing to the same values then there are right there are only two possibilities either v1 is pointing to a lower value that is dock idea of p1 is less than dock idea of p2 in which case we will increment p1 to point to the next element else p1 may be pointing to the larger element so dock idea of p1 may be greater than dock idea of V 2 that is this else condition and if that's the case then we will simply increment p2 to point to the next element so you can see that we are trying to walk these through pointers these two pointers through these lists in such a way that they are always trying to keep up with the value that the other pointer is pointing to so let's run this pseudocode in through this particular example so initially p1 is pointing over here e 2 is pointing over here now p2 is pointing to the smaller element which is 1 so what we will do is we will increment so dock I D of P 1 is greater than dock idea of p2 which means we will increment e 2 and make it point through this element now dock idea of p1 and dock idea of p2 are equal right so this first condition is true this means that the second document contains both these terms and so we will append the second that we will append the value 2 into our answer list so this if this is our answer list it's initially empty and now we've just added two to it and having added two we will increment both these pointers to point to the neck there next elements now among these two P two is pointing to the smaller element and so we will increment V 2 so P 2 will point to 5 now now 4 is less than 5 so we will increment P 1 to point to the next element which is 8 5 is less than 8 so we will increment V to 0.28 now P 1 and P 2 are both pointing to the same element to the same value sorry which means that we should take that value which is 8 and append it to her answer list and then increment both the pointers now P 2 is pointing to a smaller value 13 is less than 16 so we will increment P 2 now P 1 is pointing to a smaller value because 16 is less than 21 so we will increment V 1 21 is less than 32 so we will increment V 2 now 32 is less than 34 so we will increment P 1 34 is less than 64 so we will increment P 2 and now we find that P 2 has crossed the end of the second list that means this condition is no longer true so we need to repeat this loop as long as both P 1 and P 2 are pointing to some valid element but if either of them crosses the end of their respective lists then we exit this loop and that's because once we have reached the end of one of the lists clearly there are no other elements in that list 2 to explore which means there are there are going to be no further elements added to our answer list because if there are no more elements to consider from one of the lists clearly there there can't be any more elements in their intersection so we stopped there now you could have noticed that this is pretty similar to the merge step of merge sort because in the merge step what we did was we had to sort these two sequences into well these two sequences are already sorted but we had to merge them into an overall global sorted sequence so there we would we would argue that the smallest element globally would be either this element or this element because for these lists are sorted and so the smallest element must either be at the head of the first list or the head of the second list so we would choose whichever is the smaller of the two and say that that is going to be our first element in the sorted array and then we would take it out put it there and increment the pointer to point to the next element and then we would say that the second smallest element must be one of these two and in this way we would continue walking our two pointers through these two lists in a very similar way that we did over here except that we would be a appending every element that we are seeing to our answer list which is going to be an overall sorted sequence so with minor variations this merge procedure is pretty similar to the merge step of merge sort now notice that if the lengths of the two lists are x and y we took only a single but we need to make only a single pass through both these lists in order to generate the result which is the intersection of these two lists so if the list lengths are x and y the merge takes time filter off or Big O of X plus y and what was important here for this algorithm to work was that the postings lists had to be sorted by doc IDs if the postings this had not been sorted by doc IDs then when I look at the first element of say the first list I don't know whether this element is present in the second list or not I would have to do an entire linear scan through the second list in order to figure out whether or not two is present but if both the lists are already sorted then I could follow this kind of a walk through both these lists so that if I'm pointing at two you know in one of the lists on the other list I'll be somewhere near two if it contains a - then I'll be exactly at two if it doesn't contain two I'll either be one element before or maybe one element after two depending on you know very exactly I'm located in this algorithm within this pseudocode I can imagine tweaking this procedure to handle and/or query we just handled an an and query if we now know how to calculate Brutus and Caesar we take the postings list for both these terms and then we intersect them according to this procedure now what if we had to take Brutus or Caesar or how would we change our procedure to handle and/or query we would do something very similar to what we just did instead of appending only those doc IDs that are present in both the lists we would append every single dog ID that we are seeing into our answer list and of course if both the doc IDs are the same then we won't offend two twice we would just append it once and then increment both the pointers so so we would not just add the value to the answer list if this condition is true we would also add the value that we are seeing to the answer list over here and over here so Long's along with these two steps we would have append doc idea of p1 to the answer list over here and append doc idea of p2 to the answer list over here apart from incrementing the pointers and when we exit this while loop so we could exit this while loop when only one of the pointers has crossed beyond the boundary of one of the lists so we have to take the remaining elements of the other list and append them to the answer list as well because every single that element that's every single value that's present in these two lists is going to be present in their union when we take the union of for these two both these lists so we'll need to add some extra code over here to say that take the remaining elements of the other list the one that the lists other than the one that we just crossed and append their remaining elements to the answer list so that would be how we would do an or operation how about an odd operation that is how do we know which documents don't contain the word Brutus if we have the postings list for Brutus we can easily calculate which documents don't contain the word Brutus because those would be the documents whose doc IDs do not appear in that postings list so if this is the postings list for Brutus we would start from the beginning and we would append to our answer list all the doc IDs that we are skipping as we traverse this single list so since we are starting from two we would append one to the answer list then when we jump from two to four since we jumped over three we'll add three to her answer list when we jumped from four to eight we will add five six and seven to our answer list when we jumped from eight to sixteen we will add the numbers from 9 through 15 to our answer list and so on so we can generate another postings list with all the doc IDs that are missing from this list and that would be the answer to the query not Brutus I will leave this as an exercise for you to figure out whether or not or how exactly we would answer queries like this what you need to keep in mind is that as far as possible you want to do this efficiently so you want to do this in time proportional to the lengths of the two lists the or operation can be done in time because X plus y because it's exactly like the and operation with some minor tweaks here and there this may not be doable in linear time because if the length of the Brutus list is small and if the corpus size is large then so let's say that we have a total of D documents in the corpus and let's say that the length of the Brutus list is small B then the length of the postings list for not Brutus is going to be D minus B because we have all the documents that are not that don't contain the word Brutus in the postings list for not Brutus and so this need not be need not be doable in linear time because since we have to add to our answer list every single doc ID that we are skipping in the Brutus list it's possible that D minus B may be much much much larger than B so this is not doable in linear time at least not in the worst case so an and and an all query can be done in time proportional to the lengths of the two lists but an odd operation can't be and what you need to figure out over here is whether or not and and not operation or an or not operation can be done in time proportional to the length of the true list that is in linear time

Loading