WDM 18 Tokenization Process in an IR System

now that we have defined what an document unit is hopefully and now that we have extracted the character sequence we can now think about tokenization so what is a token a token is just a particular instance of a sequence of characters for example friends is a token here it's a particular instance of a sequence of characters from this document so if there were two instances if there were two occurrences of this word friends in this document and there would be two tokens generated each having the same string friends but when we talk about a term we are talking about an entry in the dictionary of your inverted index and both those instances of friends would map to the same term in the index and that term would be friend so the term is what is being stored in the dictionary and the token is the output of the tokenizer so these two are different that's something you need to keep in mind when we discuss this section so the simplest example of tokenization would be just to remove whitespace remove punctuation marks and in this case yeah these are the only two things that have been done by the tokenizer now each token that is generated by the tokenizer is going to be normalized in the linguistic module that's going to come down the pipeline we're going to look at that later right now let's just look at how the tokenizer works now you may think that tokenization is should be simple because I mean what is there to be done but let's look at some issues some practical issues that are going to come up if you try to tokenize documents in English and also some other languages consider this phrase Finland's capital how would you tokenize this word Finland's would you tokenize it as Finland Finland's or Finland's in other words how would you deal with the apostrophe what do you think makes sense among these three possibilities anybody wanna guess by the way the other thing that you need to keep in mind here is the same processing that is done on the documents needs to be done on done on the query if we discuss that in Chapter one we saw what happens if you don't do if you do different kinds of processing on the query and the documents you want to mess up your results because so do you remember that example we had taken out that time where let's say you convert turell's into singular when parsing the documents let's say you don't do that in your query in that case let's say your query contains a word in the plural that word will not map to any term in the index right because all tools were changed into singular when you build the index so all the terms in your dictionary were in the singular singular words were anyways were singular and plural words were converted into single before they were stored in the dictionary so if your query contains a word in the plural and if you don't normalize it in exactly the same way you're not going to get any results for that query it so it's important that the normalization that you do the tokenization and normalization you do is the same for the query and the document so keep that in mind when you think about what makes sense here think about the kinds of queries that would that could possibly come to the search engine and think about how a decision to normalize or tokenize in this case tokenized in a particular way would affect the kinds of results that would be returned to the query so let's look at a query like Finland so if the query is Finland do you think this document needs to be returned let's say it refers to Finland's capital well yeah it is relevant in some way so but if let's say we don't tokenize it in that case we will store I mean sorry we don't we we we keep this string as it is in the tokenizer so the tokenizer doesn't do anything to this string and this is the term that gets told of course we're gonna do some pre-processing also but for now just imagine that whatever the tokenizer is outputting you're going to you know put that into the index then when a query comes on Finland this document is not going to match it if you remove the apostrophe so let's look at this alternate if you remove the apostrophe you will be storing Finland's so when a query comes on Finland again this document is not going to be returned in the result so you can see that the way you decide to treat apostrophes can impact the performance of your system if you store it as Finland then the document will be returned right so do you see that this is the this is the kind of thing that makes sense this is the option that makes sense you probably want to tokenize Finland's as Finland is that clear but look at what happens if you generalize this let's look at a word like aren't so this apostrophe is indicating possession right so so Finland's means something belonging to Finland this apostrophe here in the word earned is used just to convert our North into a short form like to shorten the phrase are not into aren't now if I use the same technique if I use the same sort of tokenization that I used here then this would get converted into ar e n now if your query is earned would this document be returned yes it would be if the query was processed in the same way which it should be so if aren't is in the query it could also get converted into AR en and so a document containing AR en would be returned in the result let's reveal else do you see ' so let's look at I don't know if I got the name of this cricketer right Kevin O'Brien suppose we apply the same technique so we would get Kevin as a separate token Oh so how would be tokenized O'Brien well it's not clear how so if we were just removing everything after the apostrophe along with the apostrophe then this would get tokenized as oh and that would be a disaster so you can see that this generalization wouldn't work in case of names right so these are three different examples of the use of apostrophes in three different ways in in the English language and something that works for one kind of use you know for one use of the apostrophe wouldn't work for another use so dealing with apostrophes itself is a non-trivial problem you have to decide whether something is a name or not and if so then you would tokenize it in a particular way you have to decide whether something is whether the apostrophes use is being used to connote possession or you know shortened or particular phrase in that case it would be used in a different way you would tokenize it in a different way any questions on this how would you talk in eyes hewlett-packard so so the question here is how would you deal with this - so this is about the apostrophe these examples are about dealing with the - one possibility is to split Hewlett and Packard as - towards what would be the pros and cons of this can you think of a query can you think of a potential problem that could result as a consequence of splitting it into two tokens maybe a person with the name Hewlett and that would and if you split it up with Hewlett Packard you may get confused so I mean yeah I mean if there let's say so can you repeat what you said I heard the first part but I couldn't hear the second part of tokens like it's almost as if so they not well like it would be like if you have considered making the document size as you say small enough or rather it would be they could return the element or results but so if what if you don't rise it just like this okay maybe this is an easier question suppose you just organized it as you late - Packard what would be the problem by the way when I say what would be the problem I am NOT trying to say that this stick this is necessarily a bad thing to do I'm just asking whether whether there is a problem maybe I should ask whether there is any whether there would be any problems I do this I don't want you to think that it will necessarily be a problem sometimes the user may type in a query without the - yeah exactly so know if somebody just types in Hewlett and Packard with the space then this document would not be returned would not be considered relevant so that would be that won't be very good but this seems only right if you make Hewlett and Packard has two separate tokens I mean whether the query contains Hewlett space Packard or whether it contains Hewlett - pattern or whether you know if you take just the word Hewlett this document it would be written okay so we were talking about how to handle hyphens in tokenization and we saw that if you have a complex name right like Hewlett Packard we can split it into two tokens and maybe that would handle all kinds of queries Hewlett and Packard whether it's you late - Packard or Hewlett space Packard or Packard Hewlett so that would be he fine although you may want to think about it whether Packard Packard Hewlett if the query had been Packard followed by Hewlett you necessarily want Hewlett Packard to be returned do you do you want documents containing this hyphenated Hewlett - Packard to be returned you'll need to think about that because you know maybe you late - Packard is not referring to the two people but it's referring to the company and in that case if you if you type if somebody types a query like Packard followed by Hewlett maybe the company is not what they're looking for maybe they're looking for information about both these people let's say they type in their full name Dave Packard and you know forget the first name Hewlett probably hewlett-packard is not what they are looking for Hewlett - Packard so those are some issues to think about you know what what are the different kinds of queries that could come whether documents containing this particular string would be relevant to that query or not what would the users what would a typical user submitting such a query have in mind when submitting that query these are issues to think about carefully before you decide to tokenize one way or another other uses of the hyphens are for example to break up the hyphenated sorry - to talk about to connect together words which which are really different words but you don't want these words to be connected to what follows is the word which immediately follows this which is why you would add hyphens in between this is the state-of-the-art technology that you don't want are to be combined with technology and our technology to be thought of together you want state of the art to be thought of together so you can you add these hyphens now in such a for such a string it may make sense to break up the hyphenated sequence right and that should handle various queries another use of the hyphen is in connecting together to vowels like co-education may you have this - between co and education again you may you may want to preserve the hyphen in this case because Co education is a is the proper word and somebody searching for Co education will probably enter the - if you think that they won't if you think that they will write Co education as a single word then you want to somehow ensure that this word and the word Co education without the hyphen mapped to the same term so one way to do that is to delete this - and to write Co education as a single string if you do that then you know you'll be able to retrieve documents containing not just Co education with the - but also education without the Nicene now what about optional uses of the - in some cases lower case may be written as lower - case in some cases it may be written as lower space case in some other cases it may be written as lower case as a single word you want to ensure that documents containing any of these variants are returned for a query containing the word lowercase right so if I want to search for lowercase and documents containing either of these forms of the word would be relevant so in Westlaw for example they expect the users to put in hyphens so if the user types in lower - case what they do is they take this query and split it into 3 tokens lower case lower space case and lower - case and what they do is they add an or operator and they expand the query like this so a single term lower - case would be expanded into this or query and each of these tokens sorry each of these strings is tokenized separately by them so broadly speaking there are two ways in which you can make these three words map to the there are two broad ways in which you can handle tokens like this in such a way that documents containing one of these variants are returned for queries expressed in either of those forms the first is to normalize all three of them to the same term that's one way to do it the other is to tokenize all three of them separately but then to expand the query when it is when the user submits the query this is called query expansion where you take the query and you expand it into multiple strings with an or operator in between but this of strategy of query expansion which Westlaw does only if the user puts in a - so if the user puts in I type in a query like lower case then they won't do it so web search engines don't do this kind of a thing because it's very hard to expect the users to write the query in a particular format but you know whatever reason Westlaw has decided that for users entering the query in this particular format they will expand the query the user enters a query like lower case then they won't do it so you want to think about your audience whether your audience can be trained to enter queries in a particular way in that case you know you may want to make your own life easier by not worrying about too many complications but you can you know pass that burden to the user to enter their queries properly but in the context of the web you can't like you know there's such a large variety of users that you can't expect them to stick to any particular standard so you have to do the work there to anticipate what what sort of documents would be relevant to what kinds of queries wide spaces so if we just blindly split on white spaces then San and Francisco would be split into separate documents sorry separate tokens so if if somebody is searching for Francisco just Francisco you know this person is probably searching for this user is probably searching for a person of this name but then if you split San Francisco into two separate tokens a document containing San Francisco would be returned as a result that wouldn't be relevant so you definitely want to preserve this as a single token and how do you decide it is one token well maybe you have a list of city names available to you maybe you see that the first let's say San Francisco appears somewhere in the middle of a sentence you see that both these letters are capitalized so maybe this is it this is the name of either a person or a place or a you know city or whatever if capitalizations appear in the middle of a sentence then it's a you know it increases the likelihood that it's it's a name so you may not want to tokenize in the middle of these two words both of which are capitalized then other complications come up because there are different ways to write dates in in the u.s. for example this would be interpreted as 12th of March right but the very dates are written in India this would be interpreted as 3rd of December if this is the day and this is the month for for Americans I mean according to the American Convention the month is what is written first then the day and then the year but in Europe and also in India you would write the date first then the month and then the year so that adds complications because if you want to if you want to map all documents referring to the 3rd of December 1991 to the same term then you would need to treat Europe European documents separately from American documents you need to tokenize them differently and then the same date could appear written in a different format March 12 1999 1991 now a document containing this state would be relevant to a query like you know if the query is just three 1291 this document would be relevant so how do you ensure that these two are interpreted in the same way well during tokenization first of all you need to ensure that you don't tokenize in white space here because this is 12 by itself is just meaningless right you don't split a date in the middle you want to preserve this whole date as a single token and moreover you want to preserve it in some normalized form so that if there are other ways of writing that date that same date same month year and day then those documents are also present in the same postings list so that if a query comes it is normalized in the same way and all documents containing referring to that date in whatever format you can think of would be in that postings list you also don't want to split on white space for dates like this 55 BC so by the way this is the these are the two variants of writing the date convention in America in Europe so I didn't note that this was their b-52 this may be the name of a bomber or f-16 you don't want to split on the - here because these stands for a single entity this stands for the name of a kind of aircraft phone numbers PGP keys social security numbers many of these would have embedded spaces in them but you don't want to split on white spaces if you split on white spaces you will break this entity apart so earlier I our systems earlier information retrieval systems would not index numbers at all because the problem within numbers is that you can hugely increase the size of your dictionary in which case your dictionary wouldn't fit into main memory but nowadays as you know memory a computer memory has gone up dictionaries do index numbers and it's often useful to index numbers because you know they may be users looking you know searching for September 11 2001 and you know you want documents to be retrieved to them because there are some dates that are important some phone numbers that may be important you know police station or some you know some other public utility services those numbers could be important also error codes and stack traces you know you you may be writing certain programs which halt by spitting out some exception code and often if you just take that exception cord search online you'd be able to find people who have discussed that kind of error you know 4:04 not found you know four zero for error so if four zero four is not indexed then you wouldn't know what you know what to do or what exactly could have gone wrong stack traces and so on so you can see that it's often useful to index numbers for this reason in Chapter three we are going to look at something called an Engram index which is another way to deal with numbers so that the index size doesn't explode and I won't mention what this is right now we'll look at that separately the other thing we will you know you may want to do is to often index meta data separately for example if you are indexing a bunch of emails then you may want to index the date field of that email separately from the body and instead of in instead of thinking of there being a single in the for the entire email think of the date being indexed separately and the body being indexed separately so that you can handle more complicated queries of the form get me all emails where the date field varies from this date to that date so if you index your metadata separately you could interpret that metadata in more sophisticated ways and you know we can go back to this what we talked in the first lecture where if you can somehow detect structure in your data there is there are certain fields in your documents which have clear-cut meaning then you may want to preserve that meaning in some way so that you can handle more sophisticated queries so that would be an example of you know semi structured search so this was about English but there would be other kinds of language issues that would come up if you are dealing with you know other languages so here are some examples for French documents again you would have issues like the following when there is a word beginning with a vowel and if you have a definitive article before it so for example if you want to say in French the the ensamble okay so the word for the in French is la or III don't know friends or I may be murdering the pronunciation but this love when it appears before a word that begins with a vowel is converted into this kind of a short form with this apostrophe so you want to make sure that you know of like a document containing this string law in Sam bleh is split into this apostrophe should be taken out and we need to keep track of the fact that this L was really this love and this word in some way was sort of appeared different and separately so that if somebody searches just for in some way then they would be able to retrieve this document that he won't treat it as a single token if you treated it as a single token this whole string and if somebody entered this string you wouldn't be able to retrieve this document and many of these issues have been handled pretty recently as people find out things not working they you know try to rectify these issues and particularly there is a lot of improvement to be done in you know search engines dealing with documents in other languages English of course is you know there's a lot of people working on that in German for example there are other kinds of issues that you don't encounter in English in German if you have a bunch of nouns one after the other for example life insurance company employee this phrase in English has four words separated by spaces but if you translate this into German these four words would be written without any spaces together so now in compounds don't have spaces in in in German so then you would need a separate splitter module compound splitter module which will take this these kinds of compound words and split them into their individual words and tokenize each word within that compound separately so if you don't do that your performance will be 15% poorer in terms of you know precision and recall what if you do if you do that then you get a boost in your performance Chinese and Japanese are probably the languages which would present the worst kinds of problems because you don't have spaces in them at all not between words not even between sentences so you really need very very sophisticated techniques you know machine learning techniques in order to split a Chinese document into its component words and again I mentioned something called an Engram index few slides ago which I said we'll look at in chapter 3 so those are the kinds of indices that one may have to build to deal with languages like this but right now let's let's ignore that for a moment we're going to look at that in Chapter three in Japanese in particular so Chinese jerry japanese korean thigh all those languages have this problem because they don't use any spaces in japanese not only there are no spaces but you can also have different scripts intermingled in the same sentence right so there are four different scripts in japanese of which one of them you can recognize romaji is basically the the roman numerals here but there are other scripts you know hiragana kanji katakana and you can you can have these varying scripts intermingled in the same document even in the same sentence so imagine a user expressing the query entirely in hiragana you know one of these scripts but let's say that you know there are documents written in other scripts but referring to the same words so those documents need to be retrieved for this query and that complicates things a lot Arabic and Hebrew are languages that are written from right to left but the complication is that certain numbers are written from left you know certain items for example numbers are written from left to right so you would start from here for example in reading this sentence but when you come to 1962 you would read it from left to right and you would again read from right to left and again from left to right here when you're reading 132 and then again you'll go right to left and then you have often these two-dimensional marks these analogous to you know diacritical marks in English where you know what you're writing is or what the word is will not just depend on this linear sequence but which is kind of a two-dimensional representation if there are certain things that are happening on the top also here so this sentence says Algeria achieved its independence in 1962 after 113 two years of French occupation now if you look at the rendering if you look at the way this sentence is written it's pretty complicated but what's happening in recent times is that even though the surface presentation that means the way the the the sentence is written on paper that is complex but the way in which it's stored as a sequence of bytes let's say in Unicode that is pretty straightforward because what you store is basically a linear sequence a linear sequence corresponding to the sequence of sounds that would be produced when you read this sentence right even though the the writing is two-dimensional but when you speak this sentence you will be basic he's letting out a sequence of syllables a linear sequence of syllables so that is the sequence that is actually stored in the file but when presenting it when displaying it when rendering that sentence it's rendered in a two-dimensional way so that's that makes it easier because you know interpreting a two-dimensional script would have been very hard but if it's a one-dimensional script and the task of presenting or rendering it in 2d is made different is made separate from the task of you know interpreting what these words mean then it becomes easier any questions so far these are just examples I am giving of you know idiosyncrasies in the you know tokenization when you deal with other kinds of language what are the sort what's the kind of problems that you encounter which you don't in English any questions so you don't have to remember any of this I mean you can just go through it again you can read the chapter in the book which discusses many of these examples and this is just to give you an idea of what people who would work in on on tokenization and linguistic processing what what are the kind of things that they would need to do and you can probably understand now why it's important to have people speaking the language or familiar with the language to be implementing these modules because you really need to understand how the language works you need to understand how users type of their queries in that language in order to build an effective tokenizer now there's something called a stop word a stop word is a word that is so common that it appears in almost every document so think of a word like the or some of the other articles those would all be stop words now the stop words have very little semantic content so for most of the queries if I just say the capital of India for example you know this this doe is probably gonna appear in almost every document so it's not it doesn't mean anything right it's not it's not going to help me filter the documents in any way so one of the strategies that equal use to a doc is to drop these stop words entirely from the index because they would have little semantic content and almost every document would have these words and by dropping the stop words in fact by dropping dropping the top 30 words when I say top 30 I mean the top 30 words in terms of what is called the collection frequency yeah so this is the this is another term that I am introducing here remember that we have talked about document frequency which is the number of times the number of documents in which a term appears or the length of the length of the postings list for a particular term that's the document frequency then we had the term frequency the term frequency of a particular term T for a given document D is the number of times the term T appears in the in the document this is the third frequency that will be concerned with which is called collection frequency it's the number of times so again this collection frequency is defined for a term T it's the number of times T appears in the entire corpus as a whole in the corpus as a whole so which is the most frequent word in the whole corpus that would be the term with the other word with the highest collection frequency so when building the index you can also keep track of what the collection frequency of the different terms is and if you take the top 30 words in terms of their collection frequencies these would be the top 30 stop words and if you just drop them from your index the size of your index would go down by 30% because these the postings lists for these top 30 words would occupy 30% would make up 30% of the total number of postings in your index so that's why to save space people used to drop stop words but now the trend is away from doing this and there are a couple of reasons for that firstly we have very good compression techniques for compressing the postings lists so what that means is that space is no longer an issue even if we have very long postings this for stop words we can represent those postings this in a very very compact form so that we don't really lose out on space by maintaining these truffles the other thing is we have good query optimization techniques whether we're going to look at these compression techniques in a later chapter again something we'll look at in a later chapter are some good query optimization techniques we don't have to traverse the whole postings list in order to decide whether or not you know the documents are relevant to the query we can we can we can arrange the documents not by their doc IDs but using some other methods where we can prematurely terminate our scan once we have decided that we've we've already looked at the most important documents so we look at that I mean don't worry about that right now but what you can note right right now is the fact that stop words can be very important in certain kinds of queries for example if you have a query like King of Denmark this off could be pretty important because this is a trace query saying i i would like information about the king of denmark right so if if the king if there was a king or if there is a king who has a particular name of that particular place then a document containing that name would be relevant to you various song titles let it be you know to be or not to be you can see that these are these are phrases mostly composed of stop words and if we delete soft words from the index then we won't be able to retrieve documents referring to these song titles and so on it would make a hell lot of difference whether you are referring to slides to london or flights from london it's somebody searching from flights from london searching for flights out of london and the kind of pages that would be relevant to such a person would be different from the pages that are relevant to someone was flying to London so the kinds of stop words that indicate relationships are pretty important you can see that these are all queries where you need stop words and because we can handle stop words in a relatively efficient ways now the trend now is to avoid dropping them so that finishes with tokenization we looked at a bunch of issues that a bunch of complications that should probably convince you now that organization is not as trivial as it sounds