WDM 17 Parsing Documents and Issues Associated with it

what exactly is a document when we want to collect all all the initial data into a set of files that we are going to then process in the tokenization and linguistic pre-processing phase we are making it's not a trivial task that's what we saw last time because in order to work with documents we need to be able to extract the character sequence in the documents because ultimately a document is a sequence of bytes in a file right that's how it's stored on a machine and we need to take the this sequence of bytes in it in the file in the document file and convert it or extract from it a character sequence the sequence of characters this sequence of characters is what we will then tokenize so we were assuming in this diagram of a pipeline that we have the character sequence already extracted but there will be an initial step where we need to take documents in various formats languages and encoding and extract this character sequence out of each of those kinds of documents and that may not be a totally trivial task so documents can come in different formats PDF documents Word documents Excel HTML and so on and to extract character sequences out of each of these kinds of documents is going to require us to do to use different techniques and mostly information retrieval people don't do this they basically use licensed libraries to convert PDF into a text file and so on so ultimately we have to assume that all these documents have already been decoded and we have or we have obtained the character sequence out of them suppose if you're looking at HTML document even after converting it into a text file you may still need to process that text file and remove all the markup that's in the HTML document because markup is not something you want to index markup is basically meta information and that kind of information can be discarded so we actually want the body of the text when we parse the X the HTML file also we want to figure out what language the text is in because you've seen in the first chapter that this the tokenization and linguistic pre-processing steps depend on the language the kind of tokenization you'll do and the kind of linguistic pre-processing you will do depends on the language and so we need to know the language of the documents in advance so that we can use the appropriate schemes for tokenization and other kinds of normalization we also need to know what character set is in use you know there are different kinds of character sets ASCII Unicode utf-8 and so on and in order to convert bytes into characters we also have to know what character set we are using or what character set the document is using now each of these tasks figuring out the format the language the character set is a classification problem recall from chapter 1 that a classification problem is one where you you have a predefined set of classes and you also have a set of examples of typical documents in each class and then you train the machine to recognize documents in each of those classes so that when you get a new document the machine will be able to classify it in one of those existing classes so there is something called supervised learning which is a particular subfield of machine learning and classification is an example of supervised learning where a machine learns to recognize the different classes by you having to provide it examples of typical example of typical documents that lie in each class so it's supervised because you are like the supervisor who is providing a standardized list of examples for each of the classes to the machine so the machine will look at each of those examples try to figure out patterns from those examples patterns that are unique to each of the classes so that when you after this training phase so there there will be a training phase where you show of the computer these different examples of documents belonging to all the classes and the computer will learn how to recognize each of the classes because it will extract patterns characterizing each class in the training phase and then in the test phase you have a new document that's coming in which is not classified which is not labeled so you so the machine doesn't know which class it belongs to and you feed it to the machine and the machine will use the knowledge that it has accumulated in the training phase to classify the document in the test phase so whether the classes correspond to different formats of documents or different languages or different character sets it's the same classification problem and we want to study classification in more detail later in the course when we look at the machine learning section but more very often these classification tasks are often done heuristic Li for example we saw last time that if you can just figure out what the file extension is that can often give you the file format if you look at the document metadata that can often tell you what the character set is and similarly if you look at you know some of the words in the document assuming that you know what the character set is if you are able to extract some of the the words you can probably figure out what language the document is in just by looking at a few words so using a combination of machine learning and heuristic techniques we can extract the character sequence out of the sequence of bytes and we're going to assume now that we have that character sequence extracted some complications in this are for example when documents that you are trying to index are heterogeneous in other words there is no single language that is found in the documents what if some of your documents are in Germans some of them are in French some of them are in English in that case your single index is going to contain terms belonging to several languages and you'll need to do tokenization and linguistic pre-processing separately for you know documents belonging to different languages now that's something you I really would want to avoid because you know if you are the person building the search engine you clearly have some language in mind and it's impossible for one single person to think about you know doing tokenization and linguistic pre-processing for multiple languages in some cases a single document itself could contain multiple languages or multiple formats for example you know an email in English could quote say a purse peach in French in which case a section of your email would be in French and the rest of it would be in English sometimes the attachments that you send here with your emails could be from other languages then have a French email with the German PDF attachment so these are some complications that you need to deal with if you plan to build an ir system for the web the other thing that you need to decide so this is about extracting the character sequence so this slide up to here is about the task of extracting the character sequence from the sequence of bytes making up the documents the other issue you have to worry about before you actually start this pipeline before you can consider this initial step to have been done is to decide how you're going to define a document one simple way to define it is every file that I'm having is going to turn into is going to be a separate document so if there are 37 plays of Shakespeare I'll have 37 documents because the number of documents you define is going to determine the IDs that you are will be assigning to each of those documents now in some cases you may not want to define a document as a single file for example the UNIX M box stores all your mails in a single file called the M box file if you want to index us this kind of a file then you probably want to split the file into its individual emails and you want to make each email within the M box file into a separate document so this is an example where you may want to split a single file into multiple documents what if you have an email that has 5 attachments in that case you may want to convert the body of the email into a separate document and each of the attachments into a separate doc so you will you will have six documents being created out of this email what if one of the attachments is a zip file in that case you may want to unzip that file that that file and then look at all of all the files that that was it and convert each of them or define each of them as a separate document in some cases people often convert PowerPoint presentations or latex presentations into n into a sequence of HTML pages so that they can be viewed easily on the web in a browser in that case a single file a single original file say a PowerPoint file would be split into a sequence of HTML pages if there were 30 slides in the PPT file that would generate this process would generate 30 HTML pages so you don't want to treat each HTML page as a separate document here you may want to combine those 30 pages into a single document which is sort of the converse of what you are doing here where you want here you are splitting a single file into multiple documents here you are trying to merge multiple files into a single document now suppose you have a let's say you have a book right let's say you take a single play of Shakespeare even there you may want to think about whether you want to define a single document as the entire play or do you want to consider certain you know if you want to break the play into a into a sequence of sections and say that each section is going to be a separate document let's say you have a huge book in PDF format you may want to split that book into individual chapters and define each chapter as a separate document now why would you want to such a thing why would why would it make sense sometimes to split an entire book into a number of documents instead of treating it as a single document what are the advantages and disadvantages of doing that is there a use case you can think of where this may make sense let me actually give you an example so that maybe you can think about why you may want to do this so let's say there is a a book from the Middle Ages okay let's say there's a his some historian writes a book on the Middle Ages in Europe now in that book you're probably going to find the word Christ appearing you know because naturally when you study the history of Europe in the Middle Ages we'll be studying pretty much the history of Christianity because that would have been the dominant religion in Europe at that time so you'll find the word Christ appearing at many places more over the university system that we have today was initially started in the Middle Ages in Europe so you would probably see the word university also appearing in some of those chapters now suppose your query was Christ University would this document be relevant to you would this book be relevant to you this book on the Middle Ages in Europe probably not but you know because Christ appears in the book and university appears in the book maybe in entirely different chapters or even if they appear in the same chapter in on different pages of the same chapter certainly they won't appear side by side like this in that case if you define the entire book as a single document this query Christ University if treated as an ant way if treated as an ant berry would return this book and many other such documents now suppose we were to split the entire document into a sequence of paragraphs and if you were to take the whole PDF file extract the character sequence from it and convert each paragraph into a separate document in that case you will probably have very few in fact the it's very unlikely that you're going to find Christ and University appearing in the same paragraph because when the university system is being discussed probably religion is not going to be discussed or even if it's discussed it'll be discussed in a generic sense not you know so I hope you are getting what I'm saying by refining your definition of a document you will be increasing the chance that if a doc if the the search engine is returning you a list of documents for this query it's very likely that those documents are talking about Christ University because if Christ and University are appearing very far from one another as they will in many of the online books then you don't want that to be returned in the result so my question to you is recall these terms precision and recall that we discussed last time is it possible to say what would be you know consider a scenario where you treat every book no matter how large it is as a single document what would a search what would the search engine that you would get as a result of that be called as would it would it be would it what would its performance be would you say it's high precision would you say if I recall we say it's low precision low recall think of a query like this and think of a document like this and think of a scheme where every document every file is turned into an individual document including huge books so what the precision behind do you remember what precision is we discussed this last time can you speak into the microphone some for some reason I couldn't hear it's the voice is very faint yeah yeah so if in in your sequence of results you are getting a lot of irrelevant documents as you will in this case because you will get many of these you know books which mention Christ and University but you know not appearing next to each other but appearing in totally unrelated ways somewhere in the book inside the book you're gonna get all those books in your result so that would be low precision like the precision would be low because many of the documents that were retrieved were not relevant to your information need your information need was you are interested in finding out information about this entity called Christ University which is in Bangalore but what you are getting in your results is a number of documents which mentioned Christ on university but not bring to Christ University but referring to you know University and Christ two different and separate entities so as your document size grows it is more and more likely that you're going to get examples like this where the words in your query are all appearing in that document but they are appearing so far from one another that the document is really not relevant when you if you enter two or three words in your query you expected those two or three words to be basically found next to one another or at least in the same sentence or same paragraph maybe but because they are being found in you know at a great distance from one another inside the book all those would be returned as well so the precision would be low but the recall would be high because if a document is relevant to you if there is a document referring to Christ University it's definitely going to be in the result the problem is that it's not just these kinds of DS documents that will be in the result you're also going to get other kinds of documents that's the problem so you're gonna get low precision and higher equal so if document size is too large then the precision will be low and they're in and the recall would be high what if the document size is too small let's say you change every sentence into into a separate document well maybe you would still do well because Christ University strictly speaking is a phrase query so if you expect a reference to it you probably expect it you probably expect it to be found in the same sentence right next to one another but let's consider another example say Chinese toys now this is not necessarily a phrase query because you can think of Chinese toys as toys made in China and it's possible for Chinese and toys to appear not together but you know not too far from one another so that that paragraph itself is talking about toys and it's referring to the fact that there are Chinese manufacturers for it so that that kind of a paragraph would also be relevant to you but what if though you make the document size too small maybe there are some relevant documents where the word Chinese appears and the word toys appears in the same paragraph but not in the same sentence so if you were to make every sentence into a separate document then you would lose these associations right you would lose these documents do you see that if the document size is too small that is for example if every sentence is converted into you know a separate document then when you search for Chinese toys when you search for a query like Chinese toys you're going to get a list of documents in the result which will be such that the word Chinese and toys appear in the same sentence right by definition because every sentence is being treated as a separate document so all the all the doc doc IDs here correspond to sentences where the word Chinese and the word toys both appear but my point is that in the corpus as a whole there could be many other or if you look at the original corpus if you look at the original corpus from where you changed every sentence into a separate document in that original corpus consisting of large PDF books it's possible that there was there were many references to Chinese toys but they were references where the word Chinese and the word toys did not appear in the same sentence but appeared in adjacent sentences but by converting every sentence into a separate document those references would not be returned in the result because you know one sentence would just have the word Chinese so that document wouldn't be returned another sentence would just have the word toys that document wouldn't be returned so in this case the recall would be low and the precision would be high why would the precision be high because you are assured at least that whatever results you are getting are probably relevant because these are all results where Chinese and toys appear in the same sentence but why is a recall low because in the corpus there were many references to Chinese toys where Chinese and toys did not occur in the same sentence and you basically don't get those references in this results list so I hope you see this trade-off here between what happens if you make the document size too large and what happens if you make the document size too small one hand you'll you'll get low precision and high recall and on the other hand you'll get my precision and low recall any questions about this so you need to avoid either extreme and of course there are no set rules to avoid either extreme you basically have to think carefully about the kinds of documents that you are dealing with the kinds of users that will be using your system and need to think about the kinds of queries that they will be submitting I have a question when you think that dog size is too big or small and you give any specification like how big it is in variances in Emmys or GB so the question is not whether it's too big in terms of mb/s or jeebies and the question is more about the kind of language the kind of queries and the kind of users that will be using your system so I mean think about just an entire novel like you know think of any your favorite novel right you pick a random word from the first chapter of the novel a random word but a word that is relatively read okay not articles or anything like that and you pick a random word from the last chapter and suppose somebody were to get give a query with two words one of them being that that word from the first chapter the other being the word from the last chapter would that book be relevant to that query it won't be because you know there's there's no reason why that document would be expected to be relevant to that user because you know the idea of entering when we think of the kinds of queries you you yourself enter on Google right you you would mostly be entering phrase queries or queries where you expect the words to be found close to one another in the document but if the document size is too large it's you increase the likelihood that you will find all those words in the document but they will not be close together the larger the document you choose the larger the likelihood that those words will be present and they will be present in unrelated places that's all I mean by saying the document size is too large it's not a question of how many mb/s or jeebies of course em you know think of a typical PDF book that would be mb/s write it won't be I don't think you know a typical novel would be more than 1 or 2 MB in in PDF form at least so even that could be pretty large that could be too large maybe you want to index each page separately something like that or each paragraph for each section I mean it depends on what documents you're looking at really so as I said there is no set rule here like that there is no universal rule saying this document is to this document size is too large or this one is too small it depends and you know these are things that you learn more by actually doing it evaluating the performance of your system then changing things and seeing if the performance goes up or goes down I mean there's no mathematical definition of being too large or too small you have to see what maximizes your performance