What we're going to do now is we're going to look at text summarization. Specifically what we're doing is we're going over and introduction to text analytics, specifically trying to use computational methods to analyze what people are talking about in aggregate with regards to their text. Now please remember this is just an introduction and so we'll be doing a very high level overview of many of these methods. Just to start with the description, I'm not going to read this whole definition completely, but what we can see in this definition is that we're taking short members of a collection of texts, and we're trying to process it computationally so that we can understand what is being discussed. This can be used for various additional techniques not just with regard to summarization, but you can take these aggregations of text and do things like text classification, novelty detection, so on and so forth. Texts preprocessing, which is the first and initial method with regards to analyzing text, can generally be defined as bringing your text into a form that is predictable and analyzable for your task. Here are the steps to preprocess text. What I mean by preprocessing text is that when a computer reads any given body of text that is created by humans, it has to go through a series of steps to prepare the text to be processed computationally. This in itself is a form of text summarization or trying to understand what is being said in the text. I would say it's like a primary or foundational way to get an understanding of the text while using computational methods. There are generally five steps to text preprocessing. First, we collect the text data. Second, we tokenize the text data. Third, we normalize the tokens and I'll go over what it means by token. Forth, we remove stop words, and fifth, we stem and lemmatize those tokens. First, we need to collect the text data. Now in this course, we've already gone over collecting data from surveys. How you would collect text data from surveys is that you would have some sort of open-ended question in your survey mechanism that would ask something like, do you have any additional comments or how did you feel about this product or what do you think about this brand? Then you give them an open textbox so that they can write that text into the survey response. But you can also obviously collect text data from databases, whether they're private databases within your company or public databases that are available on the Internet, as well as another popular way to collect text data is to collect internet social data. Social media data via what is called an application programming interface. Now I'm not going to go too much into that in this course, but it's basically the way that a social media company would enable you to pull data from them. That's not just text data, it could be image data, it could be whatever they allow you to pull from them. Now this is not an exhaustive list of how to collect text data, but this is three of the more popular ways to collect text data for marketing purposes. The second step is then to tokenize the text. Here is where I'll explain what tokenize means or this concept of tokens. For a computer, a token is basically breaking down text in any way that will enable it to process batches or items. So the computer calls it tokens. Now we could call words tokens because most of the time tokenization is just applied as, let's just do it word for word. To give you an example, let's start with this sentence here. I love Fridays and hate Mondays, but this Monday I turned 21, exclamation point. If we use whitespace word tokenization, which means that the computer will count a token after each whitespace, then what you have on this slide here is I is a token, love is a token, Fridays is a token, and so on and so forth. You can see that for our example here at the tokens are just simply words that are separated by whitespace. After you've tokenized the text, now you normalize the tokens. We start with our whitespace word tokenization at the top of this slide. I, whitespace, love, whitespace, Fridays. Then you go through this process of normalization, which you can actually choose depending on your goals of how you're processing the text how you want to normalize. Many times, people will remove numbers, they'll remove punctuation, and they'll do lower casing of all the tokens so that everything can be standardized or again, normalized. But I actually highlighted numbers in red here because I noticed in this sentence that that number 21, it's very important to this sentence. It's easy for me to see because we're looking at one sentence. But just so you know, the normalization step, you can make a decision based on your goals as to how you want to normalize. If you look at the normalization on the bottom of the slide, you'll see that punctuation is taken out. After 21, there's no more exclamation point. Everything is lowercased, but we did retain the number 21. Then the fourth step is to remove stop words. I'll just read this definition verbatim. Stop words are the most common words that appear in a text but are no actually important to understanding the topical content of what is being discussed. You see examples here, the word the or with, or to, a, and. Those words usually are not super important to understanding the meaning of a body of text, and they're typically called stop words. Now stop words, there's no defined for sure stop words out there that everybody has to define what their stop words are. This is another chance for a data scientist, you all, to make a decision as to what do you want your corpus of stop words to be? Now of course, there are very popular stop word aggregations on the internet that you can just use, and that's what most people do. But just so you know that stop words are not necessarily predefined and set in stone. If we remove stop words, you can see that what we started with at the top of this slide after we normalized the tokens, that we removed the stop words at the bottom of this slide, and now we're left with love, fridays, hate mondays, monday, turn, 21. Then the final step is to stem or lemmatized the tokens. Well, what is stemming? You can see these examples here. You may have various forms of the word run: running, ran, runs. So what stemming does is it will convert all three of those various forums of the word run to just run so that you know that you have three instances of some form of run. Lemmatize is actually taking stemming one step further, where you'll have these words: better, best, good. A computer actually will know through a lemmatization process that all of these are actually meaning the root word, good. Stemming would not get you to good three times from these three words and better, best and good, but lemmatization will do that. Again, this is a decision that you make as to whether you want to stem, or whether you want to lemmatize, or in some senses, you might want to combine both, but that's not necessarily what most people do. They'll make a decision to either stem or lemmatize. I've chosen stemming here. Where we started with love, fridays, hate, mondays, monday, turn, 21, now we have love, friday, hate, monday, monday. You see that now there instead of a Mondays plural and a Monday singular, now we just have to Monday singular, monday, monday, turn, 21.