[MUSIC] Okay, what will go through next is various methods of text summarization. I had already mentioned in the first lecture using text pre processing as a method of text summarization. But now we'll go into n-gram frequency count which is building on top of the tax free processing as well as we'll go through a concept called phrase mining. So text pre processed frequency counting which is basically taking what we've done previously and just counting the frequency of the words, it could be considered a way of topic detection. Lb it it's a naive approach but it's almost always a proper starting point to understand what is being talked about. And so let's go back to our first example from the first lesson which we had a resulting set of words from pre processing which were love, friday, hate, monday, monday turn, 21. And so the resulting frequency counts come out to be monday is 2 love is 1 friday is 1 so on and so forth. Now you look at this list and let's pretend that somehow we don't remember the first sentence that started all of this. If you look at this list and you see Monday two love Friday hate turn 21 you might be able to piece together. I think they're talking something about days of the week and I'm guessing they're probably thinking hate monday because we all hate Mondays. And then something about 21 maybe a birthday because you see the word turn, you can kind of see how just from this frequency count you start to piece together what is being discussed. But it's not perfect and I guess spoiler alert, no text analytics method at this point is perfect. And so we're just going to have to learn various techniques and try to see what works best for any given situation. But this is pretty simple. We take what we had from pre processing and then we just count the frequency of the tokens which in this case are just words. Now there is this concept of n-grams which is basically remember I said that the tokens that we chose previously were just white space words. But n-grams is a concept where you can take multi word tokens or rather instead of doing whitespace tokenization you can do some sort of algorithm that will do every other white space. And then you're going to end up with these tokens that are kind of two word phrases, right? And so you see examples here where unigram one g is love but bigram which is two gram is love friday trigram, love friday hate, 4-gram love friday hate Monday. And these would be the tokens that you choose. And so the value of N if you're using n-grams is crucial because it really does affect how you're going to summarize the text in general. Now you can kind of see the difficulty of the n-gram technique even though it's a very widely used technique to tokenize and to look at pre processed text. But the problem with the n-gram technique is that it's really important how you choose N. And sometimes a number of N will work for a certain set of words but it won't work for another set of words. So the us we now have this advanced technique and this is really cutting edge called phrase mining. I'll read this definition verbatim because this work is really in my mind, incredibly important. Phrase mining refers to the process of automatic extraction of high-quality phrases, for example, scientific terms and general entity names in a given corpus for example, research papers and news. Representing the text with quality phrases instead of n-grams can improve computational models for applications such as information extraction, retrieval, taxonomy, construction and topic modeling. I do want to mention that this is work from Dr Shang when he was a PhD student here at the University of Illinois at Urbana Champagne. But this comes out of the way Hans lab and it's very cutting edge work basically trying to figure out instead of just using the set number N n-grams. Can we have a more intelligent method that can use various forms of N or various values of N to find quality phrases. Now this figure looks a bit complicated, but I'm just going to walk us through this because this is basically how Dr Shang's phrase mining works. So on the left of this figure, you have two boxes, you have massive text corpora and you have Wikipedia. Now, let's start first with massive text corpora which is basically you have a huge body of text. Let's for example, take a bunch of news articles. So you have a bunch of news articles that are just a massive bunch of text and then you have Wikipedia. Now this is one of the advancements that Dr Shang brought forward in automated phrase mining, which is the fact that. We have this crowd sourced body of topics really, which is called Wikipedia right where people have created Wikipedia entries that could be a topic or a person so for example, Barack Obama. There is a Wikipedia entry for Barack Obama which has a bunch of definition or description about Barack Obama. And so this thought was how we can try to detect what our quality phrases is to look at what are the titles of articles on Wikipedia. And this was a major intuitive advancement. And so you could take the various phrases that are coming out of the massive text corpora which we see in this second box called phrase candidates frequent n-grams. And these are n-grams that could be of differing sizes. This is where your computational power has to be pretty high because it's kind of calculating n-grams via various values of N. And so you have these words speaks at or I'll just kind of skip around Barack Obama, Anderson Cooper, US President. And basically you'll find that some of these n-grams have Wikipedia entries and some of them don't. And so the ones that do you kind of wake them up saying this is a bit more of a potentially quality phrase. And so in the middle of this figure you have a positive pool box in a noisy negative pool box. And so the quality phrases you put into that positive toolbox Barack Obama, Anderson Cooper, US President. And then the noisy negative pool you have speaks at a town Obama administration, which Obama administration is kind of an interesting example. Because that seems like a quality phrase but it might not necessarily have a Wikipedia entry. And so there is a limitation there is that maybe that's a problem with this method. But again, every method has the pros and cons and then we'll stop right there. And I'm going to jump to the right side of this figure which is the box that says POS-guided phrases segmentation. Which is part of speech guided phrasal segmentation which is taking the actual phrases, actual sentences from let's say these news articles. So the first one is US President Barack Obama speaks at a town hall meeting with CNN's Anderson Cooper. And what computational is being done here is a method that is very common within data science which is there is what are called part of speech taggers POS taggers. And they can identify within a sentence. What's the noun? What's the verb, what's the adjective etc. And so what phrase mining is doing is it's now on this side without looking at Wikipedia. It's segmenting parts of speech and then combining what it finds namely it's going to wait now is very high. So US President Barack Obama Anderson Cooper. And it's going to combine mathematically and we won't go to into the mathematics of this but combine that with the positive pool words that are Wikipedia entries. And then it's going to give a confidence score which is right at the middle or the kind of right middle of this figure in the box called robust positive only distant training. Where it will say with great confidence 0.9999% confidence we believe us president is a quality phrase 98% confidence Anderson Cooper is a quality phrase so on and so forth. Whereas with 30% confidence we think speaks at is a quality phrase or 0.2 or 20% confidence a town is a quality phrase. And you can see right here through this phrase mining exercise which is again very cutting edge that you get a much better picture as compared to choosing a random value of N and using n-grams. But rather using this method to figure out what is actually being discussed in this body of news articles. So now what we'll do is we'll switch to the social media microscope and specifically its tool called SMILE the social media intelligence and learning environment. And we will actually try text pre processed frequency counting as well as automated phrase mining. >> Now we will see a demonstration of text pre processing as well as phrase mining within the tool SMILE from within the social media microscope. Remember SMILE is the social media intelligence and learning environment. And it is a general social media analytics tool for ingesting in data as well as analyzing that data via data science methods. So the first step is to actually bring in data. So we'll click on get started here. The screen that we're taking to is a request for citation if you are conducting research using our tools we would greatly appreciate that citation but we click on next. And the first thing that we need to do is bring in some data. Now there's various ways to pull in social data into SMLE one would be via public API and we've got three options here or two options really that our public. You can pull in via the Twitter public API you can pull in via the Reddit API or if you have a account with a company like crimson hexagon. You could use their private API to pull in data or if you have a CSV with your own data you can use that to import data into SMILE. Please note that once your SMILE session is closed all the data that you have pulled in will be deleted. So you can rest assured that any outside external data that you bring in via CSV will be securely deleted once you are done using SMILE. So for this demonstration we will authorize with our Twitter account. And one of the things that we have tried to do is make it simple to connect to various public APIs. And so with regards to Twitter you can see a screen like this. I'm already logged into my twitter account which is joeyunnresearch and it's basically asking me to authorize the app of SMILE to connect to my Twitter account. I will do that. It gives me a key or a pin. I take that pin back into SMILE and submit it. And I am now successfully authorized to pull via Twitter's public API. Now I'll just go with that for now and click on Next and then I'm taken to the search screen for SMILE in which I can select tweets. And say I would like to pull recent tweets from a brand let's say in this case Nike which I've already pulled previously but I will run through this just as an example click on Search. And what this is doing is it will give us a preview of some of the current tweets that are trending for Nike or currently that are being tweeted with regard to the keyword Nike. And then I can save it using a file name such as Nike and click on Save and then that data set will be pulled. Now I've already pulled that data set for time's sake. And so what I can do next is I can go to analytics tools here and I can click on natural language pre processing. Since that's the first technique that we want to go over in this video. I can select my data set which I already previously saved, that's why you see it here Nike. Let me close out this Twitter authorization screen and you'll see some of the preview tweets that are regarding Nike. And you can see right here actually if you wanted to import your own file to do text pre processing you could do it right here but we've got our Nike set that we've already pulled. Then I want to use the pre processed step of stemming. And then I'm going to select currently the only option which is the natural language toolkit part of speech tagger. And again, what is different about the social media microscope is that all the tools are open meaning various things actually. But one of the things that it means when it says that we are open is that all of our methods. They'll be associated papers in which you could click on these links and you can find out more information about how that method works. So you can find specific information from this paper about NLTK. But I will click on Submit and it is going through the tweet data set of Nike that I had previously pulled. And then it is stemming and doing various tokenization techniques and presenting to us a frequency chart. Now I can tag the results and this is throughout the system, I'll say Nike and I'll say pre processing and I'll tag that result. But now I can scroll down and I can see that the most common tokens that are being discussed for Nike or Nike air Jordan MSNSB, today zoom green etc. This tool also has a word tree that you could actually interact with and then you can see how these basically phrases or tweets play out through interaction with this word tree. So that's how you run text preprocessing, our natural language process preprocessing via the SMILE tool. Now I will show you how to run automated phrase mining from SMILE. So you'll go to analytics tools just like I just did and click on automated phrase mining and from here we'll go through a very similar process that we just followed where you will click on Nike. And then with regards to this method which is outlined here and the associated paper from Dr Shang is right here you have to select a parameter called minimum support. And basically what this parameter is how many times does a phrase have to occur in the corpus or in the data set of tweets before is even considered for this list of phrases. And so we'll set that threshold somewhat low, we'll say 10 here and I will click on Submit when I click on Submit, it's going to actually ask for my email address. Because this phrase mining takes quite a bit of time to process and then once it is done processing then I will be sent an email that says that. Your phrase mining exercise or algorithm is now complete and you can check the results. So I've already run this. So click on past results here at the top right I will click here on automated phrase mining and you'll see right here that I ran auto phrase on Nike with a minimum support value of 10. So if I click on this. You'll see the results well and I did not mention this in the previous preprocessing step but you can actually download various aspects of all of these results for all the algorithms withinSMILE from here. But what you see here quickly is a visualization of some of the phrases as well as a preview of the highest phrases that are part of this data set. General Motors, air Jordan, air max, released state, style code Nike SB dunk low so on and so forth. But you can download the complete phrase listing right here or right here, a complete list of phrase extracted so on and so forth. So this is how you use SMILE to conduct text preprocessing as well as phrase mining. Thank you.