As a self procclaimed nerd during my high school, I would spend hours binge-watching TED videos. It was the most easily accessible high quality content about a variety of topics that were delivered by politicians to scientists to comedians and actors. There was something for everyone. I enjoyed TED content so much that I even joined TEDxNapierBridge chapter based in Chennai to contribute to this wonderful platform.
What is a TED talk
TED is a non-profit institution that partners with individuals to assist in sharing ideas globally. Today, TED boasts a collection of over 3,000 TED talk videos. Additionally, they add new videos day in and day out, so there’s never a shortage of engaging content.
TED stands for Technology Entertainment and Design. But the talks discuss about these 3 blanket themes through various topics.
The goal of this project is to identify the main topics discussed in the 50 most popular TEDx talks.
The data for this project will be the transcripts from the 50 most popular videos on the TED platform. I am going to be considering both TED and TEDx talks. The popularity is based on the view counts on the videos hosted on their TED website. This is the most popular and accessible way that TED talks are consumed.
Data Extraction
The data for this modelling is from transcripts from the videos in TED website. Details for each videos are :
1. The name of the speaker
2. The title of the talk
3. The timestamps and the transcript
Data Cleaning We’ll first look at a sample of a transcript to understnad
Inferences
After the data cleanup, all the talks are combined to a dataframe.
Now that we have clean data, lets do some exploratory data analysis to understand our data a little better before we dive into topic modeling Thw questions I want to explore:
What is the average word count of the most popular talks’ titles
What’s the distribution of the talk length
Does speed (number of words per minute) play a part in the popularity of the talk
Look at a wordcloud to find the most popular words in the talks
We want to look at the average rating for each of the genres. We want to know if there are some genres that are generally rated higher than other genres or are the average rating are not impacted by the genres at all.
Inferences:
We see that most title lengths are between 3 and 8
The 2 most popular lengths are 3 and 4. Suprisingly 4 worded titles are very uncommon
Some titles do have 8+ words but those are not as common
Here we have considered 3 different types of TED talks. The global conference, TEDx talks and also a few TED Ed talks. There is a prescribed time limit for each format. TED and TEDx talks are supposed to be under 18 minutes (One of the biggest challenges of the curation team). Let’s explore how the talk lengths actually are distributed.
Looking at the dataframe, I see an anomoly.
Some talks have length 00:00 The reason for this is the way transcripts for these talks are in one chunk instead of being separated by time stamps as are the case with majority of the talks. In these cases, the value is computed with the average of all the talks (minute the 40 minute talk
Inferences:
There is one talk that is 40 minutes in length. How is that possible? A little digging gives us the reason behind this. The talk with 40 minutes is not actually a standard TED talk. It’s actually an interview of Elon Musk by THE Chris Anderson (Founder of TED for the uninformed). This is obviously a special case and can defy the time limit of 18 minutes.
Most talks are capped at 20 minutes. and surprisingly a lot of talks are around 12-13 minutes. This could be a result of the data value computed.
Surprisingly around 4,5 talks are less than 8 minutes. These could be the TEDEd talks
The oration style of speakers could also be a major contributor to encouraging someone to watch a video (multiple times even). Some speakers talk fast, some talk slow. I want to access with Ted Talks if that really is the case.
Inferences:
We see that the number of words per minute or the length of the talk has no effect on the ranking of the talks.
We also see that there are few talks with extremely low words per minute. The reason could be that there are couple of visual or musical performances in the top 50. These talks could be the ones with very low words per minute. 1/10 of the talks are of this type. It’s good to know that performances have considerable success.
We now move towards the modeling part of the project with an aim to identify distinct themes and topics in the talks.
To provide the topics we would be doing the following:
Organizing the data for modeling
Building a base model
Model tuning to derive the most coherent topics
The transcripts need to be organized in a way to be fed to the LDA model.
The format I will use for that is Document Term matrix.
I will first tokenize the transcript into words and then vectorize it using Count Vectorizer.
I also will perform Lemmatization of all the words to remove inflection in the text.
I will also strip the transcript of all stop words
There are a few libaries and methods to perform lemmatization. Here inspired by an NLP blog (https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) I want to try various methods to try which method would yield a better result.
I’m using a quote from Do schools kill creativity? talk to check with the various methods.
'There have been three themes running through the conference'
WordNet Lemmatizer
there have been three themes running through the conference
As we can see, there isn’t much of any modification done to the text.
WordNet Lemmatizer with POS
POS stands for Part of Speech. I’m going to try Noun (n), Verb (v) and an Adjective (J)
Verb:there have been three themes running through the conferenceNoun:
there have been three themes running through the conference
WordNet Lemmatization with appropriate POS tag Here we identify each part of the sentence for the pos and use that for lemmatization.
There have be three theme run through the conference
From the above examples it is clear that the WordNet Lemmatization with appropriate POS tag is the most efficient. Using it to lemmatize the transcripts. Also the strings must be converted to lower case for the lemmatization to be effective.
In addition to the Document Term matrix, we would also need a dictionary that contains the vocabulary of all the terms in the transcripts and their frequency.
The model are based on LDA - Latent Dirichlet Allocation
Latent - Hidden
Dirchlet - Type of pobability distribution
Each document contains a list of topics. Every document is a probability distribution of various topics Each topic is a list of words
Working of LDA Goal : We want to use LDA to learn topic mix in each document and word mix in each topic Steps: i. Choose the number of topics. Here we use k=2 (usually start with k=2) ii. Randomly asssign each word in a document to one of the topics iii. Go through every word and its topic assignment. Look at How often the topic occurs in the document How often does the word occur in the topic overall Change the assignment based on the results. Perform multiple iterations of this till the topics make sense
Working of gensim
Input : Document-term matrix, Number of topics, number of iterations
Gensim will go through the process of finding the best distribution
Output : The top words in each topic. It’s our responsibility to interpret and see if the results make sense.
Tuning : Terms in the document term matrix, number of topics, number of iterations
I first build a base model with 2 topics and 10 passes.
As we see from the base model, the 2 topics aren’t very obvious and there are a lot of overlap in the 2 topics. So we will next perform Model tuning
The following are the parameters I’m going to tune.
The number of topics.
The number of passes
The corpus - based on stop words and Parts of speech tags
I built 8 models and below are the topics and inferences from the models.
1. The first model had 20 topics and 30 passes.
The above topic classification actually have distinct themes. We can see one about - relationship, infidelity, affair. Clearly this is about Partnership.
But a lot of the topics still seem very generic. Example topic 5 doesn’t have a distinct theme.
(5, ‘0.012“people” + 0.011“want” + 0.011“come” + 0.011“like” + 0.010“work” + 0.010“say” + 0.009“know” + 0.009“dont” + 0.009“need” + 0.009“time”’),
2. The second model had 20 topics and 80 passes.
Intuitively it seems that increasing the number of passes would tune the model and provide more distinct topics. Increasing the number of passes hasn’t actually make the topics more defined. Infact it has had an opposite effect.
(8, ‘0.000“like” + 0.000“say” + 0.000“thing” + 0.000“want” + 0.000“people” + 0.000“think” + 0.000“life” + 0.000“dont” + 0.000“make” + 0.000“know”’),
(16, ‘0.000“like” + 0.000“people” + 0.000“want” + 0.000“say” + 0.000“just” + 0.000“thing” + 0.000“think” + 0.000“know” + 0.000“im” + 0.000“youre”’),
3. The 10 topics and 30 passes.
Maybe decreasing the number of topics would cause less overlap. So here I’ve created 10 topics. Decreasing the number of topics to have hasn’t yielded in better results as well. Let’s try a middle ground of 15 topics
‘0.014“like” + 0.012“say” + 0.012“think” + 0.011“people” + 0.010“know” + 0.010“just” + 0.009“thing” + 0.009“youre” + 0.009“want” + 0.009“im”’),
0.011“like” + 0.010“thing” + 0.009“want” + 0.008“think” + 0.008“time” + 0.008“make” + 0.008“minute” + 0.008“really” + 0.007“know” + 0.007“people”’)
4. The 15 topics and 30 passes.
I took the middle ground of 15 topics. These topics are looking a but better, but we still see a lot of repetition in the topics that don’t make add value.
Words such as:
Say Just Like Know Like
I’ll add these to the stop words and try recreating the models.
(5, ‘0.019“north” + 0.014“korean” + 0.012“family” + 0.010“hand” + 0.010“korea” + 0.008“just” + 0.008“like” + 0.007“little” + 0.007“pocket” + 0.006“watch”’),
(1, ‘0.018“universe” + 0.014“just” + 0.014“life” + 0.012“trillion” + 0.010“maybe” + 0.009“galaxy” + 0.009“earth” + 0.008“question” + 0.007“answer” + 0.006“know”’),
5. Adding more stop words
The stop words currently being used are part of nltk corpus. I’m adding stop words iteratively based on the topics generated.
The final list of stop words are :
additional_stop_words=['say','just','like','know','like','know','ok','kb','im','way','em','yeah','thing','things','yours','people','ca','youre','thats']
(1, ‘0.030“room” + 0.022“number” + 0.022“youre” + 0.017“im” + 0.011“night” + 0.011“rooms” + 0.009“numbers” + 0.009“people” + 0.008“life” + 0.008“bus”’),
(2, ‘0.015“people” + 0.011“time” + 0.009“north” + 0.009“life” + 0.008“work” + 0.008“something” + 0.008“years” + 0.008“im” + 0.008“addiction” + 0.008“family”’),
6. Perform topic modeling on the nouns in the transcript
Noun: a word (other than a pronoun) used to identify any of a class of people, places, or things ( common noun ), or to name a particular one of these ( proper noun ).
Using the nouns in the transcript, we could maybe identify the topics more precisely. We see more defined topics as shown below.
[(0, ‘0.021“life” + 0.015“universe” + 0.014“earth” + 0.014“universes” + 0.010“years” + 0.010“galaxy” + 0.010“questions” + 0.008“planets” + 0.008“stars” + 0.007“answers”’), (1, ‘0.021“desire” + 0.015“world” + 0.012“time” + 0.012“sex” + 0.009“objects” + 0.008“paper” + 0.007“need” + 0.007“place” + 0.007“question” + 0.007“partner”’),
7. Nouns only with 10 topics
Recreation of the above model but only taking 10 topics into consideration, to see if the topics can be even tighter. (0, ‘0.013“work” + 0.011“time” + 0.010“line” + 0.008“number” + 0.007“youre” + 0.007“bit” + 0.007“something” + 0.007“kind” + 0.005“hats” + 0.005“ive”’), (1, ‘0.016“addiction” + 0.015“addicts” + 0.010“lot” + 0.010“loads” + 0.010“water” + 0.010“drug” + 0.009“youre” + 0.007“life” + 0.007“heroin” + 0.006“alexander”’),
8. Modeling using Nouns and Adjectives
Adding adjectives to the above model could help strengthen the topics even more. The topics I mined are as follows:
‘0.020“brain” + 0.014“desire” + 0.010“time” + 0.008“world” + 0.008“sex” + 0.006“happiness” + 0.006“question” + 0.006“body” + 0.006“energy” + 0.006“ive”’), (1, ‘0.014“time” + 0.009“work” + 0.007“life” + 0.007“kind” + 0.007“years” + 0.006“something” + 0.006“world” + 0.005“lot” + 0.005“sort” + 0.004“day”’),
From the above 8 models created, changing various parameters, I feel like Model 7 is the most efficient model with as many distinct topics as possible.
This model was with only Nouns and 10 topics
Work life
Drugs and addiction
Hotels
Human body
Time
World
Life & connections
Relationships
Education
Future
These 10 topics are just loose interpretations from the topics provided but as we can see from the above models created, by iteratively changing paramteres, we can create more distinct topics