TED - Topic Modeling

Header image

Introduction

As a self procclaimed nerd during my high school, I would spend hours binge-watching TED videos. It was the most easily accessible high quality content about a variety of topics that were delivered by politicians to scientists to comedians and actors. There was something for everyone. I enjoyed TED content so much that I even joined TEDxNapierBridge chapter based in Chennai to contribute to this wonderful platform.

What is a TED talk

TED is a non-profit institution that partners with individuals to assist in sharing ideas globally. Today, TED boasts a collection of over 3,000 TED talk videos. Additionally, they add new videos day in and day out, so there’s never a shortage of engaging content.

TED stands for Technology Entertainment and Design. But the talks discuss about these 3 blanket themes through various topics.

The goal of this project is to identify the main topics discussed in the 50 most popular TEDx talks.

Data Collection and Cleaning

The data for this project will be the transcripts from the 50 most popular videos on the TED platform. I am going to be considering both TED and TEDx talks. The popularity is based on the view counts on the videos hosted on their TED website. This is the most popular and accessible way that TED talks are consumed.

Data Extraction

The data for this modelling is from transcripts from the videos in TED website. Details for each videos are :

1. The name of the speaker

2. The title of the talk

3. The timestamps and the transcript

Data Cleaning We’ll first look at a sample of a transcript to understnad

Header image

Inferences

The text has a lot of escape characters that need to be cleaned
A lot of the cues are mentioned within paranthesis that aren’t part of the talk
The transcript is partitioned using timestamps that need to be removed and the text needs to be combined
Remove all emtpy lines.
Remove single quote characters

After the data cleanup, all the talks are combined to a dataframe.

Header image

Exploratory Data Analysis

Now that we have clean data, lets do some exploratory data analysis to understand our data a little better before we dive into topic modeling Thw questions I want to explore:

What is the average word count of the most popular talks’ titles
What’s the distribution of the talk length
Does speed (number of words per minute) play a part in the popularity of the talk
Look at a wordcloud to find the most popular words in the talks

Word count of the most popular talks:

We want to look at the average rating for each of the genres. We want to know if there are some genres that are generally rated higher than other genres or are the average rating are not impacted by the genres at all.

Talk Tile Count

Inferences:

We see that most title lengths are between 3 and 8
The 2 most popular lengths are 3 and 4. Suprisingly 4 worded titles are very uncommon
Some titles do have 8+ words but those are not as common

What is the distribution of the talk length:

Here we have considered 3 different types of TED talks. The global conference, TEDx talks and also a few TED Ed talks. There is a prescribed time limit for each format. TED and TEDx talks are supposed to be under 18 minutes (One of the biggest challenges of the curation team). Let’s explore how the talk lengths actually are distributed.

Looking at the dataframe, I see an anomoly.

Some talks have length 00:00 The reason for this is the way transcripts for these talks are in one chunk instead of being separated by time stamps as are the case with majority of the talks. In these cases, the value is computed with the average of all the talks (minute the 40 minute talk

Talk Tile Count

Inferences:

There is one talk that is 40 minutes in length. How is that possible? A little digging gives us the reason behind this. The talk with 40 minutes is not actually a standard TED talk. It’s actually an interview of Elon Musk by THE Chris Anderson (Founder of TED for the uninformed). This is obviously a special case and can defy the time limit of 18 minutes.
Most talks are capped at 20 minutes. and surprisingly a lot of talks are around 12-13 minutes. This could be a result of the data value computed.
Surprisingly around 4,5 talks are less than 8 minutes. These could be the TEDEd talks

Does speed play a part in the popularity of the talk:

The oration style of speakers could also be a major contributor to encouraging someone to watch a video (multiple times even). Some speakers talk fast, some talk slow. I want to access with Ted Talks if that really is the case.

Talk Tile Count

Inferences:

We see that the number of words per minute or the length of the talk has no effect on the ranking of the talks.
We also see that there are few talks with extremely low words per minute. The reason could be that there are couple of visual or musical performances in the top 50. These talks could be the ones with very low words per minute. 1/10 of the talks are of this type. It’s good to know that performances have considerable success.

Modeling

We now move towards the modeling part of the project with an aim to identify distinct themes and topics in the talks.

To provide the topics we would be doing the following:

Organizing the data for modeling
Building a base model
Model tuning to derive the most coherent topics

Organizing the data

The transcripts need to be organized in a way to be fed to the LDA model.

The format I will use for that is Document Term matrix.
I will first tokenize the transcript into words and then vectorize it using Count Vectorizer.
I also will perform Lemmatization of all the words to remove inflection in the text.
I will also strip the transcript of all stop words

Lemmatization

There are a few libaries and methods to perform lemmatization. Here inspired by an NLP blog (https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) I want to try various methods to try which method would yield a better result.

I’m using a quote from Do schools kill creativity? talk to check with the various methods.

 'There have been three themes running through the conference'

WordNet Lemmatizer

 there have been three themes running through the conference

As we can see, there isn’t much of any modification done to the text.

WordNet Lemmatizer with POS

POS stands for Part of Speech. I’m going to try Noun (n), Verb (v) and an Adjective (J)

Verb:

 there have been three themes running through the conference

Noun:

 there have been three themes running through the conference

WordNet Lemmatization with appropriate POS tag Here we identify each part of the sentence for the pos and use that for lemmatization.

 There have be three theme run through the conference

From the above examples it is clear that the WordNet Lemmatization with appropriate POS tag is the most efficient. Using it to lemmatize the transcripts. Also the strings must be converted to lower case for the lemmatization to be effective.

Dictionary

In addition to the Document Term matrix, we would also need a dictionary that contains the vocabulary of all the terms in the transcripts and their frequency.

Base Model

The model are based on LDA - Latent Dirichlet Allocation

Latent - Hidden

Dirchlet - Type of pobability distribution

Each document contains a list of topics. Every document is a probability distribution of various topics Each topic is a list of words

Working of LDA

Goal : We want to use LDA to learn topic mix in each document and word mix in each topic
Steps:
i. Choose the number of topics. Here we use k=2 (usually start with k=2)
ii. Randomly asssign each word in a document to one of the topics
iii. Go through every word and its topic assignment. Look at 
        How often the topic occurs in the document
        How often does the word occur in the topic overall
    Change the assignment based on the results.

Perform multiple iterations of this till the topics make sense

Working of gensim

Input : Document-term matrix, Number of topics, number of iterations

Gensim will go through the process of finding the best distribution

Output : The top words in each topic. It’s our responsibility to interpret and see if the results make sense.

Tuning : Terms in the document term matrix, number of topics, number of iterations

I first build a base model with 2 topics and 10 passes.

Talk Tile Count

As we see from the base model, the 2 topics aren’t very obvious and there are a lot of overlap in the 2 topics. So we will next perform Model tuning

Model Tuning

The following are the parameters I’m going to tune.

The number of topics.
The number of passes
The corpus - based on stop words and Parts of speech tags

I built 8 models and below are the topics and inferences from the models.

1. The first model had 20 topics and 30 passes.

The above topic classification actually have distinct themes. We can see one about - relationship, infidelity, affair. Clearly this is about Partnership.

But a lot of the topics still seem very generic. Example topic 5 doesn’t have a distinct theme.

(5, ‘0.012“people” + 0.011“want” + 0.011“come” + 0.011“like” + 0.010“work” + 0.010“say” + 0.009“know” + 0.009“dont” + 0.009“need” + 0.009“time”’),

2. The second model had 20 topics and 80 passes.

Intuitively it seems that increasing the number of passes would tune the model and provide more distinct topics. Increasing the number of passes hasn’t actually make the topics more defined. Infact it has had an opposite effect.

(8, ‘0.000“like” + 0.000“say” + 0.000“thing” + 0.000“want” + 0.000“people” + 0.000“think” + 0.000“life” + 0.000“dont” + 0.000“make” + 0.000“know”’),

(16, ‘0.000“like” + 0.000“people” + 0.000“want” + 0.000“say” + 0.000“just” + 0.000“thing” + 0.000“think” + 0.000“know” + 0.000“im” + 0.000“youre”’),

3. The 10 topics and 30 passes.

Maybe decreasing the number of topics would cause less overlap. So here I’ve created 10 topics. Decreasing the number of topics to have hasn’t yielded in better results as well. Let’s try a middle ground of 15 topics

‘0.014“like” + 0.012“say” + 0.012“think” + 0.011“people” + 0.010“know” + 0.010“just” + 0.009“thing” + 0.009“youre” + 0.009“want” + 0.009“im”’),

0.011“like” + 0.010“thing” + 0.009“want” + 0.008“think” + 0.008“time” + 0.008“make” + 0.008“minute” + 0.008“really” + 0.007“know” + 0.007“people”’)

4. The 15 topics and 30 passes.

I took the middle ground of 15 topics. These topics are looking a but better, but we still see a lot of repetition in the topics that don’t make add value.

Words such as:

Say Just Like Know Like

I’ll add these to the stop words and try recreating the models.

(5, ‘0.019“north” + 0.014“korean” + 0.012“family” + 0.010“hand” + 0.010“korea” + 0.008“just” + 0.008“like” + 0.007“little” + 0.007“pocket” + 0.006“watch”’),

(1, ‘0.018“universe” + 0.014“just” + 0.014“life” + 0.012“trillion” + 0.010“maybe” + 0.009“galaxy” + 0.009“earth” + 0.008“question” + 0.007“answer” + 0.006“know”’),

5. Adding more stop words

The stop words currently being used are part of nltk corpus. I’m adding stop words iteratively based on the topics generated.

The final list of stop words are :

additional_stop_words=['say','just','like','know','like','know','ok','kb','im','way','em','yeah','thing','things','yours','people','ca','youre','thats']

(1, ‘0.030“room” + 0.022“number” + 0.022“youre” + 0.017“im” + 0.011“night” + 0.011“rooms” + 0.009“numbers” + 0.009“people” + 0.008“life” + 0.008“bus”’),

(2, ‘0.015“people” + 0.011“time” + 0.009“north” + 0.009“life” + 0.008“work” + 0.008“something” + 0.008“years” + 0.008“im” + 0.008“addiction” + 0.008“family”’),

6. Perform topic modeling on the nouns in the transcript

Noun:
a word (other than a pronoun) used to identify any of a class of people, places, or things ( common noun ), or to name a particular one of these ( proper noun ).

Using the nouns in the transcript, we could maybe identify the topics more precisely. We see more defined topics as shown below.

[(0, ‘0.021“life” + 0.015“universe” + 0.014“earth” + 0.014“universes” + 0.010“years” + 0.010“galaxy” + 0.010“questions” + 0.008“planets” + 0.008“stars” + 0.007“answers”’), (1, ‘0.021“desire” + 0.015“world” + 0.012“time” + 0.012“sex” + 0.009“objects” + 0.008“paper” + 0.007“need” + 0.007“place” + 0.007“question” + 0.007“partner”’),

7. Nouns only with 10 topics

Recreation of the above model but only taking 10 topics into consideration, to see if the topics can be even tighter. (0, ‘0.013“work” + 0.011“time” + 0.010“line” + 0.008“number” + 0.007“youre” + 0.007“bit” + 0.007“something” + 0.007“kind” + 0.005“hats” + 0.005“ive”’), (1, ‘0.016“addiction” + 0.015“addicts” + 0.010“lot” + 0.010“loads” + 0.010“water” + 0.010“drug” + 0.009“youre” + 0.007“life” + 0.007“heroin” + 0.006“alexander”’),

8. Modeling using Nouns and Adjectives

Adding adjectives to the above model could help strengthen the topics even more. The topics I mined are as follows:

‘0.020“brain” + 0.014“desire” + 0.010“time” + 0.008“world” + 0.008“sex” + 0.006“happiness” + 0.006“question” + 0.006“body” + 0.006“energy” + 0.006“ive”’), (1, ‘0.014“time” + 0.009“work” + 0.007“life” + 0.007“kind” + 0.007“years” + 0.006“something” + 0.006“world” + 0.005“lot” + 0.005“sort” + 0.004“day”’),

Conclusion

From the above 8 models created, changing various parameters, I feel like Model 7 is the most efficient model with as many distinct topics as possible.

This model was with only Nouns and 10 topics

Talk Tile Count

Work life
Drugs and addiction
Hotels
Human body
Time
World
Life & connections
Relationships
Education
Future

These 10 topics are just loose interpretations from the topics provided but as we can see from the above models created, by iteratively changing paramteres, we can create more distinct topics