Learn How to Investigate The 10,000 movies Dataset (TMDb).
Data Science

Learn How to Investigate The 10,000 movies Dataset (TMDb).

1. Introduction

Dataset Description

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

Notes:
Certain columns like: ‘cast’ , ‘genres’, contain multiple values separated by pipe (|) characters. There are some odd characters in the ‘cast’ column. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

column names:

  • id
  • imdb_id
  • popularity
  • budget
  • revenue
  • original_title
  • cast
  • homepage
  • director
  • tagline
  • keywords
  • overview
  • runtime
  • genres
  • production_companies
  • release_date
  • vote_count
  • vote_average
  • release_year
  • budget_adj
  • revenue_adj

Questions for Analysis

Who is the actor with most movies?

Who is the director with most movies?

Who is the actor that makes the most money?

Who is the director that makes the most money?

Which genres are most popular from year to year?

In [1]:
# All used packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
In [2]:
# Function to split columns containing several values.
def spl(df,split_c,y):
    
    new_df = pd.DataFrame()
    
# apply split function to each column 
    for c in split_c:
        df_s = df[c].str.split(y, expand=True)
        
# Rename new columns
        df_s = df_s.add_prefix(c)
        
# merge new columns
        new_df = new_df.merge(df_s, how='outer' , left_index=True, right_index=True)
    
# Remove old columns (that containing several values)
    df=df.drop(split_c,axis=1)

# merge new columns with dataframe
    new_df = df.merge(new_df, how='outer' , left_index=True, right_index=True)

    return(new_df)
In [3]:
# Function to combine multiple columns into different shapes.

def comp( dataframe , multiple_columns , compare_with):
    
    a = multiple_columns
    b = compare_with
    x = dataframe

    d_a = x.loc[:,a]
    d_b = x.loc[:,[b]]


    df_m = d_b.merge( d_a, how='outer' , left_index=True, right_index=True)
    df_m = df_m.set_index(b)

    df_s = df_m.stack()
    df_s = pd.DataFrame(df_s)
    


    df_m
    df_f = df_m[a[0]]
    k = a[1:]
    for n in k:
        df_f = df_f.append(df_m[n])
    df_f = pd.DataFrame(df_f)
    return(df_s , df_f)

2. Data Wrangling

General Properties

In [4]:
# Load data
df = pd.read_csv('tmdb-movies.csv')
df.head(10)
Out[4]:
idimdb_idpopularitybudgetrevenueoriginal_titlecasthomepagedirectortagline...overviewruntimegenresproduction_companiesrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
0135397tt036961032.9857631500000001513528810Jurassic WorldChris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...http://www.jurassicworld.com/Colin TrevorrowThe park is open....Twenty-two years after the events of Jurassic ...124Action|Adventure|Science Fiction|ThrillerUniversal Studios|Amblin Entertainment|Legenda...6/9/1555626.520151.379999e+081.392446e+09
176341tt139219028.419936150000000378436354Mad Max: Fury RoadTom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...http://www.madmaxmovie.com/George MillerWhat a Lovely Day....An apocalyptic story set in the furthest reach...120Action|Adventure|Science Fiction|ThrillerVillage Roadshow Pictures|Kennedy Miller Produ...5/13/1561857.120151.379999e+083.481613e+08
2262500tt290844613.112507110000000295238201InsurgentShailene Woodley|Theo James|Kate Winslet|Ansel...http://www.thedivergentseries.movie/#insurgentRobert SchwentkeOne Choice Can Destroy You...Beatrice Prior must confront her inner demons ...119Adventure|Science Fiction|ThrillerSummit Entertainment|Mandeville Films|Red Wago...3/18/1524806.320151.012000e+082.716190e+08
3140607tt248849611.1731042000000002068178225Star Wars: The Force AwakensHarrison Ford|Mark Hamill|Carrie Fisher|Adam D...http://www.starwars.com/films/star-wars-episod...J.J. AbramsEvery generation has a story....Thirty years after defeating the Galactic Empi...136Action|Adventure|Science Fiction|FantasyLucasfilm|Truenorth Productions|Bad Robot12/15/1552927.520151.839999e+081.902723e+09
4168259tt28208529.3350141900000001506249360Furious 7Vin Diesel|Paul Walker|Jason Statham|Michelle ...http://www.furious7.com/James WanVengeance Hits Home...Deckard Shaw seeks revenge against Dominic Tor...137Action|Crime|ThrillerUniversal Pictures|Original Film|Media Rights ...4/1/1529477.320151.747999e+081.385749e+09
5281957tt16632029.110700135000000532950503The RevenantLeonardo DiCaprio|Tom Hardy|Will Poulter|Domhn...http://www.foxmovies.com/movies/the-revenantAlejandro González Iñárritu(n. One who has returned, as if from the dead.)...In the 1820s, a frontiersman, Hugh Glass, sets...156Western|Drama|Adventure|ThrillerRegency Enterprises|Appian Way|CatchPlay|Anony...12/25/1539297.220151.241999e+084.903142e+08
687101tt13401388.654359155000000440603537Terminator GenisysArnold Schwarzenegger|Jason Clarke|Emilia Clar...http://www.terminatormovie.com/Alan TaylorReset the future...The year is 2029. John Connor, leader of the r...125Science Fiction|Action|Thriller|AdventureParamount Pictures|Skydance Productions6/23/1525985.820151.425999e+084.053551e+08
7286217tt36593887.667400108000000595380321The MartianMatt Damon|Jessica Chastain|Kristen Wiig|Jeff ...http://www.foxmovies.com/movies/the-martianRidley ScottBring Him Home...During a manned mission to Mars, Astronaut Mar...141Drama|Adventure|Science FictionTwentieth Century Fox Film Corporation|Scott F...9/30/1545727.620159.935996e+075.477497e+08
8211672tt22936407.404165740000001156730962MinionsSandra Bullock|Jon Hamm|Michael Keaton|Allison...http://www.minionsmovie.com/Kyle Balda|Pierre CoffinBefore Gru, they had a history of bad bosses...Minions Stuart, Kevin and Bob are recruited by...91Family|Animation|Adventure|ComedyUniversal Pictures|Illumination Entertainment6/17/1528936.520156.807997e+071.064192e+09
9150540tt20966736.326804175000000853708609Inside OutAmy Poehler|Phyllis Smith|Richard Kind|Bill Ha...http://movies.disney.com/inside-outPete DocterMeet the little voices inside your head....Growing up can be a bumpy road, and it's no ex...94Comedy|Animation|FamilyWalt Disney Pictures|Pixar Animation Studios|W...6/9/1539358.020151.609999e+087.854116e+08

10 rows × 21 columns

In [5]:
df.shape
Out[5]:
(10866, 21)
In [6]:
df.nunique()
Out[6]:
id                      10865
imdb_id                 10855
popularity              10814
budget                    557
revenue                  4702
original_title          10571
cast                    10719
homepage                 2896
director                 5067
tagline                  7997
keywords                 8804
overview                10847
runtime                   247
genres                   2039
production_companies     7445
release_date             5909
vote_count               1289
vote_average               72
release_year               56
budget_adj               2614
revenue_adj              4840
dtype: int64
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [8]:
df.describe()
Out[8]:
idpopularitybudgetrevenueruntimevote_countvote_averagerelease_yearbudget_adjrevenue_adj
count10866.00000010866.0000001.086600e+041.086600e+0410866.00000010866.00000010866.00000010866.0000001.086600e+041.086600e+04
mean66064.1774340.6464411.462570e+073.982332e+07102.070863217.3897485.9749222001.3226581.755104e+075.136436e+07
std92130.1365611.0001853.091321e+071.170035e+0831.381405575.6190580.93514212.8129413.430616e+071.446325e+08
min5.0000000.0000650.000000e+000.000000e+000.00000010.0000001.5000001960.0000000.000000e+000.000000e+00
25%10596.2500000.2075830.000000e+000.000000e+0090.00000017.0000005.4000001995.0000000.000000e+000.000000e+00
50%20669.0000000.3838560.000000e+000.000000e+0099.00000038.0000006.0000002006.0000000.000000e+000.000000e+00
75%75610.0000000.7138171.500000e+072.400000e+07111.000000145.7500006.6000002011.0000002.085325e+073.369710e+07
max417859.00000032.9857634.250000e+082.781506e+09900.0000009767.0000009.2000002015.0000004.250000e+082.827124e+09
In [9]:
df.duplicated().sum()
Out[9]:
1
In [10]:
df[df.duplicated()]
Out[10]:
idimdb_idpopularitybudgetrevenueoriginal_titlecasthomepagedirectortagline...overviewruntimegenresproduction_companiesrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
209042194tt04119510.5964330000000967000TEKKENJon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...NaNDwight H. LittleSurvival is no game...In the year of 2039, after World Wars destroy ...92Crime|Drama|Action|Thriller|Science FictionNamco|Light Song Films3/20/101105.0201030000000.0967000.0

1 rows × 21 columns

In [11]:
df.isnull().sum(1).sum()
Out[11]:
13434
In [12]:
df.hist(figsize=(8,8));

observation:

  1. There are many columns need to be removed as they are not important for our analysis.

    • id
    • imdb_id
    • homepage
    • tagline
    • keywords
    • overview
    • release_date
  2. There is one duplicated row need to dropped.

  3. There are many rows with NaN values need to dropped.

  4. There are many 0 values found in the histograms and this will affect the analysis results, so we need to drop any row with 0 values.

  5. There are many columns containing several values, which are seperated by an "|",so we need to split them.

    • cast
    • director
    • keywords
    • genres
    • production_companies

Data Cleaning

In [13]:
# Remove non-important columns

df=df.drop(['id', 'imdb_id', 'homepage',  'tagline', 'keywords', 'overview','release_date'],axis=1)
df.head(1)
Out[13]:
popularitybudgetrevenueoriginal_titlecastdirectorruntimegenresproduction_companiesvote_countvote_averagerelease_yearbudget_adjrevenue_adj
032.9857631500000001513528810Jurassic WorldChris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...Colin Trevorrow124Action|Adventure|Science Fiction|ThrillerUniversal Studios|Amblin Entertainment|Legenda...55626.520151.379999e+081.392446e+09
In [14]:
# Remove duplicated rows.

df=df.drop(df[df.duplicated()].index)
df.duplicated().sum()
Out[14]:
0
In [15]:
# Remove rows with null values.
df=df.dropna()
df.isnull().sum(1).sum()
Out[15]:
0
In [16]:
# Remove rows with 0 values.
df=df[df!=0]
df=df.dropna()
df.describe()
Out[16]:
popularitybudgetrevenueruntimevote_countvote_averagerelease_yearbudget_adjrevenue_adj
count3805.0000003.805000e+033.805000e+033805.0000003805.0000003805.0000003805.0000003.805000e+033.805000e+03
mean1.2037843.760800e+071.089734e+08109.351117534.1590016.1705652001.2291724.471977e+071.387159e+08
std1.4805694.232179e+071.772976e+0819.845678883.7575880.79243711.3296164.488697e+072.169973e+08
min0.0103351.000000e+002.000000e+0015.00000010.0000002.2000001960.0000009.693980e-012.370705e+00
25%0.4706511.000000e+071.433379e+0796.00000074.0000005.7000001995.0000001.354637e+071.925371e+07
50%0.8108052.500000e+074.621664e+07106.000000209.0000006.2000002004.0000003.038360e+076.284688e+07
75%1.3871635.000000e+071.260695e+08119.000000584.0000006.7000002010.0000006.084153e+071.658054e+08
max32.9857634.250000e+082.781506e+09338.0000009767.0000008.4000002015.0000004.250000e+082.827124e+09
In [17]:
# Apply split function:

df_n=spl( df ,  ['cast', 'genres', 'production_companies'] , '|')
df_n.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3805 entries, 0 to 10848
Data columns (total 26 columns):
popularity               3805 non-null float64
budget                   3805 non-null float64
revenue                  3805 non-null float64
original_title           3805 non-null object
director                 3805 non-null object
runtime                  3805 non-null float64
vote_count               3805 non-null int64
vote_average             3805 non-null float64
release_year             3805 non-null int64
budget_adj               3805 non-null float64
revenue_adj              3805 non-null float64
cast0                    3805 non-null object
cast1                    3802 non-null object
cast2                    3802 non-null object
cast3                    3794 non-null object
cast4                    3776 non-null object
genres0                  3805 non-null object
genres1                  3169 non-null object
genres2                  2089 non-null object
genres3                  862 non-null object
genres4                  255 non-null object
production_companies0    3805 non-null object
production_companies1    2924 non-null object
production_companies2    1972 non-null object
production_companies3    1188 non-null object
production_companies4    692 non-null object
dtypes: float64(7), int64(2), object(17)
memory usage: 962.6+ KB

Exploratory Data Analysis

Q1 Who is the actor with most movies?

In [18]:
# Apply combine Function to combine multiple columns into different shapes.

a,b = comp( df_n , ['cast0', 'cast1', 'cast2', 'cast3', 'cast4'] , 'original_title')
In [19]:
# Fast look
a
Out[19]:
0
original_title
Jurassic Worldcast0Chris Pratt
cast1Bryce Dallas Howard
cast2Irrfan Khan
cast3Vincent D'Onofrio
cast4Nick Robinson
Mad Max: Fury Roadcast0Tom Hardy
cast1Charlize Theron
cast2Hugh Keays-Byrne
cast3Nicholas Hoult
cast4Josh Helman
Insurgentcast0Shailene Woodley
cast1Theo James
cast2Kate Winslet
cast3Ansel Elgort
cast4Miles Teller
Star Wars: The Force Awakenscast0Harrison Ford
cast1Mark Hamill
cast2Carrie Fisher
cast3Adam Driver
cast4Daisy Ridley
Furious 7cast0Vin Diesel
cast1Paul Walker
cast2Jason Statham
cast3Michelle Rodriguez
cast4Dwayne Johnson
The Revenantcast0Leonardo DiCaprio
cast1Tom Hardy
cast2Will Poulter
cast3Domhnall Gleeson
cast4Paul Anderson
.........
Watership Downcast0John Hurt
cast1Richard Briers
cast2Michael Graham Cox
cast3John Bennett
cast4Simon Cadell
Who's Afraid of Virginia Woolf?cast0Elizabeth Taylor
cast1Richard Burton
cast2George Segal
cast3Sandy Dennis
cast4Agnes Flanagan
Torn Curtaincast0Paul Newman
cast1Julie Andrews
cast2Lila Kedrova
cast3Hansjörg Felmy
cast4Tamara Toumanova
El Doradocast0John Wayne
cast1Robert Mitchum
cast2James Caan
cast3Charlene Holt
cast4Paul Fix
The Sand Pebblescast0Steve McQueen
cast1Richard Attenborough
cast2Richard Crenna
cast3Candice Bergen
cast4Emmanuelle Arsan
Fantastic Voyagecast0Stephen Boyd
cast1Raquel Welch
cast2Edmond O'Brien
cast3Donald Pleasence
cast4Arthur O'Connell

18979 rows × 1 columns

In [20]:
# The answer:
b[0].value_counts().sort_values(ascending=False).head(1)
Out[20]:
Robert De Niro    52
Name: 0, dtype: int64
In [21]:
# The top 20:
T_act = b[0].value_counts().sort_values(ascending=False).head(20)
T_act
Out[21]:
Robert De Niro           52
Bruce Willis             46
Samuel L. Jackson        44
Nicolas Cage             43
Matt Damon               36
Johnny Depp              35
Morgan Freeman           34
Brad Pitt                34
Tom Hanks                34
Harrison Ford            34
Sylvester Stallone       34
Tom Cruise               33
Eddie Murphy             32
Denzel Washington        32
Liam Neeson              31
Julianne Moore           30
Owen Wilson              30
Arnold Schwarzenegger    29
Robin Williams           29
Meryl Streep             29
Name: 0, dtype: int64
In [22]:
# visualization to the top 20 actors using Bar Charts:

T_act.plot(kind="bar")
plt.title("the top actors")
plt.xlabel("Actor Name")
plt.ylabel("Number of movies")
Out[22]:
Text(0,0.5,'Number of movies')

It's concluded that the No. of the first four actors' movies are relatively higher than the rest.

Q2 Who is the director with most movies?

In [23]:
# Apply combine Function.

a,b = comp( df_n , ['director'] , 'original_title')
In [24]:
# Fast look
a
Out[24]:
0
original_title
Jurassic WorlddirectorColin Trevorrow
Mad Max: Fury RoaddirectorGeorge Miller
InsurgentdirectorRobert Schwentke
Star Wars: The Force AwakensdirectorJ.J. Abrams
Furious 7directorJames Wan
The RevenantdirectorAlejandro González Iñárritu
Terminator GenisysdirectorAlan Taylor
The MartiandirectorRidley Scott
MinionsdirectorKyle Balda|Pierre Coffin
Inside OutdirectorPete Docter
SpectredirectorSam Mendes
Jupiter AscendingdirectorLana Wachowski|Lilly Wachowski
Ex MachinadirectorAlex Garland
PixelsdirectorChris Columbus
Avengers: Age of UltrondirectorJoss Whedon
The Hateful EightdirectorQuentin Tarantino
Taken 3directorOlivier Megaton
Ant-MandirectorPeyton Reed
CinderelladirectorKenneth Branagh
The Hunger Games: Mockingjay - Part 2directorFrancis Lawrence
TomorrowlanddirectorBrad Bird
SouthpawdirectorAntoine Fuqua
San AndreasdirectorBrad Peyton
Fifty Shades of GreydirectorSam Taylor-Johnson
The Big ShortdirectorAdam McKay
Mission: Impossible - Rogue NationdirectorChristopher McQuarrie
Ted 2directorSeth MacFarlane
Kingsman: The Secret ServicedirectorMatthew Vaughn
SpotlightdirectorTom McCarthy
Maze Runner: The Scorch TrialsdirectorWes Ball
.........
The Sound of MusicdirectorRobert Wise
Doctor ZhivagodirectorDavid Lean
Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutesdirectorKen Annakin
The Greatest Story Ever TolddirectorGeorge Stevens
On Her Majesty's Secret ServicedirectorPeter R. Hunt
Butch Cassidy and the Sundance KiddirectorGeorge Roy Hill
Midnight CowboydirectorJohn Schlesinger
The Wild BunchdirectorSam Peckinpah
GreasedirectorRandal Kleiser
Jaws 2directorJeannot Szwarc
Dawn of the DeaddirectorGeorge A. Romero
SupermandirectorRichard Donner
HalloweendirectorJohn Carpenter
Animal HousedirectorJohn Landis
The Deer HunterdirectorMichael Cimino
Midnight ExpressdirectorAlan Parker
The Lord of the RingsdirectorRalph Bakshi
Death on the NiledirectorJohn Guillermin
F.I.S.T.directorNorman Jewison
Force 10 from NavaronedirectorGuy Hamilton
ConvoydirectorSam Peckinpah
Invasion of the Body SnatchersdirectorPhilip Kaufman
The WizdirectorSidney Lumet
Damien: Omen IIdirectorDon Taylor|Mike Hodges
Watership DowndirectorMartin Rosen
Who's Afraid of Virginia Woolf?directorMike Nichols
Torn CurtaindirectorAlfred Hitchcock
El DoradodirectorHoward Hawks
The Sand PebblesdirectorRobert Wise
Fantastic VoyagedirectorRichard Fleischer

3805 rows × 1 column

In [25]:
# The answer:
b.iloc[:,0].value_counts().sort_values(ascending=False).head(1)
Out[25]:
Steven Spielberg    27
Name: director, dtype: int64
In [26]:
# The top 20:
T_act = b.iloc[:,0].value_counts().sort_values(ascending=False).head(20)
T_act
Out[26]:
Steven Spielberg        27
Clint Eastwood          24
Ridley Scott            21
Woody Allen             18
Martin Scorsese         17
Steven Soderbergh       17
Tim Burton              16
Oliver Stone            15
Renny Harlin            15
Brian De Palma          15
Robert Zemeckis         15
Wes Craven              14
Joel Schumacher         14
Tony Scott              14
Ron Howard              14
Francis Ford Coppola    13
Richard Donner          13
Robert Rodriguez        12
Barry Levinson          12
Rob Reiner              12
Name: director, dtype: int64
In [27]:
# visualization to the top 20 directors using Bar Charts:

T_act.plot(kind="bar")
plt.title("the top directors")
plt.xlabel("Director Name")
plt.ylabel("Number of movies")
Out[27]:
Text(0,0.5,'Number of movies')

It's concluded that the No. of the first three directors' movies are relatively higher than the rest.

Q3 Who is the actor that makes the most money?

In [28]:
# Apply combine Function to combine multiple columns into different shapes.

a,b = comp( df_n , ['cast0', 'cast1', 'cast2', 'cast3', 'cast4'] , 'revenue_adj')
In [29]:
# Fast look

b.columns =['Actors']
b=b.reset_index()
l=b.groupby(by='Actors').sum()
l.iloc[:,0].sort_values(ascending=False)
Out[29]:
Actors
Harrison Ford                1.428570e+10
Tom Cruise                   1.117507e+10
Tom Hanks                    1.043351e+10
Emma Watson                  8.790080e+09
Ian McKellen                 8.628837e+09
Johnny Depp                  8.518033e+09
Daniel Radcliffe             8.515082e+09
Eddie Murphy                 8.403307e+09
Rupert Grint                 8.358341e+09
Bruce Willis                 8.236476e+09
Samuel L. Jackson            7.948471e+09
Cameron Diaz                 7.836254e+09
Will Smith                   7.775643e+09
Carrie Fisher                7.678282e+09
Ralph Fiennes                7.591763e+09
Brad Pitt                    7.546207e+09
Orlando Bloom                7.467078e+09
Sean Connery                 7.407260e+09
Mark Hamill                  7.379360e+09
Leonardo DiCaprio            7.226350e+09
Gary Oldman                  7.116634e+09
Robert Downey Jr.            7.051118e+09
Robin Williams               6.963477e+09
Sandra Bullock               6.913063e+09
Sylvester Stallone           6.807269e+09
Ben Stiller                  6.694215e+09
Arnold Schwarzenegger        6.572779e+09
Robert De Niro               6.474663e+09
Dustin Hoffman               6.373330e+09
Liam Neeson                  5.996106e+09
                                 ...     
James Rolleston              4.300000e+01
Te Aho Aho Eketone-Whitu     4.300000e+01
Taika Waititi                4.300000e+01
Kathy Ireland                4.075569e+01
J. D. Cannon                 3.615428e+01
Robin Sherwood               3.615428e+01
Dolly Parton                 2.849233e+01
Heather Litteer              2.726311e+01
David Johansen               2.670238e+01
John D. LeMay                2.264205e+01
Kari Keegan                  2.264205e+01
Ciaran Owens                 1.701769e+01
Michael Legge                1.701769e+01
Joe Breen                    1.701769e+01
Wings Hauser                 1.574064e+01
Duane Whitaker               1.574064e+01
Tony Jayawardena             1.385334e+01
Stark Sands                  1.385334e+01
Ingvar Eggert Sigurðsson    1.029637e+01
Helgi Björnsson             1.029637e+01
Charlotte Bøving            1.029637e+01
Kristbjörg Kjeld            1.029637e+01
Steinn Ãrmann Magnússon     1.029637e+01
Martha Burns                 8.585801e+00
Angie Everhart               6.951084e+00
Clayton Watson               5.926763e+00
John Demita                  5.926763e+00
Kevin Michael Richardson     5.926763e+00
Jeremy London                2.861934e+00
Shannen Doherty              2.861934e+00
Name: revenue_adj, Length: 6747, dtype: float64
In [30]:
# The answer:
l.iloc[:,0].sort_values(ascending=False).head(1)
Out[30]:
Actors
Harrison Ford    1.428570e+10
Name: revenue_adj, dtype: float64
In [31]:
# The top 20:
T_act = l.iloc[:,0].sort_values(ascending=False).head(20)
T_act
Out[31]:
Actors
Harrison Ford        1.428570e+10
Tom Cruise           1.117507e+10
Tom Hanks            1.043351e+10
Emma Watson          8.790080e+09
Ian McKellen         8.628837e+09
Johnny Depp          8.518033e+09
Daniel Radcliffe     8.515082e+09
Eddie Murphy         8.403307e+09
Rupert Grint         8.358341e+09
Bruce Willis         8.236476e+09
Samuel L. Jackson    7.948471e+09
Cameron Diaz         7.836254e+09
Will Smith           7.775643e+09
Carrie Fisher        7.678282e+09
Ralph Fiennes        7.591763e+09
Brad Pitt            7.546207e+09
Orlando Bloom        7.467078e+09
Sean Connery         7.407260e+09
Mark Hamill          7.379360e+09
Leonardo DiCaprio    7.226350e+09
Name: revenue_adj, dtype: float64
In [32]:
# visualization to the top 20 actors using Bar Charts:

T_act.plot(kind="bar")
plt.title("the top actors")
plt.xlabel("Actor Name")
plt.ylabel("Revenue")
Out[32]:
Text(0,0.5,'Revenue')

We can see that there is a huge difference between the first actor compared to the rest of the actors: The first actor earns a lot more than the other actors.

Q4 Who is the director that makes the most money?

In [33]:
# Apply combine Function.

a,b = comp( df_n ,  ['director'] , 'revenue_adj')
In [34]:
# Fast look

b.columns =['Director']
b=b.reset_index()
l=b.groupby(by='Director').sum()
l.iloc[:,0].sort_values(ascending=False)
Out[34]:
Director
Steven Spielberg                1.520245e+10
James Cameron                   7.327221e+09
Peter Jackson                   7.019848e+09
George Lucas                    6.313919e+09
Robert Zemeckis                 5.655648e+09
Michael Bay                     5.460672e+09
Chris Columbus                  4.893486e+09
Tim Burton                      4.529285e+09
David Yates                     4.177455e+09
Christopher Nolan               4.164262e+09
Ridley Scott                    4.141848e+09
Roland Emmerich                 4.076981e+09
Ron Howard                      4.011669e+09
Gore Verbinski                  3.926130e+09
Sam Raimi                       3.595782e+09
Clint Eastwood                  3.518207e+09
J.J. Abrams                     3.414677e+09
Richard Donner                  3.242881e+09
Tony Scott                      3.117253e+09
Francis Lawrence                3.067993e+09
M. Night Shyamalan              2.834729e+09
Barry Sonnenfeld                2.812794e+09
Joss Whedon                     2.779224e+09
Guy Hamilton                    2.752939e+09
Sam Mendes                      2.748417e+09
Francis Ford Coppola            2.664669e+09
William Friedkin                2.589000e+09
Carlos Saldanha                 2.546672e+09
Ivan Reitman                    2.504342e+09
Steven Soderbergh               2.475274e+09
                                    ...     
Hal Haberman|Jeremy Passmore    7.790181e+03
Anders Anderson                 7.425822e+03
Jonathan Newman                 5.989677e+03
Logan Miller                    5.753797e+03
Gillian Armstrong               3.744992e+03
 Frédéric Jardin              3.255239e+03
David Brooks                    2.858730e+03
Stephen Elliott                 2.852082e+03
David Weaver                    2.394305e+03
Ted Koland                      1.840604e+03
Dermot Mulroney                 1.335831e+03
Aaron Blaise|Robert Walker      2.963382e+02
Phil Alden Robinson             2.339664e+02
Prabhu Deva                     1.361977e+02
Jeff Pollack                    1.309053e+02
Michael Pressman                1.248852e+02
Duane Adler                     1.141961e+02
Stuart Gillard                  6.339774e+01
Taika Waititi                   4.300000e+01
Gene Quintano                   4.075569e+01
John Harrison                   2.670238e+01
Gregory Widen                   2.289547e+01
Adam Marcus                     2.264205e+01
Rusty Cundieff                  1.574064e+01
Andy Cadiff                     1.385334e+01
Benedikt Erlingsson             1.029637e+01
Bille August                    9.056820e+00
Peter Hall                      8.585801e+00
Gilbert Adler                   6.951084e+00
Shinichiro Watanabe             5.926763e+00
Name: revenue_adj, Length: 1683, dtype: float64
In [35]:
# The answer:
l.iloc[:,0].sort_values(ascending=False).head(1)
Out[35]:
Director
Steven Spielberg    1.520245e+10
Name: revenue_adj, dtype: float64
In [36]:
# The top 20:
T_act = l.iloc[:,0].sort_values(ascending=False).head(20)
T_act
Out[36]:
Director
Steven Spielberg     1.520245e+10
James Cameron        7.327221e+09
Peter Jackson        7.019848e+09
George Lucas         6.313919e+09
Robert Zemeckis      5.655648e+09
Michael Bay          5.460672e+09
Chris Columbus       4.893486e+09
Tim Burton           4.529285e+09
David Yates          4.177455e+09
Christopher Nolan    4.164262e+09
Ridley Scott         4.141848e+09
Roland Emmerich      4.076981e+09
Ron Howard           4.011669e+09
Gore Verbinski       3.926130e+09
Sam Raimi            3.595782e+09
Clint Eastwood       3.518207e+09
J.J. Abrams          3.414677e+09
Richard Donner       3.242881e+09
Tony Scott           3.117253e+09
Francis Lawrence     3.067993e+09
Name: revenue_adj, dtype: float64
In [37]:
# visualization to the top 20 actors using Bar Charts:

T_act.plot(kind="bar");
plt.title("the top director")
plt.xlabel("Director Name")
plt.ylabel("Revenue")
Out[37]:
Text(0,0.5,'Revenue')

We can see that there is a huge difference between the first director compared to the rest of the directors: The first director earns a lot more than the others.

Q5.1 Which genres are most (in general)?

In [38]:
# Apply combine Function to combine multiple columns into different shapes.

a,b = comp( df_n , ['genres0', 'genres1', 'genres2', 'genres3', 'genres4'] , 'release_year')
In [39]:
# Fast look
a
Out[39]:
0
release_year
2015genres0Action
genres1Adventure
genres2Science Fiction
genres3Thriller
genres0Action
genres1Adventure
genres2Science Fiction
genres3Thriller
genres0Adventure
genres1Science Fiction
genres2Thriller
genres0Action
genres1Adventure
genres2Science Fiction
genres3Fantasy
genres0Action
genres1Crime
genres2Thriller
genres0Western
genres1Drama
genres2Adventure
genres3Thriller
genres0Science Fiction
genres1Action
genres2Thriller
genres3Adventure
genres0Drama
genres1Adventure
genres2Science Fiction
genres0Family
.........
1978genres1Comedy
genres2Drama
genres0Horror
genres1Thriller
genres2Science Fiction
genres3Mystery
genres0Adventure
genres1Family
genres2Fantasy
genres3Music
genres4Science Fiction
genres0Action
genres1Drama
genres2Horror
genres3Thriller
genres0Adventure
genres1Animation
genres2Drama
1966genres0Drama
genres0Mystery
genres1Thriller
genres0Action
genres1Western
genres0Action
genres1Adventure
genres2Drama
genres3War
genres4Romance
genres0Adventure
genres1Science Fiction

10180 rows × 1 column

In [40]:
# The answer:
b[0].value_counts().sort_values(ascending=False).head(1)
Out[40]:
Drama    1729
Name: 0, dtype: int64
In [41]:
# The top 20:
T_act = b[0].value_counts().sort_values(ascending=False).head(20)
T_act
Out[41]:
Drama              1729
Comedy             1335
Thriller           1194
Action             1076
Adventure           743
Romance             658
Crime               649
Science Fiction     517
Horror              459
Family              417
Fantasy             395
Mystery             343
Animation           199
Music               131
History             128
War                 119
Western              52
Documentary          26
Foreign               9
TV Movie              1
Name: 0, dtype: int64
In [42]:
# visualization to the top 20 actors using Bar Charts:

T_act.plot(kind="bar")
plt.title("the top genres")
plt.xlabel("Genres Name")
plt.ylabel("Number of movies")
Out[42]:
Text(0,0.5,'Number of movies')

It's concluded that the No. of the first four genres are relatively higher than the rest. And "Drama" is the most popular with a relatively large difference.

In [43]:
b.columns =['Genres']
b=b.reset_index()
b["Y_G"] = b["release_year"].astype(str) + "_" + b["Genres"]
b.head()
Out[43]:
release_yearGenresY_G
02015Action2015_Action
12015Action2015_Action
22015Adventure2015_Adventure
32015Action2015_Action
42015Action2015_Action
In [44]:
T_act = b.iloc[:,2].value_counts()
T_act = pd.DataFrame(T_act)
T_act = T_act.reset_index()
In [45]:
T_act.columns=['Y_G','N']
T = T_act['Y_G'].str.split('_', expand=True)
T.columns=['release_year','Genres']
T['N'] = T_act['N']
t = T.sort_values(by = 'release_year')
df = t.reset_index( drop = True )
#df = t.set_index(['release_year'])
#df = df.transpose()

df
Out[45]:
release_yearGenresN
01960Romance2
11960History1
21960Action2
31960Horror1
41960Thriller1
51960Comedy2
61960Adventure1
71960Drama3
81960Western1
91961Adventure2
101961History1
111961Comedy4
121961War1
131961Animation1
141961Western1
151961Horror1
161961Crime1
171961Romance1
181961Family2
191961Action2
201961Drama6
211961Music1
221962Action2
231962History1
241962Adventure3
251962Thriller1
261962Crime1
271962Drama5
281962Western2
291962War1
............
8092014History4
8102014Action44
8112014Drama79
8122014Romance20
8132014Adventure29
8142014Comedy48
8152014War9
8162014Mystery9
8172014Thriller41
8182014Animation10
8192014Family14
8202014Fantasy14
8212014Western1
8222015War5
8232015Music9
8242015Drama81
8252015Western2
8262015Science Fiction22
8272015Animation9
8282015Mystery13
8292015Adventure34
8302015Crime24
8312015Thriller47
8322015Horror17
8332015Romance18
8342015Action37
8352015History5
8362015Family13
8372015Fantasy13
8382015Comedy52

839 rows × 3 columns

In [46]:
df = df.pivot(index = 'release_year' , columns = 'Genres' , values = 'N' )
In [47]:
df.plot(kind="bar" , figsize = (20,15));
plt.title("Most popular genres from year to year")
plt.xlabel("Release year")
plt.ylabel("Number of movies")
Out[47]:
Text(0,0.5,'Number of movies')

It's concluded that almost all genres' popularity increasing over time.

In [51]:
import pylab as pl
df.plot(kind="line" ,  figsize = (20,15));
plt.title("Most popular genres from year to year")
plt.xlabel("Change over years")
plt.ylabel("Number of movies")
Out[51]:
Text(0,0.5,'Number of movies')

It's concluded that almost all genres' popularity increasing over time.

In [49]:
df.plot( kind="hist" , figsize = (10,7));
plt.title("Most popular genres from year to year")
plt.xlabel("Change over years")
plt.ylabel("Number of movies")
df = df.cumsum();
plt.figure();
<matplotlib.figure.Figure at 0x7f0a412ac080>

It's concluded that Some genres were really popular in the past and now their popularity decreased significantly like western movies. In contrast, some genres weren't popular and over time they became popular the most like drama.

Conclusions

  • This database was really beneficial and rich with info that from it I came to many conclusions. Those conclusions helped me in answering the above question and can do a lot more.
  • limitations:
    I found many columns need to be removed as they are not important for our analysis (such as "imdb_id"). so, I dropped them.
    There was one duplicated row need to drop.
    The dataset contains some rows with null and zero values in some features. And these rows need to be removed.
    There were many columns containing several values, which are separated by an "|", so I create a function to split them. #### Who is the actor with most movies? Robert De Niro with 52 movies #### Who is the director with most movies? Steven Spielberg with 27 movies #### Who is the actor that makes the most money? Harrison Ford with 1.428570e+10 dollars #### Who is the director that makes the most money? Steven Spielberg with 1.520245e+10 dollars #### Which genres are most popular? Drama
  • This database was really beneficial and rich with info that from it I came to many conclusions. Those conclusions helped me in answering the above question and can do a lot more.
  • From the first question, it's concluded that the No. of the first four actors' movies are relatively higher than the rest.
  • From the second question, it's concluded that the No. of the first three directors' movies are relatively higher than the rest.
  • There is a huge difference between the first actor compared to the rest of the actors: The first actor earns a lot more than the other actors. this same conclusion applies too to question four where the first director earns a lot more than the others.
  • From the last question, it's concluded that Some genres were really popular in the past and now their popularity decreased significantly like western movies. In contrast, some genres weren't popular and over time they became popular the most like drama.

  • Andrew Samy
  • Mar, 27 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.