Posted by
Ronnen Nagal
on March 25, 2019 ·
41 mins read
Introduction
The film-industry is in a constant growth trend. The global box office was worth 41.7 billion in 2018. Hollywood has the world’s most massive box office revenue with 2.6 billion tickets sold and around 2000 films produced annually.
One of the main interests of the film studios and related stakeholders is a prediction of revenue that a new movie can generate based on a few given input attributes.
Background
Starting in 1929, during the Great Depression and the Golden Age of Hollywood, an insight began to evolve related to the consumption of movie tickets. It appeared that even in that bad economic period, the film industry kept growing. The phenomenon repeated in the 2008 recession.
The primary goal is to build a machine-learning based model that will predict the revenue of a new movie given such features as cast, crew, keywords, budget, release dates, languages, production companies, and countries.
EDA was the first step followed by introducing an initial linear model and comparing it to other models at the end of the process. 7398 movies data collected from The Movie Database (TMDB) as part of a kaggle.com Box Office Prediction Competition. A train/test division is also given to build and evaluate the developed model.
The Challenge
Consumer behaviours have changed over the years: the MeToo movement, as well as other social developments, have surfaced in our society, and that reflected in movie scripts. However, some of the preferences that were relevant 50 years ago are still relevant today; hence, an analysis based on the last few decades of movies production is always appropriate and will be able to serve any stakeholders that have an interest in predicting a new movie revenue.
The Packages that I used in this exercise:
EDA
Pictures are best to illustrate and present first findings from the dataset.
Begin the exploration with a scatter plot of ‘Revenue vs Budget’ to view the upper-end data points:
Continue the exploration with a scatter plot that will show us the lower-end points by using a log10 of the values, the plot of ‘Revenue vs Budget’ will change:
Checking the scatter plot of ‘Revenue vs Popularity’.
Comparing the movies with the biggest budget values:
Comparing the movies with the biggest Revenue values:
Comparing the movies with the biggest Profit values:
Moving ahead to explore the highest Revenue by ‘genres’ as follow:
The column ‘belongs_to_collection’ was converted to a ‘True’ / ‘False’ column, if a movie belongs to a collection of movies, or not.
Simple box-plot revel that movies that belong to a collection benefit from a higher Revenue as reflected by the median and the range (25,75 percentile), the orange (right) box-plot is more elevated.
Define a function (named ‘parse_json’) to parse the first ‘name’ value from this structure of a list of dictionaries:
Applying the ‘parse_json’ function on the ‘production_companies’ column yields:
Visualizing the production companies with the highest Revenue yields the plot:
Data Preparation
Starting with Sentiment AnaLysis of the columns ‘overview’ and ‘tagline’ that contains a short verbal overview of the movie as well as the relevant tagline.
I used vaderSentiment package with the value ‘compound’ to explore the question: Does a sentiment analysis is correlated with the Revenue column?
Continue with a helper function that helps to convert the given data as string to a list, for example, the function will convert ‘[1,2,3,4]’ (string) into [1,2,3,4] (a list).
The next step is to combine the Train and Test Sets into a combined Set, all the preparations will be done on the combined Set that will be split later.
Drop all of the not-relevant columns from the combined dataset Columns that will not contribute to predicting the revenue.
Preparation for the parsing step, applying ‘text_to_list’ function on the relevant columns.
Converts the ‘belogs_to_collection’ column to a zero/one column.
Every value that includes some value (meaning the movie belong to a collection) will be converted to 1.
Every value that includes a NaN (meaning the movie does not belong to a collection) will be converted to 0.
Reminder, a Sentiment analysis Revealed that there is no correlation between the columns: ‘overview’ and ‘tagline’ to the ‘revenue’ column. (our predicted column)
Hence, we will create a binary label for each movie ‘tagline’ (and for ‘homepage’ as well later), for every movie: has or has not a ‘tagline’ and a ‘homepage’.
The second step will be to create a new feature with an overview of characters count.
Creating a new feature, the new feature includes the number of characters in each movie’s overview.
The head() of the new feature:
Creating a new feature contains the NUMBER of genres for each movie.
Moving on to parse the ‘genres’ names from the ‘genres’ column.
Some movies have more than one genre while others have no genre at all.
For this purpose, there is a helper function named: ‘parse_genre’ that will parse the first three genres that relates to a movie (if exists) and create 3 new columns named: ‘genres1’, ‘genres2’, ‘genres3’ in the combined dataset.
Apply the function to create 3 new columns and drop the original ‘genres’ column:
Creating a new column with the number of production companies related to each movie with the code-line:
Building a function to parse the production companies of a movie.
Few movies do not have a production companies value, some have more than one value, the function will parse only the first 3 production companies (if exist) and create 3 new columns named: ‘prod1’, ‘prod2’, ‘prod3’ in the combined dataset.
Apply the function to create 3 new columns and drop the original ‘production companies’ column.
Create a new column with the number of production countries related to each movie with the code-line:
Few movies do not have a production countries value, some have more than one value.
A helper function will parse the production countries of a movie. It will parse only the first 3 production countries (if exist) and create 3 new columns named: ‘country1’, ‘country2’, ‘country3’ in the combined dataset.
Apply the function to create 3 new columns and drop the original ‘production countries’ column with the code-line:
The ‘release_date’ column need a parse and a fill for the Nan values, that will be done with the following code:
Fill the Nan values in the ‘runtime’ column with the median value.
Create a new column with the number of spoken languages for each movie with the code-line:
Few movies do not have a spoken languages value, some have more than one value the function.
A helper function to parse the spoken languages of a movie. will parse only the first 3 spoken languages (if exist) and create 3 new columns named: ‘lang1’, ‘lang2’, ‘lang3’ in the combined dataset:
Apply the function to create 3 new columns and drop the original ‘spoken languages’ column:
Most of the ‘status’ column values are ‘Released’, hence, the Nan values in this column will change to ‘Released’.
Create a new column with the number of Keywords for each movie.
Few movies do not have a keywords value, some have more than one value. The helper function will parse only the first 3 keywords (if exist) and create 3 new columns named: ‘key1’, ‘key2’, ‘key3’ in the combined dataset.
Apply the function to create 3 new columns and drop the original ‘Keywords’ column:
Create 3 new features.
Counting the number of the cast for genders 0,1,2 for each movie.
Sample to observe one of the new columns head:
Create a new column with the number of cast values for each movie with the code-line:
Parsing the cast column. Taking the first five cast members by their cast_id values and creating five cast-related new columns:
Apply the function to create 5 new columns and drop the original ‘cast’ column:
Create a new column with the number of crew values for each movie:
A function to parse the Director and Producer from the ‘crew’ column:
Apply the function to create 2 new columns and drop the original ‘crew’ column:
Create two new columns (features) for the two columns that contain Numeric Values (‘budget’, ‘popularity’) using np.log1p (calculate log(1 + x)) since there is a possibility that we will have a zero value and log of zero does not exist. RandomForest or light_gbm models can use both features without a conflict, Moreover, these two new features contribute to the models’ accuracy.
Apply LabelEncoder on the new 5 generated feature-groups columns, fit and transform as a second step.
Apply Label Encode the two left category column:
Split the combined dataset back to Test and Train sets
Another three steps of preparation:
Model Building
Start with a basic Linear Regression Model.
Continue with a random forest regression model (Improved result comparing to the LinearRegression try).
View the importance of the features of the random forest model in a bar plot. dropping the revenue column before.
Continue with a LGBMRegressor Model (fast execution) the results improved comparing to the RandomForestRegressor try.
The parameters of this model explanation: 0.4 means that for each of the 1500 (n_estimator) only 40% of the features will be selected (randomly). max_depth is inf (-1) but is restricted by the leaves number (20).
View the importance of the features of the LGBMRegressor model in a bar plot.
Dropping the revenue column before According to this model, the year is the most important feature in predicting the revenue and that makes sense, as the years pass the revenue increase. (across all Industries) The second important feature according to this model is the production company, budget, director.. The choices of this model are relevant and lead to a better prediction outcome, compare to the previous two models that I tried.
License
I open-sourced this jupyter-notebook for all to use as an entry point to the competition. If you, however, make progress and develop a better performance model, please let me know, empowering me to understand better and grow. Thank you. Ronnen.
This article, along with any associated source code and files, is licensed under GPL. (GPLv3)