YouTube View Prediction with Machine Learning

Before you make an effort and create a video, do you like to get an idea of how many views you are going to get from the video?

This article will guide you on how to create a model and predict YouTube views for a video that does not already exist on YouTube.

As we all know using and watching YouTube videos is an important part of our everyday lives. Most people are trying to build their influence, income, and impact with YouTube and online video. In nutshell, everyone is trying to be a YouTube influencer. It will be nice if a YouTube influencer can get an idea of how the view count is going to be before making and finalizing the video. Here we tried to create a model that can help influencers to predict the number of views for their next video.

You can go through the steps below and come up with a model to predict the view count.

Collecting Data

In our model, data was collected from a Kaggle data set which contained data on daily trending YouTube videos. This is the most relevant data set currently available for use. If you want you can create a data set from the scratch, by scrapping data through YouTube APIs. This data set includes several months of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, FR , RU, MX, KR, JP, and IN regions (USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India respectively) over the same time period. Each region’s data is in a separate file. To train our model we used data related to the USA. It includes in the USvideos.csv file. This data set contains a total of 16 columns as follows.

Feature Engineering and Data Preparation

Data Exploration

Feature Engineering plays an important role in data preparation, which is a key component of the AI workflow. There are some common feature engineering steps that can be done in data preparation such as filling missing values, one code encoding, etc. Apart from that, there are many different ways to optimize data. Some of them are, removing unnecessary columns, normalization, column aggregation, row aggregation, generating new features, etc.

Our next step is to prepare data in a way that we can use to train and get a good model. First of all, we need to look at the current data and analyze them before going further. We chose the data from USvideos.csv. There are a total of 40 949 data rows in this CSV file. Next, we need to find out the distribution of features along these 40 949 data rows.

category_id -This is the first feature we going to analyze. There are different types of categories on YouTube, and when uploading a video, it is compulsory to select the category of your video. In this data set, category_id refers to the category and that mapping can be found in a JSON file separately. Since this is a mandatory requirement in YouTube, there are no missing values for this feature in the data set. From the graph below, you can get a rough idea of how category id has distributed.