Intro To Data Science and Data Visualisation

By  Shaurya Singh    37 - 24 July, 20
data-science-intro-article

Share article on social media


WELCOME!
 
In this article I am going to get you started into the world of Data Science and Machine Learning using the latest concepts with the help of an easy and fun project walk-through.
  • Tip: Before starting this project walk-through make sure you have downloaded Anaconda Software for your respective operating systems and run Jupyter Notebook.
  • You can find all the required data files and the completed project file on this github repository - github.com
  • You are required to have basic knowledge of Python programming language namely data types and arrays.

OBJECTIVE: We are going to employ libraries such as numpy, pandas, sci-kit learn and matplotlib in order to establish a relationship between the Production Cost of a movie in USD and its World Wide Revenue in USD.

DATA-SET: We will be using a data-set pre-compiled for us and available online. Our data contains two columns namely - Production Budget and Worldwide Gross and has a total of 5035 entries. In order to work with this data we will have to further clean it by removing all the commas and dollar symbols from our numbers and will also have to remove the entries for which our Worldwide Gross is zero as these are movies that weren't released and having them in our data will hinder our models capability to fit them and give us undesired results. You can clean the data yourself using Microsoft Excel or just get the clean data-set from the github repository.

 

IMPORTING THE NECESSARY MODULES:


Here, we have imported several modules namely:

  1. Pandas - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
    built on top of the Python programming language.
  2. Matplotlib - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
  3. Scikit-learn - Simple and efficient tools for predictive data analysis.
  4. Numpy - The fundamental package for scientific computing with Python.

 

READING OUR CSV FILE:

What we have done in the above steps is that we have created a variable called 'data' that will store our CSV file. The 'pd.read_csv('cost_revenue_clean.csv')' function takes in a CSV file and returns it to us in the form of a DataFrame (a form of python data type). The 'data.describe()' function gives us a brief view of the data that we will work with.

 

DATA-VISUALIZATION:

In the above two steps we have created two different variables named X and y for storing our column values. Both the variables are of type DataFrame and are created by accessing specific columns from our 'data' DataFrame.

Now, we will visualize our data in order to better understand it.

We will go over all these functions one-by-one.

  1. plt.figure(figsize=(10,6)) - The figure function allows you to access the properties of the figure and we have specifically accessed the figsize or figure size property and set it equal to a tuple having values for width = 10cm and height = 6cm.
  2. plt.scatter(X, y , color='indigo', alpha=0.3) - Clear from its name the scatter function creates a scatter plot for our two variables X and y. It may also take additional information like the colour of the dots and the transparency(alpha) value.
  3. plt.title() - This function is used to add title to our graphs and can also be given additional arguments like fontsize
  4. plt.xlabel() - This function is use to add a label to our x-axis and takes a string as an argument. It can also be given additional arguments such as fontsize.
  5. plt.ylabel() - This function is use to add a label to our y-axis and takes a string as an argument. It can also be given additional arguments such as fontsize.
  6. plt.xlim() - This function is used to set a limit on the values that are displayed on the x-axis. It takes two arguments separated by a comma. The first value is the lower limit and the second value is the upper limit.
  7. plt.ylim() - This function is used to set a limit on the values that are displayed on the y-axis. It takes two arguments separated by a comma. The first value is the lower limit and the second value is the upper limit.
  8. plt.show() - This function wraps up all the code that is written above it and displays a graph according to it.

 

  • Tip: You can pull up quick documentation for any function in jupyter notebook by clicking shift+tab on that function and get an understanding of what the function does.

 

CONGRATULATIONS! GIVE YOURSELF A PAT ON THE BACK FOR YOU HAVE ENTERED INTO THE WORLD OF DATA SCIENCE WITH MACHINE LEARNING AND HAVE WRITTEN YOUR FIRST EVER PROGRAM TO VISUALISE DATA USING PYTHON MODULES.

I WILL BE POSTING FURTHER ARTICLES, TEACHING YOU VARIOUS CONCEPTS SUCH AS REGRESSION, GRADIENT DESCENT ALL THE WAY TO TEACHING YOU TO MAKE YOUR OWN ML MODELS.