Titanic Climax with Matplotlib Numpy Pandas

brahmajit-mohapatra-f8fe5582 · 16 May 2021 19:42

Hello Data Enthusiasts,

The playground of data is Kaggle. No player ever achieved great success, without practicing daily on ground. Be it Messi, Ronaldo or Virat Kohli, you can think of some of your favorite players too. The reason they are what they are today is sure dedication and daily grind on ground to be the master of that sport. So does we need to get into the arena to fight and achieve greatness.

Kaggle is the arena for us. Let’s begin with solving a easy Titanic problem statement.

The most interesting fact is that, a novel predicted the Titanic sinking 14 years previously to the actual disaster. WOW! Crazy. In 1898 (14 years before the Titanic sank), American author Morgan Robertson wrote a novel titled ‘The Wreck of the Titan.’ The book was about a fictional ocean liner that sinks due to a collision with an iceberg. In the book, the ship is described as being “unsinkable” and doesn’t have enough lifeboats for everyone on board, sounds familiar yeah you’re right it’s the epic story of titanic which was predicted years ago.

We cannot conclude whether the author had technical proofs for his prediction, but we as responsible Data science enthusiasts can predict the possibilities and outcomes of the disaster using the data set and what not we can even try to envision the various prospects of the climax.

I am sure that all of us know what happened to Rose and Jack in the movie Titanic. We all wished that the story had a different ending, didn’t we? Let’s try to make our wish come true by recreating the climax of the story by a simple analysis of the story plot,

At the end of the analysis we will be creating three climaxes and come to know the answer of three questions:

• Is there a possibility for jack to be alive and rose’s survival?

• Was there a chance for Jack and Rose together to narrate their adventurous story to their grandchildren?

• Did Cal Hockley (Rose’s Fiancé) have a higher chance of survival as he belonged to the upper-class or what would make the villain dead?

We are carrying out our analysis using the ‘Matplotlib’, ‘Numpy’,’Pandas’, and ‘Seaborn’ Libraries.

Let us see what each library function is:

Matplotlib is a python library used for visualizing data sets using various plots; it has more than 50 plots to name a few, bar plot, line plot, histogram, etc.

Numpy is also a Python library that provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays.

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Let us start our journey…

Data Exploration:

import pandas as pd
import numpy as np
import random as rnd

Visualization

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We import all the necessary libraries and read the data set which has the stat of titanic disaster using the ‘pandas’ library.

titanic_df = pd.read_csv(‘titanic.csv’)

Then we display the first five entries using the head command to get a glimpse of the nature of the data and the categories of labels which we are going to explore,

titanic_df.head()

Feature Analysis:

Looking at titanic_df.describe() we gain a lot of useful insights and find the categorical labels which we can ignore

titanic_df.describe()

• PassengerId: Unique for each passenger so this has no relation with the survival label hence this need not be considered for analyzing
• Survived: Survival is a binary option, 0 for the passenger is dead and 1 for the passenger is alive, so this will be only ‘Y’ variable in XY plotting
• Pclass: Integer equal to 1, 2, or 3 indicating the class of each passenger (lower, middle, or upper), this can be taken for analyzing as this has three inner categories which may contribute to the survival of passengers
• Age: Number representing the age of each passenger, though as we can see in titanic_df.tail(), some passengers have NaN for their age, this can also be considered as maybe younger ones can act swiftly and escape so this can also contribute to the survival label
• SibSp: Number of siblings also on board, we may not completely ignore this, as it may or may not support the survival label
• Parch: Number of children also on board, this also has a similar case of SibSp
• Fare: amount paid for the ticket by each passenger, this may add essence to the Passenger Class label as the higher the fare higher the class of ticket.
For a quick comparison, we’ll create use NumPy functions to verify the mean, standard deviation, min, and max of numerical columns.

columns = list(titanic_df[[‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]])
def describe_data(data, col):
print (‘\n\n’, col)
print (‘_’ * 40)
print (‘Mean:’, np.mean(data)) numpy Mean
print (‘STD:’, np.std(data)) numpy STD
print (‘Min:’, np.min(data)) numpy Min
print (‘Max:’, np.max(data)) numpy Max
for c in columns:
describe_data(titanic_df[c], c)

Insights from these are:

• Survived is a categorical label with 0 or 1 values.

• Around 38% of samples survived representative of the actual survival rate at 32%.

• Most passengers (> 75%) did not travel with parents or children.

• Nearly 30% of the passengers had siblings and/or spouse aboard.

• Fares varied significantly with few passengers (<1%) paying as high as $512.

• Few elderly passengers (<1%) within the age range 65–80.

Great numbers, Let us move on to realize our dream climaxes…

Apart from our assumptions to the Climax, there are certain limitations:

As some of these inferences were drawn based on correlation, it’s always important to remember that correlation does not imply causation (relationship).
Since we know that some passengers did not have a recorded age, entries with ‘NaN’ (null) were not taken into account when running these numbers.
Conclusions were drawn based on descriptive statistics, charts, and opted not to run t-tests on the sample.

Interesting Findings:

What proportion of passengers in the sample survived?

38% of total passengers in the sample survived

Did women and children have a higher survival rate?

The female survival rate in this sample was 55.3% higher than the survival rate for males.
Women had a much higher rate of survival than men.
Children under the age of 5, regardless of sex, had a much higher rate of survival

Did upper-class passengers in the sample have an advantage that translated into a higher survival rate than lower-class passengers?

The class has a strong correlation with survival, with upper-class passengers having a much larger rate of survival than lower-class passengers, regardless of sex and age.
Upper-class passengers were more likely to survive than lower-class passengers.

So, as we are approaching the climax of our post, quickly let’s summarize, we got some insights about Matplotlib, Numpy, pandas, and seaborn libraries which are essential and inevitable for data science.

Also instead of mourning on the loss of Jack and the separation of true love, we tried the possibilities to change the climax, what’s exactly the duty data scientist, to analyze the data and come up with useful possibilities to attain desired outcomes.

Now, it’s your turn folks to create your own customized climaxes and conclusions with these kinds of simple analysis of the data set and come up with creative and innovative endings of your favorite historical epics, kudos for learners!

Keep Exploring!

Thankyou.