How to Develop a Neural Net for Predicting Car Insurance Payout-1

swapneel-panda-419bc751 · 28 June 2021 11:33

Auto Insurance Regression Dataset

The first step is to define and explore the dataset.

We will be working with the “Auto Insurance” standard regression dataset.

The dataset describes Swedish car insurance. There is a single input variable, which is the number of claims, and the target variable is a total payment for the claims in thousands of Swedish krona. The goal is to predict the total payment given the number of claims.

You can learn more about the dataset here:

Auto Insurance Dataset (auto-insurance.csv)
Auto Insurance Dataset Details (auto-insurance.names)
You can see the first few rows of the dataset below.

108,392.5
19,46.2
13,15.7
124,422.2
40,119.4
…

108,392.5
19,46.2
13,15.7
124,422.2
40,119.4
…
We can see that the values are numeric and may range from tens to hundreds. This suggests some type of scaling would be appropriate for the data when modeling with a neural network.

We can load the dataset as a pandas DataFrame directly from the URL; for example:

load the dataset and summarize the shape

from pandas import read_csv

define the location of the dataset

url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv’

load the dataset

df = read_csv(url, header=None)

summarize shape

print(df.shape)

load the dataset and summarize the shape

from pandas import read_csv

define the location of the dataset

url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv’

load the dataset

df = read_csv(url, header=None)

summarize shape

print(df.shape)
Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has two variables (one input and one output) and that the dataset has 63 rows of data.

This is not many rows of data for a neural network and suggests that a small network, perhaps with regularization, would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(63, 2)
1
(63, 2)
Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

show summary statistics and plots of the dataset

from pandas import read_csv
from matplotlib import pyplot

define the location of the dataset

url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv’

load the dataset

df = read_csv(url, header=None)

show summary statistics

print(df.describe())

plot histograms

df.hist()
pyplot.show()

show summary statistics and plots of the dataset

from pandas import read_csv
from matplotlib import pyplot

define the location of the dataset

url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv’

load the dataset

df = read_csv(url, header=None)

show summary statistics

print(df.describe())

plot histograms

df.hist()
pyplot.show()
Running the example first loads the data before and then prints summary statistics for each variable

We can see that the mean value for each variable is in the tens, with values ranging from 0 to the hundreds. This confirms that scaling the data is probably a good idea.

            0           1

count 63.000000 63.000000
mean 22.904762 98.187302
std 23.351946 87.327553
min 0.000000 0.000000
25% 7.500000 38.850000
50% 14.000000 73.400000
75% 29.000000 140.000000
max 124.000000 422.200000

            0           1

count 63.000000 63.000000
mean 22.904762 98.187302
std 23.351946 87.327553
min 0.000000 0.000000
25% 7.500000 38.850000
50% 14.000000 73.400000
75% 29.000000 140.000000
max 124.000000 422.200000
A histogram plot is then created for each variable.

We can see that each variable has a similar distribution. It looks like a skewed Gaussian distribution or an exponential distribution.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed, which will likely improve model performance.