In this activity we practised calculating correlations and making regression models with Python - I learned a lot about which Python libraries and functions can be used for these tasks!
I started by generating random datasets and calculating the Pearson correlation coefficient.
from numpy.random import randn
from matplotlib import pyplot as plt
from scipy.stats import pearsonr
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
plt.scatter(data1, data2)
plt.show()
corr = pearsonr(data1, data2)
print("Pearson's correlation: %.3f" % corr.statistic)

Pearsons correlation: 0.893
I added more variance to data2 - the correlation went down.
data2 = data1 + (20 * randn(1000) + 50)
plt.scatter(data1, data2)
plt.show()
corr = pearsonr(data1, data2)
print("Pearson's correlation: %.3f" % corr.statistic)

Pearsons correlation: 0.698
I added even more variance to data2 - the correlation went further down!
data2 = data1 + (50 * randn(1000) + 50)
plt.scatter(data1, data2)
plt.show()
corr = pearsonr(data1, data2)
print("Pearson's correlation: %.3f" % corr.statistic)

Pearsons correlation: 0.333
Next I fitted a linear regression model to the random data - I used the data with high correlation (0.893).
from scipy import stats data2 = data1 + (10 * randn(1000) + 50) model = stats.linregress(data1, data2) def model_function(x): return model.slope * x + model.intercept mymodel = list(map(model_function, data1)) plt.scatter(data1, data2) plt.plot(data1, mymodel, "y") plt.show()

Next I created a multiple linear regression model using house price data which I downloaded from Kaggle.
import pandas
from sklearn import linear_model
df = pandas.read_csv("real_estate.csv")
df.head()
| No | X1 transaction date | X2 house age | X3 distance to the nearest MRT station | X4 number of convenience stores | X5 latitude | X6 longitude | Y house price of unit area | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2012.917 | 32.0 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
| 1 | 2 | 2012.917 | 19.5 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
| 2 | 3 | 2013.583 | 13.3 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
| 3 | 4 | 2013.500 | 13.3 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
| 4 | 5 | 2012.833 | 5.0 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
I created the model using the features which seemed most relevant to house price - house age, distance to nearest MRT station, and number of convenience stores.
X = df[["X2 house age", "X3 distance to the nearest MRT station", "X4 number of convenience stores"]] y = df["Y house price of unit area"] regr = linear_model.LinearRegression() regr.fit(X.values, y)
I used the model to predict the price of unit area for a house with with these parameters.
print(regr.predict([[10, 100, 7]])) [48.99291231]
The predicted "house price of unit area" was about 49.
Finally I tried polynomial models with different dimensions. The data is the speed of cars passing a toll booth at different times - x is the time and y is the speed.
import numpy import matplotlib.pyplot as plt x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22] y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]
1 dimension produces a straight line.
mymodel = numpy.poly1d(numpy.polyfit(x, y, 1)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show()

2 dimensions produces a simple curve.
mymodel = numpy.poly1d(numpy.polyfit(x, y, 2)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show()

3 dimensions produces a curve which fits the data well.
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show()

4 dimensions produces a curve which is starting to be "overfitted".
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show()

10 dimensions produces a curve which has definitely been "overfitted".
mymodel = numpy.poly1d(numpy.polyfit(x, y, 10)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show()

The best fit for this data appears to be 3 dimensions.
This was a great activity for practising creating different statistical models. We were encouraged try different parameters - this definitely helped me learn what effects the different parameters have, which will definitely be useful in my future data science career!