From the course: Artificial Intelligence Foundations: Machine Learning

Visualizing and understanding data

From the course: Artificial Intelligence Foundations: Machine Learning

Visualizing and understanding data

When exploring your dataset, you'll need to visualize that data through plotting charts and graphs, so you better understand it. Matplotlib and Seaborn are widely used Python-based 2D plotting libraries. These libraries allow for generating production-quality visualizations with just a few lines of code. Here, I've navigated to the Jupyter notebook. Let's start with histograms. Histograms are similar to bar charts and show us the distribution of our numerical data. We'll plot the values for the target variable median house value using a series of bars on the histogram. Here's the histogram. You'll see on the x-axis at the bottom the cost of the home and on the y-axis on the left the count of homes at that value. We can see from the plot that the values of the median house value are distributed normally with a few outliers here on the right. Most of the homes are within the $100,000 to $200,000 range. Now let's plot histograms for the remaining features to understand the data distributions. Since ocean proximity is non-numeric. A histogram is not needed. Here we have the histogram for longitude, latitude, housing median age, total rooms, total bedrooms, population, let's scroll down, households, median income, and median house value. Median income is not expressed in U.S. dollars or USD. The data collection team has pre-processed the data by scaling and capping at 15 for higher median incomes and 0.5 for lower median incomes. Now these histograms tell us several things. Let's scroll back up to the housing median age. There are several outliers noted here on the right-hand side. Several local peaks are quite gradual. However, this peak here on the right-hand side is really odd at the maximum value, which indicates outliers. The peak becomes more visible by adjusting the bins parameter of the histogram function. Now let's find median house value down here. Median house value contains outliers too. There's an odd peak at its maximum value here around 500,000 which could be an outlier. I'll show you later in the course how to handle outliers so they don't impact the performance of your model. Now let's look at heat maps. Heat maps show the correlation between features, or how related one feature is to another feature. If features are highly correlated, that means those features could possibly teach the model the same thing. Duplicate features should be removed to speed up the training process, save money, and improve the model's prediction capabilities. When reading a heat map, expect a line running from top-left to bottom-right. In each cell, expect values from 0 to 1. Values closer to 0 show a low correlation, while values closer to 1 show a high correlation. We want the heat map to be symmetrical, where the bottom-left is the same as the top-right, each feature positively correlated with the other. The heat map shows that several features are correlated. The light pink appears more than once in a row, so consider those two features to be correlated. As expected, the total rooms feature is related to itself, but it is also related to the total bedrooms, population, and households feature. These features are candidates for removal. Dimensionality reduction is the act of removing features to improve the runtime and effectiveness of your models. Now that we understand our dataset, we are ready for feature engineering, a process that manipulates your data by adding, deleting, combining, and creating new features to improve training and prediction capabilities.

Contents