Industry-Location Relationship Analysis

Now we want to analyze how port locations correlate with dominant industries using categorical-categorical visualization, while practicing log-scale transformations for skewed distributions.

We will create a heatmap to show the relationships between continents and top industries. Here's some reading to get you familiar with the concept:

Heatmaps and correlation plots, Ricardo García Ramírez, Medium.com
Demystifying heapmaps: A comprehensive beginner's guide

🔧 Your tasks:

First calculate the contingency table.
Plot the contingency table as a heatmap. It will likely look like this:

There is a problem here: The "Mineral Products" industry absolutely dominates all other industries, this prevents us from seeing the relationships of other columns. We can use a log scale to prevent this.

Transform the contingency table values to log scale. Now you should have something like this:

Optionally, to reduce clutter and focus on the most important industries, you can filter the dataset first to only include top 5 industries and then calculate the contingency table based on that. That would result in this:

Interpretation tasks:

🤔 Do some online research to find an explanation for such dominance of the "Mineral Products" industry in maritime trade.
🤔 Do some online research to find out what real-world factors explain Asia's high "Animal & Animal Products" presence?

You can use pandas' .crosstab() for computing a contingency table between two variables.
For log scale transformation, you can use numpy's log1p(). This calculates the natural logarithm of the value plus 1. (🤔Why plus one?)
You can use value_counts() with nlargest(x) to filter the top x values.

PreviousVessel Count Category vs. Trade Share NextGrouping & Aggregation

Last updated 11 months ago

hashtag🔧 Your tasks:

🔧 Your tasks: