Pareto Analysis
We saw earlier that the vessel count distributions are highly right-skewed, meaning that there are a few ports that handle significantly more traffic than the rest. This suggests that there is a small portion of all ports that is responsible for the majority of global maritime trade. The Pareto Principle (commonly known as the 80-20 rule) states that roughly 80% of the consequences come from 20% of the causes. Your task is to examine if it applies here.
🔧 Your tasks:
Calculate the cumulative sum of total vessel counts (What is a cumulative sum?)
Calculate the cumulative percentage of total vessels. You can do this by dividing the cumulative sum by the total sum you calculated in the first step.
Interpretation tasks:
🤔Why are we doing this? What does it mean for a port in this sorted dataset to have a cumulative percentage of X percent?
Determine the cutoff for when cumulative percentage is larger than 80%. Basically, "What is the first port which has a cumulative percentage of more than 80%?".
Here you need the cutoff index and not the port name, as you want to know how many ports come before this cutoff. For example, if cutoff port has index 10, then 9 ports account for 80% of results.
Finally answer: What percentage of ports account for roughly 80% of global maritime trade? Does this correspond do the pareto principle?
Make your own variation of the 80-20 principle. Tweak the percentage to reach another somewhat surprising insight. Present your answer like so: "The top X% of ports account for Y% of global maritime trade."
You can use
.sum()for the first task..sort_values()to sort the dataset, make sure to reset the index (Find out why and how on your own)You can use
.cumsum()for calculating cumulative index.
Last updated
