# Pareto Analysis

We saw earlier that the vessel count distributions are highly right-skewed, meaning that there are a few ports that handle significantly more traffic than the rest. This suggests that there is a small portion of all ports that is responsible for the majority of global maritime trade.\
\
The Pareto Principle (commonly known as the 80-20 rule) states that roughly 80% of the consequences come from 20% of the causes. Your task is to examine if it applies here.

***

### **🔧 Your tasks:**

* [ ] Calculate the sum of vessel counts (`TOTAL_VESSEL_COUNT`)
* [ ] Sort the data based on vessel counts in descending order. This is necessary for calculating the cumulative sum.

  **Interpretation tasks:**

  * [ ] **🤔**Why is it necessary to sort data before calculating the cumulative sum?
  * [ ] 🤔Considering the question we are trying to answer, why is it necessary in this case to sort in descending order? When would we need to sort in ascending order?
  * [ ] 🤔What would happen if we calculate the cumulative sum on unsorted data? Why is this undesirable?
* [ ] Calculate the cumulative sum of total vessel counts ([What is a cumulative sum?](https://www.google.com/search?q=What+is+a+cumulative+sum))
* [ ] Calculate the cumulative percentage of total vessels. You can do this by dividing the cumulative sum by the total sum you calculated in the first step.

  **Interpretation tasks:**

  * [ ] **🤔**Why are we doing this? What does it mean for a port in this **sorted** dataset to have a cumulative percentage of *X* percent?
* [ ] Determine the cutoff for when cumulative percentage is larger than 80%. Basically, "What is the first port which has a cumulative percentage of more than 80%?".

- Here you need the cutoff index and not the port name, as you want to know how many ports come before this cutoff. For example, if cutoff port has index 10, then 9 ports account for 80% of results.

* [ ] Finally answer: What percentage of ports account for roughly 80% of global maritime trade? Does this correspond do the pareto principle?
* [ ] Make your own variation of the 80-20 principle. Tweak the percentage to reach another somewhat surprising insight. Present your answer like so: "The top *X*% of ports account for *Y*% of global maritime trade."

> <img src="https://2669499530-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FnYNN3nXNuXMJpHACcH73%2Fuploads%2Ft1yAGmUambZeYVQvPSeu%2Fp.png?alt=media&#x26;token=01872756-9ca8-44f9-9ec1-1ff5f70ce561" alt="" data-size="line">
>
> You can use `.sum()` for the first task. `.sort_values()` to sort the dataset, make sure to reset the index (Find out why and how on your own)
>
> You can use `.cumsum()` for calculating cumulative index.
