Chapter 4 Missing values
For the missing values chapter, we chose slightly more variables than the ones mentioned in the previous chapters. This is to represent a more general version of the missing value patterns in the data set.
4.1 Number of missing attributes for each wave per question - Bar plots and Cleaveland Plots
First, we plotted a barplot of the maximum number of missing answers given (for variables 13 - 17) per wave for each question. This helps us understand which waves had people that had a considerable amount of missing values for each question.
For Question 1_1:
For Question 1_1
We can observe that in waves 2, 3, 6, 13, and 14; there are people who did not answer any of the 5 attributes in question 1_1. In wave 5, the maximum number of missing attributes is 1, so everyone rated at least 4 of 5 attributes. In all other waves, all attributes were answered by everyone.
For Question 2_1:
For Question 2_1
Same as question 1_1, we can observe that in waves 2, 3, 6, 13, and 14; there are people who did not answer any of the 5 attributes in question 2_1. In wave 5, the maximum number of missing attributes is 1, so everyone rated at least 4 of 5 attributes. In all other waves, all attributes were answered by everyone.
For Question 3_1:
For Question 3_1
We can observe that in waves 2, 3, 6, 13, 14, 15 and 16; there are people who did not answer any of the 5 attributes in question 3_1. In all other waves, all attributes were answered by everyone. Here, the maximum number of missing answers given are all or none.
For Question 4_1:
For Question 4_1
We can observe that in waves 1, 2, 3, 4, 5, 6, 13, and 14; there are people who did not answer any of the 5 attributes in question 4_1. In all other waves, all attributes were answered by everyone. Here too, the maximum number of missing answers given are all or none.
For Question 5_1:
For Question 5_1
We can observe that in waves 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15, and 16; there are people who did not answer any of the 5 attributes in question 5_1. In all other waves, all attributes were answered by everyone. Here too, the maximum number of missing answers given are all or none.
4.2 Heatmap
The above plots are only furnishing information regarding the count of missing values. We can drill down to check missing values by wave. Heatmaps are one way to visualize the details of missing columns. Below plots show an example of heatmaps for several questions across a single wave.
For Question 1_1 and wave 5
For Question 1_1 and wave 6
For Question 2_1 and wave 5
For Question 2_1 and wave 6
4.3 Missing patterns question
Since we have 8378 observations, we used row percentages instead of row count in our missing values plot. Below are the plots for missing values per question.
For question 1_1:
We can observe that we have 9 missing value patterns in the data for question 1_1. We can see that almost above 90% of the observations come under the complete cases pattern. The second most observed pattern is that there is missing values for attractiveness, sincerity, intelligence, fun, ambitious, age, goal, and race. The third most observed pattern is that there is missing values only for age.
Moreover, we observe that about 14% of the rows have a missing value for ambitious, which is the highest percentage of missing rows for question 1_1. The second most missing column is age, with slightly lower than 14% of the rows missing it; followed by fun, with about 13% of the rows missing it.
For question 2_1:
We can observe that we have 8 missing value patterns in the data for question 2_1. We can see that almost above 90% of the observations come under the complete cases pattern. The second most observed pattern is that there is missing values for attractiveness, sincerity, intelligence, fun, ambitious, age, goal, and race. The third most observed pattern is that there is missing values only for age.
Moreover, we observe that about 14% of the rows have a missing value for age, which is the highest percentage of missing rows for question 2_1. The second most missing column is ambitious, with about 13% of the rows missing it; followed by fun, with about 12% of the rows missing it.
For question 3_1:
We can observe that we have 7 missing value patterns in the data for question 3_1. We can see that almost above 90% of the observations come under the complete cases pattern. The second most observed pattern is that there is missing values for attractiveness, sincerity, intelligence, fun, ambitious, age, goal, and race. The third most observed pattern is that there is missing values only for age.
Moreover, we observe that all five attributes, attractiveness, sincerity, intelligence, fun, and ambitious are missing in about 13.5% of the rows, which is the highest percentage of missing rows for question 3_1. The second most missing column is age, with about 12.5% of the rows missing it; followed by goal, with about 10% of the rows missing it.
For question 4_1:
We can observe that we have 8 missing value patterns in the data for question 4_1. We can see that almost above 75% of the observations come under the complete cases pattern. About 20% of the observations have missing values for for all five attributes, i.e., attractiveness, sincerity, intelligence, fun, and ambitious.
The second most observed pattern is that there is missing values for all five attributes, i.e., attractiveness, sincerity, intelligence, fun, and ambitious. The third most observed pattern is that there is missing values for attractiveness, sincerity, intelligence, fun, ambitious, age, goal, and race.
Moreover, we observe that all five attributes, attractiveness, sincerity, intelligence, fun, and ambitious are missing in about 19% of the rows, which is the highest percentage of missing rows for question 4_1. The second most missing column is age, with about 1% of the rows missing it; followed by goal, with less than 1% of the rows missing it.
For question 5_1:
We can observe that we have 8 missing value patterns in the data for question 5_1. We can see that almost 58% of the observations come under the complete cases pattern. About 20% of the observations have missing values for for all five attributes, i.e., attractiveness, sincerity, intelligence, fun, and ambitious.
The second most observed pattern is that there is missing values for all five attributes, i.e., attractiveness, sincerity, intelligence, fun, and ambitious. The third most observed pattern is that there is missing values for attractiveness, sincerity, intelligence, fun, ambitious, age, goal, and race.
Moreover, we observe that all five attributes, attractiveness, sincerity, intelligence, fun, and ambitious are missing in about 20% of the rows, which is the highest percentage of missing rows for question 5_1. The second most missing column is age, with about 1% of the rows missing it; followed by goal, with less than 1% of the rows missing it.
We have noted only the top three of the most seen pattern and the number of missing rows per column. However, we could bot make a note for others as their percentages are too small to interpret from the plot.