Categorical data in a DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"raw_grade": ['a', 'b', 'c', 'd', 'e']})
Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Rename the categories to more meaningful names:
df["grade"].cat.categories = ["very bad","very good","better","good","bad"]
Reorder the categories and simultaneously add the missing categories (methods under Series .cat return
a new Series by default).
df["grade"] = df["grade"].cat.set_categories(["very bad","very good","better","good","bad"])
df["grade"]
Sorting is per order in the categories, not lexical order:
df.sort_values(by="grade")
Grouping by a categorical column also shows empty categories:
df.groupby("grade").size()