Pandas Series: factorize() function
Encode the object in Pandas
The factorize() function is used to encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().
Syntax:
Series.factorize(self, sort=False, na_sentinel=-1)
Parameters:
Name | Description | Type/Default Value | Required / Optional |
---|---|---|---|
sort | Sort uniques and shuffle labels to maintain the relationship. | boolean Default Value: False |
Required |
na_sentinel | Value to mark “not found”. | int Default Value: 1 |
Required |
- labels - ndarray
An integer ndarray that’s an indexer into uniques. uniques.take(labels) will have the same values as values. - uniques - ndarray, Index, or Categorical
The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.
Returns:
Example - These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize():
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
labels
Output:
array([0, 0, 1, 2, 0], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
uniques
Output:
array(['q', 'p', 'r'], dtype=object)
Example - With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained:
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
labels
Output:
array([1, 1, 0, 2, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
uniques
Output:
array(['p', 'q', 'r'], dtype=object)
Example - Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are never included in uniques:
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
labels
Output:
array([ 0, -1, 1, 2, 0], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
uniques
Output:
array(['q', 'p', 'r'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
labels
Output:
array([0, 0, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
uniques
Output:
[p, r] Categories (3, object): [p, q, r]
Notice that 'q' is in uniques.categories, despite not being present in cat.values.
Example - For all other pandas objects, an Index of the appropriate type is returned:
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
labels
Output:
array([0, 0, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
uniques
Output:
Index(['p', 'r'], dtype='object')
Previous: First discrete difference of element in Pandas
Next: Maximum of the values for the Pandas requested axis
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics