Pandas Series: factorize() function
Encode the object in Pandas
The factorize() function is used to encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().
Syntax:
Series.factorize(self, sort=False, na_sentinel=-1)
Parameters:
Name | Description | Type/Default Value | Required / Optional |
---|---|---|---|
sort | Sort uniques and shuffle labels to maintain the relationship. | boolean Default Value: False |
Required |
na_sentinel | Value to mark “not found”. | int Default Value: 1 |
Required |
- labels - ndarray
An integer ndarray that’s an indexer into uniques. uniques.take(labels) will have the same values as values. - uniques - ndarray, Index, or Categorical
The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.
Returns:
Example - These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize():
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
labels
Output:
array([0, 0, 1, 2, 0], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
uniques
Output:
array(['q', 'p', 'r'], dtype=object)
Example - With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained:
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
labels
Output:
array([1, 1, 0, 2, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
uniques
Output:
array(['p', 'q', 'r'], dtype=object)
Example - Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are never included in uniques:
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
labels
Output:
array([ 0, -1, 1, 2, 0], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
uniques
Output:
array(['q', 'p', 'r'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
labels
Output:
array([0, 0, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
uniques
Output:
[p, r] Categories (3, object): [p, q, r]
Notice that 'q' is in uniques.categories, despite not being present in cat.values.
Example - For all other pandas objects, an Index of the appropriate type is returned:
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
labels
Output:
array([0, 0, 1], dtype=int64)
Python-Pandas Code:
import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
uniques
Output:
Index(['p', 'r'], dtype='object')
Previous: First discrete difference of element in Pandas
Next: Maximum of the values for the Pandas requested axis
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://w3resource.com/pandas/series/series-factorize.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics