w3resource

Pandas Series: factorize() function

Encode the object in Pandas

The factorize() function is used to encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

Syntax:

Series.factorize(self, sort=False, na_sentinel=-1)
Pandas Series factorize image

Parameters:

Name Description Type/Default Value Required / Optional
sort Sort uniques and shuffle labels to maintain the relationship. boolean
Default Value: False
Required
na_sentinel Value to mark “not found”. int
Default Value: 1
Required

    Returns:

  • labels - ndarray
    An integer ndarray that’s an indexer into uniques. uniques.take(labels) will have the same values as values.
  • uniques - ndarray, Index, or Categorical
    The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

Example - These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize():

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
labels

Output:

array([0, 0, 1, 2, 0], dtype=int64)

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'])
uniques

Output:

array(['q', 'p', 'r'], dtype=object)

Example - With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained:

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
labels

Output:

array([1, 1, 0, 2, 1], dtype=int64)

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', 'q', 'p', 'r', 'q'], sort=True)
uniques

Output:

array(['p', 'q', 'r'], dtype=object)

Example - Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are never included in uniques:

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
labels

Output:

array([ 0, -1,  1,  2,  0], dtype=int64)

Python-Pandas Code:

import numpy as np
import pandas as pd
labels, uniques = pd.factorize(['q', None, 'p', 'r', 'q'])
uniques

Output:

array(['q', 'p', 'r'], dtype=object)

Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.

Python-Pandas Code:

import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
labels

Output:

array([0, 0, 1], dtype=int64)

Python-Pandas Code:

import numpy as np
import pandas as pd
cat = pd.Categorical(['p', 'p', 'r'], categories=['p', 'q', 'r'])
labels, uniques = pd.factorize(cat)
uniques

Output:

[p, r]
Categories (3, object): [p, q, r]

Notice that 'q' is in uniques.categories, despite not being present in cat.values.

Example - For all other pandas objects, an Index of the appropriate type is returned:

Python-Pandas Code:

import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
labels

Output:

array([0, 0, 1], dtype=int64)

Python-Pandas Code:

import numpy as np
import pandas as pd
cat = pd.Series(['p', 'p', 'r'])
labels, uniques = pd.factorize(cat)
uniques

Output:

Index(['p', 'r'], dtype='object')

Previous: First discrete difference of element in Pandas
Next: Maximum of the values for the Pandas requested axis



Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://w3resource.com/pandas/series/series-factorize.php