Pandas Series: describe() function
Generate descriptive statistics in Pandas
The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
Syntax:
Series.describe(self, percentiles=None, include=None, exclude=None)
Parameters:
Name | Description | Type/Default Value | Required / Optional |
---|---|---|---|
percentiles | The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles. | list-like of numbers | Optional |
include | A white list of data types to include in the result. Ignored for Series. Here are the options:
|
‘all’, list-like of dtypes or None (default), | Optional |
exclude | A black list of data types to omit from the result. Ignored for Series. Here are the options:
|
list-like of dtypes or None (default) | Options |
Returns: scalar or Series
Summary statistics of the Series or Dataframe provided.
Notes: For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.
Example - Describing a numeric Series:
Python-Pandas Code:
import numpy as np
import pandas as pd
s = pd.Series([2, 3, 4])
s.describe()
Output:
count 3.0 mean 3.0 std 1.0 min 2.0 25% 2.5 50% 3.0 75% 3.5 max 4.0 dtype: float64
Example - Describing a categorical Series:
Python-Pandas Code:
import numpy as np
import pandas as pd
s = pd.Series(['p', 'p', 'q', 'r'])
s.describe()
Output:
count 4 unique 3 top p freq 2 dtype: object
Example - Describing a timestamp Series:
Python-Pandas Code:
import numpy as np
import pandas as pd
s = pd.Series([
np.datetime64("2012-02-02"),
np.datetime64("2019-02-02"),
np.datetime64("2019-02-02")
])
s.describe()
Output:
count 3 unique 2 top 2019-02-02 00:00:00 freq 2 first 2012-02-02 00:00:00 last 2019-02-02 00:00:00 dtype: object
Previous: Cumulative sum over a Pandas DataFrame or Series axis
Next: First discrete difference of element in Pandas
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics