Matching / broadcasting behavior
DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for carrying
out binary operations.For broadcasting behavior, Series input is of primary interest. Using these functions,
you can use to either match on the index or columns via the axis keyword:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'one': pd.Series(np.random.randn(2), index=['a', 'b']),
'two': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'three': pd.Series(np.random.randn(4), index=['b', 'c', 'd','f'])})
df
row = df.iloc[1]
column = df['two']
df.sub(row, axis='columns')
df.sub(row, axis=1)
df.sub(column, axis='index')
df.sub(column, axis=0)
Furthermore you can align a level of a MultiIndexed DataFrame with a Series.
dfmi = df.copy()
dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
(1, 'c'), (2, 'a'),
(2, 'f')],
names=['first', 'second'])
dfmi.sub(column, axis=0, level='second')
Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at
the same time returning a two-tuple of the same type as the left hand side. For example:
s = pd.Series(np.arange(10))
s
div, rem = divmod(s, 3)
div
rem
idx = pd.Index(np.arange(8))
idx
div, rem = divmod(idx, 3)
div
rem
We can also do elementwise divmod():
div, rem = divmod(s, [1, 1, 2, 2, 3, 3, 4, 4, 5, 5,])
div
rem
Missing data / operations with fill values
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to
substitute when at most one of the values at a location are missing.For example, when adding two DataFrame objects,
you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will
be NaN (you can later replace NaN with some other value using fillna if you wish).
df
df2 = pd.DataFrame(np.random.randint(low=8, high=10, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'f'])
df2
df = pd.DataFrame(np.random.randint(low=6, high=8, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'f'])
df
df + df2
df.add(df2, fill_value=0)
Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous
to the binary arithmetic operations described above:
df.gt(df2)
df2.ne(df)
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.
These boolean objects can be used in indexing operations.
Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.
(df > 0).all()
(df > 0).any()
You can reduce to a final boolean value.
(df > 0).any().any()
You can test if a pandas object is empty, via the empty property.
df.empty
pd.DataFrame(columns=list('ABC')).empty
To evaluate single-element pandas objects in a boolean context, use the method bool():
pd.Series([True]).bool()
pd.Series([False]).bool()
pd.DataFrame([[True]]).bool()
pd.DataFrame([[False]]).bool()
Comparing if objects are equivalent
Often you may find that there is more than one way to compute the same result. As a simple example, consider df + df and df 2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df + df == df 2).all(). But in fact, this expression is False:
df + df == df * 2
(df + df == df * 2).all()
Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs
do not compare as equals:
np.nan == np.nan
So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs in corresponding
locations treated as equal.
(df + df).equals(df * 2)
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
df1 = pd.DataFrame({'col': ['boo', 0, np.nan]})
df2 = pd.DataFrame({'col': [np.nan, 0, 'boo']}, index=[2, 1, 0])
df1.equals(df2)
df1.equals(df2.sort_index())
Comparing array-like objects
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:
pd.Series(['boo', 'far', 'baz']) == 'boo'
pd.Index(['boo', 'far', 'baz']) == 'boo'
Pandas also handles element-wise comparisons between different array-like objects of the same length:
pd.Series(['boo', 'far', 'aaz']) == pd.Index(['boo', 'far', 'qux'])
pd.Series(['boo', 'far', 'aaz']) == np.array(['boo', 'far', 'qux'])
Trying to compare Index or Series objects of different lengths will raise a ValueError:
pd.Series(['boo', 'far', 'aaz']) == pd.Series(['boo', 'far'])
ValueError: Series lengths must match to compare
pd.Series(['boo', 'far', 'aaz']) == pd.Series(['boo'])
ValueError: Series lengths must match to compare
Note that this is different from the NumPy behavior where a comparison can be broadcast:
np.array([1, 2, 3, 4]) == np.array([3])
Combining overlapping data sets
A problem occasionally arising is the combination of two similar data sets where values in one are preferred
over the other.An example would be two data series representing a particular economic indicator where
one is considered to be of “higher quality”.However, the lower quality series might extend further back in history
or have more complete data coverage.As such, we would like to combine two DataFrame objects where missing values
in one DataFrame are conditionally filled with like-labeled values from the other DataFrame.The function implementing
this operation is combine_first(), which we illustrate:
df1 = pd.DataFrame({'A': [1., np.nan, 4., np.nan],
'B': [np.nan, 2., 3., 6.]})
df2 = pd.DataFrame({'A': [1., 2., 4., np.nan, 3.],
'B': [np.nan, 3., 4., 8.,5.]})
df1
df2
df1.combine_first(df2)
General DataFrame combine
The combine_first() method above calls the more general DataFrame.combine(). This method takes another
DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of
Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:
def combiner(a, b):
return np.where(pd.isna(a), b, a)