From dict of Series or dicts
import numpy as np
import pandas as pd
d = {'one': pd.Series([2., 3., 4.], index=['p', 'q', 'r']),
'two': pd.Series([2., 3., 4., 5.], index=['p', 'q', 'r', 's'])}
df = pd.DataFrame(d)
df
pd.DataFrame(d, index=['s', 'q', 'p'])
pd.DataFrame(d, index=['s', 'q', 'p'], columns=['two', 'three'])
The row and column labels can be accessed respectively by accessing the index and columns attributes:
df.index
df.columns
From dict of ndarrays / lists
The ndarrays must all be the same length.
If an index is passed, it must clearly also be the same length as the arrays.
If no index is passed, the result will be range(n), where n is the array length.
d = {'one': [4., 5., 6., 7.],
'two': [7., 6., 5., 4.]}
pd.DataFrame(d)
pd.DataFrame(d, index=['w', 'x', 'y', 'z'])
From structured or record array:
data = np.zeros((2, ), dtype=[('P', 'i4'), ('Q', 'f4'), ('R', 'a10')])
data[:] = [(2, 3., 'Best'), (3, 4., "Friend")]
pd.DataFrame(data)
pd.DataFrame(data, index=['first', 'second'])
pd.DataFrame(data, columns=['R', 'P', 'Q'])
From a list of dicts
data2 = [{'p': 2, 'q': 4}, {'p': 5, 'q': 10, 'r': 15}]
pd.DataFrame(data2)
pd.DataFrame(data2, index=['first', 'second'])
pd.DataFrame(data2, columns=['p', 'q'])
From a dict of tuples
You can automatically create a MultiIndexed frame by passing a tuples dictionary.
pd.DataFrame({('p', 'q'): {('P', 'Q'): 2, ('P', 'R'): 1},
('p', 'p'): {('P', 'R'): 4, ('P', 'Q'): 3},
('p', 'r'): {('P', 'Q'): 6, ('P', 'R'): 5},
('q', 'p'): {('P', 'R'): 8, ('P', 'Q'): 7},
('q', 'q'): {('P', 'S'): 10, ('P', 'Q'): 9}})
Missing data
To construct a DataFrame with missing data, we use np.nan to represent missing values.
Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame
constructor, and its masked entries will be considered missing.
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'c', 'd'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df2
Alternate constructors
DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame.
It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default,
but which can be set to 'index' in order to use the dict keys as row labels.
pd.DataFrame.from_dict(dict([('P', [2, 3, 4]), ('Q', [5, 6, 7])]))
If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired
column names:
pd.DataFrame.from_dict(dict([('P', [2, 3, 4]), ('Q', [5, 6, 7])]),
orient='index', columns=['one', 'two', 'three'])
DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype.
data
pd.DataFrame.from_records(data, index='R')
Column selection, addition, deletion
df['one']
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df
Columns can be deleted or popped like with a dict:
del df['two']
three = df.pop('three')
df
When inserting a scalar value, it will naturally be propagated to fill the column:
df['foo'] = 'bar'
df
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the
DataFrame’s index:
df['one_trunc'] = df['one'][:2]
df
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular
location in the columns:
df.insert(1, 'bar', df['one'])
df
Assigning new columns in method chains
iris = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv')
iris.head()
(iris.assign(sepal_ratio=iris['sepal_width'] / iris['sepal_length'])
.head())
In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be
evaluated on the DataFrame being assigned to.
iris.assign(sepal_ratio=lambda x: (x['sepal_width'] / x['sepal_length'])).head()
assign always returns a copy of the data, leaving the original DataFrame untouched.
(iris.query('sepal_length > 4')
.assign(sepal_ratio=lambda x: x.sepal_width / x.sepal_length,
petal_ratio=lambda x: x.petal_width / x.petal_length)
.plot(kind='scatter', x='sepal_ratio', y='petal_ratio'))
Indexing / selection
The basics of indexing are as follows:
Operation Syntax Result
Select column df[col] Select column
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
import numpy as np
import pandas as pd
d = {'one': pd.Series([2., 3., 4.], index=['p', 'q', 'r']),
'two': pd.Series([2., 3., 4., 5.], index=['p', 'q', 'r', 's'])}
df = pd.DataFrame(d)
df.loc['q']
For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section
on indexing. We will address the fundamentals of reindexing / conforming to new sets of labels in the
section on reindexing.
Data alignment and arithmetic
Data alignment between DataFrame objects automatically align on both the columns and the index
(row labels). Again, the resulting object will have the union of the column and row labels.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['P', 'Q', 'R', 'S'])
df2 = pd.DataFrame(np.random.randn(9, 3), columns=['P', 'Q', 'R'])
df + df2
When doing an operation between DataFrame and Series, the default behavior is to align the Series index
on the DataFrame columns, thus broadcasting row-wise. For example:
df - df.iloc[0]
In the special case of working with time series data, if the DataFrame index contains dates,
the broadcasting will be column-wise:
index = pd.date_range('1/1/2019', periods=6)
df = pd.DataFrame(np.random.randn(6, 3), index=index, columns=list('XYZ'))
df
type(df['X'])
df - df['X']
For explicit control over the matching and broadcasting behavior.
Operations with scalars are just as you would expect:
df * 4 + 2
1 / df
df ** 6
Boolean operators work as well:
df1 = pd.DataFrame({'x': [1, 0, 1], 'y': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'x': [0, 1, 1], 'y': [1, 1, 0]}, dtype=bool)
df1 & df2
df1 | df2
df1 ^ df2
-df1
Show the first 5 rows:
df[:5].T
DataFrame interoperability with NumPy functions
np.exp(df)
np.asarray(df)
pandas automatically align labeled inputs as part of a ufunc with multiple inputs.
For example, using numpy.remainder() on two Series with differently ordered labels will
align before the operation.
ser1 = pd.Series([2, 3, 4], index=['p', 'q', 'r'])
ser2 = pd.Series([3, 4, 5], index=['q', 'p', 'r'])
ser1
ser2
np.remainder(ser1, ser2)
As usual, the union of the two indices is taken, and non-overlapping values are filled with missing values.
ser3 = pd.Series([4, 6, 8], index=['q', 'r', 's'])
ser3
np.remainder(ser1, ser3)
When a binary ufunc is applied to a Series and Index, the Series implementation takes precedence and
a Series is returned.
ser = pd.Series([2, 3, 4])
idx = pd.Index([5, 6, 7])
np.maximum(ser, idx)
NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays.
If possible, the ufunc is applied without converting the underlying data to an ndarray.
Console display
Very large DataFrames will be truncated to display them in the console.
baseball = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/baseball.csv')
print(baseball)
baseball.info()
However, using to_string will return a string representation of the DataFrame in tabular form, though
it won’t always fit the console width:
print(baseball.iloc[-20:, :10].to_string())
Wide DataFrames will be printed across multiple rows by default:
pd.DataFrame(np.random.randn(4, 10))
You can change how much to print on a single row by setting the display.width option:
pd.set_option('display.width', 30)
pd.DataFrame(np.random.randn(4, 10))
You can adjust the max width of the individual columns by setting display.max_colwidth
datafile = {'filename': ['filename_01', 'filename_02'],
'path': ["media/user_name/storage/folder_01/filename_01",
"media/user_name/storage/folder_02/filename_02"]}
pd.set_option('display.max_colwidth', 40)
pd.DataFrame(datafile)
pd.set_option('display.max_colwidth', 100)
pd.DataFrame(datafile)
You can also disable this feature via the expand_frame_repr option. This will print the table in one block.
DataFrame column attribute access and IPython completion
If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:
df = pd.DataFrame({'boo1': np.random.randn(4),
'boo2': np.random.randn(4)})
df
df.boo2