Pandas standard deviation [Complete Guide] dataframes, series groupby with examples
In this tutorial, You will learn how to write a program to calculate standard deviation in pandas.
Pandas has a inbuilt function std() , we can use that. You can calculate for standard deviation for entire data and single column also.
Standard Deviation on Dataframes:
Syntax: DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Parameters:
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
#pandas standard deviation example
import pandas as pd
data = pd.DataFrame({ 'name':['ravi','david','raju','david','kumar','teju'],
'experience':[1,2,3,4,5,2],
'salary':[15000,20000,30000,45389,50000,20000],
'join_year' :[2017,2017,2018,2018,2019,2018] })
#To calculate standard deviation
print(data.std())
#to calculate standard deviation for specific column
print(data['salary'].std())
Output:
experience 1.471960 join_year 0.752773 salary 14572.550229 dtype: float64 14572.550228654787
Standard Deviation on Series:
Syntax: pandas.Series.std
Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)[source]¶
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters:
axis : {index (0)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns:
std : scalar or Series (if level specified)
import pandas as pd
d= pd.Series([1,2,3,6])
#To calculate standard deviation on series
print(d.std())
Rolling standard deviation:
Here you will know, how to calculate rolling standard deviation.
Syntax: pandas.rolling_std(arg, window, min_periods=None, freq=None, center=False, how=None, **kwargs)
Parameters:
arg : Series, DataFrame
window : int
Size of the moving window. This is the number of observations used for calculating the statistic.
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
freq : string or DateOffset object, optional (default None)
Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
center : boolean, default False
Set the labels at the center of the window.
how : string, default ‘None’
Method for down- or re-sampling
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements.
Returns:
y : type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.
The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).
import pandas as pd
d= pd.Series([1,5,8,4,15,6,37,8,49])
#To calculate rolling standard deviation
print(pd.rolling_std(d,2))
Unbiased standard deviation:
you can calculate unbiased standard deviation use df.sem() function.
pandas.DataFrame.sem():Return unbiased standard error of the mean over requested axis.
Syntax: DataFrame.sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
import pandas as pd
d= pd.Series([1,5,8,4,15,6,37,8,49])
#To calculate standard deviation
print(d.sem())
Output:
5.57219729694
pandas standard deviation groupby:
We can calculate standard deviation by using GroupBy.std function.
import pandas as pd
df=pd.DataFrame({'A':[3,4,3,4],'B':[4,3,3,4],'C':[1,2,2,1]})
#To calculate standard deviation by groupby
print(df.groupby(['A']).std())
Output:
B C A 1 0.707107 0.707107 2 0.707107 0.707107