Functions In Python
Every Data Scientist will be using different in-built python functions for different purposes.
In this story I have tried to segregate all the python functions which can help as a quick learn/search.
unique() → This function gives unique values
nunique() → This function gives count of unique values in columns of a data frame
df.duplicated() → This function gives duplicate values in a data frame
df.dropduplicates() → this function drops all duplicate rows in a dataframe
head() → By default gives first 5 rows of a data frame
head(n) → Gives n number of rows of a data frame
tail() → By default gives the last 5 rows of a data frame
type() → data type of container. Container means list, array, tuple, series, dataframe
dtype() → data type of a column inside container
describe() → Gives statistical data like count, mean, median, SD, quartile values of a data set for only numerical columns in the data frame.
info() → Gives the number of rows & columns and their types in a data frame. Also gives the count of rows with data in every column, helping to identify which columns have null values.
mean() → this will give mean for a particular set of data (This can be used on a dataframe)
merge() → this is used to merge 2 data frames
sample(n) → this will print sample n rows of the dataframe.
value_counts() → this is like group by, we can fetch counts of all categories in a column
plot() → used for plotting graph
xlabel() → gives name for x-axis in your chart
ylabel() → gives name for y-axis in your chart
title() → gives title for your chart
savefig() → used for saving a chart figure on to your local disk
empty(3,2) → creates an empty array of desired size, here it’ll create an empty array of size 3*2
arange(10,25,5) → creates an array of 3 elements with a step size of 5 i.e., 10, 15, 20. In this array 25 will not be there as last element is excluded
linspace(10,25,5) → This generates 5 random numbers between 10 & 25 and the last random number would be the stop element i.e., 25 which is ideally ignored in range or arange.
astype() → converts data type from one format to another i.e., astype(int) converts a different type to int type
corrcoeff() → one of aggregate functions which will get correlation coefficient
std() → Standard deviation
median() → Mathematical Median
view() → creates a virtual array from the original array and any changes to view effects the original array
sort() → sorts an array
copy() → creates deep copy of array and changes made to new array won’t effect original array
resize() → reshapes the existing array to a new shape
append() → appends items to an array
title()/capitalize() → In series this converts elements of series to Init Caps
np.diff() → calculates n-th discrete difference along axis. It calculates arr[i+1]-arr[i]
np.sign() → fetches the sign of elements
np.where() → returns the indices of elements in an input array where the given condition is satisfied
np.is_busday() → returns all business days from the given input. Eg: np.is_busday(tmp) -this returns all business days from the dates mentioned inside tmp
np.busday_count() → returns count of number of business days in a given set of dates
np.pad() → does padding of rows and columns around an existing array
np.inner() → computes product of 2 inner arrays
eg: np.inner(a,b)=sum(a[:] * b[:])
series.str.contain() → used to check if a particular string is present in the series
df[‘columnname].pct_change() → this calculates the percentage change
retail_df = retail_df.loc[retail_df[“Invoice No”].str.startswith(‘C’, na=False)] → startswith function used here is used to fetch records for a particular column which have a value starting with that string.
Df1=Df.copy(deep=true) → this copies the data and structure of dataframe df to new dataframe df1
Style.highlight_max() → this function highlights max value in dataframe
df.str.extract → this function is used to extract data from columns in dataframe.
Let us say this is your data
Now, to extract data from this column of dataframe we’ll use extract as shown below
melt() → this function is used to change the DataFrame format from wide to long. It’s used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns — variable and value.
site.getsitepackages() → this function tells you the path where third-party packages are installed on your machine. For this we’ve to import site first
plt.legend(loc=”upper right”) → this function places the legend at upper right corner of your plot.
np.random.rand(5,5) → Generates a 5*5 matrix with random numbers
np.random.randn(5,5) → Generates a normally distributed 5*5 matrix with random numbers
NOTE: Will keep updating this story with few more functions.