Time Series Analysis Cheat Sheet

broken image


Threat Modeling Cheat Sheet¶ Introduction¶. Threat modeling is a structured approach of identifying and prioritizing potential threats to a system, and determining the value that potential mitigations would have in reducing or neutralizing those threats. This lubridate cheatsheet covers how to round dates, work with time zones, extract elements of a date or time, parse dates into R and more. The back of the cheatsheet describes lubridate's three timespan classes: periods, durations, and intervals; and explains how to do math with date-times. Updated December 17.

This post updates a previous very popular post 50+ Data Science, Machine Learning Cheat Sheets by Bhavya Geethika. If we missed some popular cheat sheets, add them in the comments below.

Cheatsheets on Python, R and Numpy, Scipy, Pandas

Data science is a multi-disciplinary field. Thus, there are thousands of packages and hundreds of programming functions out there in the data science world! An aspiring data enthusiast need not know all. A cheat sheet or reference card is a compilation of mostly used commands to help you learn that language's syntax at a faster rate. Here are the most important ones that have been brainstormed and captured in a few compact pages.

Mastering Data science involves understanding of statistics, mathematics, programming knowledge especially in R, Python & SQL and then deploying a combination of all these to derive insights using the business understanding & a human instinct—that drives decisions.

Here are the cheat sheets by category:

Cheat sheets for Python:

Python is a popular choice for beginners, yet still powerful enough to back some of the world's most popular products and applications. It's design makes the programming experience feel almost as natural as writing in English. Python basics or Python Debugger cheat sheets for beginners covers important syntax to get started. Community-provided libraries such as numpy, scipy, sci-kit and pandas are highly relied on and the NumPy/SciPy/Pandas Cheat Sheet provides a quick refresher to these.

Time Series Analysis Python Cheat Sheet

  1. Python Cheat Sheet by DaveChild via cheatography.com
  2. Python Basics Reference sheet via cogsci.rpi.edu
  3. OverAPI.com Python cheatsheet
  4. Python 3 Cheat Sheet by Laurent Pointal

Cheat sheets for R:

The R's ecosystem has been expanding so much that a lot of referencing is needed. The R Reference Card covers most of the R world in few pages. The Rstudio has also published a series of cheat sheets to make it easier for the R community. The data visualization with ggplot2 seems to be a favorite as it helps when you are working on creating graphs of your results.

At cran.r-project.org:

At Rstudio.com:

  1. R markdown cheatsheet, part 2

Others:

  1. DataCamp's Data Analysis the data.table way

Cheat sheets for MySQL & SQL:

For a data scientist basics of SQL are as important as any other language as well. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. SQL cheatsheets provide a 5 minute quick guide to learning it and then you may explore Hive & MySQL!

  1. SQL for dummies cheat sheet

Cheat sheets for Spark, Scala, Java:

Apache Spark is an engine for large-scale data processing. For certain applications, such as iterative machine learning, Spark can be up to 100x faster than Hadoop (using MapReduce). The essentials of Apache Spark cheatsheet explains its place in the big data ecosystem, walks through setup and creation of a basic Spark application, and explains commonly used actions and operations.

  1. Dzone.com's Apache Spark reference card
  2. DZone.com's Scala reference card
  3. Openkd.info's Scala on Spark cheat sheet
  4. Java cheat sheet at MIT.edu
  5. Cheat Sheets for Java at Princeton.edu

Cheat sheets for Hadoop & Hive:

Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. Explore the Hadoop cheatsheets to find out Useful commands when using Hadoop on the command line. A combination of SQL & Hive functions is another one to check out.

Cheat sheets for web application framework Django:

Django is a free and open source web application framework, written in Python. If you are new to Django, you can go over these cheatsheets and brainstorm quick concepts and dive in each one to a deeper level.

  1. Django cheat sheet part 1, part 2, part 3, part 4

Cheat sheets for Machine learning:

We often find ourselves spending time thinking which algorithm is best? And then go back to our big books for reference! These cheat sheets gives an idea about both the nature of your data and the problem you're working to address, and then suggests an algorithm for you to try.

  1. Machine Learning cheat sheet at scikit-learn.org
  2. Scikit-Learn Cheat Sheet: Python Machine Learning from yhat (added by GP)
  3. Patterns for Predictive Learning cheat sheet at Dzone.com
  4. Equations and tricks Machine Learning cheat sheet at Github.com
  5. Supervised learning superstitions cheatsheet at Github.com

Cheat sheets for Matlab/Octave

MATLAB (MATrix LABoratory) was developed by MathWorks in 1984. Matlab d has been the most popular language for numeric computation used in academia. It is suitable for tackling basically every possible science and engineering task with several highly optimized toolboxes. MATLAB is not an open-sourced tool however there is an alternative free GNU Octave re-implementation that follows the same syntactic rules so that most of coding is compatible to MATLAB.

Series

Cheat sheets for Cross Reference between languages

Related:

Pandas can be used as the most important Python package for Data Science. It helps to provide a lot of functions that deal with the data in easier way. It's fast, flexible, and expressive data structures are designed to make real-world data analysis.

Pandas Cheat Sheet is a quick guide through the basics of Pandas that you will need to get started on wrangling your data with Python. If you want to begin your data science journey with Pandas, you can use it as a handy reference to deal with the data easily.

This cheat sheet will guide through the basics of the Pandas library from the data structure to I/O, selection, sorting and ranking, etc.

Key and Imports

We use following shorthand in the cheat sheet:

  • df: Refers to any Pandas Dataframe object.
  • s: Refers to any Pandas Series object. You can use the following imports to get started:

Importing Data

  • pd.read_csv(filename) : It read the data from CSV file.
  • pd.read_table(filename) : It is used to read the data from delimited text file.
  • pd.read_excel(filename) : It read the data from an Excel file.
  • pd.read_sql(query,connection _object) : It read the data from a SQL table/database.
  • pd.read_json(json _string) : It read the data from a JSON formatted string, URL or file.
  • pd.read_html(url) : It parses an html URL, string or the file and extract the tables to a list of dataframes.
  • pd.read_clipboard() : It takes the contents of clipboard and passes it to the read_table() function.
  • pd.DataFrame(dict) : From the dict, keys for the columns names, values for the data as lists.

Exporting data

  • df.to_csv(filename): It writes to a CSV file.
  • df.to_excel(filename): It writes to an Excel file.
  • df.to_sql(table_name, connection_object): It writes to a SQL table.
  • df.to_json(filename) : It write to a file in JSON format.

Create Test objects

It is useful for testing the code segments.

  • pd.DataFrame(np.random.rand(7,18)): Refers to 18 columns and 7 rows of random floats.
  • pd.Series(my_list): It creates a Series from an iterable my_list.
  • df.index= pd.date_range('1940/1/20', periods=df.shape[0]): It adds the date index.

Viewing/Inspecting Data

  • df.head(n): It returns first n rows of the DataFrame.
  • df.tail(n): It returns last n rows of the DataFrame.
  • df.shape: It returns number of rows and columns.
  • df.info(): It returns index, Datatype, and memory information.
  • s.value_counts(dropna=False): It views unique values and counts.
  • df.apply(pd.Series.value_counts): It refers to the unique values and counts for all the columns.

Selection

  • df[col1]: It returns column with the label col as Series.
  • df[[col1, col2]]: It returns columns as a new DataFrame.
  • s.iloc[0]: It select by the position.
  • s.loc['index_one']: It select by the index.
  • df.iloc[0,:]: It returns first row.
  • df.iloc[0,0]: It returns the first element of first column.

Data cleaning

Series Test Cheat Sheet

  • df.columns = ['a','b','c']: It rename the columns.
  • pd.isnull(): It checks for the null values and returns the Boolean array.
  • pd.notnull(): It is opposite of pd.isnull().
  • df.dropna(): It drops all the rows that contain the null values.
  • df.dropna(axis= 1): It drops all the columns that contain null values.
  • df.dropna(axis=1,thresh=n): It drops all the rows that have less than n non null values.
  • df.fillna(x): It replaces all null values with x.
  • s.fillna(s.mean()): It replaces all the null values with the mean(the mean can be replaced with almost any function from the statistics module).
  • s.astype(float): It converts the datatype of series to float.
  • s.replace(1, 'one'): It replaces all the values equal to 1 with 'one'.
  • s.replace([1,3],[ 'one', 'three']):It replaces all 1 with 'one' and 3 with 'three'.
  • df.rename(columns=lambda x: x+1):It rename mass of the columns.
  • df.rename(columns={'old_name': 'new_ name'}): It consist selective renaming.
  • df.set_index('column_one'): Used for changing the index.
  • df.rename(index=lambda x: x+1): It rename mass of the index.

Filter, Sort, and Groupby

  • df[df[col] > 0.5]: Returns the rows where column col is greater than 0.5
  • df[(df[col] > 0.5) & (df[col] < 0.7)] : Returns the rows where 0.7 > col > 0.5
  • df.sort_values(col1) :It sorts the values by col1 in ascending order.
  • df.sort_values(col2,ascending=False) :It sorts the values by col2 in descending order.
  • df.sort_values([col1,col2],ascending=[True,False]) :It sort the values by col1 in ascending order and col2 in descending order.
  • df.groupby(col1): Returns a groupby object for the values from one column.
  • df.groupby([col1,col2]) :Returns a groupby object for values from multiple columns.
  • df.groupby(col1)[col2]) :Returns mean of the values in col2, grouped by the values in col1.
  • df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) :It creates the pivot table that groups by col1 and calculate mean of col2 and col3.
  • df.groupby(col1).agg(np.mean) :It calculates the average across all the columns for every unique col1 group.
  • df.apply(np.mean) :Its task is to apply the function np.mean() across each column.
  • nf.apply(np.max,axis=1) :Its task is to apply the function np.max() across each row.

Power Series Cheat Sheet

Join/Combine

  • df1.append(df2): Its task is to add the rows in df1 to the end of df2(columns should be identical).
  • pd.concat([df1, df2], axis=1): Its task is to add the columns in df1 to the end of df2(rows should be identical).
  • df1.join(df2,on=col1,how='inner'): SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values, 'how' can be of 'left', 'right', 'outer', 'inner'.
Time series analysis python cheat sheet

Cheat sheets for Cross Reference between languages

Related:

Pandas can be used as the most important Python package for Data Science. It helps to provide a lot of functions that deal with the data in easier way. It's fast, flexible, and expressive data structures are designed to make real-world data analysis.

Pandas Cheat Sheet is a quick guide through the basics of Pandas that you will need to get started on wrangling your data with Python. If you want to begin your data science journey with Pandas, you can use it as a handy reference to deal with the data easily.

This cheat sheet will guide through the basics of the Pandas library from the data structure to I/O, selection, sorting and ranking, etc.

Key and Imports

We use following shorthand in the cheat sheet:

  • df: Refers to any Pandas Dataframe object.
  • s: Refers to any Pandas Series object. You can use the following imports to get started:

Importing Data

  • pd.read_csv(filename) : It read the data from CSV file.
  • pd.read_table(filename) : It is used to read the data from delimited text file.
  • pd.read_excel(filename) : It read the data from an Excel file.
  • pd.read_sql(query,connection _object) : It read the data from a SQL table/database.
  • pd.read_json(json _string) : It read the data from a JSON formatted string, URL or file.
  • pd.read_html(url) : It parses an html URL, string or the file and extract the tables to a list of dataframes.
  • pd.read_clipboard() : It takes the contents of clipboard and passes it to the read_table() function.
  • pd.DataFrame(dict) : From the dict, keys for the columns names, values for the data as lists.

Exporting data

  • df.to_csv(filename): It writes to a CSV file.
  • df.to_excel(filename): It writes to an Excel file.
  • df.to_sql(table_name, connection_object): It writes to a SQL table.
  • df.to_json(filename) : It write to a file in JSON format.

Create Test objects

It is useful for testing the code segments.

  • pd.DataFrame(np.random.rand(7,18)): Refers to 18 columns and 7 rows of random floats.
  • pd.Series(my_list): It creates a Series from an iterable my_list.
  • df.index= pd.date_range('1940/1/20', periods=df.shape[0]): It adds the date index.

Viewing/Inspecting Data

  • df.head(n): It returns first n rows of the DataFrame.
  • df.tail(n): It returns last n rows of the DataFrame.
  • df.shape: It returns number of rows and columns.
  • df.info(): It returns index, Datatype, and memory information.
  • s.value_counts(dropna=False): It views unique values and counts.
  • df.apply(pd.Series.value_counts): It refers to the unique values and counts for all the columns.

Selection

  • df[col1]: It returns column with the label col as Series.
  • df[[col1, col2]]: It returns columns as a new DataFrame.
  • s.iloc[0]: It select by the position.
  • s.loc['index_one']: It select by the index.
  • df.iloc[0,:]: It returns first row.
  • df.iloc[0,0]: It returns the first element of first column.

Data cleaning

Series Test Cheat Sheet

  • df.columns = ['a','b','c']: It rename the columns.
  • pd.isnull(): It checks for the null values and returns the Boolean array.
  • pd.notnull(): It is opposite of pd.isnull().
  • df.dropna(): It drops all the rows that contain the null values.
  • df.dropna(axis= 1): It drops all the columns that contain null values.
  • df.dropna(axis=1,thresh=n): It drops all the rows that have less than n non null values.
  • df.fillna(x): It replaces all null values with x.
  • s.fillna(s.mean()): It replaces all the null values with the mean(the mean can be replaced with almost any function from the statistics module).
  • s.astype(float): It converts the datatype of series to float.
  • s.replace(1, 'one'): It replaces all the values equal to 1 with 'one'.
  • s.replace([1,3],[ 'one', 'three']):It replaces all 1 with 'one' and 3 with 'three'.
  • df.rename(columns=lambda x: x+1):It rename mass of the columns.
  • df.rename(columns={'old_name': 'new_ name'}): It consist selective renaming.
  • df.set_index('column_one'): Used for changing the index.
  • df.rename(index=lambda x: x+1): It rename mass of the index.

Filter, Sort, and Groupby

  • df[df[col] > 0.5]: Returns the rows where column col is greater than 0.5
  • df[(df[col] > 0.5) & (df[col] < 0.7)] : Returns the rows where 0.7 > col > 0.5
  • df.sort_values(col1) :It sorts the values by col1 in ascending order.
  • df.sort_values(col2,ascending=False) :It sorts the values by col2 in descending order.
  • df.sort_values([col1,col2],ascending=[True,False]) :It sort the values by col1 in ascending order and col2 in descending order.
  • df.groupby(col1): Returns a groupby object for the values from one column.
  • df.groupby([col1,col2]) :Returns a groupby object for values from multiple columns.
  • df.groupby(col1)[col2]) :Returns mean of the values in col2, grouped by the values in col1.
  • df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) :It creates the pivot table that groups by col1 and calculate mean of col2 and col3.
  • df.groupby(col1).agg(np.mean) :It calculates the average across all the columns for every unique col1 group.
  • df.apply(np.mean) :Its task is to apply the function np.mean() across each column.
  • nf.apply(np.max,axis=1) :Its task is to apply the function np.max() across each row.

Power Series Cheat Sheet

Join/Combine

  • df1.append(df2): Its task is to add the rows in df1 to the end of df2(columns should be identical).
  • pd.concat([df1, df2], axis=1): Its task is to add the columns in df1 to the end of df2(rows should be identical).
  • df1.join(df2,on=col1,how='inner'): SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values, 'how' can be of 'left', 'right', 'outer', 'inner'.

Time Series Analysis Cheat Sheet Pdf

Statistics

The statistics functions can be applied to a Series, which are as follows:

  • df.describe(): It returns the summary statistics for the numerical columns.
  • df.mean() : It returns the mean of all the columns.
  • df.corr(): It returns the correlation between the columns in the dataframe.
  • df.count(): It returns the count of all the non-null values in each dataframe column.
  • df.max(): It returns the highest value from each of the columns.
  • df.min(): It returns the lowest value from each of the columns.
  • df.median(): It returns the median from each of the columns.
  • df.std(): It returns the standard deviation from each of the columns.

Series Cheat Sheet Pdf

Next TopicPandas Index





broken image