CheatSheet For Data Analytics

🚀 Python Cheat Sheet

### Basic Syntax & Data Structures

  • Variables: `x = 5`, `name = "John"`

  • Data Types: Integer, Float, String, Boolean

  • Lists: `my_list = [1, 2, 3]`, `my_list.append(4)`, `my_list[0]`, `len(my_list)`

  • Dictionaries: `my_dict = {'key': 'value'}`, `my_dict['key']`, `my_dict.keys()`, `my_dict.values()`

  • Tuples: `my_tuple = (1, 2, 3)`, immutable

  • Sets: `my_set = {1, 2, 3}`, `my_set.add(4)`, unique elements

### Control Flow

  • If-Else: `if condition: ... elif condition: ... else: ...`

  • Loops: `for item in iterable: ...`, `while condition: ...`

### Functions & Lambda Expressions

  • Defining Functions: `def my_function(): ...`

  • Lambda Functions: `lambda arguments: expression`, e.g., `lambda x: x * 2`

### Pandas for Data Manipulation

  • Creating DataFrames: `import pandas as pd`, `df = pd.DataFrame(data)`

  • Reading Data: `pd.read_csv('file.csv')`, `pd.read_excel('file.xlsx')`

  • Data Selection: `df['column']`, `df[['col1', 'col2']]`, `df.iloc[rows, columns]`

  • Filtering: `df[df['column'] > value]`

  • Grouping: `df.groupby('column').agg({'column2': 'sum'})`

  • Pivot Tables: `df.pivot_table(index='column', values='column2', aggfunc='mean')`

  • Merging & Joining: `pd.merge(df1, df2, on='column', how='inner')`

### NumPy for Numerical Computing

  • Arrays: `import numpy as np`, `np.array([1, 2, 3])`

  • Array Operations: Element-wise operations, `+`, `-`, `*`, `/`

  • Statistics: `np.mean(array)`, `np.median(array)`, `np.std(array)`

### Matplotlib & Seaborn for Data Visualization

  • Matplotlib Basics: `import matplotlib.pyplot as plt`, `plt.plot(x, y)`, `plt.show()`

  • Seaborn: `import seaborn as sns`, `sns.lineplot(x='x_column', y='y_column', data=df)`

  • Plot Customization: Title, labels, legends

### Scikit-Learn for Machine Learning

  • Preparing Data: `from sklearn.model_selection import train_test_split`, `X_train, X_test, y_train, y_test = train_test_split(X, y)`

  • Model Training: `from sklearn.linear_model import LinearRegression`, `model = LinearRegression()`, `model.fit(X_train, y_train)`

  • Prediction: `predictions = model.predict(X_test)`

  • Evaluation: `from sklearn.metrics import mean_squared_error`, `mean_squared_error(y_test, predictions)`

### General Best Practices

  • Code Organization: Functions, classes, and modules for reusable code

  • Version Control: Use Git for version control and collaboration

  • Virtual Environments: Use `venv` or `conda` for managing project-specific dependencies

### Additional Libraries

  • Scrapy for web scraping, BeautifulSoup for HTML parsing

  • SQLAlchemy for database interaction

  • Plotly and Dash for interactive visualizations and web applications

------- end -------

### Basic SQL Syntax

  • SELECT: `SELECT column1, column2 FROM table_name;` - Retrieves specific columns from a table.

  • SELECT DISTINCT: `SELECT DISTINCT column1 FROM table_name;` - Returns unique values in the specified column.

  • WHERE: `SELECT * FROM table_name WHERE condition;` - Filters the result set to only include rows where the condition is TRUE.

  • ORDER BY: `SELECT * FROM table_name ORDER BY column1 ASC|DESC;` - Sorts the result set in ascending or descending order.

### Aggregation Functions

  • COUNT: `SELECT COUNT(column_name) FROM table_name WHERE condition;` - Returns the number of input rows that match a specific condition.

  • SUM: `SELECT SUM(column_name) FROM table_name WHERE condition;` - Returns the total sum of a numeric column.

  • AVG: `SELECT AVG(column_name) FROM table_name WHERE condition;` - Returns the average value of a numeric column.

  • MIN/MAX: `SELECT MIN(column_name) FROM table_name;` / `SELECT MAX(column_name) FROM table_name;` - Returns the smallest/largest value of the selected column.

### Joining Tables

  • INNER JOIN: `SELECT columns FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name;` - Selects records that have matching values in both tables.

  • LEFT JOIN (LEFT OUTER JOIN): `SELECT columns FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name;` - Returns all records from the left table, and the matched records from the right table.

  • RIGHT JOIN (RIGHT OUTER JOIN): `SELECT columns FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name;` - Returns all records from the right table, and the matched records from the left table.

  • FULL JOIN (FULL OUTER JOIN): `SELECT columns FROM table1 FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;` - Returns all records when there is a match in either left or right table.

### Subqueries and CTEs

  • Subquery: `SELECT column_name FROM table_name WHERE column_name IN (SELECT column_name FROM table_name WHERE condition);` - A query nested inside another query.

  • Common Table Expressions (CTE): `WITH cte_name AS (SELECT column_name FROM table_name) SELECT * FROM cte_name;` Temporary result set that you can reference within another SELECT, INSERT, UPDATE, or DELETE statement.

### Window Functions

  • ROW_NUMBER: `SELECT ROW_NUMBER() OVER (ORDER BY column_name) FROM table_name;` - Assigns a unique sequential integer to rows within a result set.

  • RANK / DENSE_RANK: `SELECT RANK() OVER (ORDER BY column_name) FROM table_name;` - Assigns a rank to each row within a partition of a result set.

  • LEAD/LAG: `SELECT LEAD(column_name) OVER (ORDER BY column_name) FROM table_name;` - Accesses data from a subsequent/preceding row in the same result set.

### Data Manipulation

  • INSERT INTO: `INSERT INTO table_name (column1, column2) VALUES (value1, value2);` - Inserts new rows into a table.

  • UPDATE: `UPDATE table_name SET column1 = value1 WHERE condition;` - Updates existing records in a table.

  • DELETE: `DELETE FROM table_name WHERE condition;` - Deletes existing records in a table.

### Data Definition and Control

  • CREATE TABLE: `CREATE TABLE table_name (column1 datatype, column2 datatype);` - Creates a new table.

  • ALTER TABLE: `ALTER TABLE table_name ADD column_name datatype;` - Adds a new column

------- end -------

🚀 SQL Cheat Sheet

### Basic Formulas

  • SUM: `=SUM(A1:A10)` - Adds all numbers in a range of cells.

  • AVERAGE: `=AVERAGE(B1:B10)` - Calculates the average of a range of cells.

  • MIN/MAX: `=MIN(C1:C10)`, `=MAX(C1:C10)` - Finds the minimum/maximum value in a range.

### Lookup & Reference

  • VLOOKUP: `=VLOOKUP(value, table_range, col_index_num, [range_lookup])` - Searches for a value in the first column of a table and returns a value in the same row from a specified column.

  • INDEX: `=INDEX(array, row_num, [col_num])` - Returns the value of a cell in a specific row and column within a range.

  • MATCH: `=MATCH(lookup_value, lookup_array, [match_type])` - Searches for a specified item in a range of cells, and then returns the relative position of that item.

### Text Functions

  • CONCATENATE / CONCAT: `=CONCATENATE(A1, " ", B1)`, `=CONCAT(A1, " ", B1)` - Combines two or more strings into one string.

  • LEFT / RIGHT: `=LEFT(text, number_of_characters)`, `=RIGHT(text, number_of_characters)` - Extracts a specified number of characters from a text string starting from the left/right.

### Logical

  • IF: `=IF(logical_test, value_if_true, value_if_false)` - Checks whether a condition is met, and returns one value if TRUE, and another value if FALSE.

  • AND / OR: `=AND(logical1, [logical2], ...)`, `=OR(logical1, [logical2], ...)` - Returns TRUE if all arguments are TRUE / if any arguments are TRUE.

### Date and Time

  • TODAY: `=TODAY()` - Returns the current date.

  • NOW: `=NOW()` - Returns the current date and time.

  • YEAR / MONTH / DAY: `=YEAR(A1)`, `=MONTH(A1)`, `=DAY(A1)` - Extracts the year/month/day from a date.

### Financial

  • NPV: `=NPV(rate, value1, [value2], ...)` - Calculates the net present value of an investment based on a constant interest rate.

  • PMT: `=PMT(rate, nper, pv, [fv], [type])` - Calculates the payment for a loan based on constant payments and a constant interest rate.

### Statistical

  • COUNT: `=COUNT(A1:A10)` - Counts the number of cells that contain numbers.

  • COUNTIF / COUNTIFS: `=COUNTIF(range, criteria)`, `=COUNTIFS(range1, criteria1, [range2, criteria2],...)` - Counts the number of cells that meet a criterion; can use multiple criteria.

  • SUMIF / SUMIFS: `=SUMIF(range, criteria, [sum_range])`, `=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2, criteria2],...)` - Adds the cells specified by a given condition(s).

### Data Analysis

  • PivotTables: Essential for summarizing, analyzing, exploring, and presenting summary data. PivotTables can dynamically sort, count, and total the data stored in one table or spreadsheet and create a second table displaying the summarized data.

  • Data Validation: Use Data Validation (`Data` > `Data Validation`) to create dropdown menus, restrict cell input values to certain types, or create rules that dictate what can be entered.

### Array Formulas

  • Array Formula: Entered by pressing `Ctrl` + `Shift` + `Enter`. Useful for performing multiple calculations on one or more items in an array. For example, `{=SUM(IF(A1:A10>5, A1:A10))}` sums only the numbers greater than 5 in the range.

------- end -------

🚀 Excel Cheat Sheet

### 1. Mean (Arithmetic Average)

  • Definition: The sum of all values divided by the number of values.

  • Example: For sales values $30, $50, $10, $170, $20, $80, the mean is ($30 + $50 + $10 + $170 + $20 + $80) / 6 = $60.

### 2. Median

  • Definition: The middle value of an ordered dataset.

  • Example: For the dataset $10, $20, $30, $50, $80, $170, the median is $40 (average of $30 and $50).

### 3. Mode

  • Definition: The most frequently occurring value in a dataset.

  • Example: For the datasets 3, 5, 1, 5, 2, 5, 7, the mode is 5.

### 4. Variance

  • Definition: A measure of the spread of numbers in a dataset; the average of the squared differences from the mean.

  • Example: For sales values $30, $50, $10, $170, $20, $80 with a mean of $60, the variance is [(30-60) ² + (50-60)² + (10-60)² + (170-60)² + (20-60)² + (80-60)²] / 6 = 2833.33.

### 5. Standard Deviation**

  • Definition: The square root of the variance; measures the average distance from the mean.

  • Example: For the same sales values, the standard deviation is √2833.33 = 53.22.

### 6. Descriptive Statistics

  • Definition: Statistical methods that describe and summarize data.

  • Example: Calculating the mean, median, mode, variance, and standard deviation for a dataset.

### 7. Exploratory Data Analysis (EDA)

  • Definition: Visualizing and exploring data to find patterns, trends, and relationships.

  • Example: Calculating the mean, median, mode, variance, and standard deviation for a dataset.

### 8. Predictive Statistics (Inferential Statistics)

  • Definition: Using data from a sample to make inferences about a population.

  • Example: Creating histograms, scatter plots, and box plots to visualize sales data.

### 9. Data Collection

  • Definition: Gathering relevant data from various sources.

  • Example: Collecting sales data through surveys, databases, and records.

### 10. Data Cleaning and Preprocessing

  • Definition: Ensuring data quality and reliability by removing outliers, correcting missing values, and formatting data.

  • Example: Removing a sales outlier value of $1000 from a dataset of typical sales values around $50.

### 11. Data Structure and Organization

  • Definition: Understanding the format and organization of data.

  • Example: Identifying data as numeric or categorical and arranging it in rows and columns.

------- end -------

🚀 Statistical Analysis Terminology Cheatsheet