-
Exploring different ways to represent data
- Simple models like Linear Model and Naive Bayes Classifier are mainly influenced
- Models like decision trees are not heavily influenced, while models like NN, SVM, and Nearest Neighbor Method can be influenced as well
- Simple models like Linear Model and Naive Bayes Classifier are mainly influenced
-
Important tasks of a Data Scientist
-
There are things more important than adjusting parameters
-
- Example: When a variable like job has three values {student, engineer, chef}, it can be represented as {1,0,0} using One-hot Encoding
- Sometimes, each value is assigned a number, but since they are not continuous values (the order doesn’t matter), they should not be treated as numerical values directly
-
Binning (Discretization)
- Dividing continuous values into intervals to create classes
-
Polynomial features
- In models like Linear Model, adding terms like x^2, x^3 along with x can help in separating curves
- In contrast, it might decrease performance in models like decision trees
- Besides powers, functions like sin, cos, log can also be used
- Particularly useful for data that can be transformed to resemble a bell curve
-
Automated feature selection
- Automatically selecting useful features to reduce the number of features can improve generalization performance
- Selecting features with high statistical correlations
- Training a model to obtain feature importance and then further training using the important features
- Repeating the above steps
- Automatically selecting useful features to reduce the number of features can improve generalization performance
-
Utilizing domain knowledge
- Human experts may have insights that cannot be derived from statistical data
- Incorporating this knowledge while performing Feature Engineering
#getting Started with Machine Learning in Python