Thoughts on the Information Alpha Challenge 2: Training CNN with CIFAR10
- I thought it was a process of balancing both “adjusting the model to improve accuracy” and “improving generalization performance”.
- It involved repeatedly working on the task of advancing the direction that should be advanced based on the graphs of accuracy and validation accuracy.
- I’m not sure if it’s the correct way, though.
from 東大1S情報α (University of Tokyo 1st Semester Information Alpha) Deep Learning Techniques
- Determining the initial parameter values
- What kind of initial values are good?
- Since the distribution tends to be centered around 0 after learning, it is good to start with values centered around 0.
- Also, the larger the dimension, the smaller the variation.
- Additionally, it is good to use a nicely shaped normal distribution.
- This “nicely shaped normal distribution” is called Xavier’s normal distribution or He’s normal distribution.
- The range of the distribution is different.
- This “nicely shaped normal distribution” is called Xavier’s normal distribution or He’s normal distribution.
- What kind of initial values are good?
- Gradient clipping
- Setting an upper limit for the gradient.
- Preventing overfitting
- Regularization
- To avoid large parameters θ.
- Dropout (Deep Learning)
- Randomly dropping connections.
- It seems to be less commonly used recently.
- Early stopping
- Just stop before overfitting occurs.
- Batch normalization
- Correcting the variation in data within each mini-batch.
- Making the mean and variance of the batch data the same as the entire data.
- It allows for stochastic gradient descent on the micro level and is happy to be similar to the entire set on the macro level.
- It’s not clear why it works.
- In fact, the original purpose (reducing the deviation from the entire data) doesn’t have much meaning.
- But, for some reason, it smooths out the roughness of the loss function, so it is used.
- Magic~ (blu3mo)
- Regularization
- There are various techniques, but
- Since the best method is unknown, it is practical to deal with combinations of techniques.
- As for which one to use, well, it’s trial and error.
- For now, it seems that batch normalization is particularly strong, so I want to remember it.