Few points that I need to note:
3B1B’s ML explainer videos are pretty good
They explain the fact that there are in fact three stages of optimization that occurs. First the training data gives us a value for the different stages. Within these stages, weights determine the importance of each variable (I am using the wrong terms – need to look it up) and the error function gives us variance.
Mean squared error tells us how far off the prediction is from the actual value – a correct prediction leads to minimization of this value
Stochastic gradient descent is different from gradient descent wherein it only works on a small random batch of data to figure out the minima and not the entire data or the global minima since minimizing on the entire data is computationally expensive and finding out the global minima is not really possible.