3강 Loss function, optimization

todo:

define a loss function that quantifies our un happiness with the scores across the training data
come up with a way of efficiently finding the parameters that minimize the loss function (optimization)

스크린샷 2023-07-01 오후 8.25.20.png

SVM - Hinge loss

Sj → 잘못된 레이블의 score

Sy_i → 알맞는 레이블의 score

+1 → safety margin
- 위의 예시에서의 loss 구하기
  
  Q. what if the sum was instead over all classes? (including j = y_i)
  
  A. 모든 loss가 +1이 된다. 최종 loss도 +1 된다.
  
  Q. what if we used a mean instead of a sum here
  
  A. loss를 minimize하는 것이 중요하기 때문에 별 상관 없을 듯
  
  Q. what if we used
  
  A. 제곱을 해주면 nonlinear이기 때문에 차이가 있다.
  
  Q. what is the min/max possible loss
  
  A. [0, 무한대]
  
  Q. usually at initialization W are small numbers, so all s ~= 0. what is the loss?
  
  A. 2, 클래스 개수 - 1, sanity check
  - There is a bug with the loss
    
    → Suppose that we found a W such that L = 0. Is this W unique? no.
    
    for unique weight
    
    → Weight Regularization
    
    → a way of trading off training loss and generalization loss on test set
    
    → w2 더 선호
    
    → diffuse over ererything
    
    → weight를 가능한 최대로 spread out 모든 input feature를 고려하길 원한다.
Softmax - Cross entropy loss
- log를 취하는 이유?
  - 기본적으로 수학적으로 좀 더 나이스하고 편리하고 log를 취함으로써 장점이 있기 때문

스크린샷 2023-07-01 오후 8.59.58.png

Q. what is the min/max possible loss L_i

A. x가 [0, 1]일 때 (확률이기 때문), [무한대, 0]

Q. usually at initialization W are small numbers, so all s ~= 0. what is the loss?

A. - log(1/3), -log(1/클래스 개수)

→ sanity check 용

→ Loss를 minimize하는 Weight를 찾는 과정

스크린샷 2023-07-01 오후 9.12.56.png

R(W) → weight에만 영향을 받는 함수이다.

제일 안 좋은 방법 : random search 15.5%. SOTA ~ 95%
Follow the slope (numerical gradient)
- 근사치기 때문에 정확하지 않다. (approximate)
- 평가하는 것이 매우 느리다. (slow)
- easy to write
analytic gradient

the loss is just a function of W

미분만 알면 된다.
- exact, fast, error-prone
실제로는 언제나 analytic gradient를 사용하나, 계산이 잘 되고 있는지 확인하는 용도로 numerical gradient를 사용한다. 이를 gradient check라고 한다.