ML Part

2021.11.22 실험

<aside> 1️⃣ Model : RidgeRegression

제출 결과 링크 : https://www.kaggle.com/rukimdev/ridgeregression-ensemble-3-remove-0-data
train data :
1. unintended toxic comment data
2. ruddit data
metric : MSE
train data 전처리 :
- 각 데이터셋 별 non-toxic 값을 제거함
- non-toxic 정의 :
  - unintended competition : toxic 관련 label의 값의 총 합이 0인 값
  - ruddit data : offensiveness score가 음수값
Embedding : TF-IDF
training 과정 :
- 데이터셋별로 Ridge 학습 후 대회 validation 데이터셋을 less_toxic과 more_toxic을 구분하여 각각 predict를 낸 후, 그 결과값 중 more_toxic 값이 큰 경우만 남겨 각 데이터셋별 predict를 Ensemble
Accuracy :
- public : 0.760
- local : 0.674 </aside>

<aside> 2️⃣ Model : LSTM

제출 결과 링크 : https://www.kaggle.com/rukimdev/lstm-with-word2vec-first-second-ruddit?scriptVersionId=80567351
train data :
1. toxic comment classification
2. unintended toxic comment data
3. ruddit data
metric : MSE
train data 전처리 :
- 각 데이터셋 별 non-toxic 값 0으로, toxic을 1로 target을 만든 후 세 데이터 모두 merge
tokenization : PLMs tokenizer
Embedding : Word2Vec
training 과정 :
- Word2Vec-LSTM-relu-relu-FullyConnected
- Word2Vec-LSTM-relu-relu-FullyConnected + Ridge ensemble - failed(OoM)
Accuracy :
- public : 0.623
- local : 0.740 </aside>

<aside> 3️⃣ Model : RidgeRegression+BERT

제출 결과 링크 : https://www.kaggle.com/rukimdev/tfidf-bert-ensemble-with-first-second-ruddit
train data :
1. toxic comment classification
2. ruddit data
metric : MSE
train data 전처리 :
- toxic comment classification :
  - severe_toxic 에 2 가중치 두고 모든 label 값 합친 target 생성
  - 전처리로 cleaning을 진행한 데이터셋과 cleaning 하지 않은 데이터셋으로 구분
- ruddit data : 음수 offensiveness score 값 0으로 변환한 target 생성
tokenization : 없음
Embedding : TF-IDF
training 과정 :
- 예림 1 모델과 동일 + 상빈 1 모델과 ensemble
Accuracy :
- public : 0.848
- local : 0.68 </aside>

<aside> 4️⃣ Model : RidgeRegression

제출 결과 링크 : https://www.kaggle.com/rukimdev/jigsaw-ridgeregression-with-ruddit-data
train data :
1. ruddit data
metric : MSE
train data 전처리 :
- ruddit data :
  - 음수 offensiveness score 0으로 변환한 target 생성
  - 음수 offensiveness score 변환 없이 그대로 target 사용
tokenization : 없음
Embedding : TF-IDF
training 과정 :
- 예림 1 모델과 동일
Accuracy :
- 0 변환 점수
  - public : 0.802
  - local : 0.625
- 음수 사용 점수
  - public : 0.802
  - local : 0.625
- 뭔가 잘못됐나 싶을 정도.. </aside>

<aside> 1️⃣ Data : Jigsaw Rate Severity of Toxic comments

Criterion : MRL
전처리 방법 : 전처리를 하지 않습니다.
Embedding : PLMs tokenizer
Model : RoBERTa-base RoBERTa-large AlBERT-base GroNLP dehateBERT hatexplain
Fine tuning : Fully Connected
Training 과정 및 결과
- RoBERTa-base tokenizer-RoBERTa-base model-FullyConnected 0.816(TOP)
- RoBERTa-large tokenizer-RoBERTa-large model-FullyConnected Failed - OoM
- AlBERT-base tokenizer-AlBERT-base model-FullyConnected 0.807
- hatexplain tokenizer-hatexplain model-FullyConnected 0.788
- dehateBERT tokenizer-dehateBERT model-FullyConnected 0.719
- GroNLP tokenizer-GroNLP model-FullyConnected 0.708 </aside>

<aside> 2️⃣ Data : Ruddit data

</aside>

<aside> 3️⃣ Data : Jigsaw Toxic comment classification challenge

</aside>