nltk 자연어 전처리 및 토큰화

February 23, 2022 2 분 소요

데이터셋 준비

https://www.kaggle.com/arkhoshghalb/twitter-sentiment-analysis-hatred-speech 에서 데이터셋 다운

import pandas as pd

train = pd.read_csv("/content/drive/MyDrive/train.csv")

train.shape

(31962, 3)

데이터 전처리

소문자 변환
영어가 아닌 문자를 공백으로 교체
불용어 제거

학습 모델에서 예측이나 학습에 실제로 기여하지 않는 텍스트를 불용어라고한다.

I, that, is, the, a 등과 같이 자주 등장하는 단어이지만 실제로 의미를 찾는데 기여하지 않는 단어들을 제거하는 작업이 필요하다.

어간 추출

see, saw, seen 같은 과거형이나 미래형같은 단어를 하나의 단어로 취급하는 작업입니다.

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

import re

def text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower().split()
 
    # 불용어 제거
    stops = set(stopwords.words('english'))
    no_stops = [word for word in no_capitals if not word in stops]
 
    # 어간 추출
    stemmer = nltk.stem.SnowballStemmer('english')
    stemmer_words = [stemmer.stem(word) for word in no_stops]
 
    # 공백으로 구분된 문자열로 결합하여 결과 반환
    return ' '.join(stemmer_words)

train['clean_text'] = train['tweet'].apply(lambda x : text_cleaning(x))
train.head()

	id	tweet	clean_text
0	1	@user when a father is dysfunctional and is s...	user father dysfunct selfish drag kid dysfunct...
1	2	@user @user thanks for #lyft credit i can't us...	user user thank lyft credit use caus offer whe...
2	3	bihday your majesty	bihday majesti
3	4	#model i love u take with u all the time in ...	model love u take u time ur
4	5	factsguide: society now #motivation	factsguid societi motiv

from multiprocessing import Pool
 
def use_multiprocess(func, iter, workers):
    pool = Pool(processes=workers)
    result = pool.map(func, iter)
    pool.close()
    return result

clean_processed_tweet = use_multiprocess(text_cleaning, train['tweet'], 3)
clean_processed_tweet

#Tokenizer 종류(nltk 라이브러리 사용)

토크나이저란 입력으로 들어온 문장들에 대해 토큰으로 나누어 주는 역할을 한다.

토크나이저는 크게 Word Tokenizer와 Subword Tokenizer으로 나뉜다.

Word Tokenizer의 경우 단어를 기준으로 토큰화를 하는 토크나이저를 말하며,

subword tokenizer의 경우 단어(합성어)를 나누어 단어 안에 단어들로 토큰화를 하는것을 말한다.

subword tokenizer은 vocab에 없는 단어들에 대해서도 좋은 성능을 보인다는 장점을 가진다. wordpiece tokenizer는 subword tokenizer의 종류 중 하나이다. subword tokenizer에서 대표적으로 사용되는 방법으로 BPE(Byte Pair Encoding) 방법이 있다.

문장 토큰화

import nltk
nltk.download('punkt')

sentences = nltk.sent_tokenize(train['clean_text'][0])
sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

['user father dysfunct selfish drag kid dysfunct run']

단어 토큰화

from nltk.tokenize import word_tokenize

word_tokenize(train['clean_text'][0])

['user', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'dysfunct', 'run']

줄바꿈 기준 토큰화

from nltk.tokenize import LineTokenizer

line_tokenizer = LineTokenizer()

line_tokenizer.tokenize("I am a college student, I'm 23 years old \n I like to read books.")

["I am a college student, I'm 23 years old ", ' I like to read books.']

공백 기준 토큰화

from nltk.tokenize import SpaceTokenizer

space_tokenizer = SpaceTokenizer()
space_tokenizer.tokenize(train['clean_text'][0])

['user', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'dysfunct', 'run']

WordPuncTokenizer

‘을 기준으로 분리

from nltk.tokenize import WordPunctTokenizer 

WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage goes for a pastry shop.")

['Don',
 "'",
 't',
 'be',
 'fooled',
 'by',
 'the',
 'dark',
 'sounding',
 'name',
 ',',
 'Mr',
 '.',
 'Jone',
 "'",
 's',
 'Orphanage',
 'goes',
 'for',
 'a',
 'pastry',
 'shop',
 '.']

이모티콘 기준 토큰화

from nltk.tokenize import TweetTokenizer

tweet_tokenizer = TweetTokenizer()
tweet_tokenizer.tokenize("This is a coool #dummysmiley: :-) : -P <3 :)")

['This',
 'is',
 'a',
 'coool',
 '#dummysmiley',
 ':',
 ':-)',
 ':',
 '-',
 'P',
 '<3',
 ':)']

keras 이용 토큰화

keras.preprocessing.text.text_to_word_sequence

전처리를 안한 데이터와 비슷한 출력

from tensorflow.keras.preprocessing.text import text_to_word_sequence 

text_to_word_sequence(train['clean_text'][0])

['user', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'dysfunct', 'run']

text_to_word_sequence(train['tweet'][0])

['user',
 'when',
 'a',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 'run']

Twitter Facebook LinkedIn

nltk 자연어 전처리 및 토큰화

데이터셋 준비

데이터 전처리

문장 토큰화

단어 토큰화

줄바꿈 기준 토큰화

공백 기준 토큰화

WordPuncTokenizer

이모티콘 기준 토큰화

keras 이용 토큰화

keras.preprocessing.text.text_to_word_sequence

공유하기

댓글남기기

참고

DALL-E 2 사용법 (사용기), 텍스트로 이미지를 만드는 인공지능

구글 드라이브 파일 다운받는 gdown 사용법과 안될 시 해결법

스테이블 디퓨전(Stable Diffusion) 간단한 사용법과 가이드 및 원리 이해 by 코랩(colab)

머신러닝 - K-최근접 이웃(KNN classifier)을 이용한 분류