기초 데이터 분석 및 시각화

February 21, 2022 4 분 소요

데이터셋 준비

사이킷런에서 당뇨병 데이터셋을 받아옵니다.

그리고 데이터셋을 판다스의 데이터 프레임으로 변환합니다.

from sklearn.datasets import load_diabetes, load_boston,load_wine

dataset = load_diabetes()
features = dataset['data']
feature_names = dataset['feature_names']

import pandas as pd

df = pd.DataFrame(features, columns=feature_names)
df.head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641

데이터 정보 확인

describe(): 특징에 대한 개수(count), 평균(mean), 표준편자(std), 최소(min), 최대(max) 등을 알 수 있습니다.

info(): 특징의 null값의 유무와 Dtype을 알 수 있습니다.

nlargest(int): 값이 큰 순서대로 int값의 개수를 출력합니다.

corr(): 각 특징간 상관관계를 알려줍니다.

df.describe()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
count	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02
mean	-3.634285e-16	1.308343e-16	-8.045349e-16	1.281655e-16	-8.835316e-17	1.327024e-16	-4.574646e-16	3.777301e-16	-3.830854e-16	-3.412882e-16
std	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02
min	-1.072256e-01	-4.464164e-02	-9.027530e-02	-1.123996e-01	-1.267807e-01	-1.156131e-01	-1.023071e-01	-7.639450e-02	-1.260974e-01	-1.377672e-01
25%	-3.729927e-02	-4.464164e-02	-3.422907e-02	-3.665645e-02	-3.424784e-02	-3.035840e-02	-3.511716e-02	-3.949338e-02	-3.324879e-02	-3.317903e-02
50%	5.383060e-03	-4.464164e-02	-7.283766e-03	-5.670611e-03	-4.320866e-03	-3.819065e-03	-6.584468e-03	-2.592262e-03	-1.947634e-03	-1.077698e-03
75%	3.807591e-02	5.068012e-02	3.124802e-02	3.564384e-02	2.835801e-02	2.984439e-02	2.931150e-02	3.430886e-02	3.243323e-02	2.791705e-02
max	1.107267e-01	5.068012e-02	1.705552e-01	1.320442e-01	1.539137e-01	1.987880e-01	1.811791e-01	1.852344e-01	1.335990e-01	1.356118e-01

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB

df['bmi'].nlargest(3)

367    0.170555
256    0.160855
366    0.137143
Name: bmi, dtype: float64

df.corr()

	age	bmi	bp	s1	s2	s3	s4	s5	s6	label
age	1.000000	0.185085	0.335427	0.260061	0.219243	-0.075181	0.203841	0.270777	0.301731	0.006930
bmi	0.185085	1.000000	0.395415	0.249777	0.261170	-0.366811	0.413807	0.446159	0.388680	0.116859
bp	0.335427	0.395415	1.000000	0.242470	0.185558	-0.178761	0.257653	0.393478	0.390429	0.048823
s1	0.260061	0.249777	0.242470	1.000000	0.896663	0.051519	0.542207	0.515501	0.325717	0.008234
s2	0.219243	0.261170	0.185558	0.896663	1.000000	-0.196455	0.659817	0.318353	0.290600	0.043795
s3	-0.075181	-0.366811	-0.178761	0.051519	-0.196455	1.000000	-0.738493	-0.398577	-0.273697	-0.086296
s4	0.203841	0.413807	0.257653	0.542207	0.659817	-0.738493	1.000000	0.617857	0.417212	0.062967
s5	0.270777	0.446159	0.393478	0.515501	0.318353	-0.398577	0.617857	1.000000	0.464670	0.015286
s6	0.301731	0.388680	0.390429	0.325717	0.290600	-0.273697	0.417212	0.464670	1.000000	-0.004950
label	0.006930	0.116859	0.048823	0.008234	0.043795	-0.086296	0.062967	0.015286	-0.004950	1.000000

함수 적용

apply()

def sortsex(x):
  if x > 0:
    return 'male'
  else:
    return 'female'

df['sex'] = df['sex'].apply(sortsex)
df['sex']

0        male
1      female
2        male
3      female
4      female
        ...  
437      male
438      male
439      male
440    female
441    female
Name: sex, Length: 442, dtype: object

import numpy as np
df['label'] = np.random.randint(1, 4, len(df))

그룹화

df.groupby('sex').get_group('male').head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	label
0	0.038076	male	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646	3
2	0.085299	male	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930	1
6	-0.045472	male	-0.047163	-0.015999	-0.040096	-0.024800	0.000779	-0.039493	-0.062913	-0.038357	1
7	0.063504	male	-0.001895	0.066630	0.090620	0.108914	0.022869	0.017703	-0.035817	0.003064	1
8	0.041708	male	0.061696	-0.040099	-0.013953	0.006202	-0.028674	-0.002592	-0.014956	0.011349	3

df.groupby('sex').mean()

	age	bmi	bp	s1	s2	s3	s4	s5	s6	label
sex
female	-0.007756	-0.003936	-0.010759	-0.001575	-0.006368	0.016923	-0.014826	-0.006693	-0.009291	2.004255
male	0.008805	0.004468	0.012215	0.001788	0.007229	-0.019212	0.016832	0.007598	0.010548	2.048309

df.groupby('sex').size()

sex
female    235
male      207
dtype: int64

df.groupby(['sex', 'label']).mean()

		age	bmi	bp	s1	s2	s3	s4	s5	s6
sex	label
female	1	-0.007983	-0.011001	-0.012674	-0.002239	-0.009755	0.023223	-0.020272	-0.007147	-0.013663
	2	-0.001230	-0.001010	-0.006186	-0.000511	-0.005365	0.011965	-0.009201	-0.000791	-0.001290
	3	-0.013975	0.000152	-0.013384	-0.001970	-0.004013	0.015599	-0.015003	-0.012071	-0.012875
male	1	0.007752	-0.003550	0.007619	0.003018	0.005739	-0.013468	0.013457	0.006458	0.012009
	2	0.002551	0.004152	0.009424	-0.002898	0.006504	-0.020188	0.015658	0.000316	0.011208
	3	0.014395	0.011708	0.018313	0.004214	0.009072	-0.023501	0.020655	0.014032	0.008779

df.groupby('label').mean().sort_values('bmi', ascending=False)

	age	bmi	bp	s1	s2	s3	s4	s5	s6
label
3	0.000210	0.005930	0.002464	0.001122	0.002529	-0.003951	0.002826	0.000981	-0.002048
2	0.000398	0.001213	0.000537	-0.001539	-0.000253	-0.001882	0.001505	-0.000314	0.004092
1	-0.000597	-0.007504	-0.003149	0.000228	-0.002482	0.006001	-0.004440	-0.000761	-0.001613

데이터 시각화

import matplotlib.pyplot as plt

plt.plot(df['age'].head(), label='age')
plt.xlabel('x')  # x축
plt.ylabel('y')  # y축
plt.legend()     # 범례

<matplotlib.legend.Legend at 0x7faf8828fbd0>

plt.plot(df['age'].head(), marker='o')
plt.plot(df['bmi'].head(), marker='v', linestyle='none')
plt.plot(df['bp'].head(), linestyle=':', color='g')
plt.plot(df['s1'].head(), 'ro--') # 포맷  color, marker, linestyle

[<matplotlib.lines.Line2D at 0x7faf87eacf50>]

# figsize: 그래프 크기, alpha: 투명도
plt.figure(figsize=(10,5))
plt.plot(df['age'].head(), 'bv:', alpha=0.3, label='age')
plt.plot(df['bmi'].head(), 's-.', label='bmi')
plt.legend(ncol=2)

<matplotlib.legend.Legend at 0x7faf87d222d0>

막대 그래프

plt.bar(df.index[:5], df['age'].head(), color=['r', 'g', 'b', 'r', 'g'])

<BarContainer object of 5 artists>

plt.bar(df.index[:5], df['age'].head(), width=0.3)
plt.xticks(rotation=45)
plt.yticks(rotation=45)

(array([-0.1 , -0.05,  0.  ,  0.05,  0.1 ]),
 <a list of 5 Text major ticklabel objects>)

plt.barh(df.index[:5], df['age'].head())

<BarContainer object of 5 artists>

bar = plt.bar(df.index[:5], df['age'].head())
bar[0].set_hatch('/')
bar[2].set_hatch('x')

for idx, rect in enumerate(bar):
  plt.text(idx, rect.get_height(), df['age'][idx])

데이터 프레임 적용

d = df.sort_values('age')
plt.plot(d['age'], d['bmi'])

plt.grid()

plt.bar(df.index[:5], df['age'].head(), color='r')
plt.bar(df.index[:5], df['bmi'].head(), color='b')

<BarContainer object of 5 artists>

plt.bar(df.index[:5], df['age'].head(), color='r')
plt.bar(df.index[:5], df['bmi'].head(), color='b', bottom=df['age'].head())
plt.bar(df.index[:5], df['s1'].head(), color='g', bottom=df['age'].head() + df['bmi'].head())

<BarContainer object of 5 artists>

w = 0.25

plt.bar(df.index[:5]-w, df['age'].head(), color='r', width=w)
plt.bar(df.index[:5], df['bmi'].head(), color='b', width=w)
plt.bar(df.index[:5]+w, df['s1'].head(), color='g', width=w)

<BarContainer object of 5 artists>

g = df.groupby('sex')

plt.pie(g.size(), labels=g.size().index, autopct='%.1f')
plt.show()

Twitter Facebook LinkedIn

기초 데이터 분석 및 시각화

데이터셋 준비

데이터 정보 확인

함수 적용

그룹화

데이터 시각화

데이터 프레임 적용

공유하기

댓글남기기

참고

DALL-E 2 사용법 (사용기), 텍스트로 이미지를 만드는 인공지능

구글 드라이브 파일 다운받는 gdown 사용법과 안될 시 해결법

스테이블 디퓨전(Stable Diffusion) 간단한 사용법과 가이드 및 원리 이해 by 코랩(colab)

머신러닝 - K-최근접 이웃(KNN classifier)을 이용한 분류