라이브러리 불러오기

  • pandas 라이브러리 불러오기, supermarket_sales.csv 파일 불러오기
1
2
3
import pandas as pd 
from google.colab import drive
drive.mount("/content/drive")
Mounted at /content/drive
1
2
3
DATA_PATH = "/content/drive/MyDrive/Colab Notebooks/data/supermarket_sales.csv"
sales = pd.read_csv(DATA_PATH)
sales

Invoice ID Branch City Customer type Gender Product line Unit price Quantity Date Time Payment
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 1/5/2019 13:08 Ewallet
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3/8/2019 10:29 Cash
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 3/3/2019 13:23 Credit card
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 1/27/2019 20:33 Ewallet
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 2/8/2019 10:37 Ewallet
... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.35 1 1/29/2019 13:46 Ewallet
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.38 10 3/2/2019 17:16 Ewallet
997 727-02-1313 A Yangon Member Male Food and beverages 31.84 1 2/9/2019 13:22 Cash
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.82 1 2/22/2019 15:33 Cash
999 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7 2/18/2019 13:28 Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-5de646d0-967f-4dc5-a51f-d7ef279760fb button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5de646d0-967f-4dc5-a51f-d7ef279760fb');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Invoice ID     1000 non-null   object 
 1   Branch         1000 non-null   object 
 2   City           1000 non-null   object 
 3   Customer type  1000 non-null   object 
 4   Gender         1000 non-null   object 
 5   Product line   1000 non-null   object 
 6   Unit price     1000 non-null   float64
 7   Quantity       1000 non-null   int64  
 8   Date           1000 non-null   object 
 9   Time           1000 non-null   object 
 10  Payment        1000 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 86.1+ KB

Groupy by

  • (동의어) 집계함수를 배운다.
1
sales['Invoice ID'].value_counts()
750-67-8428    1
642-61-4706    1
816-72-8853    1
491-38-3499    1
322-02-2271    1
              ..
633-09-3463    1
374-17-3652    1
378-07-7001    1
433-75-6987    1
849-09-3807    1
Name: Invoice ID, Length: 1000, dtype: int64
1
sales.groupby('Customer type')['Quantity'].sum()
Customer type
Member    2785
Normal    2725
Name: Quantity, dtype: int64
1
sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()
Customer type  Branch  Payment    
Member         A       Cash           308
                       Credit card    282
                       Ewallet        374
               B       Cash           284
                       Credit card    371
                       Ewallet        269
               C       Cash           293
                       Credit card    349
                       Ewallet        255
Normal         A       Cash           264
                       Credit card    298
                       Ewallet        333
               B       Cash           344
                       Credit card    228
                       Ewallet        324
               C       Cash           403
                       Credit card    194
                       Ewallet        337
Name: Quantity, dtype: int64

-

1
print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()))
<class 'pandas.core.series.Series'>
1
sales.groupby(['Customer type', 'Branch', 'Payment'], as_index=False)['Quantity'].agg(['sum', 'mean']).reset_index()

Customer type Branch Payment sum mean
0 Member A Cash 308 5.500000
1 Member A Credit card 282 5.755102
2 Member A Ewallet 374 6.032258
3 Member B Cash 284 5.358491
4 Member B Credit card 371 5.888889
5 Member B Ewallet 269 5.489796
6 Member C Cash 293 4.966102
7 Member C Credit card 349 5.816667
8 Member C Ewallet 255 5.100000
9 Normal A Cash 264 4.888889
10 Normal A Credit card 298 5.418182
11 Normal A Ewallet 333 5.203125
12 Normal B Cash 344 6.035088
13 Normal B Credit card 228 4.956522
14 Normal B Ewallet 324 5.062500
15 Normal C Cash 403 6.200000
16 Normal C Credit card 194 5.105263
17 Normal C Ewallet 337 6.017857

  <script>
    const buttonEl =
      document.querySelector('#df-06183b9d-9668-4956-b5b7-bf3ab9256622 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-06183b9d-9668-4956-b5b7-bf3ab9256622');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].agg(['sum', 'mean'])))
<class 'pandas.core.frame.DataFrame'>

결측치 다루기

결측치 데이터 생성

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd 
import numpy as np

dict_01 = {
'Score_A' : [80, 90, np.nan, 80],
'Score_B' : [30, 45, np.nan, np.nan],
'Score_C' : [np.nan, 50, 80, 90]
}

df = pd.DataFrame(dict_01)
df

Score_A Score_B Score_C
0 80.0 30.0 NaN
1 90.0 45.0 50.0
2 NaN NaN 80.0
3 80.0 NaN 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-19e2d3df-d166-4556-96a3-7e87c689f899 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-19e2d3df-d166-4556-96a3-7e87c689f899');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df.isnull().sum()
Score_A    1
Score_B    2
Score_C    1
dtype: int64
1
df.fillna("0")

Score_A Score_B Score_C
0 80.0 30.0 0
1 90.0 45.0 50.0
2 0 0 80.0
3 80.0 0 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-8fdcaeb0-1a6d-42b3-a659-2ed1ac98f3c4 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-8fdcaeb0-1a6d-42b3-a659-2ed1ac98f3c4');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df.fillna(method="pad")

Score_A Score_B Score_C
0 80.0 30.0 NaN
1 90.0 45.0 50.0
2 90.0 45.0 80.0
3 80.0 45.0 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-0beb1959-e8a4-4550-b78e-1da677eaf1f7 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-0beb1959-e8a4-4550-b78e-1da677eaf1f7');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
4
5
6
7
dict_01 = {
"성별" : ["남자", "여자", np.nan, "남자"],
"Salary" : [30, 45, 90, 70]
}

df = pd.DataFrame(dict_01)
df

성별 Salary
0 남자 30
1 여자 45
2 NaN 90
3 남자 70

  <script>
    const buttonEl =
      document.querySelector('#df-cd4f0fa9-4594-42f6-b011-ec8d9c313a45 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-cd4f0fa9-4594-42f6-b011-ec8d9c313a45');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df['성별'].fillna("성별 없음")
0       남자
1       여자
2    성별 없음
3       남자
Name: 성별, dtype: object
  • 결측치

–> 문자열 타입이랑 / 숫자 타입이랑 접근 방법이 다름
–> 문자열 (빈도 –> 가장 많이 나타나는 문자열 넣어주기!, 최빈값)
–> 숫자열 (평균, 최대, 최소, 중간, 기타 등등..)

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd 
import numpy as np

dict_01 = {
'Score_A' : [80, 90, np.nan, 80],
'Score_B' : [30, 45, np.nan, 60],
'Score_C' : [np.nan, 50, 80, 90],
'Score_D' : [50, 30, 80, 60]
}

df = pd.DataFrame(dict_01)
df

Score_A Score_B Score_C Score_D
0 80.0 30.0 NaN 50
1 90.0 45.0 50.0 30
2 NaN NaN 80.0 80
3 80.0 60.0 90.0 60

  <script>
    const buttonEl =
      document.querySelector('#df-4285262a-db62-49be-9c8d-273186ab08c4 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-4285262a-db62-49be-9c8d-273186ab08c4');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df.dropna(axis = 1)

Score_D
0 50
1 30
2 80
3 60

  <script>
    const buttonEl =
      document.querySelector('#df-0e190d15-cdf3-464b-bcfb-9ef6867cfba1 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-0e190d15-cdf3-464b-bcfb-9ef6867cfba1');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df.dropna(axis = 0)

Score_A Score_B Score_C Score_D
1 90.0 45.0 50.0 30
3 80.0 60.0 90.0 60

  <script>
    const buttonEl =
      document.querySelector('#df-c4cae1e4-71a0-446b-8e51-6af06230d2ee button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c4cae1e4-71a0-446b-8e51-6af06230d2ee');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

이상치

1
sales

Invoice ID Branch City Customer type Gender Product line Unit price Quantity Date Time Payment
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 1/5/2019 13:08 Ewallet
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3/8/2019 10:29 Cash
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 3/3/2019 13:23 Credit card
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 1/27/2019 20:33 Ewallet
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 2/8/2019 10:37 Ewallet
... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.35 1 1/29/2019 13:46 Ewallet
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.38 10 3/2/2019 17:16 Ewallet
997 727-02-1313 A Yangon Member Male Food and beverages 31.84 1 2/9/2019 13:22 Cash
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.82 1 2/22/2019 15:33 Cash
999 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7 2/18/2019 13:28 Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-6f6e6109-861d-4be9-ba85-95e41759ab5e button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6f6e6109-861d-4be9-ba85-95e41759ab5e');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 일반적인 통계적인 공식

  • IQR - 박스플롯 - 사분위수

  • Q0(0), Q1(25%), Q2(50%), Q3(75%), Q4(100%)

  • 이상치의 하한 경계값 : Q1 - (1.5 * (Q3-Q1))

  • 이상치의 상한 경계값 : Q3 + (1.5 * (Q3-Q1))

  • 도메인(각 비즈니스 영역, 미래 일자리)에서 바라보는 이상치 기준(관습)

1
sales[['Unit price']].describe()

Unit price
count 1000.000000
mean 55.672130
std 26.494628
min 10.080000
25% 32.875000
50% 55.230000
75% 77.935000
max 99.960000

  <script>
    const buttonEl =
      document.querySelector('#df-9b6fc1ab-b0b3-47ba-b180-b37f4ec43d85 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-9b6fc1ab-b0b3-47ba-b180-b37f4ec43d85');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
4
5
6
7
8
Q1 = sales['Unit price'].quantile(0.25)
Q3 = sales['Unit price'].quantile(0.75)

# Q1보다 낮은 값을 이상치로 간주
outliers_q1 = (sales['Unit price'] < Q1)

# Q3보다 높은 값을 이상치로 간주
outliers_q3 = (sales['Unit price'] > Q3)
1
print(sales['Unit price'][~(outliers_q1 | outliers_q3)])
0      74.69
2      46.33
3      58.22
6      68.84
7      73.56
       ...  
991    76.60
992    58.03
994    60.95
995    40.35
998    65.82
Name: Unit price, Length: 500, dtype: float64