728x90

26. Find and Remove Duplicate Rows

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')
users.shape

users.zip_code.duplicated() # 이전 열에 같은 값이 있으면 False
users.zip_code.duplicated().sum() # zip_code가 같은 148개의 duplicate 복제본이있음
users.duplicated().sum() # 모든 열이 같은 7개의 복제본이 있음

복제본 가져오기

users.loc[users.duplicated(keep='first'), :] # first가 기본값, 같은 row들중 맨 처음값을 가져온다.
users.loc[users.duplicated(keep='last'), :] # 가장 마지막 복사본을 유지한다.
users.loc[users.duplicated(keep=False), :] # 복사본 모두 표시

복제본 삭제하기

users.drop_duplicates(keep='first').shape # (936, 4)
users.drop_duplicates(keep='last').shape # (936, 4)
users.drop_duplicates(keep=False).shape # (929, 4)

특정 열만 뽑아서 복제본 조사하기

users.duplicated(subset=['age', 'zip_code']).sum() # age와 zip_code 열에서 복제본 개수를 반환

27. Avoid Setting With Copy Warning

파이썬에서 = 연산자를 쓰면,

오른쪽 변수의 주소를 왼쪽 변수가 가리키도록 합니다.

즉 값 복사가 아니라, 단순 reference 복사입니다.

따라서 이에 따라 문제가 발생하기도 하는데,

복사본을 수정하려고 했을 때, 원본이 같이 바뀔 수도 있기 때문에 이를 방지하는 경고가 뜹니다.

"SettingWithCopyWarning"이다.

import numpy as np
# 문제발생
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan # get item --> set item(copy) 순서로 진행, pandas가 오류발생

movies.content_rating.isna().sum() # NaN인것은 원래 3개 있었음.

# 해결
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan # 'single' set item(copy)
movies.content_rating.isna().sum() # NaN 65개가 추가되서 68이됨

모쪼록 이렇게 loc을 이용하는 것이 좋습니다. 가독성도 좋습니다.

하지만 loc에서도 문제가 발생할 수 있습니다.

top_movies = movies.loc[movies.star_rating >= 9, :]

# 문제발생
top_movies.loc[0, 'duration'] = 150

다음과 같이 .copy()를 이용해 값 복사를 해주면 문제가 없습니다.

# 해결
top_movies = movies.loc[movies.star_rating >= 9, :].copy()

top_movies.head()

28. Change Display Options

pd.get_option('display.max_rows') # 기본 60개 보여준다.
pd.set_option('display.max_rows', None) # 원래 row 60개 까지 밖에 안보여주는데, row 전체를 보여준다.
# pd.reset_option('display.max_rows') # reset

pd.get_option('display.max_columns') # 기본 20개 보여준다.

pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', 100) # No more dot dot dot ... 100자 까지 보인다.


pd.get_option('display.precision') # 원래는 소수점 6자리 까지 보여준다.
pd.set_option('display.precision', 2) # 소수점 둘째짜리까지 보이게 한다.
# pd.reset_option('display.precision')

pd.set_option('display.float_format', '{:,}'.format) # float 포맷에만 적용가능(int에는 불가) 1000 --> 1,000 이런식이 됨

pd.describe_option('rows') # 모든 옵션에 대한 설명이 있음, 키워드로 검색가능
pd.reset_option('all')

29. Create DataFrame from another object

# dictionary로 작성
df = pd.DataFrame({'id': [100, 101, 102], 'color': ['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])

# list로 작성
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])

import numpy as np
arr = np.random.rand(4, 2) # 4 x 2 random numpy array
pd.DataFrame(arr, columns=['one', 'two'])

pd.DataFrame({'student': np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')

s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')

pd.concat([df, s], axis=1)
>>
    id    color    shape
a    100    red    NaN
b    101    blue    square
c    102    red    round

30. Apply a fuction to pandas Series or DataFrame

.apply로 다양한 함수를 적용할 수 있습니다.

train['Sex_num'] = train.Sex.map({'female': 0, 'male': 1})
train.loc[0:4, ['Sex', 'Sex_num']]
>>
    Sex    Sex_num
0    male    1
1    female    0
2    female    0
3    female    0
4    male    1

train['Name_length'] = train.Name.apply(len) # 새로운 열에 len함수적용
train.loc[0:4, ['Name', 'Name_length']]

train['Fare_ceil'] = train.Fare.apply(np.ceil)
train.loc[0:4, ['Fare', 'Fare_ceil']]

train.Name.str.split(',').head()

def get_element(my_list, position):
    return my_list[position]

train.Name.str.split(',').apply(get_element, position=0)

train.Name.str.split(',').apply(lambda x: x[0]).head()

drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1)
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1)
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.head()

31. Use MultiIndex

stocks = pd.read_csv('http://bit.ly/smallstocks')
stocks.head()
>>
    Date    Close    Volume    Symbol
0    2016-10-03    31.50    14070500    CSCO
1    2016-10-03    112.52    21701800    AAPL
2    2016-10-03    57.42    19189500    MSFT
3    2016-10-04    113.00    29736800    AAPL
4    2016-10-04    57.24    20085900    MSFT

stocks.index

stocks.groupby('Symbol').Close.mean()
ser = stocks.groupby(['Symbol', 'Date']).Close.mean()
ser.index
>>
MultiIndex([('AAPL', '2016-10-03'),
            ('AAPL', '2016-10-04'),
            ('AAPL', '2016-10-05'),
            ('CSCO', '2016-10-03'),
            ('CSCO', '2016-10-04'),
            ('CSCO', '2016-10-05'),
            ('MSFT', '2016-10-03'),
            ('MSFT', '2016-10-04'),
            ('MSFT', '2016-10-05')],
           names=['Symbol', 'Date'])

.unstack() 을 이용해서 MultiIndex를 DataFrame으로 만들 수 있습니다.

ser.unstack()
>>
Date    2016-10-03    2016-10-04    2016-10-05
Symbol            
AAPL    112.52    113.00    113.05
CSCO    31.50    31.35    31.59
MSFT    57.42    57.24    57.64

pivot_table로도 만들 수 있습니다.

df = stocks.pivot_table(values='Close', index='Symbol', columns='Date')
df
>>
Date    2016-10-03    2016-10-04    2016-10-05
Symbol            
AAPL    112.52    113.00    113.05
CSCO    31.50    31.35    31.59
MSFT    57.42    57.24    57.64

ser.loc['AAPL'] # ser를 df로 바꾸어도 같음
ser.loc['AAPL', '2016-10-03'] # ser를 df로 바꾸어도 같음
ser.loc[:, '2016-10-03'] # ser를 df로 바꾸어도 같음

그 외 여러가지로 연습해 볼 수 있습니다.

stocks.set_index(['Symbol', 'Date'], inplace=True)
stocks
stocks.index
stocks.sort_index(inplace=True)
stocks

stocks.loc['AAPL']
stocks.loc[('AAPL', '2016-10-03'), :]
stocks.loc[('AAPL', '2016-10-03'), 'Close']
stocks.loc[(['AAPL', 'MSFT'], '2016-10-03'), :]
stocks.loc[('AAPL', ['2016-10-03', '2016-10-04']), :]

stocks.loc[(slice(None), ['2016-10-03', '2016-10-04']), :]
>>
        Close    Volume
Symbol    Date        
AAPL    2016-10-03    112.52    21701800
2016-10-04    113.00    29736800
CSCO    2016-10-03    31.50    14070500
2016-10-04    31.35    18460400
MSFT    2016-10-03    57.42    19189500
2016-10-04    57.24    20085900

멀티 인덱스는 ()로 묶어서 각각의 인덱스를 표현할 수 있습니다.

() 안에는 []로 여러개의 row가 선택가능합니다.

마지막으로 multiindex를 합치기 위해 merge를 이용해보겠습니다.

close = pd.read_csv('http://bit.ly/smallstocks', usecols=[0, 1, 3], index_col=['Symbol', 'Date'])

volume = pd.read_csv('http://bit.ly/smallstocks', usecols=[0, 2, 3], index_col=['Symbol', 'Date'])

both = pd.merge(close, volume, left_index=True, right_index=True)
>>
        Close    Volume
Symbol    Date        
CSCO    2016-10-03    31.50    14070500
AAPL    2016-10-03    112.52    21701800
MSFT    2016-10-03    57.42    19189500
AAPL    2016-10-04    113.00    29736800
MSFT    2016-10-04    57.24    20085900
CSCO    2016-10-04    31.35    18460400
MSFT    2016-10-05    57.64    16726400
CSCO    2016-10-05    31.59    11808600
AAPL    2016-10-05    113.05    21453100

이렇게 생성된 multiIndex를 해제 할 수도 있습니다.

both.reset_index()
>>

Symbol    Date    Close    Volume
0    CSCO    2016-10-03    31.50    14070500
1    AAPL    2016-10-03    112.52    21701800
2    MSFT    2016-10-03    57.42    19189500
3    AAPL    2016-10-04    113.00    29736800
4    MSFT    2016-10-04    57.24    20085900
5    CSCO    2016-10-04    31.35    18460400
6    MSFT    2016-10-05    57.64    16726400
7    CSCO    2016-10-05    31.59    11808600
8    AAPL    2016-10-05    113.05    21453100

[참고 - Data School https://www.youtube.com/watch?v=-NbY7E9hKxk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=32]

728x90

저작자표시

'ML, DL > Data Analysis' 카테고리의 다른 글

R 데이터 타입, 그래픽, 데이터 마트, 결측값, 이상값 (77)	2020.08.25
[Pandas] Part 3. Dataframe smaller and faster, Dummy, Dates and times, (981)	2019.10.12
[Pandas] Part 2. Groupby, Describe, Missing Value, String, Index (993)	2019.10.12
[Pandas] Part 1. DataFrame, Series, Rename, Remove, Sort, Filter (961)	2019.10.12
[Tutorial] scikit-learn과 pandas사용해서 kaggle submission 파일 만들기 (965)	2019.10.11

[Pandas] Part 4. Duplicate, SettingWithCopyWarnings, Display options, Apply fuction, MultiIndex

26. Find and Remove Duplicate Rows

27. Avoid Setting With Copy Warning

28. Change Display Options

29. Create DataFrame from another object

30. Apply a fuction to pandas Series or DataFrame

31. Use MultiIndex

'ML, DL > Data Analysis' 카테고리의 다른 글

댓글

티스토리툴바

[Pandas] Part 4. Duplicate, SettingWithCopyWarnings, Display options, Apply fuction, MultiIndex

26. Find and Remove Duplicate Rows

27. Avoid Setting With Copy Warning

28. Change Display Options

29. Create DataFrame from another object

30. Apply a fuction to pandas Series or DataFrame

31. Use MultiIndex

'ML, DL > Data Analysis' 카테고리의 다른 글

관련글

댓글

티스토리툴바