[Pandas] Pandas 라이브러리와 Series

Python Library/Pandas

[Pandas] Pandas 라이브러리와 Series

바보1 2022. 2. 4. 19:12

빅 데이터와 관련된 내용은 주피터 노트북을 이용해서 공부하였습니다.

1. Pandas란?

Pandas 라이브러리란 데이터를 다루기 위한 라이브러리입니다.

용량이 큰 데이터를 안정적으로 다룰 수 있고, 2차원 데이터를 쉽고 간편하게 다룰 수 있습니다.

그래서 보통 Python에서 데이터 공부를 한다고 하면 대부분 Pandas를 이용하곤 합니다.

근데 중요한건 빅 데이터에 대한 공부지 라이브러리를 공부하는게 아니라고 생각합니다.

하여튼 Pandas를 통해서 데이터를 공부하겠습니다.

2. 설치

주피터 노트북을 이용하기 때문에 사실 pip install이나 Setting을 통해 설치해줄 필요가 없습니다.

그냥 주피터 노트북에서 import pandas as pd 라는 문장 하나면 알아서 import해줍니다.

근데 혹시나 내가 파이참에서 쓰고 싶다 하시는 분들은

pip install pandas

를 해주시거나 Setting에서 직접 추가해주시면 되겠습니다.

저는 주피터 노트북이기 때문에 import pandas as pd로 시작하겠습니다.

3. Series

Series는 1차원 데이터, 배열을 위한 자료구조입니다.

Series를 생성할 때는 리스트를 이용하여 Series 객체를 생성해줘야합니다.

시리즈 생성

temp = pd.Series([-20, -10, 10, 20])
print(temp)

이렇게 리스트를 넣어주면 이 리스트를 통해 Series객체를 생성해줍니다. print를 하면,

0   -20
1   -10
2    10
3    20
dtype: int64

이렇게 값이 나오네요!

이때, 인덱스 번호로 값을 참조하고 싶다면, temp[index]를 해주면 됩니다.

temp[0]

>>> -20

이렇게 인덱스 번호로 값을 참조할 수도 있습니다.

Series 인데스를 지정해서 생성

근데 보시다시피 저희가 따로 Index를 지정해주지 않는다면, index가 0부터 시작합니다.

따로 저희가 index를 지정해줄 수 있는데요. 한 번 지정해보겠습니다.

temp = pd.Series([-20, -10, 10, 20], index=['One', 'Two', "Three", "Four"])
print(temp)

뒤에 index = []이렇게 리스트 형태로 인덱스를 넣어주면,

One     -20
Two     -10
Three    10
Four     20
dtype: int64

이렇게 Index가 제가 넣어준 것처럼 변하는 것을 볼 수 있습니다.

이때도 마찬가지로 Index를 통해 값을 참조할 수 있습니다.

temp['One']

>>> -20

다른 인덱스를 넣어도,

temp['Four']

>>> 20

이렇게 정상적으로 나오는 걸 볼 수 있습니다.

근데 인덱스에 없는 숫자를 넣으면

temp['Five']

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Five'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18000/4273798764.py in <module>
----> 1 temp['Five']

~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    940 
    941         elif key_is_scalar:
--> 942             return self._get_value(key)
    943 
    944         if is_hashable(key):

~\anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
   1049 
   1050         # Similar to Index.get_value, but we do not fall back to positional
-> 1051         loc = self.index.get_loc(label)
   1052         return self.index._get_values_for_loc(self, loc, label)
   1053 

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'Five'

하여튼 오류가 뜨네요!

value list 출력

temp.values

>>> [-20 -10  10  20]

Index list 출력

temp.index

>>> Index(['One', 'Two', 'Three', 'Four'], dtype='object')

여러개의 index 검색

temp[['One', 'Four']]

One    -20
Four    20
dtype: int64

Series 연산

temp * 2

One     -40
Two     -20
Three    20
Four     40
dtype: int64

Index 번호가 있는지, 혹은 Value가 있는지 확인

'One' in temp

>>> True

10 in temp.values

>>> True

모두 정상적으로 나오네요!

4. Dictionary로 Series 생성

리스트 대신에 Dictionary를 넣어서 Series를 생성할 수 있습니다.

이때, 자동적으로 Index가 들어가는 모습을 확인할 수 있습니다.

data = {'hi': 10, 'hello': 20, 'guten_tag': 30}
obj = pd.Series(data)
obj

hi           10
hello        20
guten_tag    30
dtype: int64

이렇게 따로 인덱스를 지정해주지 않아도, Dictionary에 맞게 들어갑니다.

당연히 검색, list출력 다 가능합니다.

이때, index를 다시 설정해주면 어떻게 될까요?

data = {'hi': 10, 'hello': 20, 'guten_tag': 30}
index_data = ['hello', 'guten_tag', 'hi', '안녕']
obj = pd.Series(data, index= index_data)
obj

hello        20.0
guten_tag    30.0
hi           10.0
안녕            NaN
dtype: float64

이렇게 보시다시피 index에 따라 value들이 다시 재정렬되는 모습을 보실 수 있습니다.

그리고 '안녕' 이라는 index가 새로 추가됐는데, 이렇게 되면 value가 없다는 뜻으로 NaN이 출력이 됩니다.

5. isnull, notnull을 통한 데이터 검색

위의 obj 데이터를 활용했습니다.

pd.isnull(obj)

hello        False
guten_tag    False
hi           False
안녕            True
dtype: bool

이렇게 innull을 쓰면, null 데이터가 있는 곳은 True가 되고, 아닌 곳은 False가 됩니다.

pd.notnull(obj)

hello         True
guten_tag     True
hi            True
안녕           False
dtype: bool

당연히 notnull은 그 반대로 나옵니다.

6. 이름 설정

name을 통해 Series객체에 이름을 설정해줄 수 있습니다.

obj.index.name = '인사'
obj.name = '데이터'
obj

인사
hello        20.0
guten_tag    30.0
hi           10.0
안녕            NaN
Name: 데이터, dtype: float64

7. 요약

결국 데이터를 다루기 위해선 Pandas가 거의 필수인 것 같습니다.

Pandas에서 1차원 배열을 다루기 위해서는 Series를 이용하면 됩니다.

Series 객체를 만들기 위해서는 리스트 혹은 딕셔너리 형태로 넣어줘야 합니다.

그 외에는 딱히 특별한 점이 없네요.

보간법 같은 경우에는 나중에 필요해지면 그때 다시 올리겠습니다.

참고 :

나도코딩

https://eunguru.tistory.com/220

'Python Library > Pandas' 카테고리의 다른 글

[Pandas - Python] Pandas 라이브러리와 데이터의 선택(loc, iloc) (0)	2022.02.07
[Pandas - Python] Pandas 라이브러리와 데이터 확인 및 선택(기본) (0)	2022.02.05
[Pandas - Python] Pandas 라이브러리와 파일 저장 및 열기(excel, csv, txt) (0)	2022.02.05
[Pandas - Python] Pandas 라이브러리와 Index 정리 (0)	2022.02.04
[Pandas] Pandas 라이브러리와 DataFrame (0)	2022.02.04

현재글[Pandas] Pandas 라이브러리와 Series

안녕