Python Library/Pandas

[Pandas - Python] Data Aggregation - agg() (데이터 종합)

바보1 2022. 6. 16. 01:01
df


        key1	key2	data1		data2
0	a	one	-0.204708	1.393406
1	a	two	0.478943	0.092908
2	b	one	-0.519439	0.281746
3	b	two	-0.555730	0.769023
4	a	one	1.965781	1.246435

이때 key1을 기준으로 그룹화 한 뒤, data1에 해당하는 데이터를 연산할 수 있습니다.

grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    1.668413
b   -0.523068
Name: data1, dtype: float64
grouped.describe()

	data1	data2
count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
key1																
a	3.0	0.746672	1.109736	-0.204708	0.137118	0.478943	1.222362	1.965781	3.0	0.910916	0.712217	0.092908	0.669671	1.246435	1.319920	1.393406
b	2.0	-0.537585	0.025662	-0.555730	-0.546657	-0.537585	-0.528512	-0.519439	2.0	0.525384	0.344556	0.281746	0.403565	0.525384	0.647203	0.769023

Column-Wise and Multiple Function Application

 

 

기존의 데이터를 불러오겠습니다.

tips = pd.read_csv('examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]



total_bill	tip	smoker	day	time	size	tip_pct
0	16.99	1.01	No	Sun	Dinner	2	0.059447
1	10.34	1.66	No	Sun	Dinner	3	0.160542
2	21.01	3.50	No	Sun	Dinner	3	0.166587
3	23.68	3.31	No	Sun	Dinner	2	0.139780
4	24.59	3.61	No	Sun	Dinner	4	0.146808
5	25.29	4.71	No	Sun	Dinner	4	0.186240
grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')


day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

그런 다음 day, smoker로 그룹화하고, 그 그룹에서 tip_pct에 해당하는 열의 데이터를 불러왔습니다.

이후 .agg() 함수를 통해 평균에 대한 종합을 불러왔습니다.

 

다른 예시를 보겠습니다.

def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

grouped_pct.agg(['mean', 'std', peak_to_peak])


                mean		std		peak_to_peak
day	smoker			
Fri	No	0.151650	0.028123	0.067349
        Yes	0.174783	0.051293	0.159925
Sat	No	0.158048	0.039767	0.235193
        Yes	0.147906	0.061375	0.290095
Sun	No	0.160113	0.042347	0.193226
        Yes	0.187250	0.154134	0.644685
Thur	No	0.160298	0.038774	0.193350
        Yes	0.163863	0.039389	0.151240

mean, std, peak_to_peak() 함수를 불러왔습니다.

 

이때 (name, function) 형태의 리스트를 넘기면, 그에 해당하는 이름이 컬럼 이름으로 되면서, 해당 함수를 실행합니다.

grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])


            	foo		bar
day	smoker		
Fri	No	0.151650	0.028123
        Yes	0.174783	0.051293
Sat	No	0.158048	0.039767
        Yes	0.147906	0.061375
Sun	No	0.160113	0.042347
        Yes	0.187250	0.154134
Thur	No	0.160298	0.038774
        Yes	0.163863	0.039389

 

functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result


            	tip_pct					total_bill
        	count	mean		max		count	mean		max
day	smoker						
Fri	No	4	0.151650	0.187735	4	18.420000	22.75
        Yes	15	0.174783	0.263480	15	16.813333	40.17
Sat	No	45	0.158048	0.291990	45	19.661778	48.33
        Yes	42	0.147906	0.325733	42	21.276667	50.81
Sun	No	57	0.160113	0.252672	57	20.506667	48.17
        Yes	19	0.187250	0.710345	19	24.120000	45.35
Thur	No	45	0.160298	0.266312	45	17.113111	41.19
        Yes	17	0.163863	0.241255	17	19.190588	43.11
result['tip_pct']


		count	mean		max
day	smoker			
Fri	No	4	0.151650	0.187735
	Yes	15	0.174783	0.263480
Sat	No	45	0.158048	0.291990
	Yes	42	0.147906	0.325733
Sun	No	57	0.160113	0.252672
	Yes	19	0.187250	0.710345
Thur	No	45	0.160298	0.266312
	Yes	17	0.163863	0.241255

이해가 되시나요?

 

또 다른 예시입니다.

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)


		tip_pct				total_bill
		Durchschnitt	Abweichung	Durchschnitt	Abweichung
day	smoker				
Fri	No	0.151650	0.000791	18.420000	25.596333
        Yes	0.174783	0.002631	16.813333	82.562438
Sat	No	0.158048	0.001581	19.661778	79.908965
        Yes	0.147906	0.003767	21.276667	101.387535
Sun	No	0.160113	0.001793	20.506667	66.099980
        Yes	0.187250	0.023757	24.120000	109.046044
Thur	No	0.160298	0.001503	17.113111	59.625081
        Yes	0.163863	0.001551	19.190588	69.808518

이 뿐만 아니라, {name : function}에 해당하는 딕셔너리를 넣어서 실행할 수 있습니다.

grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})
             


		tip_pct								size
		min		max		mean		std		sum
day	smoker					
Fri	No	0.120385	0.187735	0.151650	0.028123	9
        Yes	0.103555	0.263480	0.174783	0.051293	31
Sat	No	0.056797	0.291990	0.158048	0.039767	115
        Yes	0.035638	0.325733	0.147906	0.061375	104
Sun	No	0.059447	0.252672	0.160113	0.042347	167
        Yes	0.065660	0.710345	0.187250	0.154134	49
Thur	No	0.072961	0.266312	0.160298	0.038774	112
        Yes	0.090014	0.241255	0.163863	0.039389	40

Returning Aggregated Data Without Row Indexes

 

 

 

이처럼 인덱스 없이 불러올 수 있습니다.

tips.groupby(['day', 'smoker'], as_index=False).mean()


	day	smoker	total_bill	tip		size		tip_pct
0	Fri	No	18.420000	2.812500	2.250000	0.151650
1	Fri	Yes	16.813333	2.714000	2.066667	0.174783
2	Sat	No	19.661778	3.102889	2.555556	0.158048
3	Sat	Yes	21.276667	2.875476	2.476190	0.147906
4	Sun	No	20.506667	3.167895	2.929825	0.160113
5	Sun	Yes	24.120000	3.516842	2.578947	0.187250
6	Thur	No	17.113111	2.673778	2.488889	0.160298
7	Thur	Yes	19.190588	3.030000	2.352941	0.163863

원래라면,

tips.groupby(['day', 'smoker']).mean()



		total_bill	tip		size		tip_pct
day	smoker				
Fri	No	18.420000	2.812500	2.250000	0.151650
        Yes	16.813333	2.714000	2.066667	0.174783
Sat	No	19.661778	3.102889	2.555556	0.158048
        Yes	21.276667	2.875476	2.476190	0.147906
Sun	No	20.506667	3.167895	2.929825	0.160113
        Yes	24.120000	3.516842	2.578947	0.187250
Thur	No	17.113111	2.673778	2.488889	0.160298
        Yes	19.190588	3.030000	2.352941	0.163863

이렇게 불러와야 합니다.