[Data] fraud

Author

김보람

Published

January 18, 2024

필수로 있어야 할것
import pandas as pd

data

    1. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud //
    1. https://www.kaggle.com/datasets/whenamancodes/fraud-detection // y가 없는듯
    1. https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud
    1. https://www.kaggle.com/datasets/ealaxi/paysim1
    1. https://www.kaggle.com/datasets/mishra5001/credit-card
    1. https://www.kaggle.com/datasets/joebeachcapital/credit-card-fraud // 복잡함..
    1. https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023
    1. https://www.kaggle.com/datasets/jainilcoder/online-payment-fraud-detection
    1. https://www.kaggle.com/datasets/rupakroy/online-payments-fraud-detection-dataset // 시간이없당
    1. https://www.kaggle.com/datasets/dermisfit/fraud-transactions-dataset
    1. https://www.kaggle.com/datasets/kartik2112/fraud-detection
    1. https://www.kaggle.com/datasets/vardhansiramdasu/fraudulent-transactions-prediction

- 유형

  • 유형1(v1~v29); 1 / 2 / 6 / 7

  • 유형2: 3

  • 유형3: 4 / 8 / 9 / 12

  • 유형4: 5

  • 유형5(책): 10 / 11

1. creditcardfraud

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

credicard = pd.read_csv("~/Desktop/creditcard.csv")
credicard
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
284802 172786.0 -11.881118 10.071785 -9.834783 -2.066656 -5.364473 -2.606837 -4.918215 7.305334 1.914428 ... 0.213454 0.111864 1.014480 -0.509348 1.436807 0.250034 0.943651 0.823731 0.77 0
284803 172787.0 -0.732789 -0.055080 2.035030 -0.738589 0.868229 1.058415 0.024330 0.294869 0.584800 ... 0.214205 0.924384 0.012463 -1.016226 -0.606624 -0.395255 0.068472 -0.053527 24.79 0
284804 172788.0 1.919565 -0.301254 -3.249640 -0.557828 2.630515 3.031260 -0.296827 0.708417 0.432454 ... 0.232045 0.578229 -0.037501 0.640134 0.265745 -0.087371 0.004455 -0.026561 67.88 0
284805 172788.0 -0.240440 0.530483 0.702510 0.689799 -0.377961 0.623708 -0.686180 0.679145 0.392087 ... 0.265245 0.800049 -0.163298 0.123205 -0.569159 0.546668 0.108821 0.104533 10.00 0
284806 172792.0 -0.533413 -0.189733 0.703337 -0.506271 -0.012546 -0.649617 1.577006 -0.414650 0.486180 ... 0.261057 0.643078 0.376777 0.008797 -0.473649 -0.818267 -0.002415 0.013649 217.00 0

284807 rows × 31 columns

len(set(credicard.V1))
275663
  • V1이 고유 뭐시기를 한거 같은데 .. 겹치는게 9,144개

  • time / amt / is_fraud 있음

2. fraud-detection

  • 1번이랑 같은 거인듯?

3. credit-card-fraud

card_transdata = pd.read_csv("~/Desktop/card_transdata.csv")
card_transdata
distance_from_home distance_from_last_transaction ratio_to_median_purchase_price repeat_retailer used_chip used_pin_number online_order fraud
0 57.877857 0.311140 1.945940 1.0 1.0 0.0 0.0 0.0
1 10.829943 0.175592 1.294219 1.0 0.0 0.0 0.0 0.0
2 5.091079 0.805153 0.427715 1.0 0.0 0.0 1.0 0.0
3 2.247564 5.600044 0.362663 1.0 1.0 0.0 1.0 0.0
4 44.190936 0.566486 2.222767 1.0 1.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ...
999995 2.207101 0.112651 1.626798 1.0 1.0 0.0 0.0 0.0
999996 19.872726 2.683904 2.778303 1.0 1.0 0.0 0.0 0.0
999997 2.914857 1.472687 0.218075 1.0 1.0 0.0 1.0 0.0
999998 4.258729 0.242023 0.475822 1.0 0.0 0.0 1.0 0.0
999999 58.108125 0.318110 0.386920 1.0 1.0 0.0 1.0 0.0

1000000 rows × 8 columns

  • distance_from_home - 거래가 발생한 집으로부터의 거리입니다.

  • distance_from_last_transaction - 마지막 트랜잭션이 발생한 거리입니다.

  • ratio_to_median_purchase_price - 중간 구매 가격에 대한 구매 가격 거래의 비율입니다.

  • Repeat_retailer - 거래가 동일한 소매업체에서 이루어졌는지 여부입니다.

  • Used_chip - 칩(신용카드)을 통한 거래입니다.

  • Used_pin_number - PIN 번호를 사용하여 거래가 이루어졌는지 여부.

  • online_order - 거래가 온라인 주문입니까?

  • 사기 - 거래가 사기인지 여부.

  • 시간 없음..

4. paysim1

PS = pd.read_csv("~/Desktop/PS_20174392719_1491204439457_log.csv")
PS
step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
0 1 PAYMENT 9839.64 C1231006815 170136.00 160296.36 M1979787155 0.00 0.00 0 0
1 1 PAYMENT 1864.28 C1666544295 21249.00 19384.72 M2044282225 0.00 0.00 0 0
2 1 TRANSFER 181.00 C1305486145 181.00 0.00 C553264065 0.00 0.00 1 0
3 1 CASH_OUT 181.00 C840083671 181.00 0.00 C38997010 21182.00 0.00 1 0
4 1 PAYMENT 11668.14 C2048537720 41554.00 29885.86 M1230701703 0.00 0.00 0 0
... ... ... ... ... ... ... ... ... ... ... ...
6362615 743 CASH_OUT 339682.13 C786484425 339682.13 0.00 C776919290 0.00 339682.13 1 0
6362616 743 TRANSFER 6311409.28 C1529008245 6311409.28 0.00 C1881841831 0.00 0.00 1 0
6362617 743 CASH_OUT 6311409.28 C1162922333 6311409.28 0.00 C1365125890 68488.84 6379898.11 1 0
6362618 743 TRANSFER 850002.52 C1685995037 850002.52 0.00 C2080388513 0.00 0.00 1 0
6362619 743 CASH_OUT 850002.52 C1280323807 850002.52 0.00 C873221189 6510099.11 7360101.63 1 0

6362620 rows × 11 columns

  • step - 현실 세계의 시간 단위를 매핑합니다. 이 경우 1단계는 1시간입니다. 총 단계 744(30일 시뮬레이션).

  • 유형 - CASH-IN, CASH-OUT, DEBIT, PAYMENT 및 TRANSFER.

  • 금액 -현지 통화로 표시된 거래 금액입니다.

  • nameOrig - 거래를 시작한 고객

  • oldbalanceOrg - 거래 전 초기 잔액

  • newbalanceOrig - 거래 후 새 잔액입니다.

  • nameDest - 거래 수신자인 고객

  • oldbalanceDest - 거래 전 초기 잔액 수령인입니다. M(가맹점)으로 시작하는 고객에 대한 정보는 없습니다.

  • newbalanceDest - 거래 후 새 잔액 수신자입니다. M(가맹점)으로 시작하는 고객에 대한 정보는 없습니다.

  • isFraud - 시뮬레이션 내 사기 행위자가 수행한 거래입니다. 이 특정 데이터세트에서 에이전트의 사기 행위는 통제권이나 고객 계정을 빼앗아 이익을 얻고 다른 계정으로 이체한 다음 시스템에서 현금화하여 자금을 비우는 것을 목표로 합니다.

  • isFlaggedFraud - 비즈니스 모델은 한 계정에서 다른 계정으로의 대규모 이체를 제어하고 불법적인 시도를 표시하는 것을 목표로 합니다. 이 데이터 세트의 불법 시도는 단일 거래에서 200,000개 이상의 전송을 시도하는 것입니다.

len(set(PS.nameOrig))
6353307

5. credit-card

application_data = pd.read_csv("~/Desktop/application_data.csv")
application_data
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307507 456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307508 456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 ... 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0
307509 456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
307510 456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0

307511 rows × 122 columns

columns_description = pd.read_csv("~/Desktop/columns_description.csv",encoding='cp1252')
columns_description
Unnamed: 0 Table Row Description Special
0 1 application_data SK_ID_CURR ID of loan in our sample NaN
1 2 application_data TARGET Target variable (1 - client with payment diffi... NaN
2 5 application_data NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
3 6 application_data CODE_GENDER Gender of the client NaN
4 7 application_data FLAG_OWN_CAR Flag if the client owns a car NaN
... ... ... ... ... ...
155 209 previous_application.csv DAYS_FIRST_DUE Relative to application date of current applic... time only relative to the application
156 210 previous_application.csv DAYS_LAST_DUE_1ST_VERSION Relative to application date of current applic... time only relative to the application
157 211 previous_application.csv DAYS_LAST_DUE Relative to application date of current applic... time only relative to the application
158 212 previous_application.csv DAYS_TERMINATION Relative to application date of current applic... time only relative to the application
159 213 previous_application.csv NFLAG_INSURED_ON_APPROVAL Did the client requested insurance during the ... NaN

160 rows × 5 columns

previous_application = pd.read_csv("~/Desktop/previous_application.csv")
previous_application
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1670209 2300464 352015 Consumer loans 14704.290 267295.5 311400.0 0.0 267295.5 WEDNESDAY 12 ... Furniture 30.0 low_normal POS industry with interest 365243.0 -508.0 362.0 -358.0 -351.0 0.0
1670210 2357031 334635 Consumer loans 6622.020 87750.0 64291.5 29250.0 87750.0 TUESDAY 15 ... Furniture 12.0 middle POS industry with interest 365243.0 -1604.0 -1274.0 -1304.0 -1297.0 0.0
1670211 2659632 249544 Consumer loans 11520.855 105237.0 102523.5 10525.5 105237.0 MONDAY 12 ... Consumer electronics 10.0 low_normal POS household with interest 365243.0 -1457.0 -1187.0 -1187.0 -1181.0 0.0
1670212 2785582 400317 Cash loans 18821.520 180000.0 191880.0 NaN 180000.0 WEDNESDAY 9 ... XNA 12.0 low_normal Cash X-Sell: low 365243.0 -1155.0 -825.0 -825.0 -817.0 1.0
1670213 2418762 261212 Cash loans 16431.300 360000.0 360000.0 NaN 360000.0 SUNDAY 10 ... XNA 48.0 middle Cash X-Sell: middle 365243.0 -1163.0 247.0 -443.0 -423.0 0.0

1670214 rows × 37 columns

6. credit-card-fraud

  • 1번이랑 같은 거인듯?

7. credit-card-fraud-detection-dataset-2023

creditcard_2023 = pd.read_csv("~/Desktop/creditcard_2023.csv")
creditcard_2023
id V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0 -0.260648 -0.469648 2.496266 -0.083724 0.129681 0.732898 0.519014 -0.130006 0.727159 ... -0.110552 0.217606 -0.134794 0.165959 0.126280 -0.434824 -0.081230 -0.151045 17982.10 0.0
1 1 0.985100 -0.356045 0.558056 -0.429654 0.277140 0.428605 0.406466 -0.133118 0.347452 ... -0.194936 -0.605761 0.079469 -0.577395 0.190090 0.296503 -0.248052 -0.064512 6531.37 0.0
2 2 -0.260272 -0.949385 1.728538 -0.457986 0.074062 1.419481 0.743511 -0.095576 -0.261297 ... -0.005020 0.702906 0.945045 -1.154666 -0.605564 -0.312895 -0.300258 -0.244718 2513.54 0.0
3 3 -0.152152 -0.508959 1.746840 -1.090178 0.249486 1.143312 0.518269 -0.065130 -0.205698 ... -0.146927 -0.038212 -0.214048 -1.893131 1.003963 -0.515950 -0.165316 0.048424 5384.44 0.0
4 4 -0.206820 -0.165280 1.527053 -0.448293 0.106125 0.530549 0.658849 -0.212660 1.049921 ... -0.106984 0.729727 -0.161666 0.312561 -0.414116 1.071126 0.023712 0.419117 14278.97 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
163352 163352 -0.452617 0.259592 -0.006192 -0.964916 -0.051024 -0.378307 0.354511 0.165575 0.616150 ... -0.212255 -0.642926 0.260726 0.082772 -0.431475 0.319001 -0.225063 -0.443086 14523.93 0.0
163353 163353 1.988546 -0.676106 -0.066960 -1.247678 0.278521 0.134391 0.348558 -0.234717 0.089843 ... 0.063047 1.204999 -0.096033 0.591697 0.462605 0.162256 -0.276319 -0.242267 12033.69 0.0
163354 163354 0.156866 -0.088362 0.153504 -0.178948 0.566320 0.753835 0.224169 -0.860127 0.012889 ... -0.263018 0.397224 -0.656675 1.578449 1.221929 -0.856416 0.051583 0.599162 15302.77 0.0
163355 163355 0.120934 -0.108950 0.578018 -0.961257 0.423398 -0.222660 0.846739 -0.201579 0.545305 ... -0.220013 -0.535015 0.061926 0.126831 -0.753613 0.330297 0.198570 0.280656 20371.26 0.0
163356 163356 1.726261 -0.440380 -0.083019 -0.377200 0.253506 -0.486139 0.515510 -0.225861 1.020000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

163357 rows × 31 columns

  • 1번이랑 같은 느낌. 그런데 time이 빠지고 id가 넣어졌는데 index가 똑같다.

8. online-payment-fraud-detection

  • 4번이랑 같은 형식.. 순서만 다른 듯

9. online-payments-fraud-detection-dataset

  • 4번이랑 같은 형식

10. fraud-transactions-dataset

  • 책과 같은 형식 (혹시 모르니.. test 다운 받아 놓깅)
fraudTest = pd.read_csv("~/Desktop/fraudTest.csv")
fraudTest
Unnamed: 0 trans_date_trans_time cc_num merchant category amt first last gender street ... lat long city_pop job dob trans_num unix_time merch_lat merch_long is_fraud
0 0 2020-06-21 12:14:25 2291163933867244 fraud_Kirlin and Sons personal_care 2.86 Jeff Elliott M 351 Darlene Green ... 33.9659 -80.9355 333497 Mechanical engineer 1968-03-19 2da90c7d74bd46a0caf3777415b3ebd3 1371816865 33.986391 -81.200714 0
1 1 2020-06-21 12:14:33 3573030041201292 fraud_Sporer-Keebler personal_care 29.84 Joanne Williams F 3638 Marsh Union ... 40.3207 -110.4360 302 Sales professional, IT 1990-01-17 324cc204407e99f51b0d6ca0055005e7 1371816873 39.450498 -109.960431 0
2 2 2020-06-21 12:14:53 3598215285024754 fraud_Swaniawski, Nitzsche and Welch health_fitness 41.28 Ashley Lopez F 9333 Valentine Point ... 40.6729 -73.5365 34496 Librarian, public 1970-10-21 c81755dbbbea9d5c77f094348a7579be 1371816893 40.495810 -74.196111 0
3 3 2020-06-21 12:15:15 3591919803438423 fraud_Haley Group misc_pos 60.05 Brian Williams M 32941 Krystal Mill Apt. 552 ... 28.5697 -80.8191 54767 Set designer 1987-07-25 2159175b9efe66dc301f149d3d5abf8c 1371816915 28.812398 -80.883061 0
4 4 2020-06-21 12:15:17 3526826139003047 fraud_Johnston-Casper travel 3.19 Nathan Massey M 5783 Evan Roads Apt. 465 ... 44.2529 -85.0170 1126 Furniture designer 1955-07-06 57ff021bd3f328f8738bb535c302a31b 1371816917 44.959148 -85.884734 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
555714 555714 2020-12-31 23:59:07 30560609640617 fraud_Reilly and Sons health_fitness 43.77 Michael Olson M 558 Michael Estates ... 40.4931 -91.8912 519 Town planner 1966-02-13 9b1f753c79894c9f4b71f04581835ada 1388534347 39.946837 -91.333331 0
555715 555715 2020-12-31 23:59:09 3556613125071656 fraud_Hoppe-Parisian kids_pets 111.84 Jose Vasquez M 572 Davis Mountains ... 29.0393 -95.4401 28739 Futures trader 1999-12-27 2090647dac2c89a1d86c514c427f5b91 1388534349 29.661049 -96.186633 0
555716 555716 2020-12-31 23:59:15 6011724471098086 fraud_Rau-Robel kids_pets 86.88 Ann Lawson F 144 Evans Islands Apt. 683 ... 46.1966 -118.9017 3684 Musician 1981-11-29 6c5b7c8add471975aa0fec023b2e8408 1388534355 46.658340 -119.715054 0
555717 555717 2020-12-31 23:59:24 4079773899158 fraud_Breitenberg LLC travel 7.99 Eric Preston M 7020 Doyle Stream Apt. 951 ... 44.6255 -116.4493 129 Cartographer 1965-12-15 14392d723bb7737606b2700ac791b7aa 1388534364 44.470525 -117.080888 0
555718 555718 2020-12-31 23:59:34 4170689372027579 fraud_Dare-Marvin entertainment 38.13 Samuel Frey M 830 Myers Plaza Apt. 384 ... 35.6665 -97.4798 116001 Media buyer 1993-05-10 1765bb45b3aa3224b4cdcb6e7a96cee3 1388534374 36.210097 -97.036372 0

555719 rows × 23 columns

fraudTest.is_fraud.mean()
0.0038598644278853163

11. fraud-detection

  • 책과 동일. 10번과 동일

12. fraudulent-transactions-prediction

  • 4번과 동일