Wanna be a Fraudster on Shopee ? Think It Twice !
Rating is very important for seller at online marketplace. As a buyer, we are also benefited by product’s rating since we would know whether that product has a good quality or not. But how if rating is manipulated by sellers ? Sure, it will be advantageous for the seller only. Manipulating product’s rating has become more popular among the online seller. To manipulate rating, seller would use another account to buy products from their store, then give it five stars so that the product’s rating would be boosted up. This is a big problem for marketplace since this would affect customer’s trust. In this article, we would discuss how Shopee utilize data analytics knowledge to tackle this issue.
First thing first, I do not know whether the algorithm we will discuss is really used by Shopee or not. The algorithm we will discuss is used for solving Shopee’s data analytics competition called “I’m the Best Coder 2019” which is posted on Kaggle (https://www.kaggle.com/c/ptr-rd2-ahy). We will use Python and Pandas to analyze the data and solve the problem.
Here is the task. We are given thousands of transaction data telling buyer’s and seller’s identity (bank account, device, and credit card). Buyer and seller who share same bank account, device, or credit card are deemed to conduct fraud order. Then, label those orders.
Without further ado, let’s jump right into the code. First, we import Pandas library and assigning the data into variables.
A few explanation about the variables and terms used in this article :
- Each userid represents unique Shopee account
- Each orderid represents unique transaction on Shopee
- The variable ‘bank_accounts’ stores list of bank accounts (already encrypted) and its owner.
- The variable ‘credit_card’ stores list of credit cards number (already encrypted) and its owner.
- The variable ‘orders’ stores transaction data
Here are the overview of those variables :
Then, we initiate a list called “is_fraud”. It will label the fraud orders with ‘1’. To find fraud orders based on bank account, we create variable called ‘dup_bank’ first. It gives us a list bank account that is owned by multiple Shopee account. Here is an overview of ‘dup_bank’ variable.
Then, we select one bank account from ‘dup_bank’ and called it ‘x’. Create a DataFrame ‘a’ which is a list of userid that own x. Next, we create DataFrame ‘b’ that stores orders which its seller and its buyer are listed in ‘a’. The orders listed in ‘b’ are labelled as fraud order.
Repeat same algorithm on ‘credit_cards’ and ‘device’. Then, compile the orderid and its related label to a DataFrame ‘submission’
The DataFrame ‘submission’ should look like this.
As a conclusion, this algorithm is not perfect. It yields score 0.98472 out of 1 when submitted on Kaggle (https://www.kaggle.com/c/ptr-rd2-ahy). Any comments and suggestions are very welcomed 😄