Wanna be a Fraudster on Shopee ? Think It Twice !

Rizky Ramadhana
3 min readOct 17, 2020
Picture 1. The algorithm overview

Rating is very important for seller at online marketplace. As a buyer, we are also benefited by product’s rating since we would know whether that product has a good quality or not. But how if rating is manipulated by sellers ? Sure, it will be advantageous for the seller only. Manipulating product’s rating has become more popular among the online seller. To manipulate rating, seller would use another account to buy products from their store, then give it five stars so that the product’s rating would be boosted up. This is a big problem for marketplace since this would affect customer’s trust. In this article, we would discuss how Shopee utilize data analytics knowledge to tackle this issue.

First thing first, I do not know whether the algorithm we will discuss is really used by Shopee or not. The algorithm we will discuss is used for solving Shopee’s data analytics competition called “I’m the Best Coder 2019” which is posted on Kaggle (https://www.kaggle.com/c/ptr-rd2-ahy). We will use Python and Pandas to analyze the data and solve the problem.

Here is the task. We are given thousands of transaction data telling buyer’s and seller’s identity (bank account, device, and credit card). Buyer and seller who share same bank account, device, or credit card are deemed to conduct fraud order. Then, label those orders.

Without further ado, let’s jump right into the code. First, we import Pandas library and assigning the data into variables.

Code 1. Assigning and cleaning the data

A few explanation about the variables and terms used in this article :

  1. Each userid represents unique Shopee account
  2. Each orderid represents unique transaction on Shopee
  3. The variable ‘bank_accounts’ stores list of bank accounts (already encrypted) and its owner.
  4. The variable ‘credit_card’ stores list of credit cards number (already encrypted) and its owner.
  5. The variable ‘orders’ stores transaction data

Here are the overview of those variables :

Code 2. How the data looks like

Then, we initiate a list called “is_fraud”. It will label the fraud orders with ‘1’. To find fraud orders based on bank account, we create variable called ‘dup_bank’ first. It gives us a list bank account that is owned by multiple Shopee account. Here is an overview of ‘dup_bank’ variable.

Code 3. Overview on “dup” DataFrame

Then, we select one bank account from ‘dup_bank’ and called it ‘x’. Create a DataFrame ‘a’ which is a list of userid that own x. Next, we create DataFrame ‘b’ that stores orders which its seller and its buyer are listed in ‘a’. The orders listed in ‘b’ are labelled as fraud order.

Code 4. Finding fraud orders based on bank account

Repeat same algorithm on ‘credit_cards’ and ‘device’. Then, compile the orderid and its related label to a DataFrame ‘submission’

Code 5. Finding fraud orders based on credit card and device

The DataFrame ‘submission’ should look like this.

Code 6. An overview about the result

As a conclusion, this algorithm is not perfect. It yields score 0.98472 out of 1 when submitted on Kaggle (https://www.kaggle.com/c/ptr-rd2-ahy). Any comments and suggestions are very welcomed 😄

--

--