Introduction
We are aiming to build a simple redirection analytics system similar to bit.ly / shorturl using opensource toolset with further investigation on the scope of data science in it.
Let take an example of a simple food review website. On the Website, users can see reviews of different dishes. If Users like a review of a particular dish and want to order it, the Website contains links to various food ordering platforms (such as Uber eats, Zomato, Swiggy). Users can click on one of the platforms and it will be redirected to the chosen platform with desired dish.
For the redirection, we simply embed the URL with the relevant link and integrate it into our website.
Now, We want to check how many users clicked on that link. We can do this by adding a URL shortener or redirection service [ https://bitly.com/, https://www.shorturl.at/ ]. This process can be manual and costly. Many statistics of redirection are generally paid. Although, most of the services give much-needed statistics for free.
One of the real-world examples of this feature is Twitter. When the user includes URL into his/her tweet it gets automatically replaced by a short URL (t.co/…”) and the user can see how many people clicked on that link. Even LinkedIn does this.
Here, we will mention the simplest way to implement this feature in Flask.
Simplest doesn’t mean best.
We will need to store all relevant links to all dishes with all supported platforms. We might store this in one big database table. But here, we will use CSV in place of the database for storing the dishes and their respective URLs. Following is the format of the CSV :
Dishname | Restaurant | Platform | URL |
---|---|---|---|
Pizza | Uncle John’s | Zomato | zomota.com/something |
Pasta | Uncle Bob’s | Uber Eats | ubereats.com/something |
We will use pandas for reading CSV. ( Note: we can use the CSV library of python standard package, too. But We will need some functionality to search for URLs. We can implement this search functionality easily with pandas)
df = pd.read_csv("database.csv")
Redirection Implementation in Flask
Now, We have redirection urls. Let’s take a look at the following code for redirection implementation.
@app.route('/redirect/<restaurant>/<dish>/<platform>', methods = ['GET'])
def redirect_stat(dish=None,restaurant=None,platform=None):
url_redirect = '/'
if dish and restaurant and platform:
result = df.query(f'dishname=="{dish}" & restaurant=="{restaurant}" ')
if len(result)>0:
if platform=='zomato' or platform=='swiggy':
return redirect(result['url_'+platform].iloc[0], code=302)
return redirect(url_redirect,code=302)
As we can see, We have defined a very simple URL routing scheme in Flask for redirection. If someone wants to order X dish from Y restaurant on the Z platform, they have to just type the following URL in their browser “ourwebsite.com/redirect/X/Y/Z” and we will redirect them to the relevant platform with that dish. Code also does a few additional checks making sure that all arguments are passed correctly and only supported platforms are allowed for redirection.
Now, We need to replace all URLs on our website with this routing scheme. Depending upon how we have implemented this on our website this can take from couple of minutes to a week.
Storing Data to Database
We need to store redirection entries. CSV? Yes, We can. But. multiple threads writing to single CSV doesn’t end up well in the long run. So, We’ll use a real database. MongoDB? No, It’s PostgreSQL. We will create a simple table for storing our information.
CREATE TABLE IF NOT EXISTS public.REDIRECTS
(IP TEXT NOT NULL,
DISH_NAME TEXT NOT NULL,
RESTAURANT TEXT NOT NULL,
DATE_TIME TEXT NOT NULL,
PLATFORM TEXT NOT NULL);
We also can create table using python.
self.conn = None
try:
# read connection parameters
params = config(section = section)
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
self.conn = psycopg2.connect(**params)
if create_database:
cur = self.conn.cursor()
database_create_query = '''CREATE TABLE IF NOT EXISTS public.REDIRECTS
(IP TEXT NOT NULL,
DISH_NAME TEXT NOT NULL,
RESTAURANT TEXT NOT NULL,
DATE_TIME TEXT NOT NULL,
PLATFORM TEXT NOT NULL); '''
cur.execute(database_create_query)
result = cur.fetchall()
self.conn.commit()
cur.close()
print ('Connected')
except (Exception, psycopg2.DatabaseError) as error:
print(error)
Now, we have a table in place. We just need to insert data and everything is almost done. Following code implements insertion in the database.
@app.route('/redirect/<restaurant>/<dish>/<platform>', methods = ['GET'])
def redirect_stat(dish=None,restaurant=None,platform=None):
url_redirect = '/'
if dish and restaurant and platform:
# inserting entry into database
redirectdb.insert(ip=request.remote_addr,datetime=current_time(),
dish=dish, restaurant=restaurant, platform=platform)
result = df.query(f'dishname=="{dish}" & restaurant=="{restaurant}" ')
if len(result)>0:
if platform=='zomato' or platform=='swiggy':
return redirect(result['url_'+platform].iloc[0], code=302)
return redirect(url_redirect,code=302)
Analytics
At this point, we have data in our Postgres. We can do so many things with this just one simple table. We might need to do ETL depending upon what we trying to do with this data but I will put a few ideas for you to try out.
-
How many total redirections from our website happened last month?
Note we can not accurately calculate number of people who were redirected because many people might be using VPNs. But, We can roughly estimate this by counting unique IPs. It might have an error in it but we can still benefit from it.
- Which platform is popular?
-
Which dishes are popular and which restaurants are popular?
If we have additional information about restaurants in our database. For example, location of every restaurant, we can even tell which restaurant are popular in X city?
Here, I have used popular word because We never know redirected user placed an order on the platform or not. So, We should refrain from making such claims.
Few words on privacy: We implemented all this without using a single third-party tool. Our website users’ data is at our hands. We can even drop the IP column form our table and only track general statistics. This is the best way to maintain our user’s privacy rather than spending a few bucks on custom analytics tools that we don’t have full control of.
At this point, You have gotten the idea of how much powerful one simple table can be. But, If you’re coming from the Machine learning or Data Science domain, There are few eccentric ideas. You should only try to implement this if you have at least more than X amount of daily active users on your website. If not, better invest time to implement other important features.
-
Based on Data, Predicting URL redirection for a particular restaurant for the upcoming day.
Yes, You read this right. We can do this using time series analysis and enough data.
- Based on IP addresses, We can generate location-based statistics.
-
Detecting bots’ IP address, if any exists on our platform, from redirection patterns.
Bots have unique patterns. Sometimes they try to emulate person but no user will do 50 URL redirection on our website in one day. It can happen in rare case but we can look at general statistics from our records to determine this behaviour.
Optimization
If you have seen a real-world production system. You might already have figured it out that how to optimize this approach. I will just mention a few points here.
- We can tighten our attributes of the table in the schema. Use Datetime for storing DateTime stamp rather than text. For IP address, You can follow this post on stack overflow
- We have Write heavy system. Our Database might face issue if we have X amount of user redirecting every second. We can scale this because We don’t need strong consistency. Yay! We can deal with eventual consistency. Also, We are not reading from the database at all. We just occasionally read from DB for our analytics purpose. The simplest and must option is batching our writes. You can read more on optimizing DB for write heavy task. This topic deserves a blog in itself.
- Our search time for url in CSV is linear [ O(N) ]. We can speed up this by using a hashmap. We can even use Redis or Memcache. If we have too many URLs, we will not be using CSV but a real database. For a large scale, indexes in DB + single in-memory cache instance (Redis/Memcache) is more than enough to scale to at-least to million urls.
written by Dipkumar Patel and Yash Panchal