Jugnu Life :-): Architecture

Showing posts with label Architecture. Show all posts

Large scale Real time bidding for advertisement

Behavioural Targeting
Based on short term and long term activities of users
e.g User searched for new car
e.g User searched for new movie
Typical advertisement pipeline

Simple HBase data mode
Single Column family
Column Qualifier
<date><hour>:<type><value>

Challenges
User profile freshness
    Stagered freshness
    Update after every few hours
Scaling
    Partition by geo location
    Apps interactions
Pipeline failure
HBase scaling issues
    Salting of keys for even spread
    Optimal pre spits for reading also

http://www.slideshare.net/Hadoop_Summit/how-did-you-know-this-ad-would-be-relevant-for-me

Now see how Yahoo solves the similar problem

http://www.slideshare.net/Hadoop_Summit/interactive-analytics-in-human-time
Yahoo makes use of druid
http://druid.io/druid.html

Lambda Architecture implementation with Summingbird

http://www.slideshare.net/Hadoop_Summit/t-145phall1boykin-mac

Design and Architecture considerations for storing time series data

Design and Architecture considerations for storing time series data

Know your Access patterns in advance

Are we going to do analysis on full day of data or data for just one hour. Having advance notes on what would be the use cases with which data will be used is highly recommended.
The granularity of information required by the client application helps deciding underlining data model for storing the information.
Frequency with which data is generated
Identify the speed with which time is being produced by the source system. Do you produce multiple data points every second.

Though we might need to persist all the time series data but more than often we don't need to store each data point as a separate record in the database.

Most of the time series problems are similar in nature and predominant issues comes when we need to Scale the system . Making the systems evolve with changing schema is another dimension which adds to the complexity. All the problem show similar patterns with only variations in the data model.

If we can define a time window and store all the readings for that period of time as an array then
we can significantly cut the number of actual records persisted in the database, improving the performance of the system.

For example
Stocktock tick information is generated once a second for each stock i.e. 86K ticks per stock per day.
If we store that many records as separate rows then the time complexity to access this information would be huge, so we can group 5 minutes or 1 hour or one day worth of records as a single vector record.

The benefits of storing information in larger chunks is obvious as you would do way fewer lookups into the
NoSQL store to fetch the information for a specific period of time. Other point to remember is that
if your window size is very small then you will be doing a lot of read/write operation and if it is
too big then durability would be a concern as you can lose the information in the event of system failure.

So you need to balance out both the forces.

There is no single size fit each time series problem is different.Fine tune the system based on the requirements and the access patterns.

If the access pattern changes in future then you might have to re-index, re-calculate the array size to optimize your queries.

So each time-series application is very custom made, where you can apply the best practices but cannot just import the data modelling templates to a different time-series problem.

Links
http://www.infoq.com/articles/nosql-time-series-data-management
https://blog.twitter.com/ko/node/4403

Architecture pattern for real time processing in Bigdata

3Vrfc+I2EP5reOyMweCER+CgufbauR430+mjwALU2BYjy0nIX38rrBW2ZKghtpMrwzBojdf6vv0piZ4/i19+FWS/+4OHNOoNvPCl53/qDQb3ngefSnCwBFvBwlzU14KMhTQtiSTnkWT7snDNk4SuZUm24VFZ2Z5sqSNYrknkSv9modzpyQ2Ck/yBsu0OH9MPxvmVVB5QR0g3JIvkL0cRXFOXY4K6NMwXLx8Gd/n4oMd9Tyvck6Q0pVfO45JA0JS9lqe9YXpe+iErLkIqSqKIJY9F3vw5GElwDjeqb/HLjEbKUGiE/LbFmauGLkGT0qPP3aDpeCJRpqfu0Pe8Y5Iu92Stxs/gPD1/umFRNOMRz7H489Fstli4z9bTeaJCUu1oJ5DgiZTHVIqDoh/tp2nRXjjytTmeT8YPAv2bXdHwyCfRZG6N7hN6+KIJqCZDW/o6MhoAfW+BRh8sgEZ4RcxNQEa+O8c8tjDfdYgZZ9sx5gGmVMR83yHmmwK9Acx2QI87xOy/E2acvMYceB1i1oWxgBnQiUcQ/cZX8PltvvyuKKACAN2S68fjBbwaoGk0LNPkV+T6gV/BE6bLN/Gkn3XJN9Id2auvWRxN1lLhnyp0DLqTL2RFo688ZZJx1RasuJSqHTA/mERsqy5I3oRH3VlMDSqqYlVRxITzJqbuz3rU8q8vt3jQdDz3JqNGiojWgrwMXV5MY1AkBsl6Cy9muhc8iIbQzOphwhN6FrP64UXERURo1UpTCxoRyZ7KfWgVTv2Mr5zBXE65C6nRlA4xUFBFyjOxpvquYmNpKRrZxQ59FhVJIrZUOoqOtBvg9SwxdCzxOVnzmCVbtaAhkjiGAQcD4qZER+ka7KH6cid8YxaG6p5pE2UBcxkyUuWtVbbFPPkWb/V1pFzy1q3g2RXZyqzwyAo1HCd6PuHbBIxcAjC3N53vBzXwwx2whlXWrtEVGEL/2/4dlH1cD78HPLxaNi4OuwBfo5a3DN7q+HDYBXi3PHcN3gprHHYBvsaKvWXwVhOLww7AI8/vCH5kJfQOwdfpvtoFb3VKOOwCfI1sT4Tgzz9Bv2ktbQJPY7u637Tqj6OouX4T16kF+r8LkqQbLmIaHvd0gQ+4zqDx9HCxlEpBSd6TfoBm1LdCN8A+v5Nm1K1ZvlDEASSpJjnZw1nC+XDGhfk64hmorBPe1+9OWGvLfkV8V64tm+hW/Tp17adbWwbWnu/QzoV1Y91WZBYWzcc6uvv/zBJW7TKV+2pLWIr8Fi3hVvyHT4vlLRtQ6rBqbI71GjyuCipWtJXHVU1sQA3dLf1/Jt/+vIUQtaPbCiF3FcvA9ghx9/sfshgKM8gyKBx6w9JbkfRYpmE041Brofp+hJJsqgkGZcVaorWSPHT30JaHVNJYkZdmqxhOsgfev3ylxnljk58kfADiAmur0ThhF8T1bynV0MJIwR/NXxo0+R+6UbcjGw/yri0ZtqIAT7tb2Bh2izdsDAsaA8vk+MNkS9PjEY7KEMfeHFYN/gs5Oqb5uoiyGNKHGf8+2ag9ZdAObyNdkvgVpAUFl2Mjohs1bHXLeVRuYs2felqIDBie/raSG+v0RyN//gM=

The flow is as follows

1) Real time data is ingested into system via stream ingestion tools like Flume , Kafta or Samza. The choice of which tool to be used among these dependent on factors like do we need event guaranteed delivery , are we bothered about sequence of events. etc. This is topic for another posr

2) Apply the processing required on the incoming data via Spark streaming api and make it ready to be consumable by third party apps

3) Using the REST gateway job server of Spark , third party apps trigger further spark jobs to process the data and return results back to the applications.

Teradata meets Hadoop SQL-H

Teradata and Hadoop have been talking via Sqoop since long. In which Sqoop was used to import export data in and out of Hadoop and Teradata.

Teradata has new tool called SQL-H

SQL Hadoop which allows to connect directly to Hadoop from Teradata and perform operations.

The two links below explain all about them.

But before going to main link I will suggest you to have a quick read at Teradata architecture. ( only 2 mins read)

What is the function of

PE – Parsing Engine , parses the query entered by user

AMP –Do actiual disk operations

BYNET – Handles communication between AMP and PE

http://www.teradatatech.com/?p=103

Now comes one additional component EAH

EAH is the External Access Handler , which talks to HCatalog on Hadoop side to get information related to table

You can run queries like

SELECT Price

, CAST(Make AS VARCHAR(20))

, CAST(Model AS VARCHAR(20))

FROM LOAD_FROM_HCATALOG(

USING

SERVER('sdll4364.labs.teradata.com')

PORT('9083')

USERNAME ('hive')

DBNAME('default')

TABLENAME('CarPriceData')

COLUMNS('*')

TEMPLETON_PORT('1880')

) as DT;

Where server is details about hadoop cluster

And we are saying in Teradata , use Hcatalog and talk to Hadoop execute this query bring back the results.

There is very good presentation which explains more

https://teradata.activeevents.com/connect/fileDownload/session/861E603F0E963A510420A734D9102127/1436_Back-1436_Frazier-1436%20Frazier%20Partners%202013%20SQL-H.pdf