your company runs a news portal, and collects clickstream data of its readers.
As a data engineer you are tasked with building a data pipeline that collects clickstream data, enriches
with articles metadata and persists for later user characteristics and behaviour analysis.
Data pipeline elements:
- Stream for click events
- Real-time data processing
- Reading the Stream
- Base64 Decoding
- Enriching clicks data with articles metadata
- Converting into Parquet format
- Persisting into the Storage
- Batch processing for analytics:
- Provide insights about:
- Distribution of clicks on each of the environment, device group and operating
system
- The most and least popular articles
- The most and least popular categories
- The average word count per article and category
- Avg time to click on article (difference between session start ts and article click
ts) per category
- Every hour query stored data and create aggregated report in form of CSV with required
insights
Producer
- Script faking a real system where clicks would be tracked
- Use provided clicks dataset sample as reference
- Before publishing into Stream data format should be converted from CSV to JSON and Base64
encoded
Dataset
[login to view URL]
Data structure
Click data:
- user_id
- session_id
- session_start
- session_size
- click_article_id
- click_timestamp
- click_environment
- Id of the Environment: 1 - Facebook Instant Article, 2 - Mobile App, 3 - AMP (Accelerated
Mobile Pages), 4 - Web
- click_deviceGroup
- Id of the Device Type: 1 - Tablet, 2 - TV, 3 - Empty, 4 - Mobile, 5 - Desktop
- click_os
- Id of the Operational System: 1 - Other, 2 - iOS, 3 - Android, 4 - Windows Phone, 5 -
Windows Mobile, 6 - Windows, 7 - Mac OS X, 8 - Mac OS, 9 - Samsung, 10 - FireHbbTV,
11 - ATV OS X, 12 - tvOS, 13 - Chrome OS, 14 - Debian, 15 - Symbian OS, 16 -
BlackBerry OS, 17 - Firefox OS, 18 - Android, 19 - Brew MP, 20 - Chromecast, 21 -
webOS, 22 - Gentoo, 23 - Solaris
- click_country
Articles metadata:
- article_id
- category_id
- created_at_ts
- publisher_id
- words_count
Notes:
- Suggested language Python
- Suggested processing framework Spark
- SparkStreaming
- SparkSQL / Hive
- Suggested streaming framework: Kinesis or Kafka
- Present reasoning for topics and partitions setup
- Please include diagram or description of your solution
- Why did you choose this particular solution architecture? What were some of the trade-offs? How
would it handle events schema evolution?