Verisoul specializes in helping businesses identify and mitigate fraudulent activities, such as fake social media accounts and unauthorized account sharing. It provides clients with real-time signals via its APIs (70% faster than competitors) and enables customers to investigate aggregate and user-level activity in their customer-facing dashboard.
Challenges
Figure 1. Original architecture
Verisoul has its original data source in Google Spanner, with data ingested into BigQuery to serve customer-facing dashboards, and they faced significant technical hurdles in its original setup:
Mutable Data and Updates
A significant portion of Verisoul's data revolves around account-specific information, which is highly mutable due to changing account statuses, such as the number of suspicious logins.
BigQuery does not natively support conditional updates, forcing the team to manually write SQL statements to retrieve the latest data and perform updates. This is a very expensive operation, both in labor-hours and dollars.
Data Freshness
Because the dashboard data relied on bigQuery, the dashboard was always behind. This was a challenge for customers when developing and integrating—they had to wait minutes to see whether their integration changes worked!
Enhancing this was prohibitively expensive due to the aforementioned conditional update challenges, compounded by BigQuery's pricing model, which charges based on the amount of data scanned and the cost of streaming ingest calculated per row. This made frequent updates on their scale financially unfeasible.
Denormalization Makes Rigid Data Pipelines
Verisoul faced challenges with suboptimal query performance with BigQuery, forcing the team to frequently denormalize data to minimize JOIN costs.
This approach significantly constrained their ability to add new features. Each new feature necessitated schema change, with each consuming over two labor-hours to reconfigure the extensive denormalization pipelines and associated tables, slowing down the speed at which they could deliver new features to their customers.
Solution
Figure 2. New architecture
Verisoul started looking for a more suitable solution for their customer-facing workloads. That was when they found CelerData. CelerData based on StarRocks is a real-time data warehouse designed to solve challenges Verisoul faced with its legacy solution. After promising benchmark results, Verisoul migrated all of the data relevant to queries that require <100ms query latency to CelerData:
-
Handling Real-Time Conditional Updates: CelerData's primary key table uses a record-level primary key index to achieve <10s data freshness on its columnar storage directly. Utilizing CelerData's primary key table and its native conditional update capabilities, Verisoul achieved 5-10 second data freshness.
-
Run All JOINs on the fly: CelerData's superior JOIN performance helped eliminate all denormalization pipelines.
-
Async Materialized View: For fields that do not receive updates often, such as overview, CelerData's async materialized view pre-aggregates some of the results to increase performance further while decreasing the load on the cluster.
Results
Verisoul’s adoption of CelerData yielded transformative results:
-
Real-Time Analytics: The new system based on CelerData now delivers real-time analytics with data freshness between 5-10 seconds, a huge improvement from the 15 minutes with BigQuery, enabling fast monitoring and responding to fraudulent activities, making Verisoul stand out among its competitors.
-
Query Performance: CelerData provided much-needed improvements in query latency, reducing it to 50-70ms and allowing for faster and more cost-effective operations, making the dashboards virtually instantaneous for the customers.
-
Faster Feature Development: Ditching the denormalization pipelines simplifies the data pipeline and makes the data schema more flexible. Adding a feature that used to take 2 hours now takes only a single line of command.
-
Cross-Customer Intelligence: CelerData' ability to handle complex OLAP queries on the fly has enabled Verisoul to build a brand-new functionality that gathers and analyzes data across multiple clients—a capability previously unattainable with BigQuery. This enhancement makes identifying suspicious activities through IP and device tracking and other metrics accurate to the minute - versus competitors that refresh only daily.
What’s Next
Looking ahead, Verisoul is planning several ambitious enhancements:
-
Integrating Apache Iceberg: Currently, only performance-critical data is hosted in CelerData, with the remaining raw data still in BigQuery. Future plans include adopting Apache Iceberg for raw data storage and leveraging CelerData's external catalog feature for direct querying, enhancing data accessibility and processing efficiency.
-
Enhanced Tracking Mechanisms: Leveraging CelerData's capacity to handle vast data volumes, Verisoul plans to track mouse movements across millions of pages simultaneously— translating to a few magnitudes more data than ever. This expansion will enable the detection of more sophisticated and subtle fraud patterns.
-
Integration with Machine Learning Applications: Verisoul aims to use CelerData to better aggregate data for integration with machine learning models. This will bolster their ability to predict and preempt fraudulent activities, enhancing predictive analytics and preventive measures.
You can learn more about Verisoul's journey and the capabilities of CelerData that have made it the top choice for the world's leading enterprises by signing up for the CelerData Cloud 30-day free trial
Join StarRocks Community on Slack
Connect on Slack