Data Engineering

Client

Property & Casualty(P&C) Insurance Underwriting Organization

Project

Data Engineering to scrape, extract & normalize web data which will fuel advanced analytical models

Challenge

  • Transform commercial P&C underwriting by tapping into structured & unstructured data sources and utilizing mathematical / statistical models on it to glean information and take objective decisions

  • Immediate Scope

    • Scrape 100s of Static & Dynamic websites
    • Content type could be either HTML or PDF reports
    • Normalize structure of Output data
    • 100% automation of scripts for future use
    • In excess of 50 million records to be extracted

Our Solution

Built Python based Web Crawlers to automate the process of scraping both HTML & PDF data sets hosted on Web Servers. Some of the advanced scripts used Selenium and other Python 3.x based libraries (BeautifulSoup, Scrapy etc)

Value Delivered

  • Normalized Data Structure and output for records scraped from hundreds of static and dynamic websites

  • Data in excess of 15 GB

  • Fully automated Python Scripts