Transforming Web Data from Unstructured to Structured

quad-data

*Originally published on the Quad Analytix Blog

At first blush, it seems like data analytics is simply collecting and analyzing loads of information, but there’s actually a lot that happens in between. There’s a whole progression to it: on a broad level, the data analytics workflow generally follows a process called ETL (Extract, Transform, Load), where data is…

  1. Extracted—in our case, obtained from various retail websites
  2. Transformed—cleaned, normalized, checked for accuracy and consistency to get the data into analyzable form
  3. Loaded into the target database for in-depth analysis

Getting extraction perfect on a large scale is an incredibly challenging problem. There is so much variation in naming conventions, formatting conventions, etc. among different merchants and different product categories that transformation is absolutely necessary.

Data cleansing is a major part of transformation, the crucial middle step that is so often misunderstood. Raw data can be unusable—too inconsistent or incomplete to analyze directly. Cleansing identifies incomplete or inaccurate extracted information and modifies that information, converting unstructured, inscrutable data into structured, usable data ready for analysis.

It’s easiest to explain the process by example: let’s say you have a set of extracted data for fuzzy socks across multiple sellers on Amazon. Among other things, you want to know each product’s color and title, so you pull title and color tags from the HTML source of each product page. All products have titles, but only the well-documented ones have color tags, like these gray socks:

 

IMG1

 

Still, even when there is no color tag, color can almost always be found within the product title. Querying the extracted data, you can use the extracted title information to fill in the blanks for the sellers who don’t label their socks with color attributes: simple algorithms can search for all color-related words to find the word “blue” in the product title “Cozy Unisex Lapis Blue Fuzzy Microfiber Socks.”

 

IMG2

 

Of course, cleansing is an imperfect process—sometimes there will be exceptions to the rule, as in this case where there are multiple socks being sold together:

 

IMG3

When algorithms fail, it usually becomes necessary to resort to manual labor. For Amazon’s 6-pair fuzzy sock pack, you might assign the value “assorted” or “multi-colored” to the color attribute after taking a brief skim through the product page. Alternatively, maybe you decide to group sock multi-packs into an entirely different category, or even try to further automate processes by designing programs that identify color based on screenshots of the product image. Still, it’s almost inevitable that certain elements must be cleaned manually—the ultimate goal is to minimize that with efficient and expansive algorithms.

Quad Analytix has gone beyond Amazon’s fuzzy sock sphere to extract data across thousands of different merchants, individual sellers, and information types that range from product description to promotion patterns to customer review records. Over time, our experience in developing optimal ways to clean data has come to ensure maximal clarity, allowing us to build analyses and insights on as solid a foundation as possible. Our ultimate goal is to help you better understand your competitive landscape online —you’ll always be the first to hear about trending sock colors!

Have questions or want to learn more? Request a demo!

0
Share
Min-Jee Hwang

Min-Jee is the Marketing Director at Wiser, a dynamic pricing and merchandising engine for online retailers. She has extensive experience working with SaaS companies and holds a BA from Carnegie Mellon University and an MBA from NYU Stern.