Transforming Web Data from Unstructured to Structured
*Originally published on the Quad Analytix Blog
At first blush, it seems like data analytics is simply collecting and analyzing loads of information, but there’s actually a lot that happens in between. There’s a whole progression to it: on a broad level, the data analytics workflow generally follows a process called ETL (Extract, Transform, Load), where data is…
- Extracted—in our case, obtained from various retail websites
- Transformed—cleaned, normalized, checked for accuracy and consistency to get the data into analyzable form
- Loaded into the target database for in-depth analysis
Getting extraction perfect on a large scale is an incredibly challenging problem. There is so much variation in naming conventions, formatting conventions, etc. among different merchants and different product categories that transformation is absolutely necessary.
Data cleansing is a major part of transformation, the crucial middle step that is so often misunderstood. Raw data can be unusable—too inconsistent or incomplete to analyze directly. Cleansing identifies incomplete or inaccurate extracted information and modifies that information, converting unstructured, inscrutable data into structured, usable data ready for analysis.
It’s easiest to explain the process by example: let’s say you have a set of extracted data for fuzzy socks across multiple sellers on Amazon. Among other things, you want to know each product’s color and title, so you pull title and color tags from the HTML source of each product page. All products have titles, but only the well-documented ones have color tags, like these gray socks:
Still, even when there is no color tag, color can almost always be found within the product title. Querying the extracted data, you can use the extracted title information to fill in the blanks for the sellers who don’t label their socks with color attributes: simple algorithms can search for all color-related words to find the word “blue” in the product title “Cozy Unisex Lapis Blue Fuzzy Microfiber Socks.”
Of course, cleansing is an imperfect process—sometimes there will be exceptions to the rule, as in this case where there are multiple socks being sold together:
When algorithms fail, it usually becomes necessary to resort to manual labor. For Amazon’s 6-pair fuzzy sock pack, you might assign the value “assorted” or “multi-colored” to the color attribute after taking a brief skim through the product page. Alternatively, maybe you decide to group sock multi-packs into an entirely different category, or even try to further automate processes by designing programs that identify color based on screenshots of the product image. Still, it’s almost inevitable that certain elements must be cleaned manually—the ultimate goal is to minimize that with efficient and expansive algorithms.
Quad Analytix has gone beyond Amazon’s fuzzy sock sphere to extract data across thousands of different merchants, individual sellers, and information types that range from product description to promotion patterns to customer review records. Over time, our experience in developing optimal ways to clean data has come to ensure maximal clarity, allowing us to build analyses and insights on as solid a foundation as possible. Our ultimate goal is to help you better understand your competitive landscape online —you’ll always be the first to hear about trending sock colors!
Have questions or want to learn more? Request a demo!