Products on eCommerce sites are typically accessed through shelf navigation of a merchant’s taxonomy. Studies have shown that typically 8 – 10% of product URLs available on a given shelf are misclassified. Could this be improved via Machine Learning to deliver end-users a better browsing experience? What about an end-user uploading a product to a marketplace such as eBay?
Can machines solve the problem of best fit, in the marketplace’s taxonomy so buyers can find it more easily? Given a lack of standardization in implementation and content/style across the vastness of the eCommerce web, taxonomies of varying morphologies, and subjective placement of products across shelves, can we use machines to effectively classify products into a canonical taxonomy at an industrial scale (100s of merchants, several 100 millions of SKUs, and 1000s of product categories)?
Last month, Quad (now Wiser Solutions, Inc.) presented a well-received session at Strata + Hadoop World 2016 in San Jose, about large-scale product classification using both text and image-based signals. Here are our key learnings and approach to taming this challenging problem.
Exploit Multiple Signals on Product Pages
Textual signals included titles, breadcrumbs, and images (product descriptions and product recommendations were omitted due to being too noisy as larger blocks of unstructured text).
The title of the product or SKU is typically found on the product page. It is supposed to be a concise description of what the product is and hence possesses significant semantic content useful for the purpose of classification.
The breadcrumb typically describes the taxonomic location of the product within the specific merchant’s taxonomy. It is a classification label that can be used to map against a canonical taxonomy.
Image signals are from the product thumbnails on the product page. There may be one or many per page. These are collected along with the text titles for analysis.
Employ Multiple Algorithms, Progressively
A bag-of-words approach is basically a sparse vector of occurrence counts of words. That is, a sparse histogram over the vocabulary. Its weakness is being too keyword-oriented, with no regard for semantic meaning. As illustrated, a bag-of-words algorithm would put shorts and rugs in close proximity, since the keyword “runner” exists in both titles.
Word2Vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2Vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. The output of the Word2Vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words. We tried this in two ways.
First, nearest neighbor classification. Formally, the nearest-neighbor search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q.
Second, support vector machine (SVM) classification, with the choice of Fisher vector pooling rather than simpler approaches like max pooling or average pooling. SVM is a discriminative classifier formally defined by a separating hyper-plane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyper-plane which categorizes new examples.
While we experienced directional improvements with accuracy, the approach still suffered from the fact that text (product titles or breadcrumbs) may be poor and cryptic, as illustrated in these examples: “Men’s Packed Out II Omni-Heat” and “Primaloft Quilted Bomber.”
Using a convolutional neural network (CNN, or ConvNet) on images. This is a type of feed-forward artificial neural network inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field. Compared to other image classification algorithms, convolutional neural networks use relatively little pre-processing. This means that the network is responsible for learning the filters that in traditional algorithms were hand-engineered. The lack of dependence on prior knowledge and human effort in designing features is a major advantage.
Our approach was to resume training via Caffe (an open source implementation of CNN) based on curated eCommerce images (100,000 images organized via our canonical taxonomy, where each node has approximately 50 to 1,000 images) added to AlexNet (which is trained on 14 million images from ImageNet organized via the WordNet Hierarchy, where each node has more than 5,000 images)
This approach did work well in several situations, although less so when confronted with gender signals and images similar to each other, as in the examples here.
Intelligent Classifier Fusion
Given the pros and cons of each approach outlined above, our solution for enhanced product classification is to exploit multiple modalities and fuse the various classifiers in an intelligent way. This optimizes classification with regards to both recall (what percentage of cases do we arrive at a confident answer) and precision (measure of accuracy – i.e. does a human agree with the machine).
Arriving at the correct classification is more of a heuristic. Given our focus on high-quality data, with minimal to no errors, our goal is to arrive at a very high-precision classifier, even at the expense of recall. An approach that looks for mutual agreement amongst classifiers provided the best results in terms of accuracy without significant loss of recall. Formally, this can be represented as follows: (Bag-Of-Words && Word-2-Vec) || (Bag-Of-Words && CNN).
When further integrated with crowd-based classification (for cases in which algorithmic methods don’t return a high-confidence response), an even higher degree of precision can be reached.
Promising Results and Future Refinements
Our approach, outlined in the figure below, has yielded significant success. Next steps include expanding the training set of images and improving fusion algorithms by incorporating a feedback loop from production data to training data. Also, a layer of rule-based classification could be introduced to overcome challenges such as confusing gender signals.
Thus, we aim to achieve even more impressive recall and precision numbers (closer to 100 percent)—arguably making this implementation the most advanced eCommerce classification system in the world. In comparison to Google Shopping reveals that their classification algorithm has upwards of a 35 percent misclassification rate.
We believe you cannot reach deep insights that can be trusted unless the journey from unstructured to structured data is carefully managed, avoiding any signal loss or corruption. Every attribute collected is cross-checked via a system with multiple algorithms and human verification.
Photo by Eric Fischer, https://flic.kr/p/a8fJsc
Contributing Writer: Sreeni Iyer