CA684 Machine Learning Assignment
Dublin City University has teamed up with leading online fashion retailer Zalando to create
the 2022 CA684 Machine Learning assignment.
Introduction
As a customer proposition, Zalando strives for “trustworthy” prices. That is, the company
wants to offer competitive prices in each of its dynamic market environments, to alleviate its
customers from having to compare prices, and to drive revenue growth. In order to do that
for its hundreds of thousands of individual products, Zalando needs to Identify exact product
matches across the relevant European competitors.
A very similar use case exists at stores like Amazon or Walmart, which allow multiple sellers
to offer the same product on their platform: identical products need to be grouped together,
even when the names, descriptions, images, etc. are not exactly the same.
Challenge
Barcode systems like EAN allow for unique identification of every product. Unfortunately,
reliable EAN information is not always available. Zalando uses multi-modal data to solve the
problem, relying on images and text. For this challenge, we are asking to make intelligent
use of text data (such as product titles, colors and descriptions). As these are not
standardized, and often manually written / changed for marketing purposes, matching
products is a non-trivial task.
This challenge has a direct business impact for a retailer like Zalando. It is also closely
related to many other problems, like record deduplication in heterogeneous catalogues,
document retrieval, and many more.
Dataset
The dataset will contain files as follows. Two files containing offers of products, for training
and testing respectively, with the following fields:
Label Description
offer_id unique identifier for an offer of a product (i.e. a product x shop
combination, where we don’t know the product component)
shop “zalando”, “aboutyou”
lang “de” (German)
brand e.g. “Nike” - note that different `shop`s might have different `brand`
nomenclature
color e.g. “blue” - note that there could be more than one and different `shop`s
might have different `brand` nomenclature (“ocean”, “light blue”, “...”) and
may have more than one color (ordering matters)
title e.g. “White Nike tennis top”
description a long product description that can may contain material composition,
cleaning instructions, etc
price price in euro without any discount
url url of the product description page
image_urls list of product images such as stock photo, with model, lifestyle photo, or
close up
A separate file containing the matches in between those offers that describe the same
products using the offer id. This is only provided for the training offers.
Label Description
zalando offer_id from “zalando” shop
aboutyou offer_id from “aboutyou” shop
brand unique identifier for the brand representing the match
The task is to predict the matches for the offers in testing by maximizing the F1 metric. See
the getting-started notebook for further details.
Happy Hacking!
Disclaimer: dataset should only be used for the purposes of the assignment and in any case
should be shared or distributed elsewhere.