1
 
 
Account
In your account you can view the status of your application, save incomplete applications and view current news and events
November 04, 2025

Learning to Rank – To the Moon and Back(-propagation): Deep Neural Networks That Learn to Rank What You Love

Introduction: Learning to Rank at OTTO

Imagine yourself searching and instantly finding exactly what you are looking for. No scrolling, no frustration. Just hits.

What sounds like a moonshot is the daily mission that drives our search teams. By using machine learning, we can learn from millions of user interactions what our customers click, what they buy and what they love.

To optimize OTTO’s ranking function and search relevance, we use Learning to Rank (LTR) – a method for improving ranking performance. Unlike rule-based algorithms, LTR models learn product relevance for a query by analyzing user interactions tied to that query. You can learn about the basic principles of LTR in our earlier blog posts:

Our previous LTR model was a Gradient-Boosted Decision Tree (GBDT). As the e-commerce LTR research community increasingly demonstrated the potential of Deep Neural Networks (DNNs) to outperform GBDT-based models, we recognized an opportunity to improve our search relevance. Driven by the goal of enhancing our customers' search journey, we therefore decided to transition from our reliable GBDT model to a Deep Neural Network (DNN).

Why Deep Learning for Learning to Rank?

After numerous iterations of feature engineering for our GBDT ranking model, we’ve reached a point of diminishing returns. Each improvement has added complexity to our tech stack, while the performance gains have steadily decreased. To overcome this plateau, we’ve shifted our focus from manual feature engineering to the model’s architecture.

Deep neural networks offer a versatile and scalable framework for innovation. They also significantly reduce the need for manual feature engineering, due to their ability to learn relevant features directly from the data. In this blog post, we’re excited to share our latest insights into optimizing ranking with a deep neural network.

Data Science Focus

This post focuses on the data science innovation behind our LTR models. If you want to learn more about the software engineering part, stay tuned for our second blog post! You will learn about our use of Clojure to build scalable big data pipelines and uncover its connection to space travel.

Two men looking at each other and the saying "We need to go deeper"
Two men looking at each other and the saying "We need to go deeper"

Ready to go even deeper into the technical intricacies of our LTR architecture? Our comprehensive paper provides all the details this blog post builds upon.

Our LTR Training Data: The basis of the model

Data is the foundation of machine learning, providing the necessary input for models to identify patterns and generalize to new scenarios. In the context of Learning to Rank for search, data enables the models to identify the relevance of products for a given query and optimize the ranking of search results. Our LTR model is trained on anonymized user interaction logs, leveraging three primary data categories:

  • Query as Context: The search query serves as the primary contextual signal, representing user intent. Additional contextual signals such as user behavior and device type may also be included.
  • Product Features: Numerical (e.g., price), categorical (e.g., brand), and textual (e.g., product descriptions) attributes of products are used to assess their relevance to the user’s query. These features are essential for enabling the model to compare and rank products effectively.
  • Clicks and Orders: Binary signals indicating whether a product was clicked or purchased are treated as implicit relevance labels. These interactions allow the model to infer user preferences and assign relevance scores to products, enabling it to learn a ranking that meets user expectations.

Two-Tower Architecture: the core of our Learning-to-Rank model

After extensive experimentation, we selected a Two-Tower Deep Neural Network (DNN) architecture for our Learning to Rank task. This approach balances performance and scalability. It outperforms our previous Gradient-Boosted Decision Tree (GBDT) model in key metrics like clicks and revenue during offline experiments and an online A/B test. 

The Two-Tower DNN consists of two neural networks that process different input data types and combine their outputs to compute a relevance score for each query-product-pair.

  • Query-Tower: The Query-Tower processes the search query by transforming its tokens into dense vectors using an embedding layer. The token embeddings are summed to produce a single dense vector that encodes the context of the query. Additional signals, such as user behavior or device type, can be included to enrich this contextual representation.
  • Product-Tower: The Product-Tower processes product features such as numerical attributes (e.g., price), categorical attributes (e.g., brand), and textual attributes (e.g., product title). Numerical features are normalized, categorical features are transformed into dense vectors using an embedding layer, and textual features are processed similarly to the query. These processed features are passed through a deep neural network, which outputs a dense vector representing a product.

The dense vectors produced by the Query-Tower and Product-Tower are combined through a dot product, resulting in a relevance score that quantifies how well each product matches the query.

To provide a more detailed understanding, the following subsection will walk through each step of the model architecture.

Step-by-Step Explanation of the Model Architecture

This subsection provides a comprehensive overview of our LTR model's architecture, as illustrated in the following figure.

Illustration of a Two-Tower Learning-to-Rank Architecture
Illustration of a Two-Tower Learning-to-Rank Architecture

1) Feature Engineering
Feature engineering preprocesses our diverse data with three type-specific approaches:

  • Numerical Features are normalized using techniques like z-score normalization for light-tailed distributions or power-law normalization for right-skewed distributions to ensure consistent scaling.
  • Categorical Features are transformed into indices, which serve as inputs to embedding layers in subsequent steps.
  • Textual Features are represented as bag-of-words vectors, which capture the presence of specific words in attributes like product titles or descriptions.

2.1a) Query-Embedding
The Query-Tower processes the search query and contextual signals. Queries are tokenized, and their tokens are transformed into dense vectors using an embedding layer.

2.2a) Aggregation
These embedded query tokens are summed to produce a single dense vector that encapsulates the query's context.

2.1b) Product-Encodings
In the Product-Encodings step, all features are transformed and merged into a dense vector representation for each product.

  • Numerical Features are normalized during the feature engineering phase and used as is.
  • Categorical Features are converted into dense vectors through embedding layers.
  • Textual Features, represented as Bag-of-words vectors, are passed through embedding layers, where the embeddings of individual words are summed into a dense representation.

The resulting embeddings and normalized numerical features are concatenated into a single dense vector, which is then fed into the deep neural network.

2.2b) Deep Neural Network
The DNN in the Product-Tower is the component responsible for transforming the dense product vectors from the Product-Encodings step into compact embeddings. These embeddings are designed to capture the rich feature interactions necessary for downstream relevance scoring.

Network Architecture:
The Product-Tower processes the concatenated feature vectors through a series of fully connected layers. Each layer applies a structured sequence of operations:

  1. Fully Connected Layer: The input is linearly transformed and projected into a new representation space.
  2. Activation: A ReLU activation function is applied to introduce non-linearity.
  3. Normalization: Layer Normalization is used to stabilize training and improve convergence.
  4. Skip Connections: The input of each layer is added to its output, ensuring efficient gradient flow and enabling deeper networks to train effectively.

The network consists of three such layers, each building upon the previous one to refine the product embeddings. The final output is a dense vector of fixed size (512 dimensions), representing the product in a way that facilitates relevance scoring when combined with the query embeddings.

The Two-Tower model is trained using the Adam optimizer. The best-performing Product-Tower configuration consists of 3 layers, a hidden size of 1024, and a dropout rate of 0.

3) Dot-Product

The final step in the Two-Tower architecture combines the outputs of the Query-Tower and the Product-Tower to compute a relevance score for each product. This is achieved through a dot product operation, which quantifies the alignment between query and product embeddings. The query embedding, generated by the Query-Tower, represents the context of the user’s search intent, while the product embedding, produced by the Product-Tower, encodes the features of a specific product. The dot product between these two embeddings results in a scalar score that reflects how well the product matches the query.

Mathematically, this can be expressed as:

Dot-Product-Funktion
Dot-Product-Funktion

Thereby h_c is the query (context) embedding and h_p is the product embedding.

4) Softmax
Once the relevance scores are computed for all candidate products per query using the dot product, the next step is to normalize these scores into probabilities. This is achieved through the Softmax function, which converts raw scores into a probability distribution. For a given query, the probability p_i of selecting product i from the candidate set is calculated as:

Softmax-Funktion
Softmax-Funktion

Here, s_i represents the raw relevance score for product i , and n is the total number of candidate products for that query. This normalization step is crucial for training the model using the Cross-Entropy (CE) loss function, as it allows the model to compare the predicted probabilities against the ground truth labels.

5) Cross Entropy Loss
The Cross Entropy (CE) Loss is the final step in training our Learning to Rank model. It measures the difference between the predicted probabilities, generated by the Softmax function, and the ground truth probabilities derived from actual user interactions, such as clicks or orders. By treating these interactions as implicit relevance labels and transforming them into a probability distribution, the CE-Loss guides the model to align its rankings with user preferences.

In the following figure, the red bars represent the model's predicted relevance probabilities, and the blue bars indicate the actual ground truth relevance derived from user interactions. The Cross Entropy Loss measures the discrepancy between these two distributions. For highly relevant products (like Product 1 and 2, where blue bars are high), the loss encourages the model to increase its predicted probability (red bar). Conversely, for irrelevant products (like Product 3-10, where the blue bars representing ground truth relevance are zero), the loss penalizes high predicted probabilities, driving them down. 

This mechanism ensures the model learns to assign high probabilities to relevant items and low probabilities to irrelevant ones.

Illustration of the Cross Entropy Loss in the LTR Model – Comparison of predicted and actual relevance scores
Illustration of the Cross Entropy Loss in the LTR Model – Comparison of predicted and actual relevance scores

Mathematically, the CE Loss is defined as:

Cross-Entropy-Loss-Funktion
Cross-Entropy-Loss-Funktion

Thereby (y_i ) ̃ is the normalized ground truth label for product i , and p_i is the predicted probability from the Softmax function. By minimizing this loss, the model learns to assign higher probabilities to products that users are more likely to interact with, improving the overall ranking quality.

Results and Impact

During an 8-week A/B test, our neural ranking model delivered:

• +1,86 % increase in clicks
• +0,56 % lift in revenue

These improvements show that our search results have become more relevant and engaging for users. Based on these strong results, we decided to roll out the new model to all users, making searching on OTTO feel even more like magic.

Outlook

With the launch of our new LTR model, the next rocket is already on the launchpad, heading towards personalization. By integrating user-specific signals into the model, we aim to refine rankings to align even more closely with individual preferences. This step toward personalization has the potential to revolutionize the search experience, making it not only highly relevant but also uniquely tailored to each user. We've built a strong, versatile foundation for future improvements, and we're excited about the possibilities. Stay tuned as we remain committed to delivering the best possible experience for our users.

Want to be part of the team?

12 people like this.

0No comments yet.

Write a comment
Answer to: Reply directly to the topic

Written by

Team Jarvis
Team Jarvis
Developer Team @ OTTO

Similar Articles

Saved!

We want to improve out content with your feedback.

How interesting is this blogpost?

We have received your feedback.