Learning to Rank – To the Moon and Back(-propagation): Deep Neural Networks That Learn to Rank What You Love
Imagine yourself searching and instantly finding exactly what you are looking for. No scrolling, no frustration. Just hits.
What sounds like a moonshot is the daily mission that drives our search teams. By using machine learning, we can learn from millions of user interactions what our customers click, what they buy and what they love.
To optimize OTTO’s ranking function and search relevance, we use Learning to Rank (LTR) – a method for improving ranking performance. Unlike rule-based algorithms, LTR models learn product relevance for a query by analyzing user interactions tied to that query. You can learn about the basic principles of LTR in our earlier blog posts:
Our previous LTR model was a Gradient-Boosted Decision Tree (GBDT). As the e-commerce LTR research community increasingly demonstrated the potential of Deep Neural Networks (DNNs) to outperform GBDT-based models, we recognized an opportunity to improve our search relevance. Driven by the goal of enhancing our customers' search journey, we therefore decided to transition from our reliable GBDT model to a Deep Neural Network (DNN).
After numerous iterations of feature engineering for our GBDT ranking model, we’ve reached a point of diminishing returns. Each improvement has added complexity to our tech stack, while the performance gains have steadily decreased. To overcome this plateau, we’ve shifted our focus from manual feature engineering to the model’s architecture.
Deep neural networks offer a versatile and scalable framework for innovation. They also significantly reduce the need for manual feature engineering, due to their ability to learn relevant features directly from the data. In this blog post, we’re excited to share our latest insights into optimizing ranking with a deep neural network.
This post focuses on the data science innovation behind our LTR models. If you want to learn more about the software engineering part, stay tuned for our second blog post! You will learn about our use of Clojure to build scalable big data pipelines and uncover its connection to space travel.

Ready to go even deeper into the technical intricacies of our LTR architecture? Our comprehensive paper provides all the details this blog post builds upon.
Data is the foundation of machine learning, providing the necessary input for models to identify patterns and generalize to new scenarios. In the context of Learning to Rank for search, data enables the models to identify the relevance of products for a given query and optimize the ranking of search results. Our LTR model is trained on anonymized user interaction logs, leveraging three primary data categories:
After extensive experimentation, we selected a Two-Tower Deep Neural Network (DNN) architecture for our Learning to Rank task. This approach balances performance and scalability. It outperforms our previous Gradient-Boosted Decision Tree (GBDT) model in key metrics like clicks and revenue during offline experiments and an online A/B test.
The Two-Tower DNN consists of two neural networks that process different input data types and combine their outputs to compute a relevance score for each query-product-pair.
The dense vectors produced by the Query-Tower and Product-Tower are combined through a dot product, resulting in a relevance score that quantifies how well each product matches the query.
To provide a more detailed understanding, the following subsection will walk through each step of the model architecture.
This subsection provides a comprehensive overview of our LTR model's architecture, as illustrated in the following figure.

1) Feature Engineering
Feature engineering preprocesses our diverse data with three type-specific approaches:
2.1a) Query-Embedding
The Query-Tower processes the search query and contextual signals. Queries are tokenized, and their tokens are transformed into dense vectors using an embedding layer.
2.2a) Aggregation
These embedded query tokens are summed to produce a single dense vector that encapsulates the query's context.
2.1b) Product-Encodings
In the Product-Encodings step, all features are transformed and merged into a dense vector representation for each product.
The resulting embeddings and normalized numerical features are concatenated into a single dense vector, which is then fed into the deep neural network.
2.2b) Deep Neural Network
The DNN in the Product-Tower is the component responsible for transforming the dense product vectors from the Product-Encodings step into compact embeddings. These embeddings are designed to capture the rich feature interactions necessary for downstream relevance scoring.
Network Architecture:
The Product-Tower processes the concatenated feature vectors through a series of fully connected layers. Each layer applies a structured sequence of operations:
The network consists of three such layers, each building upon the previous one to refine the product embeddings. The final output is a dense vector of fixed size (512 dimensions), representing the product in a way that facilitates relevance scoring when combined with the query embeddings.
The Two-Tower model is trained using the Adam optimizer. The best-performing Product-Tower configuration consists of 3 layers, a hidden size of 1024, and a dropout rate of 0.
3) Dot-Product
Mathematically, this can be expressed as:

Thereby h_c is the query (context) embedding and h_p is the product embedding.
4) Softmax
Once the relevance scores are computed for all candidate products per query using the dot product, the next step is to normalize these scores into probabilities. This is achieved through the Softmax function, which converts raw scores into a probability distribution. For a given query, the probability p_i of selecting product i from the candidate set is calculated as:

Here, s_i represents the raw relevance score for product i , and n is the total number of candidate products for that query. This normalization step is crucial for training the model using the Cross-Entropy (CE) loss function, as it allows the model to compare the predicted probabilities against the ground truth labels.
5) Cross Entropy Loss
The Cross Entropy (CE) Loss is the final step in training our Learning to Rank model. It measures the difference between the predicted probabilities, generated by the Softmax function, and the ground truth probabilities derived from actual user interactions, such as clicks or orders. By treating these interactions as implicit relevance labels and transforming them into a probability distribution, the CE-Loss guides the model to align its rankings with user preferences.
In the following figure, the red bars represent the model's predicted relevance probabilities, and the blue bars indicate the actual ground truth relevance derived from user interactions. The Cross Entropy Loss measures the discrepancy between these two distributions. For highly relevant products (like Product 1 and 2, where blue bars are high), the loss encourages the model to increase its predicted probability (red bar). Conversely, for irrelevant products (like Product 3-10, where the blue bars representing ground truth relevance are zero), the loss penalizes high predicted probabilities, driving them down.
This mechanism ensures the model learns to assign high probabilities to relevant items and low probabilities to irrelevant ones.

Mathematically, the CE Loss is defined as:

Thereby (y_i ) ̃ is the normalized ground truth label for product i , and p_i is the predicted probability from the Softmax function. By minimizing this loss, the model learns to assign higher probabilities to products that users are more likely to interact with, improving the overall ranking quality.
During an 8-week A/B test, our neural ranking model delivered:
• +1,86 % increase in clicks
• +0,56 % lift in revenue
These improvements show that our search results have become more relevant and engaging for users. Based on these strong results, we decided to roll out the new model to all users, making searching on OTTO feel even more like magic.
With the launch of our new LTR model, the next rocket is already on the launchpad, heading towards personalization. By integrating user-specific signals into the model, we aim to refine rankings to align even more closely with individual preferences. This step toward personalization has the potential to revolutionize the search experience, making it not only highly relevant but also uniquely tailored to each user. We've built a strong, versatile foundation for future improvements, and we're excited about the possibilities. Stay tuned as we remain committed to delivering the best possible experience for our users.
Want to be part of the team?


We have received your feedback.