Machine learning challenges in a used car sales companies like Spinny.

Gundeep Singh
5 min readMay 18, 2021
Photo by Samuele Errico Piccarini on Unsplash

If you have ever bought or sold a used car, then you know how hard is it to figure out the right price for the vehicle. The same problem is faced thousands of times a day in companies like Spinny.com, Cars24.com, Shift.com etc. The whole business of such companies is buying and selling pre-owned cars. Each one of them have their own pros, cons and unique selling propositions.

Predicting a used car price is one thing, however, predicting it on a scale of thousands of cars is a game of whole another level.

I was the first Data Scientist hire in Spinny in September 2019, and had a chance to work on such problem from ground up. Most of the business operations depended heavily on manual expertise and estimates of domain experts. This was becoming the bottleneck of the company and had to be automated as much as possible for faster and efficient operations.

A Glance at the complexity of Business Operations

“Tip of an Iceberg”, Photo by Jonathan Cooper on Unsplash

A huge amount of structured, but fragmented data was available throughout the companies’ google sheets. Some had a list of brands, models, variants with ex-show room prices. Some had a list of most searched vehicles on the website, or the list of details of cars sold and the prices at which they were bought, the repair cost etc.

Simple Automation

First step for such an automation was to semi-automate the things using the sheets. So each brand was assigned a deterioration rate per year or per 1000 kilometers. For example, say a Maruti Suzuki high end car’s price will decrease INR 50,000 every 50,000 kilometers and say another INR 50,000 for every year passed since its production year. In such way, very detailed breakpoints were created in google sheets to find out the price of a car, given its make, model, variant and other conditions. While calculating these prices, costs like the repair cost and the ownership transfer cost also had to be considered depending on the class & price of car.

Other factors

If Hyundai announces to launch a new i20 car next month, do you think the price of the second hand i20 cars will not be affected? The news from the car makers impact the demand, supply and the price game, of this industry in the real time.

Off-course, there is no defined path of solving such a deep machine learning problem that from outside might look like a tip of the iceberg. And you always have to double check the machine predictions with the human intelligence as the ticket size is high and the ticket volume is comparatively low unlike online ad-bids.

While at a glance, it might seem like that the company buys the car a price and sells it at a higher price, the real picture is much more complex. The demand of cars is dynamic. The company just can’t buy a car say 2002 Mitsubishi Lancer that’s available for peanuts if there is no demand for such a car. A car will after-all require a space in a garage. Over time, a stale car adds up to the operation price. So only the cars in-demand can be bought so that the turn-around-time is less and operation cost can be minimized. Moreover, the cars for sale are purchased from multiple streams ranging from owner selling cars directly to company, to purchasing from online public listings, to buying thru online bidding systems like that of Cars24 or Olx cars, completing with other bidders. And if that complexity is not enough, think of a case where the company has to buy a car not for sale but just for listing on the website to display variety on the website or just think of the unexpected delays in car handover, repairs, documents processing etc.

Intertwined ecosystem of machine learning problems

In such a complex ecosystem, as a machine learning researcher / data scientist, you first need to figure out what do these problems look like mathematically and how are they related to each other. While one can create different types of mathematical relationships between the problems at hand, here’s what a high level overview of what this list of problems and their relation look like if you had to optimize such operations:

Assuming that you have an average turn around time, TAT, of 7 days. i.e. a car purchased from a seller is sold to a buyer in 7 days

  1. Demand Prediction, D, for cluster C (clustered on price range, make, model, variant etc)
  2. Supply Prediction, S for cluster C
  3. Buying/Bid Price Prediction, B, for each car for sale in the market. This depends on the supply of cars in the market as well. For example, if you need to a car in next 2 days, and there is an abundance of such cars, you can afford to lose 5 bids before winning one. To keep things simple, for now, we are not considering the probability of winning a bid or predicting a bargain range.
  4. Optimal Selling Price Prediction, O, optimized to be high enough to make good profits, but low enough to convert a sale.

A high level relation to optimize the profits:

minimize |D-B|

maximize |S — (B + Repair Cost + Operation Cost)|

combining these two, gives us a very basic loss function for the business operations minimization:

Loss = |D-B|/|S — (B + Repair Cost + Operation Cost)|

each parameter in the above equation is a prediction itself, so optimizing a relation is not very straight forward, and it might help to make machine learning systems for each of these problems individually, while relying on human intelligence for the other problems meanwhile.

Moreover, given the high ticket size and the low ticket volume, applying an out of box deep learning technique doesn’t work in this domain. The structured and fragmented data needs to be normalized using a manual effort, ML techniques like character embeddings, automatic clustering of car blogs or reviews using various tagging techniques. And because the data is structured and low in volume, and a typical intelligent human would do some basic maths to figure out an optimal price, it makes sense to use classical machine learning techniques like XGBoost, which are based on methods like random forest which more closely resemble to the associated human intelligence than the popular neural network based techniques that are more suitable for higher volumes and unstructured data.

Thanks for reading

Let me know your thoughts

--

--

Gundeep Singh

Learner, Explorer, Developer, Deep Learning & LLM train. GOTTA CATCH EM ALL.