ML Use Case — Part 2
In ML Use Case — Part 1, we established the need for machine learning example, with examples. In this part let us see how to implement the use case on Databricks platform.
What is the use case:
- We have 50,000+ records of diamond prices, along with features such as carat, cut, color etc.
- We also have labels — price for this dataset
- We will train the model to get a good R² and we will deploy this model
- Finally we will pass new dataset (reserved out of initial 50K+) to predict the price of this dataset.
Those who know what I am doing — Don’t shout out cheating here, the reserved dataset has slight variations.
For those who do not know ML language yet — features are attributes of the record set for example a 0.23 carat diamond, with Ideal cut and clarity of ‘E’ with depth of 61.5 should be priced at $326.00, and this ‘price tag’ is called as a label. We have labels for all the record sets hence, we will frame this as supervised learning problem set.
Step 1 — Load the data
1 data_w_price = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/FileStore/tables/diamonds.csv")
2 data_w_price.cache()
3 display(data_w_price)
The csv for dimonds (dimonds.csv) is stored at location /FileStore/tables.
Step 2 — Train the model
1 from pyspark.ml.regression import LinearRegression, LinearRegressionSummary
2 from pyspark.ml.feature import VectorAssembler
3 from pyspark.ml.evaluation import RegressionEvaluator
4 from pyspark.ml import Pipeline
5 from pyspark.ml.feature import OneHotEncoder, StringIndexer6 indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(data_w_price) for column in list(set(data_w_price.columns)-set(['date']))]
7 pipeline = Pipeline(stages=indexers)
8 data_w_price_ir = pipeline.fit(data_w_price).transform(data_w_price)
9 train, test = data_w_price_ir.randomSplit([0.75, 0.25])10 feature_cols = ['cut_index', 'clarity_index', 'color_index', 'carat', 'depth', 'x', 'y', 'z']
11 assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
12 lr = LinearRegression(labelCol='price')
13 pipeline = Pipeline(stages=[assembler, lr])
14 model = pipeline.fit(train)15 predictions = model.transform(test)
16 eval = RegressionEvaluator(labelCol='price', predictionCol='prediction')
17 print('RMSE:', eval.evaluate(predictions, {eval.metricName: "rmse"}))
18 print('R-squared:', eval.evaluate(predictions, {eval.metricName: "r2"}))
Let us look at these lines little closely.
1–5 : These lines import the relevant packages for processing
6–8 : These lines prepare the data for processing. For example the cut of the diamonds are given as Premium, Good, Fair etc. We can’t use it while developing a model, so it is converted into 0, 1, 2 etc. However this is just one example. There are various methods and they are listed here : link to feature transformation. Some examples are — One hot encoder, Min max scalar, Bucketizer etc. The choice really depends on the use case at hand.
9 : Create 75%/25% partition for train and test dataset
10–14 : Define the input features and label, then fit the Linear Regression model
15–16 : Perform the test and evaluate the model
17–18 : Print the output
The R² that I got for this model is R-squared: 0.8655518325698464. That is close to 86% accuracy. Now is this a good indicator, of course not. Is this the only indicator, again, of course not. However for this discussion let us keep moving along, and we will discuss these points in subsequent posts
Step 3 — Load and prepare the data for prediction
1 data_wo_price = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/FileStore/tables/diamonds_wo_price.csv")
2 data_wo_price.cache()
3 display(data_wo_price)4 indexers_wo_price = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(data_wo_price) for column in list(set(data_wo_price.columns)-set(['date']))]
5 pipeline = Pipeline(stages=indexers_wo_price)
6 data_wo_price_ir = pipeline.fit(data_wo_price).transform(data_wo_price)
7 data_wo_price_ir.show()
Line 3 & 7 can be ignored, they are just showing the records.
1–2 : Load the dataset to be predicted, in our case it is diamonds_wo_price.csv
4–6 : Prepare the data similar to Step and fit the pipeline.
Step 4 — Predict the prices
1 pred = model.transform(data_wo_price_ir)
2 df = pred.select('carat', 'cut', 'carat_index', 'cut_index', 'features', 'prediction')
3 display(df)
The final step, to predict the output.
1–2 : Model will transform the records to predict the prices.
2 : Display the records, and for that matter here is the sample
Now I have pasted the screenshot of the output of code in step 4. One can clearly observer the sometimes the prediction is negative, indicating — give away the diamond and pay the customer, clearly absurd. Clearly this model should be improved and tweaked, and I recognize that there are many opportunities to do so.
Conclusion
In ML Use Case — Part I, we saw why we need machine learning use cases. Given the hype around machine learning, it is easy to get overwhelmed and think that it is not something for this department/organization. However platforms such as Databricks and libraries such as Scikit and MLflow make implementation much easier as shown in this article.
I am not denying that, there is much more depth to this subject, after all, it is around for more than 40 years. We will dig bit deeper into this in our next part.