Machine Learning: Supervised vs Unsupervised (specifically clustering) Learning.
- zenophilip
- Mar 30, 2021
- 2 min read
Updated: Jun 15, 2021
An interesting and useful supervised learning problem in the O&G space is the prediction of lithology using well-logs. A case-study of a supervised learning exercise where there were 6 lithology target classes – dolomite, granite, granitewash, limestone, sandstone and shale (Fig 1) – with predictor features being log values of GR, Caliper, NPHI and DPHI – is presented.

There were 1000 records (instances) of lithology labeled data. This is a classification problem and an 80/20 Train-Test split was used to train the ML model. Models used were kNN and Decision Tree (DT) and after comparison and tuning the DT model was selected as being the best performer –although the difference between the kNN and DT was not very significant in this case. Orange software was used for this study and the workflow used is shown in figure 2.

From the Test and Score widget the accuracies on the Train and Test data sets across all classes was quite high – approx. 0.98 for both accuracy and F1 score on the training data set and approx 0.95 for both accuracy and F1 score on the test data set. The confusion matrix comparing the predictions and actual values across the 6 lithology classes for the test data set is shown in figure 3. It is clear that even with a fairly small data-set the ML algorithm was able to fairly accurately distinguish between the 6 different lithologies even with only 4 predictor variables – GR, Caliper, NPHI and DPHI.

However, if for some reason the lithology labels were unavailable and one used a clustering approach –specifically k-means clustering, then the results are vastly different. The number of clusters is assigned by the k-means algorithm using the silhouette score (i.e., the optimum value of the within-group homogeneity as a function of no. of clusters). For this case the optimum number of clusters turns out to be 6 (Fig 4) which is quite encouraging since we had 6 lithology classes.

However, if we show the lithology distributions within the clusters is it seen that the lithologies are quite varied (Fig 5) even though the same feature variables were used (GR, NPHI, DPHI and Caliper) as predictors. Centroids and distances were computed using these very features in the feature space.

In summary, for unsupervised learning (specifically clustering) one has to be careful that if there are “hidden” labels that better explain the data, every attempt should be made to identify those labels instead of using a blind clustering algorithm. In many cases those labels may not be immediately obvious but as one gains domain expertise on the data hopefully the “labels” become clearer.



Comments