%matplotlib inline
import didactic_datamining as ddm
#dataset = ddm.create_dataset(npoints=10, minvalue=0, maxvalue=10)
#ddm.print_dataset(dataset)
a) Apply k-means to the dataset in the below table and figure using K=2, and the centroids c1=P2 and c2=P5. Explain what happens in any iteration (10 points).
b) Discuss the reason of the k-means termination (3 points).
c) Identify another couple of initial centroids leading to the same clustering obtained in a) (2 points).
dataset = [[5, 4],
[3, 2],
[6, 2],
[2, 3],
[8, 9],
[4, 0],
[3, 7],
[7, 9],
[4, 5],
[3, 3]]
kmeans = ddm.DidatticKMeans(K=2, centroid_indexs=(2, 5), dist=ddm.euclidean_distance)
kmeans.fit(dataset, step_by_step=False)
a) Apply Dbscan algorithm in the below table and figure with radius eps=1.8 and minPts=2 (1 neighbor plus the point itself) and for each point specify if it is a core point, border point or noise (10 points).
b) Indicate the composition of the clusters obtained (2 points).
c) Add the minimum number of points to transform the noise points in border points (3 points).
dataset = [[5, 4],
[3, 2],
[6, 2],
[2, 3],
[8, 9],
[4, 0],
[3, 7],
[7, 9],
[4, 5],
[3, 3]]
dbscan = ddm.DidatticDbscan(eps=1.9, min_pts=2)
dbscan.fit(dataset, step_by_step=False)
a) Use the training dataset below for building a decision tree based on misclassification rate for the variable “CHURN”, expanding the nodes of the tree until no split provides a gain (18 points).
b) Provide the confusion matrix and evaluate the accuracy, precision, recall and f1-measure of the tree with respect to the test set AND training set. You MUST provide the formulas of accuracy, precision, recall and f1-measure (7 Points).
import pandas as pd
dataset_df = pd.read_csv('dataset_compito_dm_20170405.csv', skipinitialspace=True, delimiter=',')
dataset_df
test_df = pd.read_csv('testset_compito_dm_20170405.csv', skipinitialspace=True, delimiter=',')
test_df
tree = ddm.DidatticClassificationTree(fun=ddm.error_rate, fun_name='misc rate',
min_samples_split=2, min_samples_leaf=1, step_by_step=False)
tree.fit(dataset_df, target='Churn')
prediction = tree.predict(test_df)
test_df['Predicted'] = prediction
test_df
tree.evaluate(test_df)