Today I am gonna kick start Machine learning. I gonna do using Python programming language.
Prerequisites:
To download python: https://www.python.org/
After Installing python we need to install scikit-learn
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms.
It is built on NumPy, SciPy, and matplotlib so before installing scikit-learn we need to install the above mentioned all
To install we can run the command:
python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
Note: ipyhton,jupyter and pandas will be used in future.
Now to install scikit-learn use below command
pip install -U scikit-learn
Basics of ML: We will work with data sets where we need to split the data in to two Train data and Test data. Generally the ratio will be 80:20. The machine will train using train data. After training the test data is passed and the machine predict the output.
To understand machine learning easily I decide to work with problem statement
Challenge 1:
Problem: Iris Flower Classification
Description:
This problem is simple one it is like a HELLO WORLD problem for Machine learning. We need to classify the iris flower based upon the sepal and petal length. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two; the latter are not linearly separable from each other.
The data base contains the following attributes:
1). sepal length in cm
2). sepal width in cm
3). petal length in cm
4). petal width in cm
5). class:
– Iris Setosa
– Iris Versicolour
– Iris Virginica
Algorithm used: Random forest
Random Forest is a supervised learning algorithm. It make use of decision tree.
- Select random samples from a given dataset.
- Construct a decision tree for each sample and get a prediction result from each decision tree.
- Perform a vote for each predicted result.
- Select the prediction result with the most votes as the final prediction.
Coding :
#Import scikit-learn dataset library which contains the iris data set
from sklearn import datasets
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
# Import train_test_split function
from sklearn.model_selection import train_test_split
#Import metrics for calculating accuracy
from sklearn import metrics
# for Dataframes creation
import pandas as pd
#Load dataset
iris = datasets.load_iris()
# print the label species(setosa, versicolor,virginica)
print(iris.target_names)
# print the names of the four features
print(iris.feature_names)
#creating Dataframe
data=pd.DataFrame({
‘sepal length’:iris.data[:,0],
‘sepal width’:iris.data[:,1],
‘petal length’:iris.data[:,2],
‘petal width’:iris.data[:,3],
‘species’:iris.target
})
data.head()
# creating data frame
X=data[[‘sepal length’, ‘sepal width’, ‘petal length’, ‘petal width’]]
y=data[‘species’]
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% tes
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training set
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print(“Accuracy:”,metrics.accuracy_score(y_test, y_pred))
print(clf.predict([[3, 5, 4, 2]]))
Output