Many the data scientists and scientist machine learning prefer
This section will introduce your project, the dataset you are analysing and its business context, and introduce and justify your choice of Analytics Technique, and your two Analytics Environments.This section will critically evaluate current knowledge relating to your chosen Analytics Technique, its purpose, an explanation of how it works, why it is used, its capabilities and limitations and the contribution that the technique could make in your chosen organisational context. You will also need to develop and justify the framework that you will use in the next section to critically evaluate your two environments.
This section will be based on practical applications of the chosen Analytics Technique in your two chosen environments, using your chosen source(s) of big data. It will compare and contrast the ease of use, the outputs from the two environments, the effectiveness of the analytics function and the two environments. This section will justify all the conclusions and recommendations that you present in the final section.
Answer:
Features
The important features and advantages of Python are (Antony et al. 2017)
It is very expressive language as the syntax and codes are used mostly based on English meaningful phrases.
Python is an Open-Source Language and all the packages need to be incorporated also they are open sources.
It has large standard library.
GUI programming is much easier.
Machine Learning
Data Science
Features
The features of R are:
It provides GUI facility.
Applications
Online data mining
Face and Tag detection
Machine Learning
Machine Learning is a set of algorithms with which the programmer will assign some objective and which tends to learn the algorithm as per the requirement. Here program is not done explicitly rather the algorithm is made such that it capable of learning from the environment with being explicit definition of each cases (Prakash, 2015).Machine Learning uses different integration of issues like data, algorithm analysis tool and a platform to execute these. When the part of analysis comes into the scenario, it is inevitable to use the statistics. Statistics helps the algorithm to analyze data is depth with application of statistical calculations.
Correlation
Regression
import numpy as np
import pandas as pd
col=file[‘Color Score’]
puri=file[‘Purification’]
Data analysis with R:
library(tidyverse)
smaller <-fruits %>%
filter(Color Score’<0.5)
Classification
Classification is a technique or better to say a set of algorithms to classify data from an entire dataset. Several algorithms are available like Logistic Regression, Neive Bayas, Gradient Descent, K-Nearest Neighbor, Decision Tree, Random Forest etc. They can be implemented using python as well as R. Classification is the technique to classify the data with respect to some parameter (Moshfeq et al. 2017)It is actually a Supervised Learning where data is known to us. Here choose some parameter to separate and segmentize the data as per required and based on the dataset. Let we see the code structure.
link="<path>.csv"
file=pd.read_csv(link)
print(hd1u)hd2=np.array(file[collist[1]])
hd2u=np.unique(hd2)
Burglary=[]
CriminalDamage=[]
SexualOffences=[]
TheftandHandling=[]
elif hd2u[1] == hd2[i]:
CriminalDamage.append(i)
elif hd2u[4] == hd2[i]:
OtherNotifiableOffences.append(i)
elif hd2u[7] == hd2[i]:
TheftandHandling.append(i)
print("Burglary:n",Burglary,"n")
print("CriminalDamage:n",CriminalDamage,"n")
print("SexualOffences:n",SexualOffences,"n")
print("TheftandHandling:n",TheftandHandling,"n")
for i in range(len(list1)-1):
list3=list1+file.iloc[Drugs[i]].tolist()[3:]
plt.grid()
Using R:
test=mydata[-trainIndex, ]
print(table(mydata$prog))
library(naivebayes)
newNB=naive_bayes(prog~sesf+science+socst,usekernel=T, data=train)
Using Python
import itertools"""prompt user to enter support and confidence values in percent"""
"""total number of transactions contained in the file"""
transactions = 0
T = []
transactions += 1
else:
count = C1[word]
print ("-----------------------------------------------------------------")
#prin t "--------------------CANDIDATE 1-ITEMSET------------------------- "
for key in C1:
if (100 * C1[key]/transactions) >= support:
print (L1)
print ("-----------------------------------------------------------------")
for list1 in Lk_1:
for list2 in Lk_1:
if list1[count] != list2[count]:
break
for item in list1:
c.append(item)
return Ck
"""function to compute 'm' element subsets of a set S"""
list = []
list = findsubsets(c,k)
s.sort()
if s not in Lk_1:
k = 2
Lk_1 = []
for item in L1:
Lk_1.append(item)
#print "-------------------------CANDIDATE %d-ITEMSET---------------------" % k
#print "Ck: %s" % Ck
s = set(c)
for T in D:
if (100 * count/transactions) >= support:
c.sort()
print ("------------------------------------------------------------------")
for l in Lk:
"""generate_association_rules function to mine and print all the association rules with given support and confidence value"""
def generate_association_rules():
inc1 = 0
inc2 = 0
print ("RULES t SUPPORT t CONFIDENCE")
print ("--------------------------------------------------------")
while count < length:
s = []
inc2 = 0
s = []
if set(s).issubset(set(T)) == True:
inc1 += 1
if index not in s:
m.append(index)
class(market)
inspect(head(market, 3))
itemFreq(market, topN=10, type="absolute", main="Item Frequency")
Output
Remaining order_item: 29662716
Item pairs: 30622410
Correlation is one of the most widely used statistical concepts it providesa definitions and intuition behind several types of correlation and illustrate how to calculate correlation using the Python pandas library (Han et al. 2017).It is a mutually connected with association. Correlation means one data is connected or related with another data with how much degree of similarity. For a business purpose, It is mandatory to maintain the clarity between the pictures of customer type and the bought items. Correlation deals with such types of topics where statistics are essential.
Using Python
mpg_data['mpg'].corr(mpg_data['weight'])
# pairwise correlation
ax=ax.flatten()
cols = ['weight', 'horsepower', 'acceleration']
i.set_ylabel('MPG')
i.scatter(mpg_data[cols[j]], mpg_data['mpg'], alpha=0.5, color=colors[j])
Using R
my_data <- read.csv(file.choose())
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
res <- cor.test(my_data$wt, my_data$mpg,
method = "pearson")
It is the technique or algorithm to relate the input with target data set. Here the example is shown regarding linear model. Let we see the example.
Using Python
list4=list1+file.iloc[CriminalDamage[i]].tolist()[3:]
plt.plot(list4,"c",label='Criminal Damage(2016-2018)')
list1=file.iloc[FraudorForgery[0]].tolist()[3:]
plt.figure(figsize=(30,10))
plt.xlabel("Type of Frauder")
plt.ylabel("Frauder Count")
par(mfrow=c(1, 2)) # divide graph area in 2 columns
plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$speed), 2))) # density plot for 'speed'
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(cars$speed)$out)) # box plot for 'speed'
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(cars$dist)$out))
Clustering is an unsupervised learning where we have to segmentize data which are not known to us. It is basically used for web filtering or stock market analysis or such type of work where the data type may be unknown to us.
Using Python
get_ipython().run_line_magic('matplotlib', 'inline')
#-------------------------------Creating Data Frame-----------------------------------
plt.scatter(df['x'],df['y'])
--------------------------Creating & Assigning Centroid------------------------------
i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
for i in range(k)
list2=centroids[2]
list3=centroids[3]
plt.figure()
a=list2[0]
b=list3[1]
plt.plot(a,b,'*c',0.6)
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
def assignment(df, centroids):
for i in centroids.keys():
+ (df['y'] - centroids[i][1]) ** 2
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
plt.show()---
def update(k):
centroids = update(centroids)
#--------------------------------Plot updated seeds in cluster form--------------------
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
data14$Date.Time <- mdy_hms(data14$Date.Time)
data14$Year <- factor(year(data14$Date.Time))
data14$Minute <- factor(minute(data14$Date.Time))
data14$Second <- factor(second(data14$Date.Time))
ggmap(NYCMap) + geom_point(aes(x = Lon[], y = Lat[], colour = as.factor(Borough)),data = data14) +
ggtitle("NYC Boroughs using KMean")
Using Python
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
def selectThresholdByCV(probs,gt):
epsilons = np.arange(min(probs),max(probs),stepsize)
for epsilon in np.nditer(epsilons):
best_epsilon = epsilon
return best_f1, best_epsilon
n_dim = tr_data.shape[1]
plt.figure()
mu, sigma = estimateGaussian(tr_data)
p = multivariateGaussian(tr_data,mu,sigma)
plt.xlabel("Latency (ms)")
plt.ylabel("Throughput (mb/s)")
Reference list
Gruden, G., Giunti, S., Barutta, F., Chaturvedi, N., Witte, D.R., Tricarico, M., Fuller, J.H., Perin, P.C. and Bruno, G., 2012. QTc interval prolongation is independently associated with severe hypoglycemic attacks in type 1 diabetes from the EURODIAB IDDM complications study. Diabetes care, 35(1), pp.125-127.
Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(3), pp.90-95.
Kwon, O. and Sim, J.M., 2013. Effects of data set features on the performances of classification algorithms. Expert Systems with Applications, 40(5), pp.1847-1857.
Han, R., John, L.K. and Zhan, J., 2018. Benchmarking big data systems: A review. IEEE Transactions on Services Computing, 11(3), pp.580-597.
Salaken, S.M., Khosravi, A., Nguyen, T. and Nahavandi, S., 2017. Extreme learning machine based transfer learning algorithms: A survey. Neurocomputing, 267, pp.516-524.
Christensen, T.F., Lewinsky, I., Kristensen, L.E., Randlov, J., Poulsen, J.U., Eldrup, E., Pater, C., Hejlesen, O.K. and Struijk, J.J., 2007, September. QT Interval prolongation during rapid fall in blood glucose in type I diabetes. In Computers in Cardiology, 2007 (pp. 345-348). IEEE.
Dincer, C., Akpolat, G. and Zeydan, E., 2017, May. Security issues of big data applications served by mobile operators. In Signal Processing and Communications Applications Conference (SIU), 2017 25th (pp. 1-4). IEEE.
Volk, M., Bosse, S. and Turowski, K., 2017, July. Providing Clarity on Big Data Technologies: A Structured Literature Review. In Business Informatics (CBI), 2017 IEEE 19th Conference on (Vol. 1, pp. 388-397). IEEE.