kaggle是数据挖掘与机器学习领域的常用网站,经常会有各种比赛,适合在数据挖掘与机器学习领域的实战提高。下面以CIFAR-10 - Object Recognition in Images项目为例演示kaggle入门。
1、下载并解压数据
直接下载,数据通常比较大,解压需要一定时间,mac下可以使用7za x vps12.7z解压。
2、读入数据
通常是csv格式,利用pandas读入数据
df = pd.read_csv("data/train.csv")
pandas是python的一个常用库,详情可参见:Python科学计算(二)
3、运行算法
利用k近邻算法,得到图片分类结果。
import numpy as np
from scipy.misc import imread, imsave, imresize
import pandas as pd
class NearestNeighbor(object):
def __init__(self):
pass
def train(self, X, y):
""" X is N x D where each row is an example. Y is 1-dimension of size N """
# the nearest neighbor classifier simply remembers all the training data
self.Xtr = X
self.ytr = y
def predict(self, X):
""" X is N x D where each row is an example we wish to predict label for """
num_test = X.shape[0]
# lets make sure that the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i'th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
min_index = np.argmin(distances) # get the index with smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
return Ypred
4、运行算法,得到结果并保存成csv格式
if __name__ == '__main__':
nearestNeighbor = NearestNeighbor()
trainSize=500
testSize=500
imgMatrix = np.ones((trainSize, 32*32*3))
testImag = np.ones((testSize,32*32*3))
for i in xrange(trainSize):
print i
img = imread('train/%d.png'%(i+1))
img_row = img.reshape(1,32*32*3)
imgMatrix[i]=img_row
df = pd.read_csv("trainLabels.csv")
nearestNeighbor.train(imgMatrix,df.label)
for j in xrange(testSize):
print j
img = imread('train/%d.png'%(j+501))
img_row=img.reshape(1,32*32*3)
testImag[j]=img_row
yPred = nearestNeighbor.predict(testImag)
output = pd.DataFrame(columns=['id','label'])
output['label']=yPred
output['id']=range(1,testSize+1)
output.to_csv('output.csv', index=False)
5、将所得到csv文件上传到kaggle,等待分析结果
使用k近邻算法,所得正确率不高,只有大约30%左右。