# Kaggle学习入门

Posted by jjx on February 24, 2017

kaggle是数据挖掘与机器学习领域的常用网站，经常会有各种比赛，适合在数据挖掘与机器学习领域的实战提高。下面以CIFAR-10 - Object Recognition in Images项目为例演示kaggle入门。

1、下载并解压数据

2、读入数据

df = pd.read_csv("data/train.csv")


pandas是python的一个常用库，详情可参见：Python科学计算(二)

3､运行算法

import numpy as np
from scipy.misc import imread, imsave, imresize
import pandas as pd

class NearestNeighbor(object):
def __init__(self):
pass

def train(self, X, y):
""" X is N x D where each row is an example. Y is 1-dimension of size N """
# the nearest neighbor classifier simply remembers all the training data
self.Xtr = X
self.ytr = y

def predict(self, X):
""" X is N x D where each row is an example we wish to predict label for """
num_test = X.shape[0]
# lets make sure that the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i'th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
min_index = np.argmin(distances) # get the index with smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example

return Ypred


4、运行算法，得到结果并保存成csv格式

if __name__ == '__main__':
nearestNeighbor = NearestNeighbor()
trainSize=500
testSize=500
imgMatrix = np.ones((trainSize, 32*32*3))
testImag = np.ones((testSize,32*32*3))
for i in xrange(trainSize):
print i
img_row = img.reshape(1,32*32*3)
imgMatrix[i]=img_row

nearestNeighbor.train(imgMatrix,df.label)

for j in xrange(testSize):
print j