Data Ingestion - AI
Normalize using sklearn
# See nice summaries here
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
Normalize 256 pixel image data
# If values run from 0-255 with no numbers that are considered outliers, we can apply a linear /= function on the numpy.ndarray
# This divides each value by 255, which normalizes to the range 0-1
X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.
Load MNIST test data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
One Hot Encode Outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
Flatten Data
from keras.layers import Flatten
# Flatten requries an input shape as defined by our data.
If we have a 2D array then our input shape would be the length of the X dimension
multiplied by the length of the Y dimension. Flatten handles this for us
if we use it like so
model.add(Flatten(input_shape=(x_shape, y_shape))
x shape and y shape here are the dimensions as mentioned.
# Take for example the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]
model.add(Flatten(input_shape=(img_width, img_height)))
Split data into train/test groups
from sklearn.model_selection import train_test_split
# X values are the feature set and y values are the label data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Reshape Data For CNN
# There are 2 options that I currently know of:
# The easy way
model.add(Reshape((28, 28, 1), input_shape=(28,28)))
# The less easy way
# If data is 2D, we want to reshape the 3D shape into a 4D shape
# keras expects that 3rd dimension (4th dimension in the shape) to be the color dimension
X_train = X_train.reshape(X_train.shape[0], config.img_width, config.img_height, 1)
X_test = X_test.reshape(X_test.shape[0], config.img_width, config.img_height, 1)