For an assignment, I have to extract data from a CSV file using NumPy. The file contains multiple rows, but the first row contains the labels and looks like
label, pixel1, pixel2, pixel3, ..., pixel785 - this one should be ignored.
The following rows contain a label in the first cell (some integer between 1-10, I believe), and the next 784 cells contain the actual pixel number values. These numbers have to be reshaped to be 28x28 arrays.
The function should return 2 np.array types, one with the labels and one with the images, and the output should look like this:
(27455, 28, 28)
(27455,)
(7172, 28, 28)
(7172,)
So far, this is what I have. I have managed to get the pixel values into 28x28 arrays (I think), but I am not sure how to go from there. The assignment suggests using np.as_type() as I need to turn the values into floats.
I have never worked with arrays in NumPy, so I am not sure how to work with them. Am I doing the first part correctly? How do I return the images and labels?
(Please stay within the constrains of the assignment when you reply since I am trying to understand all the concepts and suggestions, I don't want to be overwhelmed having to find other possible solutions since I am already struggling with this. Thanks!)
def get_data(filename):
# You will need to write code that will read the file passed
# into this function. The first line contains the column headers
# so you should ignore it
# Each successive line contains 785 comma separated values between 0 and 255
# The first value is the label
# The rest are the pixel values for that picture
# The function will return 2 np.array types. One with all the labels
# One with all the images
#
# Tips:
# If you read a full line (as 'row') then row[0] has the label
# and row[1:785] has the 784 pixel values
# Take a look at np.array_split to turn the 784 pixels into 28x28
# You are reading in strings, but need the values to be floats
# Check out np.array().astype for a conversion
with open(filename, 'r') as training_file:
# Your code starts here
#training_file.readline()
csv_reader = csv.reader(training_file)
header=next(csv_reader)
if header != None:
for row in csv_reader:
images=np.array_split(row[1:],28)
# Your code ends here
return images, labels
path_sign_mnist_train = f"{getcwd()}/../tmp2/sign_mnist_train.csv"
path_sign_mnist_test = f"{getcwd()}/../tmp2/sign_mnist_test.csv"
training_images, training_labels = get_data(path_sign_mnist_train)
testing_images, testing_labels = get_data(path_sign_mnist_test)
# Keep these
print(training_images.shape)
print(training_labels.shape)
print(testing_images.shape)
print(testing_labels.shape)
# Their output should be:
# (27455, 28, 28)
# (27455,)
# (7172, 28, 28)
# (7172,)
rowis a list of strings, which you will first need to convert into some numeric type. I don't really understand why it was suggested that you usearray_split, and you can also avoid mucking about withas_typeby converting each element directly to a float:image = np.fromiter(map(float, row[1:]), count=784).reshape(28, 28). Keep in mind you'll need to keep a list of all images, rather than overwritingimageseach time you read a row.np.genfromtxtcan load a nicecsv- one with a consistent number of columns per row. It's easiest with floats, but possible also with a mix of column types (producing a 1d structured array). In your case you might want to useusecolsto control which columns you load. You can load different sets with separate calls. Read its docs for more details.