How to extract an numpy array from a csv file?

Question

For an assignment, I have to extract data from a CSV file using NumPy. The file contains multiple rows, but the first row contains the labels and looks like label, pixel1, pixel2, pixel3, ..., pixel785 - this one should be ignored. The following rows contain a label in the first cell (some integer between 1-10, I believe), and the next 784 cells contain the actual pixel number values. These numbers have to be reshaped to be 28x28 arrays.

The function should return 2 np.array types, one with the labels and one with the images, and the output should look like this:

(27455, 28, 28)
(27455,)
(7172, 28, 28)
(7172,)

So far, this is what I have. I have managed to get the pixel values into 28x28 arrays (I think), but I am not sure how to go from there. The assignment suggests using np.as_type() as I need to turn the values into floats.

I have never worked with arrays in NumPy, so I am not sure how to work with them. Am I doing the first part correctly? How do I return the images and labels?

(Please stay within the constrains of the assignment when you reply since I am trying to understand all the concepts and suggestions, I don't want to be overwhelmed having to find other possible solutions since I am already struggling with this. Thanks!)

def get_data(filename):
  # You will need to write code that will read the file passed
  # into this function. The first line contains the column headers
  # so you should ignore it
  # Each successive line contains 785 comma separated values between 0 and 255
  # The first value is the label
  # The rest are the pixel values for that picture
  # The function will return 2 np.array types. One with all the labels
  # One with all the images
    #
  # Tips: 
  # If you read a full line (as 'row') then row[0] has the label
  # and row[1:785] has the 784 pixel values
  # Take a look at np.array_split to turn the 784 pixels into 28x28
  # You are reading in strings, but need the values to be floats
  # Check out np.array().astype for a conversion
    with open(filename, 'r') as training_file:
      # Your code starts here
        #training_file.readline()
        csv_reader = csv.reader(training_file)
        header=next(csv_reader)

        if header != None:

            for row in csv_reader:
                images=np.array_split(row[1:],28)
      # Your code ends here
    return images, labels

path_sign_mnist_train = f"{getcwd()}/../tmp2/sign_mnist_train.csv"
path_sign_mnist_test = f"{getcwd()}/../tmp2/sign_mnist_test.csv"
training_images, training_labels = get_data(path_sign_mnist_train)
testing_images, testing_labels = get_data(path_sign_mnist_test)

# Keep these
print(training_images.shape)
print(training_labels.shape)
print(testing_images.shape)
print(testing_labels.shape)

# Their output should be:
# (27455, 28, 28)
# (27455,)
# (7172, 28, 28)
# (7172,)

row is a list of strings, which you will first need to convert into some numeric type. I don't really understand why it was suggested that you use array_split, and you can also avoid mucking about with as_type by converting each element directly to a float: image = np.fromiter(map(float, row[1:]), count=784).reshape(28, 28). Keep in mind you'll need to keep a list of all images, rather than overwriting images each time you read a row. — bnaecker
– bnaecker, Commented Sep 3, 2020 at 23:40
np.genfromtxt can load a nice csv - one with a consistent number of columns per row. It's easiest with floats, but possible also with a mix of column types (producing a 1d structured array). In your case you might want to use usecols to control which columns you load. You can load different sets with separate calls. Read its docs for more details. — hpaulj
– hpaulj, Commented Sep 3, 2020 at 23:41
yeah, I don't understand why either, but I was trying to understand how to use those functions. If you read the original question, I asked to please not suggest other ways of doing it since I am already struggling with the original instructions and want to understand how to make it work that way. thanks tho! — alpablo20
– alpablo20, Commented Sep 4, 2020 at 22:36

Diogo Silva · Accepted Answer · 2020-09-04 02:46:24Z

Think this will do the trick

 def get_data(filename):
    # You will need to write code that will read the file passed
    # into this function. The first line contains the column headers
    # so you should ignore it
    # Each successive line contains 785 comma separated values between 0 and 255
    # The first value is the label
    # The rest are the pixel values for that picture
    # The function will return 2 np.array types. One with all the labels
    # One with all the images
    #
    # Tips:
    # If you read a full line (as 'row') then row[0] has the label
    # and row[1:785] has the 784 pixel values
    # Take a look at np.array_split to turn the 784 pixels into 28x28
    # You are reading in strings, but need the values to be floats
    # Check out np.array().astype for a conversion
    with open(filename, "r") as training_file:
        # Your code starts here
        # training_file.readline()
        csv_reader = csv.reader(training_file)  # None makes skip headers
        next(csv_reader, None)  # skip the headers
        images = []
        labels = []
        for row in csv_reader:
            images.append(np.array(row[1:]).reshape(28, 28))
            labels.append(row[0])

    images = np.array(images).astype(np.float32)
    labels = np.array(labels).astype(np.float32)
    # Your code ends here
    return images, labels

Collectives™ on Stack Overflow

How to extract an numpy array from a csv file?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related