How to create an array of NA or Null values in Python?

Question

This is easy to do in R and I am wondering if it is straight forward in Python and I am just missing something, but how do you create a vector of NaN values and Null values in Python? I am trying to do this using the np.full function.

R Code:

vec <- vector("character", 15)
vec[1:15] <- NA
vec

Python Code

unknowns = np.full(shape = 5, fill_value = ???, dtype = 'str')

'''test if fill value worked or not'''

random.seed(1177)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])

example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

This should lead to 5 counts of unknown in the value counts total. Ideally I would like to know how to write this fill_value for NaN and Null and know whether it differs for variable types. I have tried np.nan with and without the string data type. I have tried None and Null with and without quotes. I cannot think of anything else to try and I am starting to wonder if it is possible. Thank you in advance and I apologize if this question is already addressed and for my lack of knowledge in this area.

There are data typing issues here. You can create an array of np.nan, but that's a floating point value. You can create an array of empty strings, if that solves the problem. You can't put None in a string array. All elements in a numpy array must be the same type. — Tim Roberts
– Tim Roberts, Commented Nov 29, 2022 at 0:01

Pearl · Accepted Answer · 2022-11-29 02:41:41Z

5

you could use either None or np.nan to create an array of just missing values in Python like so:

np.full(shape=5, fill_value=None)
np.full(shape=5, fill_value=np.nan)

back to your example, this works just fine:

import numpy as np
import pandas as pd

unknowns = np.full(shape=5, fill_value=None)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

Lastly, this line is inefficient. example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

You do want to avoid loops & list comprehensions when using pandas

on large data, this is going to run much faster: example['transformed'] = example.categories.apply(lambda s: s if s else 'unknown')

edited Nov 29, 2022 at 2:41

Pearl

1331 silver badge7 bronze badges

answered Nov 29, 2022 at 0:09

govordovsky

4091 gold badge5 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pearl Over a year ago

Thank you so much! I figured it was a stupid question, with some combination of things I was not trying. I am going to delete it since it is so dumb. But I really appreciate your help.

Tim Roberts Over a year ago

You don't have to delete it. It's possible this question might arise again in the future.

Mercury Over a year ago

@Pearl no need to delete! I would also encourage you to consider accepting this (or another) answer if it solves your problem, thereby marking this as a solved question.

Pearl Over a year ago

@govordovsky Thanks for your comment on efficiency too. I was specifically running it through "isna" logic, but your coding knowledge is really appreciated. Maybe someone will find that helpful too.

Alexander L. Hayes · Accepted Answer · 2022-11-29 00:12:08Z

0

There is a typing problem here.

If you're working in numpy, vectors are typed after being initialized. Assigning a np.nan value to a vector initialized with strings will try to coalesce back into a string:

import numpy as np

v1 = np.array(['a', 'b', 'c'])
v1[0] = np.nan
# v1 = array(['n', 'b', 'c'], dtype='<U1')

v2 = np.array(['ab', 'cd', 'ef'])
v2[0] = np.nan
# v2 = array(['na', 'cd', 'ef'], dtype='<U2')

v3 = np.array(['abc', 'def', 'ghi'])
v3[0] = np.nan
# v3 = array(['nan', 'def', 'ghi'], dtype='<U3')

However, if you're working with pandas in the second half of the question, there's a separate way for handling missing data:

import pandas as pd

df = pd.DataFrame({"x": [pd.NA, "Hello", "World"]})

edited Nov 29, 2022 at 0:12

answered Nov 29, 2022 at 0:00

Alexander L. Hayes

4,3134 gold badges17 silver badges40 bronze badges

Comments

mozway · Accepted Answer · 2022-11-29 01:23:12Z

0

A simple way to create an empty Series in pandas:

s = pd.Series(index=range(15))

Output:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
dtype: float64

Or, with a string dtype:

s = pd.Series(index=range(15), dtype='string')

Output:

0     <NA>
1     <NA>
2     <NA>
3     <NA>
4     <NA>
5     <NA>
6     <NA>
7     <NA>
8     <NA>
9     <NA>
10    <NA>
11    <NA>
12    <NA>
13    <NA>
14    <NA>
dtype: string

answered Nov 29, 2022 at 1:23

mozway

267k13 gold badges56 silver badges106 bronze badges

Collectives™ on Stack Overflow

How to create an array of NA or Null values in Python?

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related