3

This is easy to do in R and I am wondering if it is straight forward in Python and I am just missing something, but how do you create a vector of NaN values and Null values in Python? I am trying to do this using the np.full function.

R Code:

vec <- vector("character", 15)
vec[1:15] <- NA
vec

Python Code

unknowns = np.full(shape = 5, fill_value = ???, dtype = 'str')

'''test if fill value worked or not'''

random.seed(1177)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])

example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

This should lead to 5 counts of unknown in the value counts total. Ideally I would like to know how to write this fill_value for NaN and Null and know whether it differs for variable types. I have tried np.nan with and without the string data type. I have tried None and Null with and without quotes. I cannot think of anything else to try and I am starting to wonder if it is possible. Thank you in advance and I apologize if this question is already addressed and for my lack of knowledge in this area.

1
  • 3
    There are data typing issues here. You can create an array of np.nan, but that's a floating point value. You can create an array of empty strings, if that solves the problem. You can't put None in a string array. All elements in a numpy array must be the same type. Commented Nov 29, 2022 at 0:01

3 Answers 3

5

you could use either None or np.nan to create an array of just missing values in Python like so:

np.full(shape=5, fill_value=None)
np.full(shape=5, fill_value=np.nan)

back to your example, this works just fine:

import numpy as np
import pandas as pd

unknowns = np.full(shape=5, fill_value=None)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

Lastly, this line is inefficient. example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

You do want to avoid loops & list comprehensions when using pandas

on large data, this is going to run much faster: example['transformed'] = example.categories.apply(lambda s: s if s else 'unknown')

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much! I figured it was a stupid question, with some combination of things I was not trying. I am going to delete it since it is so dumb. But I really appreciate your help.
You don't have to delete it. It's possible this question might arise again in the future.
@Pearl no need to delete! I would also encourage you to consider accepting this (or another) answer if it solves your problem, thereby marking this as a solved question.
@govordovsky Thanks for your comment on efficiency too. I was specifically running it through "isna" logic, but your coding knowledge is really appreciated. Maybe someone will find that helpful too.
0

There is a typing problem here.

If you're working in numpy, vectors are typed after being initialized. Assigning a np.nan value to a vector initialized with strings will try to coalesce back into a string:

import numpy as np

v1 = np.array(['a', 'b', 'c'])
v1[0] = np.nan
# v1 = array(['n', 'b', 'c'], dtype='<U1')

v2 = np.array(['ab', 'cd', 'ef'])
v2[0] = np.nan
# v2 = array(['na', 'cd', 'ef'], dtype='<U2')

v3 = np.array(['abc', 'def', 'ghi'])
v3[0] = np.nan
# v3 = array(['nan', 'def', 'ghi'], dtype='<U3')

However, if you're working with pandas in the second half of the question, there's a separate way for handling missing data:

import pandas as pd

df = pd.DataFrame({"x": [pd.NA, "Hello", "World"]})

Comments

0

A simple way to create an empty Series in pandas:

s = pd.Series(index=range(15))

Output:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
dtype: float64

Or, with a string dtype:

s = pd.Series(index=range(15), dtype='string')

Output:

0     <NA>
1     <NA>
2     <NA>
3     <NA>
4     <NA>
5     <NA>
6     <NA>
7     <NA>
8     <NA>
9     <NA>
10    <NA>
11    <NA>
12    <NA>
13    <NA>
14    <NA>
dtype: string

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.