Python Tidbits

The Mysterious Behaviour of Int

When working with datatypes in Python I keep getting surprised by their intricate dynamic nature. When working with integer values in other programming languages we often have to declare the precision of the types that we are using. In the case of integers, we have a wide variety of ranges to choose between:

TypeAlso known asFromTo
int8char-128127
uint8unsigned char0255
int16short-32,76832,767
uint16unsigned short065,535
int32long-2,147,483,6482,147,483,647
uint32unsigned long04,294,967,295
int64long long-9,223,372,036,854,775,8089,223,372,036,854,775,807
uint64unsigned long long018,446,744,073,709,551,615

But what is then happening when we in Python are using the int type? I originally guessed that it was simply used as an alias for one of the above, but that turned out to be wrong in general. Let’s show the implications of this with this short, and very real, example.

When dealing with Twitter data we often work with the IDs of tweets, as the Twitter terms of use states that we’re not allowed to share tweets directly, but instead we can share the IDs, from which they can be “rehydrated” (unless the user deleted them).

Here are some examples of tweet IDs:

  • 1496894936372813825
  • 1378982003966685186
  • 1321053468723941376

Since all of these are merely integers, it would feel natural to deal with them as such in Python. Sometimes we receive these IDs from REST APIs, which output string data, so we might find ourselves writing out the following piece of code:

>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=int)
>>> rehydrate_tweets(tweet_ids=tweet_ids)

Happy days, we got some integer tweet IDs! Let’s ship this to production, what could go wrong?

As I mentioned above, this piece of code will work on Unix-based operating systems. Indeed, on those systems we will see the following:

>>> tweet_ids
array([1496894936372813825, 1378982003966685186, 1321053468723941376])

All good. But on any Windows machine (no matter if the Windows distribution is 32-bit or 64-bit), we will suddenly see the following:

>>> tweet_ids
array([-2070601727, -1843974142,  1821806592])

Oh dear! Suddenly our script is trying to rehydrate negative tweet IDs and we face some very obscure error messages. We can fix this if we don’t allow Python to dynamically type, and instead be more specific in our typing. We can be accomplish this using the np.int64 type, where our code snippet above would now be written as:

>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=np.int64)
>>> rehydrate_tweets(tweet_ids=tweet_ids)

And hooray, this will now work on Windows as well!

As a little bonus, it turns out that even the native int function behaves in this way in Python2, so that we get the following on a Windows machine with Python2:

>>> int("1496894936372813825") = -2070601727

Thankfully, this has now been changed in Python3. Phew!