Python Tidbits
The Mysterious Behaviour of Int
Posted on July 19, 2022
When working with datatypes in Python I keep getting surprised by their intricate dynamic nature. When working with integer values in other programming languages we often have to declare the precision of the types that we are using. In the case of integers, we have a wide variety of ranges to choose between:
| Type | Also known as | From | To |
|---|---|---|---|
| int8 | char | -128 | 127 |
| uint8 | unsigned char | 0 | 255 |
| int16 | short | -32,768 | 32,767 |
| uint16 | unsigned short | 0 | 65,535 |
| int32 | long | -2,147,483,648 | 2,147,483,647 |
| uint32 | unsigned long | 0 | 4,294,967,295 |
| int64 | long long | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
| uint64 | unsigned long long | 0 | 18,446,744,073,709,551,615 |
But what is then happening when we in Python are using the int type? I originally guessed that it was simply used as an alias for one of the above, but that turned out to be wrong in general. Let’s show the implications of this with this short, and very real, example.
When dealing with Twitter data we often work with the IDs of tweets, as the Twitter terms of use states that we’re not allowed to share tweets directly, but instead we can share the IDs, from which they can be “rehydrated” (unless the user deleted them).
Here are some examples of tweet IDs:
149689493637281382513789820039666851861321053468723941376
Since all of these are merely integers, it would feel natural to deal with them as such in Python. Sometimes we receive these IDs from REST APIs, which output string data, so we might find ourselves writing out the following piece of code:
>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=int)
>>> rehydrate_tweets(tweet_ids=tweet_ids)
Happy days, we got some integer tweet IDs! Let’s ship this to production, what could go wrong?
As I mentioned above, this piece of code will work on Unix-based operating systems. Indeed, on those systems we will see the following:
>>> tweet_ids
array([1496894936372813825, 1378982003966685186, 1321053468723941376])
All good. But on any Windows machine (no matter if the Windows distribution is 32-bit or 64-bit), we will suddenly see the following:
>>> tweet_ids
array([-2070601727, -1843974142, 1821806592])
Oh dear! Suddenly our script is trying to rehydrate negative tweet IDs and we face some very obscure error messages. We can fix this if we don’t allow Python to dynamically type, and instead be more specific in our typing. We can be accomplish this using the np.int64 type, where our code snippet above would now be written as:
>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=np.int64)
>>> rehydrate_tweets(tweet_ids=tweet_ids)
And hooray, this will now work on Windows as well!
As a little bonus, it turns out that even the native int function behaves in this way in Python2, so that we get the following on a Windows machine with Python2:
>>> int("1496894936372813825") = -2070601727
Thankfully, this has now been changed in Python3. Phew!