Data Type Objects, dtype

dtype

The data type object 'dtype' is an instance of numpy.dtype class. It can be created with numpy.dtype. So far, we have used in our examples of numpy arrays only fundamental numeric data types like 'int' and 'float'. These numpy arrays contained solely homogenous data types. dtype objects are construed by combinations of fundamental data types. With the aid of dtype we are capable to create "Structured Arrays", - also known as "Record Arrays". The structured arrays provide us with the ability to have different data types per column. It has similarity to the structure of excel or csv documents. This makes it possibe to define data like the one in the following table with dtype: |Country| Population Density |      Area | nbsp;    Population | |------------|----------:|----:|----:| |Netherlands | 393 | 41526 | 16,928,800 | |Belgium |337 | 30510 |11,007,020| |United Kingdom |256 | 243610 | 62,262,000| |Germany |233 | 357021 | 81,799,600 | |Liechtenstein | 205 | 160 |32,842| |Italy | 192 | 301230 | 59,715,625| |Switzerland | 177 | 41290 | 7,301,994 | |Luxembourg | 173 | 2586 | 512,000| |France | 111 | 547030 | 63,601,002| |Austria | 97 | 83858 | 8,169,929| |Greece | 81 | 131940 | 11,606,813| |Ireland | 65 | 70280 | 4,581,269| |Sweden | 20 | 449964 | 9,515,744| |Finland | 16 | 338424 | 5,410,233| |Norway | 13 | 385252 | 5,033,675| Before we start with a complex data structure like the previous data, we want to introduce dtype in a very simple example. We define an int16 data type and call this type i16. (We have to admit, that this is not a nice name, but we use it only here!). The elements of the list 'lst' are turned into i16 types to create the two-dimensional array A.
import numpy as np

i16 = np.dtype(np.int16)
print(i16)

lst = [ [3.4, 8.7, 9.9],
[1.1, -7.8, -0.7],
[4.1, 12.3, 4.8] ]

A = np.array(lst, dtype=i16)

print(A)

int16
[[ 3  8  9]
[ 1 -7  0]
[ 4 12  4]]


We introduced a new name for a basic data type in the previous example. This has nothing to do with the structured arrays, which we mentioned in the introduction of this chapter of our dtype tutorial.

Structured Arrays

ndarrays are homogeneous data objects, i.e. all elements of an array have to be of the same data type. The data type dytpe on the other hand allows as to define separate data types for each column.

Now we will take the first step towards implementing the table with European countries and the information on population, area and population density. We create a structured array with the 'density' column. The data type is defined as np.dtype([('density', np.int)]). We assign this data type to the variable 'dt' for the sake of convenience. We use this data type in the darray definition, in which we use the first three densities.

import numpy as np

dt = np.dtype([('density', np.int32)])

x = np.array([(393,), (337,), (256,)],
dtype=dt)

print(x)

print("\nThe internal representation:")
print(repr(x))

[(393,) (337,) (256,)]

The internal representation:
array([(393,), (337,), (256,)],
dtype=[('density', '<i4')])


We can access the content of the density column by indexing x with the key 'density'. It looks like accessing a dictionary in Python:

print(x['density'])

[393 337 256]


You may wonder that we have used 'np.int32' in our definition and the internal representation shows '<i4'. We can use in the dtype definition the type directly (e.g. np.int32) or we can use a string (e.g. 'i4'). So, we could have defined our dtype like this as well:

dt = np.dtype([('density', 'i4')])
x = np.array([(393,), (337,), (256,)],
dtype=dt)
print(x)

[(393,) (337,) (256,)]


The 'i' means integer and the 4 means 4 bytes. What about the less-than sign in front of i4 in the result? We could have written '<i4' in our definition as well. We can prefix a type with the '<' and '>' sign. '<' means that the encoding will be little-endian and '>' means that the encoding will be big-endian. No prefix means that we get the native byte ordering. We demonstrate this in the following by defining a double-precision floating-point number in various orderings:

# little-endian ordering
dt = np.dtype('<d')
print(dt.name, dt.byteorder, dt.itemsize)

# big-endian ordering
dt = np.dtype('>d')
print(dt.name, dt.byteorder, dt.itemsize)

# native byte ordering
dt = np.dtype('d')
print(dt.name, dt.byteorder, dt.itemsize)

float64 = 8
float64 > 8
float64 = 8


The equal character '=' stands for 'native byte ordering', defined by the operating system. In our case this means 'little-endian', because we use a Linux computer.

Another thing in our density array might be confusing. We defined the array with a list containing one-tuples. So you may ask yourself, if it is possible to use tuples and lists interchangeably? This is not possible. The tuples are used to define the records - in our case consisting solely of a density - and the list is the 'container' for the records or in other words 'the lists are cursed upon'. The tuples define the atomic elements of the structure and the lists the dimensions.

Now we will add the country name, the area and the population number to our data type:

dt = np.dtype([('country', 'S20'), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])
population_table = np.array([
('Netherlands', 393, 41526, 16928800),
('Belgium', 337, 30510, 11007020),
('United Kingdom', 256, 243610, 62262000),
('Germany', 233, 357021, 81799600),
('Liechtenstein', 205, 160, 32842),
('Italy', 192, 301230, 59715625),
('Switzerland', 177, 41290, 7301994),
('Luxembourg', 173, 2586, 512000),
('France', 111, 547030, 63601002),
('Austria', 97, 83858, 8169929),
('Greece', 81, 131940, 11606813),
('Ireland', 65, 70280, 4581269),
('Sweden', 20, 449964, 9515744),
('Finland', 16, 338424, 5410233),
('Norway', 13, 385252, 5033675)],
dtype=dt)
print(population_table[:4])

[(b'Netherlands', 393,  41526, 16928800)
(b'Belgium', 337,  30510, 11007020)
(b'United Kingdom', 256, 243610, 62262000)
(b'Germany', 233, 357021, 81799600)]


We can acces every column individually:

print(population_table['density'])
print(population_table['country'])
print(population_table['area'][2:5])

[393 337 256 233 205 192 177 173 111  97  81  65  20  16  13]
[b'Netherlands' b'Belgium' b'United Kingdom' b'Germany' b'Liechtenstein'
b'Italy' b'Switzerland' b'Luxembourg' b'France' b'Austria' b'Greece'
b'Ireland' b'Sweden' b'Finland' b'Norway']
[243610 357021    160]


Unicode Strings in Array

Some may have noticed that the strings in our previous array have been prefixed with a lower case "b". This means that we have created binary strings with the definition "('country', 'S20')". To get unicode strings we exchange this with the definition "('country', np.unicode, 20)". We will redefine our population table now:

dt = np.dtype([('country', np.unicode, 20),
('density', 'i4'),
('area', 'i4'),
('population', 'i4')])
population_table = np.array([
('Netherlands', 393, 41526, 16928800),
('Belgium', 337, 30510, 11007020),
('United Kingdom', 256, 243610, 62262000),
('Germany', 233, 357021, 81799600),
('Liechtenstein', 205, 160, 32842),
('Italy', 192, 301230, 59715625),
('Switzerland', 177, 41290, 7301994),
('Luxembourg', 173, 2586, 512000),
('France', 111, 547030, 63601002),
('Austria', 97, 83858, 8169929),
('Greece', 81, 131940, 11606813),
('Ireland', 65, 70280, 4581269),
('Sweden', 20, 449964, 9515744),
('Finland', 16, 338424, 5410233),
('Norway', 13, 385252, 5033675)],
dtype=dt)
print(population_table[:4])

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
('United Kingdom', 256, 243610, 62262000)
('Germany', 233, 357021, 81799600)]


Input and Output of Structured Arrays

In most applications it will be necessary to save the data from a program into a file. We will write our previously created "darray" to a file with the command savetxt. You will find a detailled introduction into this topic in our chapter Reading and Writing Data Files

np.savetxt("population_table.csv",
population_table,
fmt="%s;%d;%d;%d",
delimiter=";")


It is highly probable that you will need to read in the previously written file at a later date. This can be achieved with the function genfromtxt.

dt = np.dtype([('country', np.unicode, 20), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])

x = np.genfromtxt("population_table.csv",
dtype=dt,
delimiter=";")

print(x)

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
('United Kingdom', 256, 243610, 62262000)
('Germany', 233, 357021, 81799600)
('Liechtenstein', 205,    160,    32842) ('Italy', 192, 301230, 59715625)
('Switzerland', 177,  41290,  7301994)
('Luxembourg', 173,   2586,   512000) ('France', 111, 547030, 63601002)
('Austria',  97,  83858,  8169929) ('Greece',  81, 131940, 11606813)
('Ireland',  65,  70280,  4581269) ('Sweden',  20, 449964,  9515744)
('Finland',  16, 338424,  5410233) ('Norway',  13, 385252,  5033675)]


There is also a function "loadtxt", but it is more difficult to use, because it returns the strings as binary strings!

To overcome this problem, we can use loadtxt with a converter function for the first column.

dt = np.dtype([('country', np.unicode, 20), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])

dtype=dt,
converters={0: lambda x: x.decode('utf-8')},
delimiter=";")

print(x)

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
('United Kingdom', 256, 243610, 62262000)
('Germany', 233, 357021, 81799600)
('Liechtenstein', 205,    160,    32842) ('Italy', 192, 301230, 59715625)
('Switzerland', 177,  41290,  7301994)
('Luxembourg', 173,   2586,   512000) ('France', 111, 547030, 63601002)
('Austria',  97,  83858,  8169929) ('Greece',  81, 131940, 11606813)
('Ireland',  65,  70280,  4581269) ('Sweden',  20, 449964,  9515744)
('Finland',  16, 338424,  5410233) ('Norway',  13, 385252,  5033675)]


Exercises:

Before you go on, you may take time to do some exercises to deepen the understanding of the previously learned stuff.

1. Exercise:
Define a structured array with two columns. The first column contains the product ID, which can be defined as an int32. The second column shall contain the price for the product. How can you print out the column with the product IDs, the first row and the price for the third article of this structured array?

2. Exercise:
Figure out a data type definition for time records with entries for hours, minutes and seconds.

Solutions:

Solution to the first exercise:

import numpy as np

mytype = [('productID', np.int32), ('price', np.float64)]

stock = np.array([(34765, 603.76),
(45765, 439.93),
(99661, 344.19),
(12129, 129.39)], dtype=mytype)

print(stock[1])
print(stock["productID"])
print(stock[2]["price"])
print(stock)

(45765,  439.93)
[34765 45765 99661 12129]
344.19
[(34765,  603.76) (45765,  439.93) (99661,  344.19) (12129,  129.39)]


Solution to the second exercise:

time_type = np.dtype( [('h', int), ('min', int), ('sec', int)])

times = np.array([(11, 38, 5),
(14, 56, 0),
(3, 9, 1)], dtype=time_type)
print(times)
print(times[0])
# reset the first time record:
times[0] = (11, 42, 17)
print(times[0])

[(11, 38, 5) (14, 56, 0) ( 3,  9, 1)]
(11, 38, 5)
(11, 42, 17)


A more Complex Example:

We will increase the complexity of our previous example by adding temperatures to the records.

time_type = np.dtype( np.dtype([('time', [('h', int), ('min', int), ('sec', int)]),
('temperature', float)] ))

times = np.array( [((11, 42, 17), 20.8), ((13, 19, 3), 23.2) ], dtype=time_type)
print(times)
print(times['time'])
print(times['time']['h'])
print(times['temperature'])

[((11, 42, 17),  20.8) ((13, 19,  3),  23.2)]
[(11, 42, 17) (13, 19,  3)]
[11 13]
[ 20.8  23.2]


Exercise

This exercise should be closer to real life examples. Usually, we have to create or get the data for our structured array from some data base or file. We will use the list, which we have created in our chapter on file I/O "File Management". The list has been saved with the aid of pickle.dump in the file cities_and_times.pkl.

So the first task consists in unpickling our data:

import pickle
fh = open("cities_and_times.pkl", "br")
print(cities_and_times[:30])

[('Amsterdam', 'Sun', (8, 52)), ('Anchorage', 'Sat', (23, 52)), ('Ankara', 'Sun', (10, 52)), ('Athens', 'Sun', (9, 52)), ('Atlanta', 'Sun', (2, 52)), ('Auckland', 'Sun', (20, 52)), ('Barcelona', 'Sun', (8, 52)), ('Beirut', 'Sun', (9, 52)), ('Berlin', 'Sun', (8, 52)), ('Boston', 'Sun', (2, 52)), ('Brasilia', 'Sun', (5, 52)), ('Brussels', 'Sun', (8, 52)), ('Bucharest', 'Sun', (9, 52)), ('Budapest', 'Sun', (8, 52)), ('Cairo', 'Sun', (9, 52)), ('Calgary', 'Sun', (1, 52)), ('Cape Town', 'Sun', (9, 52)), ('Casablanca', 'Sun', (7, 52)), ('Chicago', 'Sun', (1, 52)), ('Columbus', 'Sun', (2, 52)), ('Copenhagen', 'Sun', (8, 52)), ('Dallas', 'Sun', (1, 52)), ('Denver', 'Sun', (1, 52)), ('Detroit', 'Sun', (2, 52)), ('Dubai', 'Sun', (11, 52)), ('Dublin', 'Sun', (7, 52)), ('Edmonton', 'Sun', (1, 52)), ('Frankfurt', 'Sun', (8, 52)), ('Halifax', 'Sun', (3, 52)), ('Helsinki', 'Sun', (9, 52))]


Turning our data into a structured array:

time_type = np.dtype([('city', 'U30'), ('day', 'U3'), ('time', [('h', int), ('min', int)])])

times = np.array( cities_and_times , dtype=time_type)
print(times['time'])
print(times['city'])
x = times[27]
x[0]

[( 8, 52) (23, 52) (10, 52) ( 9, 52) ( 2, 52) (20, 52) ( 8, 52) ( 9, 52)
( 8, 52) ( 2, 52) ( 5, 52) ( 8, 52) ( 9, 52) ( 8, 52) ( 9, 52) ( 1, 52)
( 9, 52) ( 7, 52) ( 1, 52) ( 2, 52) ( 8, 52) ( 1, 52) ( 1, 52) ( 2, 52)
(11, 52) ( 7, 52) ( 1, 52) ( 8, 52) ( 3, 52) ( 9, 52) ( 1, 52) ( 2, 52)
(10, 52) ( 9, 52) ( 9, 52) (13, 37) (10, 52) ( 0, 52) ( 7, 52) ( 7, 52)
( 0, 52) ( 8, 52) (18, 52) ( 2, 52) ( 1, 52) ( 2, 52) (10, 52) ( 1, 52)
( 2, 52) ( 8, 52) ( 2, 52) ( 8, 52) ( 2, 52) ( 0, 52) ( 8, 52) ( 7, 52)
(10, 52) ( 8, 52) ( 1, 52) ( 0, 52) ( 1, 52) ( 4, 52) ( 0, 52) (15, 52)
(15, 52) ( 8, 52) (18, 52) ( 5, 52) (16, 52) ( 2, 52) ( 0, 52) ( 8, 52)
( 8, 52) ( 2, 52) ( 1, 52) ( 8, 52)]
['Amsterdam' 'Anchorage' 'Ankara' 'Athens' 'Atlanta' 'Auckland' 'Barcelona'
'Beirut' 'Berlin' 'Boston' 'Brasilia' 'Brussels' 'Bucharest' 'Budapest'
'Cairo' 'Calgary' 'Cape Town' 'Casablanca' 'Chicago' 'Columbus'
'Copenhagen' 'Dallas' 'Denver' 'Detroit' 'Dubai' 'Dublin' 'Edmonton'
'Frankfurt' 'Halifax' 'Helsinki' 'Houston' 'Indianapolis' 'Istanbul'
'Jerusalem' 'Johannesburg' 'Kathmandu' 'Kuwait City' 'Las Vegas' 'Lisbon'
'London' 'Los Angeles' 'Madrid' 'Melbourne' 'Miami' 'Minneapolis'
'Montreal' 'Moscow' 'New Orleans' 'New York' 'Oslo' 'Ottawa' 'Paris'

'Frankfurt'