Numerical & Scientific Computing with Python: Data Type Objects, dtype

## Data Type Objects, dtype

### dtype

Data type object 'dtype' is an instance of numpy.dtype class. It can be created with numpy.dtype.

So far, we have used in our examples of numpy arrays only fundamental numeric data types like 'int' and 'float'. These numpy arrays contained solely homogenous data types. dtype objects are construed by combinations of fundamental data types. With the aid of dtype we are capable to create "Structured Arrays", - also known as "Record Arrays". The structured arrays provide us with the ability to have different data types per column. It has similarity to the structure of excel documents. So, we can define data like the one in the following table:

Country Population Density Area Population
Netherlands 393 41526 16,928,800
Belgium 337 30510 11,007,020
United Kingdom 256 243610 62,262,000
Germany 233 357021 81,799,600
Liechtenstein 205 160 32,842
Italy 192 301230 59,715,625
Switzerland 177 41290 7,301,994
Luxembourg 173 2586 512,000
France 111 547030 63,601,002
Austria 97 83858 8,169,929
Greece 81 131940 11,606,813
Ireland 65 70280 4,581,269
Sweden 20 449964 9,515,744
Finland 16 338424 5,410,233
Norway 13 385252 5,033,675

Before we start with a complex data like the previous data, we want to introduce dtype in a very simple example. We define an int16 data type and call this type i16. (We have to admit, that this is not a nice name, but we use it only here!). The elements of the list 'lst' are turned into i16 types to create the two-dimensional array A.

import numpy as np
i16 = np.dtype(np.int16)
print(i16)
lst = [ [3.4, 8.7, 9.9],
[1.1, -7.8, -0.7],
[4.1, 12.3, 4.8] ]
A = np.array(lst, dtype=i16)
print(A)

int16
[[ 3  8  9]
[ 1 -7  0]
[ 4 12  4]]


All we did in the previous example was to introduce a new name for a basic data type. This has nothing to do with the structured arrays, we mentioned in the introduction of this chapter of our dtype tutorial.

Now we will take the first step towards implementing the table with European countries and the information on population, area and population density. We create a structured array with the 'density' column. The data type is defined as np.dtype([('density', np.int)]). We assign this data type to the variable 'dt' for the sake of convenience. We use this data type in the darray definition, in which we use the first three densities.

import numpy as np
dt = np.dtype([('density', np.int32)])
x = np.array([(393,), (337,), (256,)],
dtype=dt)
print(x)
print("\nThe internal representation:")
print(repr(x))

[(393,) (337,) (256,)]
The internal representation:
array([(393,), (337,), (256,)],
dtype=[('density', '<i4')])


We can access the content of the density column by indexing x with the key 'density'. It looks like accessing a dictionary in Python:

print(x['density'])

[393 337 256]


You may wonder that we have used 'np.int32' in our definition and the internal representation shows '<i4'. We can use in the dtype definition the type directly (e.g. np.int32) or we can use a string (e.g. 'i4'). So, we could have defined our dtype like this as well:

dt = np.dtype([('density', 'i4')])
x = np.array([(393,), (337,), (256,)],
dtype=dt)
print(x)

[(393,) (337,) (256,)]


The 'i' means integer and the 4 means 4 bytes. What about the less-than sign in front of i4 in the result? We could have written '<i4' in our definition as well. We can prefix a type with the '<' and '>' sign. '<' means that the encoding will be little-endian and '>' means that the encoding will be big-endian. No prefix means that we get the native byte ordering. We demonstrate this in the following by defining a double-precision floating-point number:

dt = np.dtype('<d')
print(dt.name, dt.byteorder, dt.itemsize)
dt = np.dtype('>d')
print(dt.name, dt.byteorder, dt.itemsize)
dt = np.dtype('d')
print(dt.name, dt.byteorder, dt.itemsize)

float64 = 8
float64 > 8
float64 = 8


The equal character '=' stands for 'native byte ordering', defined by the operating system. In our case this means 'little-endian', because we use a Linux computer.

Another thing in our density array might be confusing. We defined the array with a list containing one-tuples. So you may ask yourself, if it is possible to use tuples and lists interchangeably? This is not possible. The tuples are used to define the records - in our case consisting solely of a density - and the list is the 'container' for the records or in other words 'the lists are cursed upon'. The tuples define the atomic elements of the structure and the lists the dimensions.

Now we will add the country name, the area and the population number to our data type:

dt = np.dtype([('country', 'S20'), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])
x = np.array([('Netherlands', 393, 41526, 16928800),
('Belgium', 337, 30510, 11007020),
('United Kingdom', 256, 243610, 62262000),
('Germany', 233, 357021, 81799600),
('Liechtenstein', 205, 160, 32842),
('Italy', 192, 301230, 59715625),
('Switzerland', 177, 41290, 7301994),
('Luxembourg', 173, 2586, 512000),
('France', 111, 547030, 63601002),
('Austria', 97, 83858, 8169929),
('Greece', 81, 131940, 11606813),
('Ireland', 65, 70280, 4581269),
('Sweden', 20, 449964, 9515744),
('Finland', 16, 338424, 5410233),
('Norway', 13, 385252, 5033675)],
dtype=dt)
print(x[:4])

[(b'Netherlands', 393, 41526, 16928800) (b'Belgium', 337, 30510, 11007020)
(b'United Kingdom', 256, 243610, 62262000)
(b'Germany', 233, 357021, 81799600)]


We can acces every column individually:

print(x['density'])
print(x['country'])
print(x['area'][2:5])

[393 337 256 233 205 192 177 173 111  97  81  65  20  16  13]
[b'Netherlands' b'Belgium' b'United Kingdom' b'Germany' b'Liechtenstein'
b'Italy' b'Switzerland' b'Luxembourg' b'France' b'Austria' b'Greece'
b'Ireland' b'Sweden' b'Finland' b'Norway']
[243610 357021    160]


### Exercise:

Before you go on, you may take time to do an exercise to deepen the understanding of the previously learned stuff.

Figure out a data type definition for time records with entries for hours, minutes and seconds.

time_type = np.dtype( [('h', int), ('min', int), ('sec', int)])
times = np.array([(11, 38, 5),
(14, 56, 0),
(3, 9, 1)], dtype=time_type)
print(times)
print(times[0])
# reset the first time record:
times[0] = (11, 42, 17)
print(times[0])

[(11, 38, 5) (14, 56, 0) (3, 9, 1)]
(11, 38, 5)
(11, 42, 17)


### A more Complex Example:

We will increase the complexity of our previous example by adding temperatures to the records.

time_type = np.dtype( np.dtype([('time', [('h', int), ('min', int), ('sec', int)]),
('temperature', float)] ))
times = np.array( [((11, 42, 17), 20.8), ((13, 19, 3), 23.2) ], dtype=time_type)
print(times)
print(times['time'])
print(times['time']['h'])
print(times['temperature'])

[((11, 42, 17), 20.8) ((13, 19, 3), 23.2)]
[(11, 42, 17) (13, 19, 3)]
[11 13]
[ 20.8  23.2]


### Exercise

This exercise should be closer to real life examples. Usually, we have to create or get the data for our structured array from some data base or file. We will use the list, which we have created in our chapter on file I/O "File Management". The list has been saved with the aid of pickle.dump in the file cities_and_times.pkl.

So the first task consists in unpickling our data:

import pickle
fh = open("cities_and_times.pkl", "br")
print(cities_and_times[:30])

[('Amsterdam', 'Sun', (8, 52)), ('Anchorage', 'Sat', (23, 52)), ('Ankara', 'Sun', (10, 52)), ('Athens', 'Sun', (9, 52)), ('Atlanta', 'Sun', (2, 52)), ('Auckland', 'Sun', (20, 52)), ('Barcelona', 'Sun', (8, 52)), ('Beirut', 'Sun', (9, 52)), ('Berlin', 'Sun', (8, 52)), ('Boston', 'Sun', (2, 52)), ('Brasilia', 'Sun', (5, 52)), ('Brussels', 'Sun', (8, 52)), ('Bucharest', 'Sun', (9, 52)), ('Budapest', 'Sun', (8, 52)), ('Cairo', 'Sun', (9, 52)), ('Calgary', 'Sun', (1, 52)), ('Cape Town', 'Sun', (9, 52)), ('Casablanca', 'Sun', (7, 52)), ('Chicago', 'Sun', (1, 52)), ('Columbus', 'Sun', (2, 52)), ('Copenhagen', 'Sun', (8, 52)), ('Dallas', 'Sun', (1, 52)), ('Denver', 'Sun', (1, 52)), ('Detroit', 'Sun', (2, 52)), ('Dubai', 'Sun', (11, 52)), ('Dublin', 'Sun', (7, 52)), ('Edmonton', 'Sun', (1, 52)), ('Frankfurt', 'Sun', (8, 52)), ('Halifax', 'Sun', (3, 52)), ('Helsinki', 'Sun', (9, 52))]


Turning our data into a structured array:

time_type = np.dtype([('city', 'U30'), ('day', 'U3'), ('time', [('h', int), ('min', int)])])
times = np.array( cities_and_times , dtype=time_type)
print(times['time'])
print(times['city'])
x = times[27]
x[0]

[(8, 52) (23, 52) (10, 52) (9, 52) (2, 52) (20, 52) (8, 52) (9, 52) (8, 52)
(2, 52) (5, 52) (8, 52) (9, 52) (8, 52) (9, 52) (1, 52) (9, 52) (7, 52)
(1, 52) (2, 52) (8, 52) (1, 52) (1, 52) (2, 52) (11, 52) (7, 52) (1, 52)
(8, 52) (3, 52) (9, 52) (1, 52) (2, 52) (10, 52) (9, 52) (9, 52) (13, 37)
(10, 52) (0, 52) (7, 52) (7, 52) (0, 52) (8, 52) (18, 52) (2, 52) (1, 52)
(2, 52) (10, 52) (1, 52) (2, 52) (8, 52) (2, 52) (8, 52) (2, 52) (0, 52)
(8, 52) (7, 52) (10, 52) (8, 52) (1, 52) (0, 52) (1, 52) (4, 52) (0, 52)
(15, 52) (15, 52) (8, 52) (18, 52) (5, 52) (16, 52) (2, 52) (0, 52)
(8, 52) (8, 52) (2, 52) (1, 52) (8, 52)]
['Amsterdam' 'Anchorage' 'Ankara' 'Athens' 'Atlanta' 'Auckland' 'Barcelona'
'Beirut' 'Berlin' 'Boston' 'Brasilia' 'Brussels' 'Bucharest' 'Budapest'
'Cairo' 'Calgary' 'Cape Town' 'Casablanca' 'Chicago' 'Columbus'
'Copenhagen' 'Dallas' 'Denver' 'Detroit' 'Dubai' 'Dublin' 'Edmonton'
'Frankfurt' 'Halifax' 'Helsinki' 'Houston' 'Indianapolis' 'Istanbul'
'Jerusalem' 'Johannesburg' 'Kathmandu' 'Kuwait City' 'Las Vegas' 'Lisbon'
'London' 'Los Angeles' 'Madrid' 'Melbourne' 'Miami' 'Minneapolis'
'Montreal' 'Moscow' 'New Orleans' 'New York' 'Oslo' 'Ottawa' 'Paris'

'Frankfurt'