There are two ways to load data, they are as follows:
- Load Data using NumPy Array
- Load Data using TensorFlow Data Pipeline
Load Data using NumPy Array
We can hard-code data into a NumPy Array or we can load data from an Excel (xls or xlsx) or CSV file into a Pandas DataFrame later will be converted into a NumPy Array. If your dataset is not pretty big, which is less than 10 gigabytes, you can use this method. The data can fit into memory.
## Numpy to pandas import numpy as np import pandas as pd h = [[1,2],[3,4]] df_h = pd.DataFrame(h) print('Data Frame:', df_h) ## Pandas to numpy df_h_n = np.array(df_h) print('Numpy array:', df_h_n)
The output of the above code will be Data Frame: 0 1 0 1 2 1 3 4 Numpy array: [[1 2] [3 4]]
Load Data using TensorFlow Data Pipeline
Tensorflow has a built-in API that helps you to load the data, perform the operation and feed the machine learning algorithm easily. This method works very well when you have a pretty large dataset. For instance, image records are known to be huge and do not fit into memory. The data pipeline manages the memory by itself. This method works best if you have a huge dataset. For instance, if you have a dataset of 50 gigabytes, and your computer has only 16 gigabytes of memory then the machine will crash.
In these circumstances, you need to build a Tensorflow pipeline. The pipeline will load the data in batch, or small chunks. Each batch will be pushed to the pipeline and be ready for the training. Building a pipeline is an excellent solution because it permits you to use parallel computing. It means Tensorflow will train the model through multiple CPUs. It fosters computation and permits the training of powerful neural networks.
Methods to create TensorFlow Data Pipeline:
- Create the Data:
import numpy as np import tensorflow as tf x_input = np.random.sample((1,2)) print(x_input)
In the above code, we are generating two random numbers using NumPy’s Random Number Generator
- Create the Placeholder
x = tf.placeholder(tf.float32, shape=[1,2], name = 'X')
We are creating a placeholder using the tf.placeholder()
- Define the Dataset Method
dataset = tf.data.Dataset.from_tensor_slices(x)
We define the dataset method as tf.data.Dataset.from_tensor_slices()
- Create the Pipeline
iterator = dataset.make_initializable_iterator() get_next = iterator.get_next()
In the above code, we need to initialize the pipeline where the data will flow. We need to create an iterator with make_initializable_iterator. We name its iterator. Then we need to call this iterator to supply the next batch of data, get_next. We name this step get_next. Note that in this example, there is only one batch of data with only two values.
- Execute the Operation
with tf.Session() as sess: # feed the placeholder with data sess.run(iterator.initializer, feed_dict={ x: x_input }) print(sess.run(get_next)) # output [ 0.52374458 0.71968478]
In the above code, we initiate a session, and we run the operation iterator. We feed the feed_dict with the value generated by numpy. These two values will populate the placeholder x. Then we run get_next to print the result.
Source Code
import numpy as np import tensorflow as tf x_input = np.random.sample((1,2)) print(x_input) # using a placeholder x = tf.placeholder(tf.float32, shape=[1,2], name = 'X') dataset = tf.data.Dataset.from_tensor_slices(x) iterator = dataset.make_initializable_iterator() get_next = iterator.get_next() with tf.Session() as sess: # feed the placeholder with data sess.run(iterator.initializer, feed_dict={ x: x_input }) print(sess.run(get_next))
The output of the above code will be [[0.87908525 0.80727791]] [0.87908524 0.8072779 ]