Course Content
TensorFlow for Beginners
About Lesson

There are two ways to load data, they are as follows:

  1. Load Data using NumPy Array
  2. Load Data using TensorFlow Data Pipeline

 

Load Data using NumPy Array

We can hard-code data into a NumPy Array or we can load data from an Excel (xls or xlsx) or CSV file into a Pandas DataFrame later will be converted into a NumPy Array. If your dataset is not pretty big, which is less than 10 gigabytes, you can use this method. The data can fit into memory.

## Numpy to pandas    
import numpy as np    
import pandas as pd  
  
h = [[1,2],[3,4]]     
df_h = pd.DataFrame(h)    
print('Data Frame:', df_h)    
    
## Pandas to numpy    
df_h_n = np.array(df_h)    
print('Numpy array:', df_h_n)

 

The output of the above code will be Data Frame: 0 1 0 1 2 1 3 4 Numpy array: [[1 2] [3 4]]

 

Load Data using TensorFlow Data Pipeline

Tensorflow has a built-in API that helps you to load the data, perform the operation and feed the machine learning algorithm easily. This method works very well when you have a pretty large dataset. For instance, image records are known to be huge and do not fit into memory. The data pipeline manages the memory by itself. This method works best if you have a huge dataset. For instance, if you have a dataset of 50 gigabytes, and your computer has only 16 gigabytes of memory then the machine will crash.

In these circumstances, you need to build a Tensorflow pipeline. The pipeline will load the data in batch, or small chunks. Each batch will be pushed to the pipeline and be ready for the training. Building a pipeline is an excellent solution because it permits you to use parallel computing. It means Tensorflow will train the model through multiple CPUs. It fosters computation and permits the training of powerful neural networks.

Methods to create TensorFlow Data Pipeline:

  1. Create the Data:
    import numpy as np  
    import tensorflow as tf  
    x_input = np.random.sample((1,2))  
    print(x_input)

    In the above code, we are generating two random numbers using NumPy’s Random Number Generator

  2. Create the Placeholder
    x = tf.placeholder(tf.float32, shape=[1,2], name = 'X')

    We are creating a placeholder using the tf.placeholder()

  3. Define the Dataset Method
    dataset = tf.data.Dataset.from_tensor_slices(x)

    We define the dataset method as tf.data.Dataset.from_tensor_slices()

  4. Create the Pipeline
    iterator = dataset.make_initializable_iterator()   
    get_next = iterator.get_next()

    In the above code, we need to initialize the pipeline where the data will flow. We need to create an iterator with make_initializable_iterator. We name its iterator. Then we need to call this iterator to supply the next batch of data, get_next. We name this step get_next. Note that in this example, there is only one batch of data with only two values.

  5. Execute the Operation
    with tf.Session() as sess:  
        # feed the placeholder with data  
        sess.run(iterator.initializer, feed_dict={ x: x_input })   
        print(sess.run(get_next)) # output [ 0.52374458  0.71968478]

    In the above code, we initiate a session, and we run the operation iterator. We feed the feed_dict with the value generated by numpy. These two values will populate the placeholder x. Then we run get_next to print the result.

 

Source Code

import numpy as np  
import tensorflow as tf  
x_input = np.random.sample((1,2))  
print(x_input)  
# using a placeholder  
x = tf.placeholder(tf.float32, shape=[1,2], name = 'X')  
dataset = tf.data.Dataset.from_tensor_slices(x)  
iterator = dataset.make_initializable_iterator()   
get_next = iterator.get_next()  
with tf.Session() as sess:  
    # feed the placeholder with data  
    sess.run(iterator.initializer, feed_dict={ x: x_input })   
    print(sess.run(get_next))

 

The output of the above code will be [[0.87908525 0.80727791]] [0.87908524 0.8072779 ]