Data Analysis in Python

Reading data is the first step that is extremely important to understand. There are multiple way of reading data in python but we will explore 2 ways of reading dataset and converting it into a dataframe.

What is a Dataframe?

a Dataframe is a tabular representation of data in rows and columns.

Eg: Consider an office where the employees’ data like joining date, salary, employee id, mail id etc are saved in a table, this table is of the form Dataframe.

emp ID Name joining Date Salary mail id
123 Blob 2001-03-12 500000 bl.ob@company.com
124 crash 2000-03-22 700000 cr.ash@company.com
125 Max 2000-01-12 1000000 ma.x@company.com

Once you have this data, to analyse this data you need to be able to read the data correctly and then place it in a data-frame to be able to use it.

How to read data in Python

Method1: Using ‘csv’ library

import csv
​
with open('log.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('log.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)

Method2: Using ‘pandas’ library

Reading ‘.csv’

import pandas as pd

df = pd.read_csv('FILE_PATH.csv')

Reading excel file

Using pandas to read excel

import pandas as pd
pd.read_excel

df = pd.read_excel(open('tmp.xlsx', 'rb), sheet_name='Sheet3')  

OR Use a ‘xlrd’ library

import xlrd

# Give the location of the file
loc = ("path of file")

# To open Workbook
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

# For row 0 and column 0
print(sheet.cell_value(0, 0))

When data is too huge or reduce memory usage

a. You can read only a limited dataset by providing value to the ‘chunksize’ parameter provided in ‘read_csv’ function.

import pandas as pd

df = pd.read_csv('FILE_PATH.csv', chunksize = 1)
df = pd.DataFrame(df)

chunk_list = []  # append each chunk df here

# Each chunk is in df format
for chunk in df_chunk:  
    # perform data filtering
    chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)

# concat the list into data-frame
df_concat = pd.concat(chunk_list)

Once you have read all the chunks, merge them together to use the dataset. Note that df_chunk is not a data-frame but an object for further operation in the next step

OR

b. You can read first n lines from the dataframe and use them for an initial analysis , by passing value to the ‘nrows’ parameter

import pandas as pd

df = pd.read_csv('FILE_PATH.csv', nrows = 9999)

You will only need these majorly to read your data into a data-frame.

Hope this helps.

~P