Data Analysis in Python

Reading data is the first step that is extremely important to understand. There are multiple way of reading data in python but we will explore 2 ways of reading dataset and converting it into a dataframe.

What is a Dataframe?

a Dataframe is a tabular representation of data in rows and columns.

Eg: Consider an office where the employees’ data like joining date, salary, employee id, mail id etc are saved in a table, this table is of the form Dataframe.

emp ID	Name	joining Date	Salary	mail id
123	Blob	2001-03-12	500000	bl.ob@company.com
124	crash	2000-03-22	700000	cr.ash@company.com
125	Max	2000-01-12	1000000	ma.x@company.com

Once you have this data, to analyse this data you need to be able to read the data correctly and then place it in a data-frame to be able to use it.

How to read data in Python

Method1: Using ‘csv’ library

import csv
​
with open('log.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('log.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)

Method2: Using ‘pandas’ library

Reading ‘.csv’

import pandas as pd

df = pd.read_csv('FILE_PATH.csv')

Reading excel file

Using pandas to read excel

import pandas as pd
pd.read_excel

df = pd.read_excel(open('tmp.xlsx', 'rb), sheet_name='Sheet3')  

OR Use a ‘xlrd’ library

import xlrd

# Give the location of the file
loc = ("path of file")

# To open Workbook
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

# For row 0 and column 0
print(sheet.cell_value(0, 0))

When data is too huge or reduce memory usage

a. You can read only a limited dataset by providing value to the ‘chunksize’ parameter provided in ‘read_csv’ function.

import pandas as pd

df = pd.read_csv('FILE_PATH.csv', chunksize = 1)
df = pd.DataFrame(df)

chunk_list = []  # append each chunk df here

# Each chunk is in df format
for chunk in df_chunk:  
    # perform data filtering
    chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)

# concat the list into data-frame
df_concat = pd.concat(chunk_list)

Once you have read all the chunks, merge them together to use the dataset. Note that df_chunk is not a data-frame but an object for further operation in the next step

b. You can read first n lines from the dataframe and use them for an initial analysis , by passing value to the ‘nrows’ parameter

import pandas as pd

df = pd.read_csv('FILE_PATH.csv', nrows = 9999)

You will only need these majorly to read your data into a data-frame.

Hope this helps.

Share on

Twitter Facebook LinkedIn

Data Analysis: 1.1 Reading data using pandas

Data Analysis in Python

What is a Dataframe?

How to read data in Python

Method1: Using ‘csv’ library

Method2: Using ‘pandas’ library

Reading ‘.csv’

Reading excel file

When data is too huge or reduce memory usage

Share on

You may also enjoy

How Lewis Hamilton and Narayan from Mensa Brands Changed my perspective towards what i was doing

Wanting everything at once- From happiness to Frustration

The 32nd Birthday

Run Forrest Run