简谈array,reshape,dataframe

array

Numpy的数据结构是n维的数组对象，叫做ndarray。 NumPy数组一般是同质的（但有一种特殊的数组类型例外，它是异质的），即数组中的所有元素类型必须是一致的。^[1]

优点：

内存块风格：ndarray中的所有元素的类型都是相同的，存储元素时内存可以连续，在科学计算中，Numpy的ndarray就可以省掉很多循环语句，代码使用方面比Python原生list简单的多。
ndarray支持并行化运算（向量化运算）
Numpy底层使用C语言编写，内部解除了GIL（全局解释器锁），其对数组的操作速度不受Python解释器的限制，效率远高于纯Python代码。

import numpy as np
x = np.array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])  # 16行，没有列

y = np.array([[1,2,3,4],[8,6,9,5]])  # 2行4列

print("x行:{}".format(x.shape[0]))
print("x维度:{}".format(x.shape))
print("y行:{}".format(y.shape[0]))
print("y列:{}".format(y.shape[1]))

# out:
# x行:16
# x维度:(16,)
# y行:2
# y列:4

reshape（）

官方文档

numpy.reshape(a, newshape, order='C')[source]

a：数组--需要处理的数据^[3]
newshape：新的格式--整数或整数数组，如(2,3)表示2行3列，新的形状应该与原来的形状兼容，即行数和列数相乘后等于a中元素的数量
order : 可选范围为{‘C’, ‘F’, ‘A’}。使用索引顺序读取a的元素，并按照索引顺序将元素放到变换后的的数组中。如果不进行order参数的设置，默认参数为C。

（1）“C”指的是用类C写的读/索引顺序的元素，最后一个维度变化最快，第一个维度变化最慢。以二维数组为例，简单来讲就是横着读，横着写，优先读/写一行。

（2）“F”是指用FORTRAN类索引顺序读/写元素，最后一个维度变化最慢，第一个维度变化最快。竖着读，竖着写，优先读/写一列。注意，“C”和“F”选项不考虑底层数组的内存布局，只引用索引的顺序。

（3）“A”选项所生成的数组的效果与原数组a的数据存储方式有关，如果数据是按照FORTRAN存储的话，它的生成效果与”F“相同，否则与“C”相同。这里可能听起来有点模糊，下面会给出示例。

import numpy as np
x = np.array([[1,2,3,4],[82,63,91,52],[121,345,567,987]])

x1 = x.reshape((2,6),order='C')  # 横着读，横着写，优先读/写一行
x2 = x.reshape((2,6),order='F')  # 竖着读，竖着写，优先读/写一列
x3 = x.reshape((2,6),order='A')  # 原数组FORTRAN存储，则竖着读，竖着写，优先读/写一列，否则横着读，横着写，优先读/写一行

print("x:\n{}\n".format(x))
print("x1:\n{}\n".format(x1))
print("x2:\n{}\n".format(x2))
print("x3:\n{}\n".format(x3))

# out:
# x:
# [[  1   2   3   4]
#  [ 82  63  91  52]
#  [121 345 567 987]]

# x1:
# [[  1   2   3   4  82  63]
#  [ 91  52 121 345 567 987]]

# x2:
# [[  1 121  63   3 567  52]
#  [ 82   2 345  91   4 987]]

# x3:
# [[  1   2   3   4  82  63]
#  [ 91  52 121 345 567 987]]

出现-1的话，有两种情况 - reshape(-1)：原本数组有n个元素，返回一个n行无列的数组 - reshape(-1,n) n为任意数字，n为列数，-1会根据列数，自动计算出新数组的行数，再根据这个新的维度重新组合数组。

x = np.array([[1,2,3,4],[82,63,91,52],[121,345,567,987]])
y = x.reshape(-1)

print("x:\n{}\n".format(x))
print("y:\n{}\n".format(y))  

# out:
# x:
# [[  1   2   3   4]
#  [ 82  63  91  52]
#  [121 345 567 987]]

# y:
# [  1   2   3   4  82  63  91  52 121 345 567 987]

x = np.array([[1,2,3,4],[82,63,91,52],[121,345,567,987]])
y = x.reshape(-1,2)

print("x:\n{}\n".format(x))
print("y:\n{}\n".format(y))  

# out:
# x:
# [[  1   2   3   4]
#  [ 82  63  91  52]
#  [121 345 567 987]]

# y:
# [[  1   2]
#  [  3   4]
#  [ 82  63]
#  [ 91  52]
#  [121 345]
#  [567 987]]

# 这是二维数据，6行1列，表示（6,1） 
[[ 0.08540663]
 [ 1.85038409]
 [-2.41396732]
 [ 1.39196365]
 [-0.35908504]
 [ 0.64526911]]

# 这是一维数据，6行无列（6，）
[ 0.08540663  1.85038409 -2.41396732  1.39196365 -0.35908504  0.64526911]

上面二维变一维：reshape(-1) 一维变二维：reshape(-1,1)

a = np.array([[ 0.08540663],[ 1.85038409],[-2.41396732],[ 1.39196365],[-0.35908504],[ 0.64526911]]) # a是二维数据
b = a.reshape(-1)  # b是一维数据
c = b.reshape(-1,1) # c是二维数据

print("a的维度：{}\n".format(a.shape))
print("b：{}".format(b))
print("b的维度：{}\n".format(b.shape))
print("c：{}".format(c))
print("c的维度：{}".format(c.shape))

# a的维度：(6, 1)

# b：[ 0.08540663  1.85038409 -2.41396732  1.39196365 -0.35908504  0.64526911]
# b的维度：(6,)

# c：[[ 0.08540663]
#  [ 1.85038409]
#  [-2.41396732]
#  [ 1.39196365]
#  [-0.35908504]
#  [ 0.64526911]]
# c的维度：(6, 1)

dataframe

Pandas有两个主要的数据结构，Series和DataFrame，记住大小写区分。^[2]

Series类似于一维数组，和Numpy的array接近，由一组数据和数据标签组成。数据标签有索引的作用。

Series是一维的数据结构，DataFrame是一个表格型的数据结构，它含有不同的列，每列都是不同的数据类型。我们可以把DataFrame看作Series组成的字典，它既有行索引也有列索引。

dataframe转化成array:

1	df=df.values

array转化成dataframe

1 2	import pandas as pd df = pd.DataFrame(df)

import numpy as np
import pandas as pd
a = np.array([[1,2],[1,2]])
b = pd.DataFrame(a)
c = b.values

print(type(a))
print(type(b))
print(type(c))

# out:
# <class 'numpy.ndarray'>
# <class 'pandas.core.frame.DataFrame'>
# <class 'numpy.ndarray'>

Jay's Blog