处理缺失值 In〔87〕:frompandasimportSeries,DataFrameimportpandasaspdimportnumpyasnpimportmatplotlib。pyplotaspltimportmatplotlibasmplimportseabornassns In〔88〕:matplotlibinline In〔3〕:df1DataFrame(〔〔3,5,3〕,〔1,6,np。nan〕,〔lili,np。nan,pop〕,〔np。nan,a,b〕〕)创建有缺失值的DataFramedf1 Out〔3〕: 0hr1hr2hr0hr3hr5hr3hr1hr1hr6hrNaN 2hrlili NaN pop 3hrNaN a b查找缺失值df。isnull()、df。notnull()、df。info() In〔4〕:df1。isnull()True的为缺失值 Out〔4〕: 0hr1hr2hr0hrFalse False False 1hrFalse False True 2hrFalse True False 3hrTrue False False In〔5〕:df1。notnull()False为缺失值 Out〔5〕: 0hr1hr2hr0hrTrue True True 1hrTrue True False 2hrTrue False True 3hrFalse True True In〔6〕:df1。isnull()。sum()获取每列的缺失值数量,再通过求和就可以获取整个DataFrame的缺失值数量 Out〔6〕:011121dtype:int64 In〔7〕:df1。isnull()。sum()。sum() Out〔7〕:3 In〔9〕:df1。isnull()。values。any() Out〔9〕:True In〔10〕:df1。info()通过info方法,也可以看出DataFrame每列数据的缺失值情况classpandas。core。frame。DataFrameRangeIndex:4entries,0to3Datacolumns(total3columns):03nonnullobject13nonnullobject23nonnullobjectdtypes:object(3)memoryusage:176。0bytes删除缺失值df。dropna()、df。dropna(howall) In〔11〕:df1。dropna()dropna方法可以删除具有缺失值的行,整行删除,传入howall,则只会删除全为NaN的那些行 Out〔11〕: 0hr1hr2hr0hr3hr5hr3hrIn〔17〕:df2DataFrame(np。arange(12)。reshape(3,4))df2 Out〔17〕: 0hr1hr2hr3hr0hr0hr1hr2hr3hr1hr4hr5hr6hr7hr2hr8hr9hr10hr11hrIn〔19〕:df2。ix〔2,:〕np。nan索引为2,即第三行所有列为缺失值df2〔3〕np。nan新建列索引为3的所有数据为缺失值df2 Out〔19〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 NaN 1hr4。0 5。0 6。0 NaN 2hrNaN NaN NaN NaN In〔20〕:df2。dropna(howall)传入howall,则只会删除全为NaN的那些行 Out〔20〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 NaN 1hr4。0 5。0 6。0 NaN In〔21〕:df2。dropna(howall,axis1)axis1,轴向为列,删除整列为缺失值的,默认为行删除 Out〔21〕: 0hr1hr2hr0hr0。0 1。0 2。0 1hr4。0 5。0 6。0 2hrNaN NaN NaN填充缺失值df。fillna(0) In〔22〕:df2 Out〔22〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 NaN 1hr4。0 5。0 6。0 NaN 2hrNaN NaN NaN NaN In〔24〕:df2。fillna(0)通过fillna方法可以将缺失值替换为常数值 Out〔24〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hr0。0 0。0 0。0 0。0 In〔25〕:df2。fillna({1:6,3:0})针对列索引1和列索引3的缺失值进行相应的填充在fillna中传入字典结构数据,可以针对不同列填充不同的值,fillna返回的是新对象,不会对原数据进行修改 Out〔25〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hrNaN 6。0 NaN 0。0 In〔26〕:df2 Out〔26〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 NaN 1hr4。0 5。0 6。0 NaN 2hrNaN NaN NaN NaN In〔27〕:df2。fillna({1:6,3:0},inplaceTrue)通过inplace就地进行修改1。df。fillna()函数的功能:该函数的功能是用指定的值去填充dataframe中的缺失值。2。df。fillna()函数的基本语法:df。fillna(a,〔inplaceFalse〕),其中参数a表示的是常数或字典,若a为常数,则用常数a填充缺失值,若a为字典,则表示第key列的缺失值用key对应的value值填充,如:df。fillna({0:10,1:20}),表示用10去填充第0列的缺失值,用20去填充第1列的缺失值;inplace为可选参数,默认为False,表示不修改原对象,若指定inplaceTrue,则直接修改原对象。3。df。fillna()函数的返回值:若指定inplaceTrue,则函数返回值为None,若未指定,则函数返回填充缺失值后的数据。4。df。fillna()函数的用法补充:4。1指定method参数:(1)methodffill或pad,表示用前一个非缺失值去填充该缺失值,语法为df。fillna(methodffill);(2)methodbflii或backfill,表示用下一个非缺失值填充该缺失值,语法为df。fillna(methodbflii);4。2指定limit参数和axis参数:limit参数用于指定每列或每行缺失值填充的数量,默认按列操作,axis参数用于指定对行还是对列操作。若axis0,则对各行数据进行填充,若axis1,则对各列数据进行填充,如:df。fillna(methodffill,limit1,axis1)表示用上一个非缺失值填充该缺失值,且每行中只有一列被填充,因为methodffill并且limit1,所以每行中只有最先出现缺失值的一列被填充df2 Out〔27〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hrNaN 6。0 NaN 0。0 In〔28〕:df2。fillna(methodffill)methodffill或pad,表示用前一个非缺失值去填充该缺失值 Out〔28〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hr4。0 6。0 6。0 0。0 In〔29〕:df2 Out〔29〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hrNaN 6。0 NaN 0。0 In〔31〕:df2〔0〕df2〔0〕。fillna(df2〔0〕。mean())指定第0列缺失值通过平均值作为填充数df2 Out〔31〕: 0hr1hr2hr3hr0hr0。0 1。0 2。0 0。0 1hr4。0 5。0 6。0 0。0 2hr2。0 6。0 NaN 0。0 In〔32〕:df2。fillna?fillna的参数,可以通过?进行帮助查询,这也是自我学习最好的方法移除重复数据 In〔33〕:data{name:〔张三,李四,张三,小明〕,sex:〔female,male,female,male〕,year:〔2001,2002,2001,2002〕,city:〔北京,上海,北京,北京〕}df1DataFrame(data)df1 Out〔33〕: city name sex year 0hr北京 张三 female 2001hr1hr上海 李四 male 2002hr2hr北京 张三 female 2001hr3hr北京 小明 male 2002hrIn〔34〕:df1。duplicated()判断各行是否有重复数据,当每行的每个字段都相同时才会判断为重复项 Out〔34〕:0False1False2True3Falsedtype:bool In〔35〕:df1。dropduplicates()删除多余的重复项,当每行的每个字段都相同时才会判断为重复项 Out〔35〕: city name sex year 0hr北京 张三 female 2001hr1hr上海 李四 male 2002hr3hr北京 小明 male 2002hrIn〔36〕:df1。dropduplicates(〔sex,year〕)指定部分列作为判断重复项的依据,默认保留的数据为第一个出现的组合 Out〔36〕: city name sex year 0hr北京 张三 female 2001hr1hr上海 李四 male 2002hrIn〔39〕:df1。dropduplicates(〔sex,year〕,keeplast)传入keeplast’可以保留最后一个出现的组合 Out〔39〕: city name sex year 2hr北京 张三 female 2001hr3hr北京 小明 male 2002替换值 In〔41〕:data{name:〔张三,李四,王五,小明〕,sex:〔female,male,,male〕,year:〔2001,2003,2001,2002〕,city:〔北京,上海,,北京〕}df1DataFrame(data)df1 Out〔41〕: city name sex year 0hr北京 张三 female 2001hr1hr上海 李四 male 2003hr2hr王五 2001hr3hr北京 小明 male 2002hrIn〔42〕:df1。replace(,不详)通过replace可完成替换值的功能,空值替换为不详 Out〔42〕: city name sex year 0hr北京 张三 female 2001hr1hr上海 李四 male 2003hr2hr不详 王五 不详 2001hr3hr北京 小明 male 2002hrIn〔43〕:df1。replace(〔,2001〕,〔不详,2002〕)列表格式多值替换,空值替换为不详,2001替换为2002 Out〔43〕: city name sex year 0hr北京 张三 female 2002hr1hr上海 李四 male 2003hr2hr不详 王五 不详 2002hr3hr北京 小明 male 2002hrIn〔44〕:df1。replace({:不详,2001:2002})字典格式多值替换,空值替换为不详,2001替换为2002 Out〔44〕: city name sex year 0hr北京 张三 female 2002hr1hr上海 李四 male 2003hr2hr不详 王五 不详 2002hr3hr北京 小明 male 2002利用函数或映射进行数据转换 In〔45〕:data{name:〔张三,李四,王五,小明〕,math:〔79,52,63,92〕}df2DataFrame(data)df2 Out〔45〕: math name 0hr79hr张三 1hr52hr李四 2hr63hr王五 3hr92hr小明 In〔46〕:deff(x):定义函数fifx90:return优秀elif70x90:return良好elif60x70:return合格else:return不合格 map()函数接收两个参数,一个是函数,一个是Iterable,map将传入的函数依次作用到序列的每一个元素,并把结果作为新的Iterable返回。其语法格式为:map(function,iterable。。。)function函数名iterable一个或多个序列 In〔48〕:df2〔class〕df2〔math〕。map(f)新增class列,math列每个元素都运行函数f后填入class列df2 Out〔48〕: math name class 0hr79hr张三 良好 1hr52hr李四 不合格 2hr63hr王五 合格 3hr92hr小明 优秀 In〔49〕:deldf2〔class〕删除class列df2 Out〔49〕: math name 0hr79hr张三 1hr52hr李四 2hr63hr王五 3hr92hr小明 In〔50〕:df2〔class〕df2〔math〕。apply(f)apply函数运行f函数df2 Out〔50〕: math name class 0hr79hr张三 良好 1hr52hr李四 不合格 2hr63hr王五 合格 3hr92hr小明 优秀检测异常值 In〔2〕:df3DataFrame(np。arange(10),columns〔X〕)df3〔Y〕2df3〔X〕0。5df3。iloc〔9,1〕185df3 Out〔2〕: X Y 0hr0hr0。5 1hr1hr2。5 2hr2hr4。5 3hr3hr6。5 4hr4hr8。5 5hr5hr10。5 6hr6hr12。5 7hr7hr14。5 8hr8hr16。5 9hr9hr185。0 In〔8〕:df3。plot(kindscatter,xX,yY) Out〔8〕:matplotlib。axes。subplots。AxesSubplotat0xa1dcef0 虚拟变量 在数学建模和机器学习中,只有数值型数据才能供算法使用,对于一些分类变量则需要将其转换为虚拟变量(哑变量)(也就是0,1矩阵),通过getdumnies函数即可实现该功能 In〔10〕:dfDataFrame({朝向:〔东,南,东,西,北〕,价格:〔1200,2100,2300,2900,1400〕})df Out〔10〕: 价格 朝向 0hr1200hr东 1hr2100hr南 2hr2300hr东 3hr2900hr西 4hr1400hr北 In〔11〕:pd。getdummies(df〔朝向〕)朝向列转换为虚拟变量(哑变量)(也就是0,1矩阵) Out〔11〕: 东 北 南 西 0hr1hr0hr0hr0hr1hr0hr0hr1hr0hr2hr1hr0hr0hr0hr3hr0hr0hr0hr1hr4hr0hr1hr0hr0hrIn〔12〕:df2DataFrame({朝向:〔东北,西南,东,西北,北〕,价格:〔1200,2100,2300,2900,1400〕})df2 Out〔12〕: 价格 朝向 0hr1200hr东北 1hr2100hr西南 2hr2300hr东 3hr2900hr西北 4hr1400hr北 In〔16〕:dummiesdf2〔朝向〕。apply(lambdax:Series(x。split())。valuecounts())对于多类别的数据而言,需要通过apply函数来实现dummies Out〔16〕: 东 北 南 西 0hr1。0 1。0 NaN NaN 1hrNaN NaN 1。0 1。0 2hr1。0 NaN NaN NaN 3hrNaN 1。0 NaN 1。0 4hrNaN 1。0 NaN NaN In〔19〕:dummiesdummies。fillna(0)。astype(int)dummies Out〔19〕: 东 北 南 西 0hr1hr1hr0hr0hr1hr0hr0hr1hr1hr2hr1hr0hr0hr0hr3hr0hr1hr0hr1hr4hr0hr1hr0hr0数据合并与重塑 In〔43〕:priceDataFrame({fruit:〔apple,banana,orange〕,price:〔23,32,45〕})amountDataFrame({fruit:〔apple,banana,apple,apple,banana,pear〕,amount:〔5,3,6,3,5,7〕}) In〔44〕:price Out〔44〕: fruit price 0hrapple 23hr1hrbanana 32hr2hrorange 45hrIn〔45〕:amount Out〔45〕: amount fruit 0hr5hrapple 1hr3hrbanana 2hr6hrapple 3hr3hrapple 4hr5hrbanana 5hr7hrpearmerge合并 In〔46〕:pd。merge(amount,price)merge函数是通过一个或多个键(DataFrame的列)将两个DataFrame按行合并起来 Out〔46〕: amount fruit price 0hr5hrapple 23hr1hr6hrapple 23hr2hr3hrapple 23hr3hr3hrbanana 32hr4hr5hrbanana 32hrIn〔47〕:pd。merge(amount,price,onfruit)指定键名合并 Out〔47〕: amount fruit price 0hr5hrapple 23hr1hr6hrapple 23hr2hr3hrapple 23hr3hr3hrbanana 32hr4hr5hrbanana 32hrmerge函数常用参数 In〔48〕:pd。merge(amount,price,leftonfruit,rightonfruit)merge默认为内连接(inner),也就是返回交集。通过how参数可以选择连接方法:左连接(left)、右连接(right)和外连接(outer) Out〔48〕: amount fruit price 0hr5hrapple 23hr1hr6hrapple 23hr2hr3hrapple 23hr3hr3hrbanana 32hr4hr5hrbanana 32hrIn〔49〕:pd。merge(amount,price,howleft) Out〔49〕: amount fruit price 0hr5hrapple 23。0 1hr3hrbanana 32。0 2hr6hrapple 23。0 3hr3hrapple 23。0 4hr5hrbanana 32。0 5hr7hrpear NaN In〔50〕:pd。merge(amount,price,howright) Out〔50〕: amount fruit price 0hr5。0 apple 23hr1hr6。0 apple 23hr2hr3。0 apple 23hr3hr3。0 banana 32hr4hr5。0 banana 32hr5hrNaN orange 45hrIn〔52〕:pd。merge(amount,price,howouter) Out〔52〕: amount fruit price 0hr5。0 apple 23。0 1hr6。0 apple 23。0 2hr3。0 apple 23。0 3hr3。0 banana 32。0 4hr5。0 banana 32。0 5hr7。0 pear NaN 6hrNaN orange 45。0 In〔53〕:price2DataFrame({fruit:〔apple,banana,orange,apple〕,price:〔23,32,45,25〕})amount2DataFrame({fruit:〔apple,banana,apple,apple,banana,pear〕,amount:〔5,3,6,3,5,7〕}) In〔54〕:amount2 Out〔54〕: amount fruit 0hr5hrapple 1hr3hrbanana 2hr6hrapple 3hr3hrapple 4hr5hrbanana 5hr7hrpear In〔55〕:price2 Out〔55〕: fruit price 0hrapple 23hr1hrbanana 32hr2hrorange 45hr3hrapple 25hrIn〔57〕:pd。merge(amount2,price2) Out〔57〕: amount fruit price 0hr5hrapple 23hr1hr5hrapple 25hr2hr6hrapple 23hr3hr6hrapple 25hr4hr3hrapple 23hr5hr3hrapple 25hr6hr3hrbanana 32hr7hr5hrbanana 32hrIn〔59〕:leftDataFrame({key1:〔one,one,two〕,key2:〔a,b,a〕,val1:〔2,3,4〕})rightDataFrame({key1:〔one,one,two,two〕,key2:〔a,a,a,b〕,val2:〔5,6,7,8〕}) In〔60〕:left Out〔60〕: key1 key2 val1 0hrone a 2hr1hrone b 3hr2hrtwo a 4hrIn〔61〕:right Out〔61〕: key1 key2 val2 0hrone a 5hr1hrone a 6hr2hrtwo a 7hr3hrtwo b 8hrIn〔62〕:pd。merge(left,right,on〔key1,key2〕,howouter)多键进行合并,即传入一个list即可 Out〔62〕: key1 key2 val1 val2 0hrone a 2。0 5。0 1hrone a 2。0 6。0 2hrone b 3。0 NaN 3hrtwo a 4。0 7。0 4hrtwo b NaN 8。0 In〔63〕:pd。merge(left,right,onkey1) Out〔63〕: key1 key2x val1 key2y val2 0hrone a 2hra 5hr1hrone a 2hra 6hr2hrone b 3hra 5hr3hrone b 3hra 6hr4hrtwo a 4hra 7hr5hrtwo a 4hrb 8hrIn〔64〕:pd。merge(left,right,onkey1,suffixes(left,right))重复列名的修改,suffixes方法 Out〔64〕: key1 key2left val1 key2right val2 0hrone a 2hra 5hr1hrone a 2hra 6hr2hrone b 3hra 5hr3hrone b 3hra 6hr4hrtwo a 4hra 7hr5hrtwo a 4hrb 8hrIn〔66〕:left2DataFrame({key:〔a,a,b,b,c〕,val1:range(5)})right2DataFrame({val2:〔5,7〕},index〔a,b〕) In〔67〕:left2 Out〔67〕: key val1 0hra 0hr1hra 1hr2hrb 2hr3hrb 3hr4hrc 4hrIn〔68〕:right2 Out〔68〕: val2 a 5hrb 7hrIn〔70〕:pd。merge(left2,right2,leftonkey,rightindexTrue)连接的键位于DataFrame的行索引上,可通过传入leftindexTrue或者rightindexTrue指定将索引作为连接键来使用 Out〔70〕: key val1 val2 0hra 0hr5hr1hra 1hr5hr2hrb 2hr7hr3hrb 3hr7hrIn〔71〕:left3DataFrame({val1:range(4)},index〔a,b,a,c〕)right3DataFrame({val2:〔5,7〕},index〔a,b〕) In〔72〕:left3 Out〔72〕: val1 a 0hrb 1hra 2hrc 3hrIn〔73〕:right3 Out〔73〕: val2 a 5hrb 7hrIn〔74〕:left3。join(right3,howouter)join方法,可以快速完成按索引合并 Out〔74〕: val1 val2 a 0hr5。0 a 2hr5。0 b 1hr7。0 c 3hrNaNconcat连接 In〔3〕:s1Series(〔0,1〕,index〔a,b〕)s2Series(〔2,3〕,index〔c,d〕)s3Series(〔4,5〕,index〔e,f〕) In〔4〕:pd。concat(〔s1,s2,s3〕)需要合并的DataFrame之间没有连接键,通过pandas的concat方法实现 Out〔4〕:a0b1c2d3e4f5dtype:int64 In〔5〕:pd。concat(〔s1,s2,s3〕,axis1)默认情况下,concat是在axis0上工作的,通过指定轴向也可以按列进行连接 Out〔5〕: 0hr1hr2hra 0。0 NaN NaN b 1。0 NaN NaN c NaN 2。0 NaN d NaN 3。0 NaN e NaN NaN 4。0 f NaN NaN 5。0 In〔6〕:s4pd。concat(〔s110,s3〕)s4 Out〔6〕:a0b10e4f5dtype:int64 In〔8〕:pd。concat(〔s1,s4〕,axis1) Out〔8〕: 0hr1hra 0。0 0hrb 1。0 10hre NaN 4hrf NaN 5hrIn〔9〕:pd。concat(〔s1,s4〕,axis1,joininner)concat默认为外连接(并集),传入joininner’可以实现内连接 Out〔9〕: 0hr1hra 0hr0hrb 1hr10hrIn〔14〕:pd。concat(〔s1,s4〕,axis1,joininner,joinaxes〔〔b,a〕〕)通过joinaxes指定使用的索引顺序 Out〔14〕: 0hr1hrb 1hr10hra 0hr0hrIn〔15〕:pd。concat(〔s1,s4〕)concat只有内连接和外连接 Out〔15〕:a0b1a0b10e4f5dtype:int64 In〔17〕:pd。concat(〔s1,s4〕,keys〔one,two〕)通过keys参数给连接对象创建一个层次化索引 Out〔17〕:onea0b1twoa0b10e4f5dtype:int64 In〔18〕:pd。concat(〔s1,s4〕,axis1,keys〔one,two〕)如果按列连接,keys就成了DataFrame的列索引 Out〔18〕: one two a 0。0 0hrb 1。0 10hre NaN 4hrf NaN 5hrIn〔28〕:df1DataFrame({val1:range(3)},index〔a,b,c〕)df2DataFrame({val2:〔5,7〕},index〔a,b〕) In〔29〕:df1 Out〔29〕: val1 a 0hrb 1hrc 2hrIn〔30〕:df2 Out〔30〕: val2 a 5hrb 7hrIn〔32〕:pd。concat(〔df1,df2〕,axis1,keys〔one,two〕) Out〔32〕: one two val1 val2 a 0hr5。0 b 1hr7。0 c 2hrNaN In〔33〕:pd。concat({one:df1,two:df2},axis1)通过字典数据也可以完成连接,字典的键就是keys的值 Out〔33〕: one two val1 val2 a 0hr5。0 b 1hr7。0 c 2hrNaN In〔34〕:df1DataFrame(np。random。randn(3,4),columns〔a,b,c,d〕)df2DataFrame(np。random。randn(2,2),columns〔d,c〕) In〔35〕:df1 Out〔35〕: a b c d 0hr0。023541 0。694903 0。515242 0。460737 1hr1。326048 0。259269 0。685732 0。052237 2hr0。110079 2。729854 0。503138 1。721161 In〔36〕:df2 Out〔36〕: d c 0hr0。995995 0。342845 1hr0。848536 1。027354 In〔37〕:pd。concat(〔df1,df2〕) Out〔37〕: a b c d 0hr0。023541 0。694903 0。515242 0。460737 1hr1。326048 0。259269 0。685732 0。052237 2hr0。110079 2。729854 0。503138 1。721161 0hrNaN NaN 0。342845 0。995995 1hrNaN NaN 1。027354 0。848536 In〔38〕:pd。concat(〔df1,df2〕,ignoreindexTrue)通过ignoreindex‘True’忽略索引,以达到重排索引的效果 Out〔38〕: a b c d 0hr0。023541 0。694903 0。515242 0。460737 1hr1。326048 0。259269 0。685732 0。052237 2hr0。110079 2。729854 0。503138 1。721161 3hrNaN NaN 0。342845 0。995995 4hrNaN NaN 1。027354 0。848536combinefirst合并 In〔39〕:df1DataFrame({a:〔3,np。nan,6,np。nan〕,b:〔np。nan,4,6,np。nan〕})df2DataFrame({a:range(5),b:range(5)}) In〔40〕:df1 Out〔40〕: a b 0hr3。0 NaN 1hrNaN 4。0 2hr6。0 6。0 3hrNaN NaN In〔41〕:df2 Out〔41〕: a b 0hr0hr0hr1hr1hr1hr2hr2hr2hr3hr3hr3hr4hr4hr4hrIn〔42〕:df1。combinefirst(df2)需要合并的两个DataFrame存在重复的索引,使用combinefirst方法 Out〔42〕: a b 0hr3。0 0。0 1hr1。0 4。0 2hr6。0 6。0 3hr3。0 3。0 4hr4。0 4。0数据重塑 In〔48〕:dfDataFrame(np。arange(9)。reshape(3,3),index〔a,b,c〕,columns〔one,two,three〕)df。index。namealphdf。columns。namenumberdf Out〔48〕: number one two three alph a 0hr1hr2hrb 3hr4hr5hrc 6hr7hr8hrIn〔50〕:resultdf。stack()stack方法用于将DataFrame的列旋转为行;默认情况下,数据重塑的操作都是最内层的result Out〔50〕:alphnumberaone0two1three2bone3two4three5cone6two7three8dtype:int32 In〔51〕:result。unstack()unstack方法用于将DataFrame的行旋转为列,默认情况下,数据重塑的操作都是最内层的 Out〔51〕: number one two three alph a 0hr1hr2hrb 3hr4hr5hrc 6hr7hr8hrIn〔52〕:result。unstack(0) Out〔52〕: alph a b c number one 0hr3hr6hrtwo 1hr4hr7hrthree 2hr5hr8hrIn〔53〕:result。unstack(alph) Out〔53〕: alph a b c number one 0hr3hr6hrtwo 1hr4hr7hrthree 2hr5hr8hrIn〔54〕:dfDataFrame(np。arange(16)。reshape(4,4),index〔〔one,one,two,two〕,〔a,b,a,b〕〕,columns〔〔apple,apple,orange,orange〕,〔red,green,red,green〕〕)df Out〔54〕: apple orange red green red green one a 0hr1hr2hr3hrb 4hr5hr6hr7hrtwo a 8hr9hr10hr11hrb 12hr13hr14hr15hrIn〔55〕:df。stack() Out〔55〕: apple orange one a green 1hr3hrred 0hr2hrb green 5hr7hrred 4hr6hrtwo a green 9hr11hrred 8hr10hrb green 13hr15hrred 12hr14hrIn〔56〕:df。unstack() Out〔56〕: apple orange red green red green a b a b a b a b one 0hr4hr1hr5hr2hr6hr3hr7hrtwo 8hr12hr9hr13hr10hr14hr11hr15字符串处理 In〔71〕:data{data:〔张三男,李四女,王五女,小明男〕,}dfDataFrame(data)df Out〔71〕: data 0hr张三男 1hr李四女 2hr王五女 3hr小明男 In〔67〕:resultdf〔data〕。apply(lambdax:Series(x。split()))把数据分成两列,常用的方法是通过函数应用来完成result Out〔67〕: 0hr1hr0hr张三 男 1hr李四 女 2hr王五 女 3hr小明 男 In〔81〕:newdfdf〔data〕。str。split()pandas中字段的str属性可以轻松调用字符串的方法newdf Out〔81〕:0〔张三,男〕1〔李四,女〕2〔王五,女〕3〔小明,男〕Name:data,dtype:object In〔82〕:df〔name〕newdf。str〔0〕pandas中字段的str属性可以轻松调用字符串的方法df〔sex〕newdf。str〔1〕pandas中字段的str属性可以轻松调用字符串的方法df Out〔82〕: data name sex 0hr张三男 张三 男 1hr李四女 李四 女 2hr王五女 王五 女 3hr小明男 小明 男正则表达式 In〔83〕:df2DataFrame({email:〔102345qq。com,342167qq。com,65132qq。com〕})df2 Out〔83〕: email 0hr102345qq。com 1hr342167qq。com 2hr65132qq。com In〔84〕:df2〔email〕。str。findall((。?)) Out〔84〕:0〔102345〕1〔342167〕2〔65132〕Name:email,dtype:object In〔85〕:df2〔QQ〕df2〔email〕。str。findall((。?))。str。get(0)df2 Out〔85〕: email QQ 0hr102345qq。com 102345hr1hr342167qq。com 342167hr2hr65132qq。com 65132综合示例Iris数据集 In〔101〕:frompandasimportSeries,DataFrameimportpandasaspdimportnumpyasnpimportmatplotlib。pyplotasplt导入pyplot绘图模块importmatplotlibasmpl导入matplotlib绘图库importseabornassns导入seaborn绘图库matplotlibinline In〔107〕:irisdatapd。readcsv(open(H:python数据分析数据irisdata。csv))读取数据irisdata。head() Out〔107〕: sepallengthcm sepalwidthcm petallengthcm petalwidthcm class 0hr5。1 3。5 1。4 0。2 Irissetosa 1hr4。9 3。0 1。4 0。2 Irissetosa 2hr4。7 3。2 1。3 0。2 Irissetosa 3hr4。6 3。1 1。5 0。2 Irissetosa 4hr5。0 3。6 1。4 0。2 Irissetosa首先对数据进行简单描述,看其中是否有异常值 In〔108〕:irisdata。shape数据大小行数和列数 Out〔108〕:(150,5) In〔110〕:irisdata。describe() Out〔110〕: sepallengthcm sepalwidthcm petallengthcm petalwidthcm count 150。000000 150。000000 150。000000 145。000000 mean 5。644627 3。054667 3。758667 1。236552 std 1。312781 0。433123 1。764420 0。755058 min 0。055000 2。000000 1。000000 0。100000 25 5。100000 2。800000 1。600000 0。400000 50 5。700000 3。000000 4。350000 1。300000 75 6。400000 3。300000 5。100000 1。800000 max 7。900000 4。400000 6。900000 2。500000 In〔112〕:irisdata〔class〕。unique()去重unique函数去除其中重复的元素,并按元素由大到小返回一个新的无元素重复的元组或者列表 Out〔112〕:array(〔Irissetosa,Irissetossa,Irisversicolor,versicolor,Irisvirginica〕,dtypeobject) In〔115〕:irisdata。ix〔irisdata〔class〕versicolor,class〕Irisversicoloririsdata。ix〔irisdata〔class〕Irissetossa,class〕Irissetosairisdata〔class〕。unique() Out〔115〕:array(〔Irissetosa,Irisversicolor,Irisvirginica〕,dtypeobject) In〔118〕:sns。pairplot(irisdata,hueclass)利用seaborn绘制散点图矩阵F:Anacondaenvsdataanalysislibsitepackagesumpylibfunctionbase。py:748:RuntimeWarning:invalidvalueencounteredingreaterequalkeep(tmpamn)F:Anacondaenvsdataanalysislibsitepackagesumpylibfunctionbase。py:749:RuntimeWarning:invalidvalueencounteredinlessequalkeep(tmpamx) Out〔118〕:seaborn。axisgrid。PairGridat0x1025cac8 In〔119〕:irisdata。ix〔irisdata〔class〕Irissetosa,sepalwidthcm〕。hist() Out〔119〕:matplotlib。axes。subplots。AxesSubplotat0x10bfd5c0 In〔125〕:irisdatairisdata。loc〔(irisdata〔class〕!Irissetosa)(irisdata〔sepalwidthcm〕2。5)〕irisdata。loc〔irisdata〔class〕Irissetosa,sepalwidthcm〕。hist() Out〔125〕:matplotlib。axes。subplots。AxesSubplotat0x112ce240 In〔126〕:irisdata。loc〔(irisdata〔class〕Irisversicolor)(irisdata〔sepallengthcm〕1。0)〕 Out〔126〕: sepallengthcm sepalwidthcm petallengthcm petalwidthcm class 77hr0。067 3。0 5。0 1。7 Irisversicolor 78hr0。060 2。9 4。5 1。5 Irisversicolor 79hr0。057 2。6 3。5 1。0 Irisversicolor 80hr0。055 2。4 3。8 1。1 Irisversicolor 81hr0。055 2。4 3。7 1。0 Irisversicolor In〔127〕:irisdata。loc〔(irisdata〔class〕Irisversicolor)(irisdata〔sepallengthcm〕1。0),sepallengthcm〕100。0 In〔128〕:irisdata。isnull()。sum() Out〔128〕:sepallengthcm0sepalwidthcm0petallengthcm0petalwidthcm5class0dtype:int64 In〔131〕:irisdata〔irisdata〔petalwidthcm〕。isnull()〕 Out〔131〕: sepallengthcm sepalwidthcm petallengthcm petalwidthcm class 7hr5。0 3。4 1。5 NaN Irissetosa 8hr4。4 2。9 1。4 NaN Irissetosa 9hr4。9 3。1 1。5 NaN Irissetosa 10hr5。4 3。7 1。5 NaN Irissetosa 11hr4。8 3。4 1。6 NaN Irissetosa In〔132〕:irisdata。dropna(inplaceTrue)将缺失值进行删除处理 In〔133〕:irisdata。tocsv(H:python数据分析数据iriscleandata。csv,indexFalse)最后对清洗好的数据进行存储 In〔135〕:irisdatapd。readcsv(open(H:python数据分析数据iriscleandata。csv))irisdata。head() Out〔135〕: sepallengthcm sepalwidthcm petallengthcm petalwidthcm class 0hr5。1 3。5 1。4 0。2 Irissetosa 1hr4。9 3。0 1。4 0。2 Irissetosa 2hr4。7 3。2 1。3 0。2 Irissetosa 3hr4。6 3。1 1。5 0。2 Irissetosa 4hr5。0 3。6 1。4 0。2 Irissetosa In〔136〕:irisdata。shape Out〔136〕:(144,5)数据探索 In〔137〕:sns。pairplot(irisdata,hueclass)绘制散点矩阵图 Out〔137〕:seaborn。axisgrid。PairGridat0x113f96d8 In〔145〕:irisdata。boxplot(columnpetallengthcm,byclass,gridFalse,figsize(6,6))boxplot用于绘制箱形图,figsize可设置画布的大小F:Anacondaenvsdataanalysislibsitepackagesumpycorefromnumeric。py:57:FutureWarning:reshapeisdeprecatedandwillraiseinasubsequentrelease。Pleaseuse。values。reshape(。。。)insteadreturngetattr(obj,method)(args,kwds) Out〔145〕:matplotlib。axes。subplots。AxesSubplotat0x1359e668 In〔139〕:irisdata。boxplot?箱型图查询帮助